Git repository structure (multiple datasets merged into one master dataset)

116 Views Asked by kanu At 14 April 2023 at 13:20

I have different datasets A, B, and C that were collected and processed separately but needed to be merged into one master dataset. Each datasets will be updated at different time intervals, and the master dataset will also be updated accordingly. Contributions from people that are unaffiliated with a project, such as open-source contributors, will also be accepted.

My datasets are all in CSV files and all the data processing is done using R.

Currently I have separate git repositories for dataset A, B, and C with the following project folder structure:

source_data  
raw_data  
processed_data  
figure  
function  
markdown (Data processing RMD file for each dataset)

And each dataset were copied manually into the master repo where they were merged into a master dataset, d data summaries were produced, and published in the GitHub page.

From my Google search, I found the following options potentially available:

(a) Bring everything all together in a single repo.
(b) a single repo to store all the datasets and have a separate branch for each dataset
(c) Git submodule (ideally not preferable https://medium.com/@uttamkini/sharing-code-and-why-git-submodules-is-a-bad-idea-1efd24eb645d)
(d) Multiple repos (for each dataset and master): but in this case, what would be the process of copying each dataset into the master repo each time a dataset is updated?

Which option do you think is best for me?
How would you best structure the git repo in case of (a) or (b).

Original Q&A

There are 1 best solutions below

VonC On 14 April 2023 at 22:22

Normally, git submodules remain a valid option, and allows for your master dataset repository to reference a fixed version of your dataset repositories.

Besides that, option d is the simplest, and you could add a GitHub Action on each of your dataset repositories (except the master one) in order to update the master dataset after each push to one of those dataset repositories.

Git repository structure (multiple datasets merged into one master dataset)

There are 1 best solutions below

Related Questions in DATABASE

Related Questions in GIT

Related Questions in GITHUB

Related Questions in R-MARKDOWN

Related Questions in PROJECT-MANAGEMENT

Trending Questions

Popular # Hahtags

Popular Questions