I have different datasets A, B, and C that were collected and processed separately but needed to be merged into one master dataset. Each datasets will be updated at different time intervals, and the master dataset will also be updated accordingly. Contributions from people that are unaffiliated with a project, such as open-source contributors, will also be accepted.
My datasets are all in CSV files and all the data processing is done using R.
Currently I have separate git repositories for dataset A, B, and C with the following project folder structure:
source_data
raw_data
processed_data
figure
function
markdown (Data processing RMD file for each dataset)
And each dataset were copied manually into the master repo where they were merged into a master dataset, d data summaries were produced, and published in the GitHub page.
From my Google search, I found the following options potentially available:
- (a) Bring everything all together in a single repo.
- (b) a single repo to store all the datasets and have a separate branch for each dataset
- (c) Git submodule (ideally not preferable https://medium.com/@uttamkini/sharing-code-and-why-git-submodules-is-a-bad-idea-1efd24eb645d)
- (d) Multiple repos (for each dataset and master): but in this case, what would be the process of copying each dataset into the master repo each time a dataset is updated?
Which option do you think is best for me?
How would you best structure the git repo in case of (a) or (b).
Normally, git submodules remain a valid option, and allows for your master dataset repository to reference a fixed version of your dataset repositories.
Besides that, option d is the simplest, and you could add a GitHub Action on each of your dataset repositories (except the master one) in order to update the master dataset after each push to one of those dataset repositories.