Git repository structure (multiple datasets merged into one master dataset)

116 Views Asked by At

I have different datasets A, B, and C that were collected and processed separately but needed to be merged into one master dataset. Each datasets will be updated at different time intervals, and the master dataset will also be updated accordingly. Contributions from people that are unaffiliated with a project, such as open-source contributors, will also be accepted.

My datasets are all in CSV files and all the data processing is done using R.

Currently I have separate git repositories for dataset A, B, and C with the following project folder structure:

source_data  
raw_data  
processed_data  
figure  
function  
markdown (Data processing RMD file for each dataset)  

And each dataset were copied manually into the master repo where they were merged into a master dataset, d data summaries were produced, and published in the GitHub page.

From my Google search, I found the following options potentially available:

Which option do you think is best for me?
How would you best structure the git repo in case of (a) or (b).

1

There are 1 best solutions below

0
VonC On

Normally, git submodules remain a valid option, and allows for your master dataset repository to reference a fixed version of your dataset repositories.

Besides that, option d is the simplest, and you could add a GitHub Action on each of your dataset repositories (except the master one) in order to update the master dataset after each push to one of those dataset repositories.