Let's say I am working inside a git/dvc repo. There is a folder data containing 100k small files. I track it with DVC as a single element, as recommended by the doc:
dvc add data
and because in my experience, DVC is kinda slow when tracking that many files one by one.
I clone the repo on another workspace, and now I have the data.dvc file locally but none of the actual files inside yet. I want to add a file named newfile.txt to the data folder and track it with DVC. Is there a way to do this without pulling the whole content of data locally ?
What I have tried for now:
Adding the
datafolder again:mkdir data mv path/to/newfile.txt data/newfile.txt dvc add dataThe
data.dvcfile is built again from the local state ofdatawhich only containsnewfile.txtso this doesn't work.Adding the file as a single element in
datafolder:dvc add data/newfile.txtI get :
Cannot add 'data/newfile.txt', because it is overlapping with other DVC tracked output: 'data'. To include 'data/newfile.txt' in 'data', run 'dvc commit data.dvc'Using dvc commit as suggested
mkdir data mv path/to/newfile.txt data/newfile.txt dvc commit data.dvcSimilarly as 1., the
data.dvcis rebuilt again from local state ofdata.
Interesting question. I think there is no easy way to do this now because in this other machine if you
dvc add dataagain but with only one file in there, DVC will think you deleted all the other files, create a new cached version of the data dir (containing only the new file), and update the .dvc file accordingly (as you discovered).You could open a feature request in https://github.com/iterative/dvc.org/issues.