Let's say I am working inside a git/dvc repo. There is a folder data
containing 100k small files. I track it with DVC as a single element, as recommended by the doc:
dvc add data
and because in my experience, DVC is kinda slow when tracking that many files one by one.
I clone the repo on another workspace, and now I have the data.dvc
file locally but none of the actual files inside yet. I want to add a file named newfile.txt
to the data
folder and track it with DVC. Is there a way to do this without pulling the whole content of data
locally ?
What I have tried for now:
Adding the
data
folder again:mkdir data mv path/to/newfile.txt data/newfile.txt dvc add data
The
data.dvc
file is built again from the local state ofdata
which only containsnewfile.txt
so this doesn't work.Adding the file as a single element in
data
folder:dvc add data/newfile.txt
I get :
Cannot add 'data/newfile.txt', because it is overlapping with other DVC tracked output: 'data'. To include 'data/newfile.txt' in 'data', run 'dvc commit data.dvc'
Using dvc commit as suggested
mkdir data mv path/to/newfile.txt data/newfile.txt dvc commit data.dvc
Similarly as 1., the
data.dvc
is rebuilt again from local state ofdata
.
Interesting question. I think there is no easy way to do this now because in this other machine if you
dvc add data
again but with only one file in there, DVC will think you deleted all the other files, create a new cached version of the data dir (containing only the new file), and update the .dvc file accordingly (as you discovered).You could open a feature request in https://github.com/iterative/dvc.org/issues.