If I have different versions of a file, e.g., in different branches, and I try to reconcile those, git will has great mechanisms for that. However, in order to do the reconciliations, e.g., in a merge, git requires access to the "inside" of the file. Thus files should be text files.
If I change a version controlled file, git does not save the delta between those files, but safes and entire snapshot of the file. If one makes a change, even a small change, to a large file, the entire files will be stored twice by git. Thus files should be small.
Files that are either large or binary (or both), they should not be tracked by Git. If I still need them in my project, I should use something like DVC, git-annex, git-lfs.
As far as I understand, all three of those keep the those other files outside of git, and keep a reference, which is tracked by git. I will use DVC as a stand-in, as I know even less about the other two.
In DVC, the reference is a text file and thus, git will not get confused. However, since it is only a reference, there is not much merging to be done by git anyways. So, git's reconciliation-capabilities are not really required. What is the advantage of using DVC then regarding this aspect? Can't I just use git and just not use those mechanisms?
In DVC, it seems that if I change a large file, just like in git, a snapshot of that file is created (not a delta saved). So, how does this improve the situation compared to git? I still get lots of (near) copies of this big file.
I understand from here that git-lfs keeps most of the (near) copies of my file in the remote storage. Only if I checkout the respective version of the large file, the files is downloaded. In that case, while I would be correct about my point 2, at least it is only a "problem" of the server (in terms of space), but not on my local disk space and also not for the internet bandwidth usage. This might be the same for DVC.
Are my "objections" or "caveats" of the points 1 and 2 valid?
It's more of a need than just an advantage.
DVC in particular is nice because you don't need special servers to use it, just configure any storage provider you already own (e.g. some SSH box or an S3 bucket).
Re 2. DVC also makes sure no files are duplicated in your storage based on their content (great for datasets organized as multiple small files in a directory structure, more info).