What is the advantage of DVC, git-annex, git-lfs for large or binary files over git?

Question

What is the advantage of DVC, git-annex, git-lfs for large or binary files over git?

1.5k Views Asked by Make42 At 29 March 2022 at 13:52

If I have different versions of a file, e.g., in different branches, and I try to reconcile those, git will has great mechanisms for that. However, in order to do the reconciliations, e.g., in a merge, git requires access to the "inside" of the file. Thus files should be text files.

If I change a version controlled file, git does not save the delta between those files, but safes and entire snapshot of the file. If one makes a change, even a small change, to a large file, the entire files will be stored twice by git. Thus files should be small.

Files that are either large or binary (or both), they should not be tracked by Git. If I still need them in my project, I should use something like DVC, git-annex, git-lfs.

As far as I understand, all three of those keep the those other files outside of git, and keep a reference, which is tracked by git. I will use DVC as a stand-in, as I know even less about the other two.

In DVC, the reference is a text file and thus, git will not get confused. However, since it is only a reference, there is not much merging to be done by git anyways. So, git's reconciliation-capabilities are not really required. What is the advantage of using DVC then regarding this aspect? Can't I just use git and just not use those mechanisms?
In DVC, it seems that if I change a large file, just like in git, a snapshot of that file is created (not a delta saved). So, how does this improve the situation compared to git? I still get lots of (near) copies of this big file.

I understand from here that git-lfs keeps most of the (near) copies of my file in the remote storage. Only if I checkout the respective version of the large file, the files is downloaded. In that case, while I would be correct about my point 2, at least it is only a "problem" of the server (in terms of space), but not on my local disk space and also not for the internet bandwidth usage. This might be the same for DVC.

Are my "objections" or "caveats" of the points 1 and 2 valid?

Original Q&A

There are 1 best solutions below

**Jorge Orpinel Pérez** · Answer 1 · 2022-03-29T21:16:14.103000

It's more of a need than just an advantage.

Git is not meant to handle binary files in the first place, as their contents are not necessarily incremental (as with text/code) so no "delta saving" either.
While Git can technically handle arbitrarily large files, it will be very slow in indexing them.
Git hosting services like Github do have file size limits (even with LFS).

DVC in particular is nice because you don't need special servers to use it, just configure any storage provider you already own (e.g. some SSH box or an S3 bucket).

Re 2. DVC also makes sure no files are duplicated in your storage based on their content (great for datasets organized as multiple small files in a directory structure, more info).

What is the advantage of DVC, git-annex, git-lfs for large or binary files over git?

There are 1 best solutions below

Related Questions in GIT

Related Questions in GIT-LFS

Related Questions in DVC

Related Questions in GIT-ANNEX

Trending Questions

Popular # Hahtags

Popular Questions