I want to make a git repository public, but without including all of the history in the public version. The git documentation seems to suggest git checkout --orphan as a potential solution to this:
This can be useful when you want to publish the tree from a commit without exposing its full history. You might want to do this to publish an open source branch of a project whose current tree is "clean", but whose full history contains proprietary or otherwise encumbered bits of code.
However, I don’t want the histories of the two repositories to diverge (how could they be kept in sync?). I also don’t want to throw away the history entirely and only keep the new public version.
One (the?) way to prune old data from the history without rewriting it is to use a shallow clone. However, it seems like I can’t really push a shallow repository to hosting services like GitLab or GitHub – see e. g. these existing questions:
- Remote rejected (shallow update not allowed) after changing Git remote URL
- Shallow update not allowed (git > 1.9)
- Setting an option in a remote git repository
- How to push a shallow clone to a new repo?
- Push git to remote (Github) only one last commit, without history?
Is there no way to make a public repository that starts with the same commit as an existing, private repository? It seems like a common problem!
Update to clarify questions from the comments: There seems to be some confusion about what the setup looks like. I tried not to be too specific in order to not restrict options for solutions, but perhaps that wasn’t helpful.
Current situation: There is a single private git repository whose history contains commits which cannot be made public (I don’t see why the specific reasons would matter, but imagine files with a proprietary license). The HEAD tree of the repository does not contain any such files and can be made public.
Goal: Ideally, there should continue to be only a single repository with the existing history, but the content of the “tainted” commits in the past should not be accessible in any publicly uploaded files. In other words, there would be two “copies” of the existing repository:
- “Private copy”: Same as before; not uploaded to any public web page.
- “Public copy”: The same as the private copy, including history, but the contents of all ancestors of a specific commit are omitted.
For the purposes of development, both repositories should be interchangeable: It should be possible to push commits from “private” to “public”, but also to pull from “public” to “private“ (e. g. pull requests). The only reason I mentioned “keeping them in sync” above was because the git documentation seems to imply that git checkout --orphan is a solution to this problem, but as far as I can tell it would require some additional mechanisms for synchronization because it would, by definition, introduce two separate divergent histories. As such, it doesn’t seem like a solution to the problem.
If I understand, you have...
Your proposed solution is...
The Solution, As Required
This solution leaves you with a lot of complexity going forward.
Conflicts. So many conflicts.
Let's say we find a solution to maintain and sync your orphaned public repo with your complete private repo. Regardless of the technical details; this involves maintaining two long running peer branches, merging them together periodically.
Because commits in the private repo might have non-public commits in their history, content from the private repo must be merged using a squash merge or cherry pick to avoid accidentally pulling in private history. This further complicates merging and further distorts the public history.
There's sure to be conflicts. Always and forever. In each direction. This is on top of normal conflicts you'd experience. No tool can handle conflicts for you, a human must be involved. This means every PR potentially comes with an extra level of merge complexity, and bugs, forever.
Second class public citizens.
Because people on the public side can't see the private history, you'll have to maintain two separate projects with two separate sets of PRs. That's a bunch of overhead.
The people on the private side will have a full view of history and can see and work on all PRs, public and private.
Any changes from the private side have to come in as squash merges or cherry-picks; the people on the pubic side will always have a truncated and distorted history.
This puts anyone working on the public side at a disadvantage. Commits mirrored from the private side might reference commits and PRs they have no access to. The public side can't review, comment on, nor even see the private PRs (without a system of scrubbing out any private bits); changes will just appear in the public side, changes they had no say in and never saw before.
Lava Flow Anti-Pattern
This is a variation on the Lava Flow Anti-Pattern. You have dead code from the past (the old private code); because it was not cleaned up, it continues to have negative consequences. None of the headaches and complexity I've described above is necessary for the current situation, it's all because of that old dead code.
For this reason I'm not going to get into a solution as specified. It's very complicated. You have to make an orphaned clone. Then when pulling changes from private those have to be squash merges, else you risk pulling in private history.
Better Solution: Filter out the private parts.
Since the private content is all dead, and no further private content will be coming into the repo (if you need private content going forward, there are better options), but in the effort now to clean up your history and avoid the ongoing complexity outlined above.
There are many tools to help clean up history, primarily git-filter-repo and BFG Repo Cleaner. These can remove files and directories based on their size, content, filename, directory, etc. They can search and replace content and commit messages. This might take some time, depends on how complex your history is, but it is a one time process.
When you're done, you'll have a complete history of the public parts of your repository. Any commits which contained only private content will either be gone or empty (your choice, probably gone). Any commits which contained mixed public and private parts will still exist, but only with the private parts. The commit IDs will be different, but this is no worse than your proposed solution; any critical commits can be referenced by tag.
This is now the repository.
You can keep the original repository as a private, read-only archive if you need to reference the private parts, but it's dead code so that should be infrequent and less frequent going forward. If you find you need to add private content later, these can use the "public" content as a dependency and add private content via configuration management and plugins.
And you're done. Your project does not have to worry about the public/private split going forward.
Hybrid Solution: Private read-only, incremental cleanup.
If there's no more private content coming in, there's no need to commit to the private repo. Drop that requirement and leave the private repo as a read-only archive to be referenced as needed.
Now only the public repo and project is changing. Everyone works on the public project and public repo. Everyone is able see and comment on all future PRs. Since only one repo is changing, no complex repo syncing is necessary.
The disadvantage is "public" participants will never be able to see the full history. As the project moves forward there will be less and less need to reference the private repository and this scar will heal.
Meanwhile, you can incrementally work on cleaning up the history and grafting this clean onto the new repository.