I have a project that uses Jupyter Notebook files for data analysis and I keep output and certain pieces of metadata out of git with a clean/smudge filter. I will sometimes make adjustments to the filter and I want to have these changes automatically applied instead of having to ask my collaborators to run a git config command each time.
How can I configure the filter to run scripts that are tracked by the repository?
For the purpose of this question, let's suppose the clean command is the one from How to clear Jupyter Notebook's output and metadata when using git commit?:
jupyter nbconvert --ClearOutputPreprocessor.enabled=True --ClearMetadataPreprocessor.enabled=True --to=notebook --stdin --stdout --log-level=ERROR
This is typically configured as...
git config filter.<filter-name>.clean "<command>"
...and used with a .gitattributes that includes the following.
*.ipynb filter=<filter-name>
Related questions
- .gitattributes smudge and clean filters as a part of the repository is close but related to setting up filters on
git cloneinstead of tracking changes in the filter with the repository - How can I track system-specific config files in a repo/project? is a more general question about tracking any configuration file in a repository, not addressing the specifics of this qeustion.
- What are the security risks with allowing smudge and clean filters to be configured with git clone is about setting up a filter on
git clone, not about setting up a filter after cloning that uses a script in the repository.
To set up a smudge or clean filter in Git that references a script tracked by the repository, you need to use a relative path to the script in the repository from the project's root directory (actually, relative to your
.gitattributes, where the filter is declared)For your specific scenario with Jupyter Notebook files, you can set up the clean filter by placing the clean script in your repository, e.g.,
scripts/clean_jupyter.sh. That script (which should be executable:chmod +x scripts/clean_jupyter.sh) will include the command you mentioned:Configure your
.gitattributesfile to use this script for your Jupyter Notebook files:Set up the clean filter in
.git/configto reference your script:To have these changes automatically applied to all collaborators, you can include instructions in your project's
README.md, or set up an initialization script that they can run once to configure their local repository settings. Git does not allow the automatic application of filter configurations from a repository for security reasons, so manual setup is necessary.To answer your question on my old answer about "How can I track system-specific config files in a repo/project?":
The
smudgeorcleanfilter is a script which is:git/git/t/chainlint/token-pasting.test%PATH%or$PATHof your shell session, in which case you can reference it by its name only (no path).