How to set up a smudge and clean filter so that it references a script tracked by the repository

114 Views Asked by At

I have a project that uses Jupyter Notebook files for data analysis and I keep output and certain pieces of metadata out of git with a clean/smudge filter. I will sometimes make adjustments to the filter and I want to have these changes automatically applied instead of having to ask my collaborators to run a git config command each time.

How can I configure the filter to run scripts that are tracked by the repository?

For the purpose of this question, let's suppose the clean command is the one from How to clear Jupyter Notebook's output and metadata when using git commit?:

jupyter nbconvert --ClearOutputPreprocessor.enabled=True --ClearMetadataPreprocessor.enabled=True --to=notebook --stdin --stdout --log-level=ERROR

This is typically configured as...

git config filter.<filter-name>.clean "<command>"

...and used with a .gitattributes that includes the following.

*.ipynb filter=<filter-name>

Related questions

1

There are 1 best solutions below

0
VonC On BEST ANSWER

To set up a smudge or clean filter in Git that references a script tracked by the repository, you need to use a relative path to the script in the repository from the project's root directory (actually, relative to your .gitattributes, where the filter is declared)

+--------------------+
| Git Repository     |
|                    |
| +----------------+ |
| | .git           | |
| +----------------+ |
| +----------------+ |
| | Scripts        | |
| | +------------+ | |
| | | clean.sh   | | | <--- The script for the clean filter
| | +------------+ | |
| +----------------+ |
| +----------------+ |
| | .gitattributes | | <--- Specifies the filter for .ipynb files
| +----------------+ |
+--------------------+

For your specific scenario with Jupyter Notebook files, you can set up the clean filter by placing the clean script in your repository, e.g., scripts/clean_jupyter.sh. That script (which should be executable: chmod +x scripts/clean_jupyter.sh) will include the command you mentioned:

#!/bin/bash
jupyter nbconvert --ClearOutputPreprocessor.enabled=True --ClearMetadataPreprocessor.enabled=True --to=notebook --stdin --stdout --log-level=ERROR

Configure your .gitattributes file to use this script for your Jupyter Notebook files:

echo "*.ipynb filter=jupyterClean" >> .gitattributes

# result in your .gitattributes:
*.ipynb filter=jupyterClean

Set up the clean filter in .git/config to reference your script:

git config filter.jupyterClean.clean "scripts/clean_jupyter.sh"

To have these changes automatically applied to all collaborators, you can include instructions in your project's README.md, or set up an initialization script that they can run once to configure their local repository settings. Git does not allow the automatic application of filter configurations from a repository for security reasons, so manual setup is necessary.


To answer your question on my old answer about "How can I track system-specific config files in a repo/project?":

You say "That way, the script (managed with Git) referenced by the smudge". How exactly do you reference a script managed by git in the smudge filter? I'm relatively new to scripting and git filters but I would really like to get this working in my project.

The smudge or clean filter is a script which is: