Simple command line JSON tool equivalent of nbstripout for Zeppelin notebooks

545 Views Asked by At

Some background

Versioning notebooks can become very inefficient if the output is expected to vary a lot. I solved this problem with my Jupyter notebooks using nbstripout, but so far I've found no alternative for Zeppelin notebooks.

Because nbstripout uses nbformat to parse ipynb files, it's not an easy patch to make it support Zeppelin. On the other hand, the goal is not that complex: simply empty out all the "msg": "...".

Goal

Given a JSON file, empty out all 'paragraphs.result.msg' fields.

Sample (schema):

{"paragraps": [{"result": {"msg": "Very long output..."}}]}
3

There are 3 best solutions below

0
On BEST ANSWER

Git Filter

The best solution (thanks to @steven-penny) is to run this:

git config filter.znbstripout.clean "jq '.paragraphs[].result.msg = \"\"'"

which will setup a filter called znbstripout that invokes the jq tool. Then, in your .gitattributes file you can just put:

*.json filter=znbstripout

Python Script (usable with Git Hooks)

The following can be used as a git hook:

#!/usr/bin/env python3

from glob import glob
import json

files = glob('**/note.json', recursive=True)
for file in files:
    with open(file, 'r') as fp:
        nb = json.load(fp)
    for p in nb['paragraphs']:
        if 'result' in p:
            p['result']['msg'] = ""
    with open(file, 'w') as fp:
        json.dump(nb, fp, sort_keys=True, indent=2)
1
On

JQ can do this:

jq .paragraphs[].result.msg file

http://stedolan.github.io/jq

0
On

In (1) and (2) below, I'll assume that the incoming JSON looks like this:

{
  "paragraphs": [
    {
      "result": {
        "msg": "msg1"
      }
    },
    {
      "result": {
        "msg": "msg2"
      }
    }
  ]
}

1. To set the .result.msg values to ""

.paragraphs[].result.msg = ""

2. To remove the .result.msg fields altogether:

del(.paragraphs[].result.msg)

3. To remove "msg" fields in all objects, wherever they occur:

walk(if type == "object" then del(.msg) else . end)

(If your jq does not have walk, google: jq faq walk)

4. To remove "msg" fields wherever they occur in a .result object in a .paragraphs array:

 walk(if type == "object" and (.paragraphs|type) == "array"
      then del(.paragraphs[].result?.msg?) else . end)