How to write a JSONL file for Doccano sequence to sequence

856 Views Asked by At

Doccano needs text in the following format for a JSONL file.

This does not work with json.dumps...at least not directly. It either doesn't give double quotes (which are required) or has some weird format that Doccano doesn't accept.

{"text": "EU rejects German call to boycott British lamb.", "label": [ [0, 2, "ORG"], ... ]}
{"text": "Peter Blackburn", "label": [ [0, 15, "PERSON"] ]}
{"text": "President Obama", "label": [ [10, 15, "PERSON"] ]}

Any tips?

1

There are 1 best solutions below

0
On

Here is what worked for me

import json

notes = zip(
    df.NOTE_TEXT_CONCATINATED, [[]] * df.shape[0], df.NOTE_ID
) # [[]] field could be pre-filled with whatever labels you need to show up


fname = "/Users/apwork/Downloads/test_json.jsonl"

a = [u"text", u"label", u"NOTE_ID"]  # u"NOTE_ID" is for the Metadata field in Doccano.

jsonfile = open(fname, "w")

for row in example:
    json.dump(dict(zip(a, row)), jsonfile)
    jsonfile.write("\n")
jsonfile.close()