How to export "Document with entities from spaCy" for use in doccano

3k Views Asked by At

I want to train my model with doccano or an other "Open source text annotation tool" and continuously improve my model.

For that my understanding is, that I can import annotated data to doccano in a format described here: doccano import

So for a first step I have loaded a model and created a doc:

text = "Test text that should be annotated for Michael Schumacher" 
nlp = spacy.load('en_core_news_sm')
doc = nlp(text)

I know I can export the jsonl format (with text and annotated labels) from doccano and train a model with it but I want to know how to export that data from a spaCy doc in python so that i can import it to doccano.

Thanks in advance.

4

There are 4 best solutions below

0
On BEST ANSWER

Doccano and/or spaCy seem to have changed things and there are now some flaws in the accepted answer. This revised version should be more correct with spaCy 3.1 and Doccano as of 8/1/2021...

def text_to_doccano(text):
    """
    :text (str): source text
    Returns (list (dict)): deccano format json
    """
    djson = list()
    doc = nlp(text)
    for sent in doc.sents:
        labels = list()
        for e in sent.ents:
            labels.append([e.start_char - sent.start_char, e.end_char - sent.start_char, e.label_])
        djson.append({'text': sent.text, "label": labels})
    return djson

The differences:

  1. labels becomes singular label in the JSON (?!?)
  2. e.start_char and e.end_char are actually (now?) the start and end within the document, not within the sentence...so you have to offset them by the position of the sentence within the document.
0
On

I have used Doccano annotation tool, to generate annotation, I have exported .jsonl file from Doccano Converted to .spaCy training format using following cutomized code.

Step to Follow:

Step 1 : Use doccano tool to annotate the data.

Step 2 : Export annotation file from Doccano which is in .jsonl format.

Step 3 : Pass that .jsonl file to fillterDoccanoData("./root.jsonl") function in below code, In my case I have root.jsonl for me, you can use your own.

Step 4 : User the following code to convert your .jsonl file to .spacy training file.

Step 5 : You can find train.spacy in your working directory as a result finally.

Thanks

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import logging
import json

#filtter data to convert in spacy format
def fillterDoccanoData(doccano_JSONL_FilePath):
    try:
        training_data = []
        lines=[]
        with open(doccano_JSONL_FilePath, 'r') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['data']
            entities = data['label']
            if len(entities)>0:
                training_data.append((text, {"entities" : entities}))
        return training_data
    except Exception as e:
        logging.exception("Unable to process " + doccano_JSONL_FilePath + "\n" + "error = " + str(e))
        return None

#read Doccano Annotation file .jsonl
TRAIN_DATA=fillterDoccanoData("./root.jsonl") #root.jsonl is annotation file name file name 

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    try:
        doc.ents = ents # label the text with the ents
        db.add(doc)
    except:
        print(text, annot)
db.to_disk("./train.spacy") # save the docbin object
1
On

I had a similar task recently, here is how I did it:

import spacy
nlp = spacy.load('en_core_news_sm')

def text_to_doccano(text):
    """
    :text (str): source text
    Returns (list (dict)): deccano format json
    """
    djson = list()
    doc = nlp(text)
    for sent in doc.sents:
        labels = list()
        for e in sent.ents:
            labels.append([e.start_char, e.end_char, e.label_])
        djson.append({'text': sent.text, "labels": labels})
    return djson

Based on your example ...

text = "Test text that should be annotated for Michael Schumacher."
djson = text_to_doccano(text)
print(djson)

... would print out:

[{'text': 'Test text that should be annotated for Michael Schumacher.', 'labels': [[39, 57, 'PERSON']]}]

On a related note, when you save the results to a file the standard json.dump approach for saving JSONs won't work as it would write it as a list of entries separated with commas. AFAIK, doccano expects one entry per line and without a trailing comma. In resolving this, the following snippet works like charm:

import json

open(filepath, 'w').write("\n".join([json.dumps(e) for e in djson]))

/Cheers

0
On

Spacy doesn't support this exact format out-of-the-box, but you should be able to write a custom function fairly easily. Take a look at spacy.gold.docs_to_json(), which shows a similar conversion to JSON.