How to export "Document with entities from spaCy" for use in doccano

Question

How to export "Document with entities from spaCy" for use in doccano

3k Views Asked by reencode At 16 October 2025 at 19:43

I want to train my model with doccano or an other "Open source text annotation tool" and continuously improve my model.

For that my understanding is, that I can import annotated data to doccano in a format described here:

So for a first step I have loaded a model and created a doc:

text = "Test text that should be annotated for Michael Schumacher" 
nlp = spacy.load('en_core_news_sm')
doc = nlp(text)

I know I can export the jsonl format (with text and annotated labels) from doccano and train a model with it but I want to know how to export that data from a spaCy doc in python so that i can import it to doccano.

Thanks in advance.

Original Q&A

There are 4 best solutions below

Abhijit Manepatil On 29 March 2022 at 11:53

I have used Doccano annotation tool, to generate annotation, I have exported .jsonl file from Doccano Converted to .spaCy training format using following cutomized code.

Step to Follow:

Step 1 : Use doccano tool to annotate the data.

Step 2 : Export annotation file from Doccano which is in .jsonl format.

Step 3 : Pass that .jsonl file to fillterDoccanoData("./root.jsonl") function in below code, In my case I have root.jsonl for me, you can use your own.

Step 4 : User the following code to convert your .jsonl file to .spacy training file.

Step 5 : You can find train.spacy in your working directory as a result finally.

Thanks

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import logging
import json

#filtter data to convert in spacy format
def fillterDoccanoData(doccano_JSONL_FilePath):
    try:
        training_data = []
        lines=[]
        with open(doccano_JSONL_FilePath, 'r') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['data']
            entities = data['label']
            if len(entities)>0:
                training_data.append((text, {"entities" : entities}))
        return training_data
    except Exception as e:
        logging.exception("Unable to process " + doccano_JSONL_FilePath + "\n" + "error = " + str(e))
        return None

#read Doccano Annotation file .jsonl
TRAIN_DATA=fillterDoccanoData("./root.jsonl") #root.jsonl is annotation file name file name 

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    try:
        doc.ents = ents # label the text with the ents
        db.add(doc)
    except:
        print(text, annot)
db.to_disk("./train.spacy") # save the docbin object

fgaim On 23 October 2019 at 13:50

I had a similar task recently, here is how I did it:

import spacy
nlp = spacy.load('en_core_news_sm')

def text_to_doccano(text):
    """
    :text (str): source text
    Returns (list (dict)): deccano format json
    """
    djson = list()
    doc = nlp(text)
    for sent in doc.sents:
        labels = list()
        for e in sent.ents:
            labels.append([e.start_char, e.end_char, e.label_])
        djson.append({'text': sent.text, "labels": labels})
    return djson

Based on your example ...

text = "Test text that should be annotated for Michael Schumacher."
djson = text_to_doccano(text)
print(djson)

... would print out:

[{'text': 'Test text that should be annotated for Michael Schumacher.', 'labels': [[39, 57, 'PERSON']]}]

On a related note, when you save the results to a file the standard json.dump approach for saving JSONs won't work as it would write it as a list of entries separated with commas. AFAIK, doccano expects one entry per line and without a trailing comma. In resolving this, the following snippet works like charm:

import json

open(filepath, 'w').write("\n".join([json.dumps(e) for e in djson]))

/Cheers

aab On 13 September 2019 at 14:09

Spacy doesn't support this exact format out-of-the-box, but you should be able to write a custom function fairly easily. Take a look at spacy.gold.docs_to_json(), which shows a similar conversion to JSON.

**S'pht'Kr** · Accepted Answer

Doccano and/or spaCy seem to have changed things and there are now some flaws in the accepted answer. This revised version should be more correct with spaCy 3.1 and Doccano as of 8/1/2021...

def text_to_doccano(text):
    """
    :text (str): source text
    Returns (list (dict)): deccano format json
    """
    djson = list()
    doc = nlp(text)
    for sent in doc.sents:
        labels = list()
        for e in sent.ents:
            labels.append([e.start_char - sent.start_char, e.end_char - sent.start_char, e.label_])
        djson.append({'text': sent.text, "label": labels})
    return djson

The differences:

labels becomes singular label in the JSON (?!?)
e.start_char and e.end_char are actually (now?) the start and end within the document, not within the sentence...so you have to offset them by the position of the sentence within the document.

How to export "Document with entities from spaCy" for use in doccano

There are 4 best solutions below

Related Questions in PYTHON

Related Questions in JSON

Related Questions in SPACY

Related Questions in DOCCANO

Trending Questions

Popular # Hahtags

Popular Questions