importing NER json into spaCy

2.1k Views Asked by At

I am trying to import my NER data into spaCy. I have it in a json format, which I made thinking this was what spaCy required. My goal is to train an entity recognition system the recognises a custom set of entities. Below is a small example of the data I am using.

test =[{'datapoint_id': 0,
  'text': 'westleigh lodge care home, nel pan lane, leigh (wn7 5jt)',
  'entities': [[0, 25, 'building_name'],
   [27, 39, 'street_name'],
   [41, 46, 'city'],
   [48, 55, 'postcode']]},
{'datapoint_id': 1,
  'text': 'land at 2a gerard street, ashton in makerfield, wigan (wn4 9aa)',
  'entities': [(0, 4, 'unit_type'),
   (11, 24, 'street_name'),
   (48, 53, 'city'),
   (55, 62, 'postcode')]},
 {'datapoint_id': 2,
  'text': 'unit 111, timber wharf, worsley street, manchester (m15 4nz)',
  'entities': [(0, 4, 'unit_type'),
   (5, 8, 'unit_id'),
   (10, 23, 'building_name'),
   (24, 38, 'street_name'),
   (40, 50, 'city'),
   (52, 59, 'postcode')]}]

The spacy introduction course https://course.spacy.io/en gives guides on how to turn individual examples into spacy format but not large amounts of data.

Calling nlp(test[0]) causes an 'Expected a string or 'Doc' as input, but got: <class 'dict'>.' error

Calling !python -m spacy convert "/home/test.json" "/home/ouput/" produces a 'KeyError: 'paragraphs' error.

I can do

doc = nlp(test[0]['text'])
gold_dict = test[0]['entities']
example = Example.from_dict(doc, {'entities':gold_dict})

example

and loop through creating a list of examples, but it doesn't seem like the right approach and I am not sure what to do with the list of results.

I tried doing

corpus = JsonlCorpus("/home/test.json")
nlp = spacy.blank("en")
data = corpus(nlp)
doc_bin = DocBin(docs=data)

But this did not seem to load or save the dataset.

Obviously loading data must be very straight foreword, but I can't do it, any help appreciated.

1

There are 1 best solutions below

0
On BEST ANSWER

For training data spaCy just requires Docs that are set like the output you want, saved in a DocBin. So for your case, looping through your data and creating Docs is the right thing to do. You can do that with your Example-creating code and pull out the ex.reference Doc (an Example is basically just two Docs, one annotated and one not), though it's not the only way.

See the sample code in the training data section of the docs. It's not exactly the same format as your data but it's very similar.

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]
# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents
    db.add(doc)
db.to_disk("./train.spacy")