I have around 7.000 sentences, for which I have done a refined Name-Entity-Recognition (i.e., for specific entities) using SpaCy. Now I want to do relationship extraction (basically causal inference) and I do not know how to use NER to provide training set.

As far as I read there are a different approaches to perform relationship extraction:

  • 1) Handwritten patterns
  • 2) Supervised machine learning
  • 3) Semi-supervised machine learning

Since I want to use supervised machine learning I need training data.

It would be nice if anyone could give me some direction, many thanks. Here is a screen shoot of my data frame, entities are provided by a customised spaCy model. I have access to the syntactic dependencies and part-of-speech tags of each sentence, as given by spaCy:

enter image description here

1

There are 1 best solutions below

5
On

It seems that your dataset is some kind of technical writing, very well structured, so maybe part-of-speech tags are enough to do the extraction you want.

I would recommend you to read this paper, and understand the pos-tags based pattern used Identifying Relations for Open Information Extraction

The piece of code below tags a sent with part-of-speech tags and then looks for sequences that match the called ReVerb pattern.

import nltk

verb = "<ADV>*<AUX>*<VBN><IN|PART>*<ADV>*"
word = "<NOUN|ADJ|ADV|DET|ADP>"
preposition = "<ADP|ADJ>"

rel_pattern = "( %s (%s* (%s)+ )? )+ " % (verb, word, preposition)
grammar_long = '''REL_PHRASE: {%s}''' % rel_pattern
reverb_pattern = nltk.RegexpParser(grammar_long)

sent = "where the equation caused by the eccentricity is maximum."
sent_pos_tags = nltk.tag.pos_tag("where the equation caused by the eccentricity is maximum".split())

for x in reverb_pattern.parse(tags):
  if isinstance(x, nltk.Tree) and x.label() == 'REL_PHRASE':
     rel_phrase = " ".join([t[0] for t in x.leaves()])
    print(rel_phrase)

There is a bit missing which is to find the closest noun-phrases to right and left of the pattern, but I leave that as an exercise. I also wrote a blog post with a more detailed example. I hope it helps.