Extract relationship concepts from sentences

288 Views Asked by At

Is there a current model or how could I train a model that takes a sentence involving two subjects like:

[Meiosis] is a type of [cell division]...

and decides if one is the child or parent concept of the other? In this case, cell division is the parent of meiosis.

1

There are 1 best solutions below

4
On

Are the subjects already identified, i.e., do you know beforehand for each sentence which words or sequence of words represent the subjects? If you do I think what you are looking for is relationship extraction.

Unsupervised approach

A simple unsupervised approach is to look for patterns using part-of-speech tags, e.g.:

First you tokenize and get the PoS-tags for each sentence:

sentence = "Meiosis is a type of cell division."
tokens = nltk.word_tokenize("Meiosis is a type of cell division.")
tokens
['Meiosis', 'is', 'a', 'type', 'of', 'cell', 'division', '.']

token_pos = nltk.pos_tag(tokens)
token_pos
[('Meiosis', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('type', 'NN'), ('of', 'IN'),
 ('cell', 'NN'), ('division', 'NN'), ('.', '.')]

Then you build a parser, to parse a specific pattern based on PoS-tags, which is a pattern that mediates relationships between two subjects/entities/nouns:

verb = "<VB|VBD|VBG|VBN|VBP|VBZ>*<RB|RBR|RBS>*"
word = "<NN|NNS|NNP|NNPS|JJ|JJR|JJS|RB|WP>"
preposition = "<IN>"
rel_pattern = "({}|{}{}|{}{}*{})+ ".format(verb, verb, preposition, verb, word, preposition)
grammar_long = '''REL_PHRASE: {%s}''' % rel_pattern
reverb_pattern = nltk.RegexpParser(grammar_long)

NOTE: This pattern is based on this paper: http://www.aclweb.org/anthology/D11-1142

You can then apply the parser to all the tokens/PoS-tags except the ones which are part of the subjects/entities:

reverb_pattern.parse(token_pos[1:5])
Tree('S', [Tree('REL_PHRASE', [('is', 'VBZ')]), ('a', 'DT'), ('type', 'NN'), ('of', 'IN')])

If the the parser outputs a REL_PHRASE than there is a relationships between the two subjects. You then need to analyse all these patterns and decide which represent a parent-of relationships. One way to achieve that is by clustering them, for instance.

Supervised approach

If your sentences already are tagged with subjects/entities and with the type of relationships, i.e., a supervised scenario than you can build a model where the features can be the words between the two subjects/entities and the type of relationship the label.

sent: "[Meiosis] is a type of [cell division.]"
label: parent of

You can build a vector representation of is a type of, and train a classifier to predict the label parent of. You will need many examples for this, it also depends on how many different classes/labels you have.