Return list of sentences with a particular subject

81 Views Asked by At

I am exploring a small corpus of texts, and one of the things I am doing is examining the actions associated with various subjects. I have already inventoried how many times, for example, "man" is the subject of a sentence in which the verb is "love": that work was done with subject-verb-object triplets using Textacy.

As I work through the various statistics, I would like to be able to go back into the data and see sentences that have the subjects in their original context. NLTK has a concordance feature built right in, but it does not pay attention to part-of-speech tagging. I have gotten this far with the code.

What I am trying to do is find_the_subject("noun", corpus), such that if I input "man" I would get back a list of sentences with man as the subject of the sentence:

A man walked down the street and said why am I short in the middle?

The man comes around.

So far, I have the following code which will grab all the sentences with "man" but not just the ones with man as subject.

def find_sentences_with_noun(subject_noun, sentences):
    # Start with two empty lists
    noun_subjects = []
    noun_sentences = []
    # Work through the sentences
    for sentence in sentences:
        words = word_tokenize(sentence)
        tagged_words = nltk.tag.pos_tag(words)
        # This works but doesn't get me the subject
        for word, tag in tagged_words:
            if "NN" in tag and word == subject_noun:
                noun_subjects.append(word)
                noun_sentences.append(sentence)
    return noun_sentences

I cannot for the life of me figure out how to grab the noun in the subject position.

1

There are 1 best solutions below

0
John Laudun On

I believe I have a working solution to the problem. I don't know how efficient it is, nor scalable, but it will work and if it's useful to someone else, great. Or if there is room for improvement, I'm always happy to learn.

In my case, I found it was easier to work with spaCy than to try to stay with the NLTK. YMMV.

import spacy
# from spacy.lang.en import English

nlp = spacy.load('en_core_web_sm')

docs = list(nlp.pipe(texts))

def find_subject(subject, doc):
    subject_sents = []
    sentences = list(doc.sents)
    for sentence in sentences:
        root_token = sentence.root
        for child in root_token.children:
            if child.dep_ == 'nsubj':
                subj = child
                if subj.text == subject:
                    subject_sents.append(sentence)
    return subject_sents

Usage is quite simple: find_subject("dog", docs)