I'm playing with the brown corpus, specifically the tagged sentences in "news." I've found that "to" is the word with the most ambiguous word tags (TO, IN, TO-HL, IN-HL, IN-TL, NPS). I'm trying to write a code that will print one sentence from the corpus for each tag associated with "to". The sentences do not need to be "cleaned" of the tags, but just contain both "to" and one each of the associated pos-tags.
brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == "IN"):
print sent
I tried the above code with just one of the pos-tags to see if it worked, but it prints all the instances of this. I need it to print just the first found sentence that matches the word, tag and then stop. I tried this:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'IN'):
print sent
if (word != 'to' and tag != 'IN'):
break
This works with this pos-tag because it's the first one related to "to", but if I use:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'TO-HL'):
print sent
if (word != 'to' and tag != 'TO-HL'):
break
It returns nothing. I think I am SO close -- care to help?
You can continue to add to your current code but your code didn't consider these things:
If you want to stick with your code try this:
I suggest that you store the sentence in a
defaultdict(list)
, then you can retrieve them anytime.To access the sentences of a specific POS:
You'll realized that if 'to' with a specific POS appears twice in the sentence. It's recorded twice in
sents_with_to[pos]
. If you want to remove them, try: