Python NLTK Inaugural Text Corpora hands-on solution needed

4.5k Views Asked by At

I am doing a course in NLTK Python which has a hands-on problem(on Katacoda) on "Text Corpora" and it is not accepting my solution mentioned below. Have been stuck on this problem since long. Need to complete this hands-on to proceed foreword in course.

Problem definition:

Import inaugural corpus For each of the inaugural address text available in the corpus, perform the following. Convert all words into lower case. Then determine the number of words starting with america or citizen.

Hint : Compute conditional frequency distribution, where condition is the year in which the inaugural address was delivered and event is either america or citizen. Store the conditional frequency distribution in variable ac_cfd.

Print the frequency of words ['america', 'citizen'] in year [1841, 1993].

Hint: Make use of tabulate method associated with a conditional frequency distribution.

For this I have written below solution:

ac_cfd = nltk.ConditionalFreqDist((target, fileid[:4]) 
for fileid in inaugural.fileids() 
for w in inaugural.words(fileid) 
for target in ['america', 'citizen'] 
if w.lower().startswith(target))
ac_cfd.tabulate(conditions=['america', 'citizen'], samples=['1841', '1993'])

which gives output:

          1841 1993 
american     7   14  
citizen     38    2

I was nto able to find same problem on different forums, though I did found a similar problem which wanted to plot the conditional frequency distribution, their solution was same as mine with one different, instead of tabulate line they had plot. (https://www.nltk.org/book/ch02.html) But Katacoda isn't accepting this solution and I am not able to proceed foreward in the course as completing hands-on is mandatory. Please Help

2

There are 2 best solutions below

0
On
ac_cfd = nltk.ConditionalFreqDist(
    [(fileid[:4], target) for fileid in inaugural.fileids() for w in inaugural.words(fileid) for target in
     ['america', 'citizen'] if w.lower().startswith(target)])

ac_cfd.tabulate(conditions=['1841', '1993'], samples=['america', 'citizen'])

Question was to Print the frequency of words ['america', 'citizen'] in year [1841, 1993] but you where doing the reverse hence Its not getting accepted.

0
On

Use below code. It works for me on Katacoda. In question it is asking for the words starting with america and citizens hence I sliced the words to 7 characters.

import nltk

from nltk.corpus import inaugural

ac_cfd = nltk.ConditionalFreqDist([(fileid[:4],word.lower()[:7]) 
                                   for fileid in inaugural.fileids() 
                                   for word in inaugural.words(fileid)
                                  ])

print(ac_cfd.tabulate(conditions =['1841', '1993'],  samples=['america', 'citizen'] ))



   america citizen 
1841       7      38    
1993      33       2