I'm new to python and need help with NLTK language modeling.
I'm trying to generate the setence starting with "he said" using trigram model but get the following error:
Traceback (most recent call last):
File "C:\Users\PycharmProjects\homework3 3\main.py", line 77, in <module>
suffix = pick_word(d[prefix])
File "C:\Users\PycharmProjects\homework3 3\main.py", line 71, in pick_word
return random.choice(sents)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2288.0_x64__qbz5n2kfra8p0\lib\random.py", line 378, in choice
return seq[self._randbelow(len(seq))]
IndexError: list index out of range
I don't understand why it's complaining the list index is out of range. What I think it should be doing is taking the reuters sentence and should pick a word from it randomly and pass it as suffix
Heres the whole code, please only focus on the trigram portion as he rest is incomplete
# imports
import string
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')
from nltk.corpus import reuters, stopwords
from collections import defaultdict
from nltk import FreqDist, ngrams
# input the reuters sentences
sents = reuters.sents()
# write the removal characters such as : Stopwords and punctuation
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation + '"' + '"' + '-' + '''+''' + '—'
removal_list = list(stop_words) + list(string.punctuation) + ['lt', 'rt']
# generate unigrams bigrams trigrams
unigram = []
trigram = []
tokenized_text = []
for sentence in sents:
sentence = list(map(lambda x: x.lower(), sentence))
for word in sentence:
if word == '.':
sentence.remove(word)
else:
unigram.append(word)
tokenized_text.append(sentence)
trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))
# remove the n-grams with removable words
def remove_stopwords(x):
y = []
for pair in x:
count = 0
for word in pair:
if word in removal_list:
count = count or 0
else:
count = count or 1
if (count == 1):
y.append(pair)
return (y)
trigram = remove_stopwords(trigram)
# generate frequency of n-grams
freq_tri = FreqDist(trigram)
d = defaultdict(list)
#Trigrams
for a, b, c in freq_tri:
if (a != None and b != None and c != None):
d[a, b].extend([c] * freq_tri[a,b,c])
# print(" d[a, b].extend([c] * freq_tri[a,b,c]) ", d[a, b].extend([c] * freq_tri[a,b,c]))
#Next word prediction
s = ''
def pick_word(sents):
"Chooses a random element."
return random.choice(sents)
prefix = "he", "said"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
suffix = pick_word(d[prefix])
What am I doing wrong? Am I assuming wrong that I'm passing the reuters sentence to choose a word randomly and doing something wrong?
I thought maybe I was choosing the wrong list to pass in the pick_word function and tried to use tokenized_text. I receive the same error so I think my asumption or understand of this is wrong. I'm not sure which part of it is wrong.