trigram model getting IndexError: list index out of range when choosing random word

83 Views Asked by At

I'm new to python and need help with NLTK language modeling.

I'm trying to generate the setence starting with "he said" using trigram model but get the following error:

Traceback (most recent call last):
  File "C:\Users\PycharmProjects\homework3 3\main.py", line 77, in <module>
    suffix = pick_word(d[prefix])
  File "C:\Users\PycharmProjects\homework3 3\main.py", line 71, in pick_word
    return random.choice(sents)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2288.0_x64__qbz5n2kfra8p0\lib\random.py", line 378, in choice
    return seq[self._randbelow(len(seq))]
IndexError: list index out of range

I don't understand why it's complaining the list index is out of range. What I think it should be doing is taking the reuters sentence and should pick a word from it randomly and pass it as suffix

Heres the whole code, please only focus on the trigram portion as he rest is incomplete

# imports
import string
import random

import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')
from nltk.corpus import reuters, stopwords
from collections import defaultdict
from nltk import FreqDist, ngrams

# input the reuters sentences
sents = reuters.sents()

# write the removal characters such as : Stopwords and punctuation
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation + '"' + '"' + '-' + '''+''' + '—'
removal_list = list(stop_words) + list(string.punctuation) + ['lt', 'rt']

# generate unigrams bigrams trigrams
unigram = []
trigram = []
tokenized_text = []

for sentence in sents:
    sentence = list(map(lambda x: x.lower(), sentence))
for word in sentence:
    if word == '.':
        sentence.remove(word)
    else:
        unigram.append(word)

tokenized_text.append(sentence)
trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))

# remove the n-grams with removable words
def remove_stopwords(x):
    y = []
    for pair in x:
        count = 0
        for word in pair:
            if word in removal_list:
                count = count or 0
            else:
                count = count or 1
        if (count == 1):
            y.append(pair)
    return (y)

trigram = remove_stopwords(trigram)

# generate frequency of n-grams
freq_tri = FreqDist(trigram)

d = defaultdict(list)

#Trigrams
for a, b, c in freq_tri:
    if (a != None and b != None and c != None):
        d[a, b].extend([c] * freq_tri[a,b,c])
#        print(" d[a, b].extend([c] * freq_tri[a,b,c]) ",  d[a, b].extend([c] * freq_tri[a,b,c]))

#Next word prediction
s = ''

def pick_word(sents):
    "Chooses a random element."
    return random.choice(sents)

prefix = "he", "said"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
    suffix = pick_word(d[prefix])

What am I doing wrong? Am I assuming wrong that I'm passing the reuters sentence to choose a word randomly and doing something wrong?

I thought maybe I was choosing the wrong list to pass in the pick_word function and tried to use tokenized_text. I receive the same error so I think my asumption or understand of this is wrong. I'm not sure which part of it is wrong.

0

There are 0 best solutions below