def inverted_index(doc):
words = word_count(doc)
ln = 0
for word in words:
temp = []
with open(doc) as file:
for line in file:
ln += 1
li = line.split()
if word in li:
temp.append(ln)
words[word] = temp
return words
I am trying to create an inverted index from a text file, where words is a dictionary with all the 19000 unique words in the file. The text file has around 5000+ lines. I want to iterate through the file and dictionary to create the inverted index that has the word followed by line numbers that the word appears but it is taking too long to compile as it is nested for loop. So is there a more efficient way to do this?
Here is my approach to solve this, please read the notes below code for some pragmatic tips.
I would suggest removing stopwords first before creating the inverted index for any meaningful analysis. You can use nltk package for that.