I am trying to find out the inverse document frequency of a list of Sherlock Holmes stories. Have a look at the code:
Inverse document frequency is the measure of how common or rare a word is across multiple documents.
So, that would mean that Inverse Document Frequency or idf
for short, measures how common a word is in a particular document which isn't quite as common in others.
The formula for idf is: log x (Total_Documents/The_Number_Of_Documents_Containing(word))
main.py
import math
import nltk
import os
import sys
def main():
if len(sys.argv) != 2:
sys.exit("Usage: python main.py corpus")
print("Loading data...")
corpus = load_data(sys.argv[1])
words = set()
for filename in corpus:
words.update(corpus[filename])
idfs = list()
for word in words:
f = sum(word in corpus[filename] for filename in corpus)
idf = math.log(len(corpus) / f)
idfs[word] = idf
tfidfs = dict()
for filename in corpus:
tfidfs[filename] = []
for word in corpus[filename]:
tf = corpus[filename][word]
tfidfs[filename].append((word, tf * idfs[word]))
for filename in corpus:
tfidfs[filename].sort(key=lambda tfidf: tfidf[1], reverse=True)
tfidfs[filename] = tfidfs[filename][:5]
print()
for filename in corpus:
print(filename)
for term, score in tfidfs[filename]:
print(f" {term}: {score:.4f}")
def load_data(directory):
files = dict()
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
contents = [
word.lower() for word in
nltk.word_tokenize(f.read())
if word.isalpha()
]
frequencies = dict()
for word in contents:
if word not in frequencies:
frequencies[word] = 1
else:
frequencies[word] += 1
files[filename] = frequencies
return files
if __name__ == "__main__":
main()
But when I run python .\main.py .\shelock_holmes\
in Powershell,
I get this confusing error:
Loading data...
Traceback (most recent call last):
File ".\main.py", line 65, in <module>
main()
File ".\main.py", line 22, in main
idfs[word] = idf
TypeError: list indices must be integers or slices, not str
Can anybody please help me?
You define
idfs
as a list:If
udfs
is a list, then in this assignment:word
must be an integer, because it specifies an index or position within the list.But it appears that
words
is a list ofstr
, and so inside the iteration:word
is astr
. Since astr
is not an integer, the linecauses the error you're getting, for exactly the reason that it explains. Maybe
idfs
should be adict
rather than a list, defined like this:Then the line:
interprets
word
as a key in the dictionary, and assignsidf
as the value of that key in thedict
. Dictionary keys can be any object, and are most often strings, so this makes good sense.