TypeError: list indices must be integers or slices, not str on a Windows 10

3.1k Views Asked by At

I am trying to find out the inverse document frequency of a list of Sherlock Holmes stories. Have a look at the code:

Inverse document frequency is the measure of how common or rare a word is across multiple documents.

So, that would mean that Inverse Document Frequency or idf for short, measures how common a word is in a particular document which isn't quite as common in others.

The formula for idf is: log x (Total_Documents/The_Number_Of_Documents_Containing(word))

main.py

import math
import nltk
import os
import sys


def main():

    if len(sys.argv) != 2:
        sys.exit("Usage: python main.py corpus")
    print("Loading data...")
    corpus = load_data(sys.argv[1])

    words = set()
    for filename in corpus:
        words.update(corpus[filename])

    idfs = list()
    for word in words:
        f = sum(word in corpus[filename] for filename in corpus)
        idf = math.log(len(corpus) / f)
        idfs[word] = idf

    tfidfs = dict()
    for filename in corpus:
        tfidfs[filename] = []
        for word in corpus[filename]:
            tf = corpus[filename][word]
            tfidfs[filename].append((word, tf * idfs[word]))

    for filename in corpus:
        tfidfs[filename].sort(key=lambda tfidf: tfidf[1], reverse=True)
        tfidfs[filename] = tfidfs[filename][:5]

    print()
    for filename in corpus:
        print(filename)
        for term, score in tfidfs[filename]:
            print(f"    {term}: {score:.4f}")


def load_data(directory):
    files = dict()
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename)) as f:

            contents = [
                word.lower() for word in
                nltk.word_tokenize(f.read())
                if word.isalpha()
            ]

            frequencies = dict()
            for word in contents:
                if word not in frequencies:
                    frequencies[word] = 1
                else:
                    frequencies[word] += 1
            files[filename] = frequencies

    return files


if __name__ == "__main__":
    main()

But when I run python .\main.py .\shelock_holmes\ in Powershell,

I get this confusing error:

Loading data...
Traceback (most recent call last):
  File ".\main.py", line 65, in <module>
    main()
  File ".\main.py", line 22, in main
    idfs[word] = idf
TypeError: list indices must be integers or slices, not str

Can anybody please help me?

2

There are 2 best solutions below

0
On BEST ANSWER

You define idfs as a list:

idfs = list()

If udfs is a list, then in this assignment:

idfs[word] = idf

word must be an integer, because it specifies an index or position within the list.

But it appears that words is a list of str, and so inside the iteration:

for word in words:

word is a str. Since a str is not an integer, the line

idfs[word] = idf

causes the error you're getting, for exactly the reason that it explains. Maybe idfs should be a dict rather than a list, defined like this:

idfs = dict()

Then the line:

idfs[word] = idf

interprets word as a key in the dictionary, and assigns idf as the value of that key in the dict. Dictionary keys can be any object, and are most often strings, so this makes good sense.

0
On

Actually idfs is a list. And idfs[word] = idf adds key-values to it like a dictionary. So you should instead of idfs = list() make it idfs = {} a dictionary. Otherwise if you need list, then use .append() to add items to the end.