Is there a more efficient way to create an inverted index from a large text file?

1.1k Views Asked by At
def inverted_index(doc):
    words = word_count(doc)
    ln = 0
    for word in words:
        temp = []
        with open(doc) as file:
            for line in file:
                ln += 1
                li = line.split()
                if word in li:
                    temp.append(ln)
            words[word] = temp
    return words

I am trying to create an inverted index from a text file, where words is a dictionary with all the 19000 unique words in the file. The text file has around 5000+ lines. I want to iterate through the file and dictionary to create the inverted index that has the word followed by line numbers that the word appears but it is taking too long to compile as it is nested for loop. So is there a more efficient way to do this?

1

There are 1 best solutions below

0
Devesh On

Here is my approach to solve this, please read the notes below code for some pragmatic tips.

def inverted_index(doc):
    # this will open the file
    file = open(doc, encoding='utf8')
    f = file.read()
    file.seek(0)


    # Get number of lines in file
    lines = 1
    for word in f:
        if word == '\n':
            lines += 1
    print("Number of lines in file is: ", lines) # Just for debuggin, please remove in PROD version


    d = {}

    for i in range(lines):
        line = file.readline()
        l = line.lower().split(' ')
        for item in l:
            if item not in d:
                d[item] = [i+1]
            if item in d:
                d[item].append(i+1)

    return d

print(inverted_index('file.txt'))

I would suggest removing stopwords first before creating the inverted index for any meaningful analysis. You can use nltk package for that.