Is there a more efficient way to create an inverted index from a large text file?

1.1k Views Asked by user17322569 At 24 March 2022 at 04:34

def inverted_index(doc):
    words = word_count(doc)
    ln = 0
    for word in words:
        temp = []
        with open(doc) as file:
            for line in file:
                ln += 1
                li = line.split()
                if word in li:
                    temp.append(ln)
            words[word] = temp
    return words

I am trying to create an inverted index from a text file, where words is a dictionary with all the 19000 unique words in the file. The text file has around 5000+ lines. I want to iterate through the file and dictionary to create the inverted index that has the word followed by line numbers that the word appears but it is taking too long to compile as it is nested for loop. So is there a more efficient way to do this?

Original Q&A

There are 1 best solutions below

Devesh On 24 March 2022 at 05:05

Here is my approach to solve this, please read the notes below code for some pragmatic tips.

def inverted_index(doc):
    # this will open the file
    file = open(doc, encoding='utf8')
    f = file.read()
    file.seek(0)


    # Get number of lines in file
    lines = 1
    for word in f:
        if word == '\n':
            lines += 1
    print("Number of lines in file is: ", lines) # Just for debuggin, please remove in PROD version


    d = {}

    for i in range(lines):
        line = file.readline()
        l = line.lower().split(' ')
        for item in l:
            if item not in d:
                d[item] = [i+1]
            if item in d:
                d[item].append(i+1)

    return d

print(inverted_index('file.txt'))

I would suggest removing stopwords first before creating the inverted index for any meaningful analysis. You can use nltk package for that.

Is there a more efficient way to create an inverted index from a large text file?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PERFORMANCE

Related Questions in DATA-STRUCTURES

Related Questions in TEXT

Related Questions in INVERTED-INDEX

Trending Questions

Popular # Hahtags

Popular Questions