Postings list in Python

1.5k Views Asked by At

Hi I need to create a postings dict out of a list of unique words tokenised and processed from multiple files. So the final format of the postings dictionary would be: {wordid: [0, 1, ...], wordid2:[0, 1, ...]},

I am really struggling with this, the only code I have right now is:

    for i in range(len(docids)):
      for word in vocab:
        if word not in postings.keys():
          postings[word] = []
        else:
          postings[word].append(i)    

this just outputs the dictionary with the term and the docids just repeat each other in strange patterns depending on the number of files I've asked it to index.

Example input and expected output:
Doc1 = "hello my name is john", Doc2 = "hi my second name is smith".
This would make a vocab list: ['hello', 'my', 'name', 'is', 'john, 'hi', 'second', 'smith'].
Each word has a wordid which is just the index of the word

And a docid: [0, 1]
(this just counts the documents and is used in creating the postings list to say: word w occurs in document doc)

The final output of this example would be:

postings = {0: [0], 1: [0,1], 2: [0,1], 3: [0,1], 4: [0], 5: [1], 6: [1], 7: [1]}

So this dict shows each wordid(index of each word in vocab) and which document(s) it appears in

Also the program as a whole is supposed to be ran from the terminal and giving it the directory and number of files as arguments.

2

There are 2 best solutions below

0
On

Going line by line:

for i in range(len(docids)):

You're going over the range, but the content of the array is [0, 1, ..] and you need to add the docids to the postings, not the index of them, so you should go:

for docid in docids:

Next one:

  for word in vocab:

Since you want the wordid to be its index in vocab and not the actual word itself, you should the range here, or something like this (also, please use a standard indent of 4 in Python, almost everyone else does):

    for wordid, word in enumerate(vocab):

And then:

        if word not in postings.keys():

You're looking to add a key if it doesn't exist, or append to it if it does. That's what defaultdict(list) is for, so initialise postings = defaultdict(list) and you won't need the next couple of lines.

Finally, you need to check if a word is in a document, but you have no check in your code that actually checks if a word is in a document. Since you've already processed the entire set of documents creating vocab, it seems wasteful to look for every word individually in every document again.

You should construct your postings as you're constructing vocab.

If you must do it this way, it makes more sense to iterate over whatever is holding your docs, instead of the docids - but it's hard to provide code that would work because you're not sharing example values of the variables you're using.

0
On

I think I have what your looking for, I have simplified the problem a bit by using a list of strings, instead of a list of files, but hopefully you get where I'm coming from.

# This would be you're list of files
Docs = [
    "Hello my name is John",
    "Hi my second name is Smith"
]

vocab = []
postings = {}
for i,doc in enumerate(Docs):
    # if using files for Docs this is 
    # where you read the text from them
    words = doc.split(" ")
    for word in words:
        if word not in vocab:
            vocab.append(word)
        wordId = vocab.index(word)
        if wordId not in postings:
            postings[wordId] = [i]
        else:
            postings[wordId].append(i)

print(postings)
#{0: [0], 1: [0, 1], 2: [0, 1], 3: [0, 1], 
# 4: [0], 5: [1], 6: [1], 7: [1]}