Hi I need to create a postings dict out of a list of unique words tokenised and processed from multiple files. So the final format of the postings dictionary would be: {wordid: [0, 1, ...], wordid2:[0, 1, ...]},
I am really struggling with this, the only code I have right now is:
for i in range(len(docids)):
for word in vocab:
if word not in postings.keys():
postings[word] = []
else:
postings[word].append(i)
this just outputs the dictionary with the term and the docids just repeat each other in strange patterns depending on the number of files I've asked it to index.
Example input and expected output:
Doc1 = "hello my name is john", Doc2 = "hi my second name is smith".
This would make a vocab list: ['hello', 'my', 'name', 'is', 'john, 'hi', 'second', 'smith'].
Each word has a wordid which is just the index of the word
And a docid: [0, 1]
(this just counts the documents and is used in creating the postings list to say: word w occurs in document doc)
The final output of this example would be:
postings = {0: [0], 1: [0,1], 2: [0,1], 3: [0,1], 4: [0], 5: [1], 6: [1], 7: [1]}
So this dict shows each wordid(index of each word in vocab) and which document(s) it appears in
Also the program as a whole is supposed to be ran from the terminal and giving it the directory and number of files as arguments.
Going line by line:
You're going over the range, but the content of the array is
[0, 1, ..]
and you need to add the docids to thepostings
, not the index of them, so you should go:Next one:
Since you want the
wordid
to be its index in vocab and not the actual word itself, you should the range here, or something like this (also, please use a standard indent of 4 in Python, almost everyone else does):And then:
You're looking to add a key if it doesn't exist, or append to it if it does. That's what
defaultdict(list)
is for, so initialisepostings = defaultdict(list)
and you won't need the next couple of lines.Finally, you need to check if a word is in a document, but you have no check in your code that actually checks if a word is in a document. Since you've already processed the entire set of documents creating
vocab
, it seems wasteful to look for every word individually in every document again.You should construct your
postings
as you're constructingvocab
.If you must do it this way, it makes more sense to iterate over whatever is holding your docs, instead of the docids - but it's hard to provide code that would work because you're not sharing example values of the variables you're using.