I am experimenting with UCI Bag of Words Dataset. I have read document IDs, words (word IDs), and word counts into three separate lists. The first 10 items of those lists are similar to what is below:
['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count
I can't figure out how to make term-document matrix from these lists, without any redundancy. I'd like to turn rows into docIDs, columns into wordIDs, and corresponding cell values into word count. What is the efficient way to do this with python (pandas) ?
I think this answers your question:
Lists:
DataFrame with each list in a separate column:
Pivot this for index as "docIDs", columns as "wordIDs", values as "count":
Output:
Alternatively, you can use
unstack()
by setting the desired index and columns as the index, then unstacking the columns:Which produces the same result. This should use less memory.