How to create an efficient term-document matrix from bag-of-words dataset

352 Views Asked by At

I am experimenting with UCI Bag of Words Dataset. I have read document IDs, words (word IDs), and word counts into three separate lists. The first 10 items of those lists are similar to what is below:

['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count

I can't figure out how to make term-document matrix from these lists, without any redundancy. I'd like to turn rows into docIDs, columns into wordIDs, and corresponding cell values into word count. What is the efficient way to do this with python (pandas) ?

1

There are 1 best solutions below

2
On

I think this answers your question:

Lists:

docid = ['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
wordid = ['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
counted = ['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count

DataFrame with each list in a separate column:

df = pd.DataFrame([docid, wordid, counted],
                  index = ["docIDs", "wordIDs", "count"]).T

Pivot this for index as "docIDs", columns as "wordIDs", values as "count":

df = df.pivot(index="docIDs", columns="wordIDs", values="count")

Output:

#wordIDs  118  129  168   20  285  529 6941    7  890
#docIDs                                              
#1          1    1    1    2    1  NaN  NaN  NaN  NaN
#2        NaN  NaN  NaN  NaN  NaN    1    1    5  NaN
#3        NaN  NaN  NaN  NaN    1  NaN  NaN  NaN    1

Alternatively, you can use unstack() by setting the desired index and columns as the index, then unstacking the columns:

df.set_index(["docIDs", "wordIDs"])["count"].unstack("wordIDs")

Which produces the same result. This should use less memory.