Build a document-term matrix from a list of documents, each of which is in list form

3k Views Asked by At

I wonder if there exists an elegant way to convert a list of documents to a document-term matrix. The motivation to do this is the need of subtle transformation on the terms from documents, i.e., stemming. the input data is like

[['tom','want','apple'],['tom','love','pear']]

output data should be a matrix or whatever data type that can be easily converted to a numpy.array. Just like:

[[1,1,1,0,0],[1,0,0,1,1]]

What I have now is join every element in the outer list and then use CountVectorizer in sklearn.feature_extraction.text. But it is inefficient to do that for a large corpus.

Any suggestions? Thank you.

1

There are 1 best solutions below

0
On BEST ANSWER

Use MultiLabelBinarizer.

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
data = [['tom','want','apple'],['tom','love','pear']]
mlb.fit_transform(data)

Return

array([1, 0, 0, 1, 1],
      [0, 1, 1, 1, 0])

Note: the documents are sorted in alphabetical order.

mlb.classes_
>>> array(['apple', 'love', 'pear', 'tom', 'want'])