I wonder if there exists an elegant way to convert a list of documents to a document-term matrix. The motivation to do this is the need of subtle transformation on the terms from documents, i.e., stemming. the input data is like
[['tom','want','apple'],['tom','love','pear']]
output data should be a matrix or whatever data type that can be easily converted to a numpy.array. Just like:
[[1,1,1,0,0],[1,0,0,1,1]]
What I have now is join every element in the outer list and then use CountVectorizer
in sklearn.feature_extraction.text
. But it is inefficient to do that for a large corpus.
Any suggestions? Thank you.
Use
MultiLabelBinarizer
.Return
Note: the documents are sorted in alphabetical order.