I am trying to convert a matrix to the type that can be received by gensim. AuthorTopic Model, which means I should convert a matrix to a sparse vector. I have already tried several functions in gensim like gensim.matutils.full2sparse and gensim.matutils.any2sparse. But there is something wrong:
my code:
matrix=numpy.array([[1,0 ,1],[0,1,1]])
mycorpus=any2sparse(matrix)
print(matrix)
print(mycorpus)
the output:
[[1 0 1]
[0 1 1]]
[(0, 1.0), (0, 1.0), (1, 0.0), (1, 0.0)] #mycorpus
accoring to the tutorial, mycorpus should be like:
[[(0,1),(2,1)]
[(1,1),(2,1)]]
I have no idea what's wrong. I really appreciate if anyone could give me some advise.
The Gensim
AuthorTopicModeldocs describe its desired corpus-format as iterable of list of (int, float).Those
intvalues would be word-ids, and ideally be accompanied by theid2worddict which idntifies whichintmeans which word.What's the source of your matrix, & do you know if it's the rows or the columns that represent words, and have a mapping of indexes to words? That will drive the conversion.
Also, as the docs mention, "The model is closely related to
LdaModel. TheAuthorTopicModelclass inheritsLdaModel, and its usage is thus similar.Have you reviewed guides to Gensim LDA usage to see how they prepare their corpus, such as the multiple Usage Examples, to see if that helps suggest steps & necessary formats?
Or, is your corpus still available as texts, so you can directly use the examples there as a model to turn the text into the BoW format (rather than your already-processed matrix)?
If you're still having problems, you should expand your question text with more details, especially how the true corpus matrix that you have was created, and which errors you've encountered (& how you triggered them) that convince you things aren't working.