I'm wondering if, when I use CountVectorizer().fit_transform()
, the output preserves the order of the input.
My input is a list of documents. I know that the output matches the input in terms of the length, but I'm not sure if they are ordered the same way.
I understand that I might not be explaining it very well, so here's an example.
Say if I have:
input = ["<text_1>", "<text_2>", "<text_3>"]
a = CountVectorizer().fit_transform(input)
Will the indexes correspond as though order is preserved?
For example, in:
(0, 33) 1
...
(0, 42) 8
...
(385, 58) 1
(385, 51) 6
Is (0, 33) 1
eqivalent to input[0]
, or (385, 58) 1
to input[365]
?
Yes, the row order is preserved. This must be true for all scikit-learn transformation methods, because a common workflow is to split your data into a feature matrix
X
and a target vectory
, where each row of the matrix corresponds to one element of the vector. When you transformX
, you must still be able to train the model on the transformedX
paired withy
, so the order must be preserved.