I have the following pandas structure:
col1 col2 col3 text
1 1 0 meaningful text
5 9 7 trees
7 8 2 text
I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via mysparsematrix).toarray()
. However, how can I add this info with labels to my original df? So the target would look like:
col1 col2 col3 meaningful text trees
1 1 0 1 1 0
5 9 7 0 0 1
7 8 2 0 1 0
UPDATE:
Solution makes the concatenation wrong even when renaming original columns:
Dropping columns with at least one NaN results in only 7 rows left, even though I use
fillna(0)
before starting to work with it.
You can proceed as follows:
Load data into a dataframe:
Output:
Tokenize the
text
column using:sklearn.feature_extraction.text.TfidfVectorizer
Convert the tokenized data into a dataframe:
Output:
Concatenate the tokenization dataframe to the orignal one:
Output:
If you want to drop the column
text
, you need to do that before the concatenation:Output:
Here's the full code: