Having a corpus like this:
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?'
I'm using this vocabulary ["this", "document", "this document"]. After the vectorizer fit, I'm getting these result:
[[1 1 0]
[1 2 1]
[1 0 0]
[1 1 0]]
which is correct. Is there a way I can use regex (or something else) in order to take "this document" feature in the first row of my corpus? More specifically this [1 1 1] than [1 1 0]?
My row is this: ["This is the first document"]. Can I somehow "remove" the words "is the first" (or whatever words) to get "this document" feature? Maybe with token_pattern?
Just figure it out. What I actually wanted to do is to create features based on all word combination on my corpora (unigrams and bigrams). For example, my row: This is the first document. Extracted features:
I made this by writing my own tokenizer and using it on the tokenizer parameter of my CountVectorizer().