scikit-learn CountVectorizer vocabulary with regex

180 Views Asked by At

Having a corpus like this:

'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?'

I'm using this vocabulary ["this", "document", "this document"]. After the vectorizer fit, I'm getting these result:

[[1 1 0]
[1 2 1]
[1 0 0]
[1 1 0]]

which is correct. Is there a way I can use regex (or something else) in order to take "this document" feature in the first row of my corpus? More specifically this [1 1 1] than [1 1 0]?

My row is this: ["This is the first document"]. Can I somehow "remove" the words "is the first" (or whatever words) to get "this document" feature? Maybe with token_pattern?

1

There are 1 best solutions below

0
Paris Karipidis On

Just figure it out. What I actually wanted to do is to create features based on all word combination on my corpora (unigrams and bigrams). For example, my row: This is the first document. Extracted features:

this, 
is, 
the, 
first, 
document, 
this is, 
this the, 
this document, 
is the, 
is first, 
is document, 
the first, 
the document, 
first document

I made this by writing my own tokenizer and using it on the tokenizer parameter of my CountVectorizer().