Logistic Regression Bigram Text Classification w/ Patsy

632 Views Asked by At

I'm working on upgrading a LogisticRegression text classification from single word features to bigrams (two word features). However when I include the two word feature in the formula sent to patsy.dmatrices, I receive the following error...

y, X = dmatrices("is_host ~ dedicated + hosting + dedicated hosting", df, return_type="dataframe")

  File "<string>", line 1
    dedicated hosting
                ^
SyntaxError: unexpected EOF while parsing

I've looked around online for any examples on how to approach this and haven't found anything. I tried throwing a few different syntax options at the formula and none seem to work.

"is_host ~ dedicated + hosting + {dedicated hosting}"
"is_host ~ dedicated + hosting + (dedicated hosting)"
"is_host ~ dedicated + hosting + [dedicated hosting]"

What is the proper way to include multi-word features in the formula passed to dmatricies?

1

There are 1 best solutions below

0
On

You want:

y, X = dmatrices("is_host ~ dedicated + hosting + Q('dedicated hosting')", df, return_type="dataframe")

Q is short for quote.