What R package is suited to identifying words that are positively correlated with a binary response variable

89 Views Asked by At

I have a tibble that has to three columns:

  1. wine - Name of the wine
  2. wine_description - Words describing wine (punctuation has been stripped out)
  3. target - 0 or 1 variable 1 = Top Rated Wine, 0 = Not Top Rated Wine

What R package might I use if I were interested in identifying words that tend to be present with top-rated wine (the target variable = 1)

I came across Text Mining in R Text Mining with R, but this appears to be more about sentiment analysis which seems close to what I'm trying to achieve, but perhaps a bit off the mark. Any suggestions would be welcomed.

I am working under the assumption that once I've completed some basic analysis I will be able to incorporate that into a logistic regression.

2

There are 2 best solutions below

0
PRZ On

A minimal working example would be nice. As far as I can see, all you need is a package to turn your data into a document-feature matrix (dfm), using your wine_description variable as the text field. I like Quanteda for doing that.

Logistic regression with the dfm as predictors would then be one way to identify which words are used to describe top-rated wines.

1
Julia Silge On

You can use the tidymodels framework for this kind of modeling, using the textrecipes package for data preprocessing. You'll end up with modeling that looks something like this.

## ══Workflow════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## ● step_tokenize()
## ● step_stopwords()
## ● step_tokenfilter()
## ● step_tfidf()
## ● step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

Check out this recent tutorial for more details.