What R package is suited to identifying words that are positively correlated with a binary response variable

89 Views Asked by Mutuelinvestor At 10 May 2020 at 14:27

I have a tibble that has to three columns:

wine - Name of the wine
wine_description - Words describing wine (punctuation has been stripped out)
target - 0 or 1 variable 1 = Top Rated Wine, 0 = Not Top Rated Wine

What R package might I use if I were interested in identifying words that tend to be present with top-rated wine (the target variable = 1)

I came across Text Mining in R Text Mining with R, but this appears to be more about sentiment analysis which seems close to what I'm trying to achieve, but perhaps a bit off the mark. Any suggestions would be welcomed.

I am working under the assumption that once I've completed some basic analysis I will be able to incorporate that into a logistic regression.

Original Q&A

There are 2 best solutions below

PRZ On 10 May 2020 at 15:09

A minimal working example would be nice. As far as I can see, all you need is a package to turn your data into a document-feature matrix (dfm), using your wine_description variable as the text field. I like Quanteda for doing that.

Logistic regression with the dfm as predictors would then be one way to identify which words are used to describe top-rated wines.

Julia Silge On 11 May 2020 at 15:55

You can use the tidymodels framework for this kind of modeling, using the textrecipes package for data preprocessing. You'll end up with modeling that looks something like this.

## ══Workflow════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## ● step_tokenize()
## ● step_stopwords()
## ● step_tokenfilter()
## ● step_tfidf()
## ● step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

Check out this recent tutorial for more details.

What R package is suited to identifying words that are positively correlated with a binary response variable

There are 2 best solutions below

Related Questions in R

Related Questions in DPLYR

Related Questions in TEXT-MINING

Related Questions in TIDYTEXT

Related Questions in QDAP

Trending Questions

Popular # Hahtags

Popular Questions