I am doing texting mining using natural language processing. I used quanteda
package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test.
I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r)
I learned that I could use chi.squared
in FSelector
package but I don't know how to apply this function to a dfm class object (trainingtfidf
below). (Shows in the manual, it applies to the predictor variable)
Could anyone give me a hint? I appreciate it!
Example code:
description <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
code <- c(4,3,6)
example <- data.frame(description, code)
library(quanteda)
trainingcorpus <- corpus(example$description)
trainingdfm <- dfm(trainingcorpus, verbose = TRUE, stem=TRUE, toLower=TRUE, removePunct= TRUE, removeSeparators=TRUE, language="english", ignoredFeatures = stopwords("english"), removeNumbers=TRUE, ngrams = 2)
# tf-idf
trainingtfidf <- tfidf(trainingdfm, normalize=TRUE)
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
Here's a general method for computing Chi-squared values for features. It requires that you have some variable against which to form the associations, which here could be some classification variable you are using for training your classifier.
Note that I am showing how to do this in the quanteda package, but the results should be general enough to work for other text package matrix objects. Here, I am using the data from the auxiliary quantedaData package that has all of the State of the Union addresses of US presidents.
These can now be selected using the
dfm_select()
command. (Note that column indexing by name would also work.)Added: With >= v0.9.9 this can be done using the
textstat_keyness()
function.This information can then be used to select the most discriminating features, after the sign of the chi^2 score is removed.