Chi-squared feature selection using Fselector in R

1.9k Views Asked by At

I am a beginner in R and I have a data frame that has binary values in it. In my data frame, the first 6000 columns are the attributes I am going to select features from, and the last 10 columns (again binary) are the classes I need to train my data with. I have learned that I can use the Fselector package to calculate the chi-squared value for each attribute, then rank-order them and select my features. I've found this example from Fselector package:

# Use HouseVotes84 data from  mlbench package
library(mlbench)# For data
library(FSelector)#For method
data(HouseVotes84)


#Calculate the chi square statistics 
weights<- chi.squared(Class~., HouseVotes84)


# Print the results 
print(weights)


# Select top five variables
subset<- cutoff.k(weights, 5)


# Print the final formula that can be used in classification
f<- as.simple.formula(subset, "Class")
print(f)

But when I write the same code for my data, R doesn't find the object Class after command weights<- chi.squared(Class~., HouseVotes84). The Fselector package notes that a formula should be there, but I don't know what kind of formula. Should I write the mathematical formula of chisquare test there? Then what's the point of the package vs. using a For loop for calculating the X^2 statistic?

I am not going to use other packages like quanteda because I actually want to avoid typing in the whole formula of chi-square for feature selection. Do you have any suggestions for how to fix that line of code based on the structure of my data?

UPDATE: This is the first three rows of my data with 10 out of 6000 columns of terms. The last 10 columns are my classes.

   structure(list(rigid = c(0, 0, 0), sobaaox = c(0, 0, 0), intermittententsharpleft = c(0, 
0, 0), pnuemondayia = c(0, 0, 0), medport = c(0, 0, 0), assharp = c(0, 
0, 0), ambult = c(0, 0, 0), cmpliant = c(0, 0, 0), anlk = c(0, 
0, 0), scoliosi = c(0, 0, 0), espec = c(0, 0, 0), `290` = c(0L, 
0L, 0L), `320` = c(0L, 0L, 0L), `390` = c(1L, 0L, 0L), `460` = c(0L, 
0L, 0L), `520` = c(0L, 1L, 0L), `580` = c(0L, 0L, 0L), `710` = c(0L, 
0L, 0L), `780` = c(0L, 0L, 1L), `800` = c(0L, 0L, 0L), `100001` = c(0L, 
0L, 0L)), .Names = c("rigid", "sobaaox", "intermittententsharpleft", 
"pnuemondayia", "medport", "assharp", "ambult", "cmpliant", "anlk", 
"scoliosi", "espec", "290", "320", "390", "460", "520", "580", 
"710", "780", "800", "100001"), row.names = c(NA, 3L), class = "data.frame")
0

There are 0 best solutions below