what R Code to calculate the entropy for each level in a categorical variable

535 Views Asked by highclef At 12 December 2022 at 17:38

I have quite some categorical variable in my dataset, These variables have more than two levels each. Now i want an R code function (or loop) that can calculate the entropy and information gain for each levels in each categorical variable and return the lowest entropy and highest information gain.

data <- list(buys = c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no"),credit = c("fair", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "excellent"),student = c("no", "no", "no","no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes", "no"),income = c("high", "high", "high", "medium", "low", "low", "low", "medium", "low", "medium", "medium", "medium", "high", "medium"),age = c(25, 27, 35, 41, 48, 42, 36, 29, 26, 45, 23, 33, 37, 44))
data<- as.data.frame(data)

Above is a sample dataframe

entropy_tab <- function(x) { tabfun2 <- prop.table(table(data[,x],training_credit_Risk[,13]) + 1e-6, margin = 1)sum(prop.table(table(data[,x]))*rowSums(-tabfun2*log2(tabfun2)))}

Above function calculates entropy for each variable, i want a fuction to calculate the contribution to the entropy for each level? i.e the contribution of "excellent" and "fair" to the entropy of "Credit"

Original Q&A

There are 2 best solutions below

Mike On 12 December 2022 at 18:52

You have to modify your function to have two inputs, the variable you want and the level of the variable. Inside the function you then have to subset based on the level of the variable you want. I then use mapply to loop through the variable credit and each of its levels.

entropy_tab <- function(x,y) { 
  tabfun2 <- prop.table(table(data[,x][data[,x] == y] ,data[,5][data[,x]==y]) + 1e-6, margin = 1)
sum(prop.table(table(data[,x][data[,x] == y]))*rowSums(-tabfun2*log2(tabfun2)))
}


x <- mapply(entropy_tab, c("credit","credit"), unique(data$credit))

names(x) <- unique(data$credit)

#checks
entropy_tab("credit","excellent")
entropy_tab("credit","fair")

James_D On 12 December 2022 at 18:53

In measure theory, the expected surprisal of an event A in a measure space with measure mu is

-mu(A)log(mu(A))

And so the entropy is the sum over all events of the expected surprisal. So what you're looking for is the expected surprisal of each level of each variable.

Note you won't be able to express the surprisal of a data frame as a data frame, as each variable in the data frame has a different number of levels.

You can do

exp_surprisal <- function(x, base=exp(1)) {
  t <- table(x)
  freq <- t/sum(t)
  ifelse(freq==0, 0, -freq * log(freq, base))
}

And then

lapply(data, exp_surprisal)

gives

$buys
x
       no       yes 
0.3677212 0.2840353 

$credit
x
excellent      fair 
0.3631277 0.3197805 

$student
x
       no       yes 
0.3465736 0.3465736 

$income
x
     high       low    medium 
0.3579323 0.3579323 0.3631277 

$age
x
       23        25        26        27        29        33        35        36        37        41        42        44        45        48 
0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041

Note you can also define

entropy <- function(x) sum(exp_surprisal(x))

to get the entropy.

Then

lapply(data, entropy)

gives

$buys
[1] 0.6517566

$credit
[1] 0.6829081

$student
[1] 0.6931472

$income
[1] 1.078992

$age
[1] 2.639057

what R Code to calculate the entropy for each level in a categorical variable

There are 2 best solutions below

Related Questions in R

Related Questions in DATAFRAME

Related Questions in DATA-SCIENCE

Related Questions in ENTROPY

Related Questions in INFORMATION-GAIN

Trending Questions

Popular # Hahtags

Popular Questions