How can I implement a R code for AUC in genetic algorithm

210 Views Asked by At

I am doing research on applying a genetic algorithm to binary logistic regression. I have a few questions to be clarified. Can you please help me?

  1. Can I use AIC or BIC as the fitness function in the GA? (I used them and results show that GA is more accurate than the traditional binary logistic model. However, I found that in most of the papers, they used AUC as the fitness function)

  2. I tried GA using AUC as the fitness function according to this paper (http://atm.amegroups.com/article/view/18292/html) and it gives following error. Can you create a small reproducible example to overcome this problem?

Error in model.frame.default(formula = as.numeric(tey) ~ predict.glm(trm, : variable lengths differ (found for 'predict.glm(trm, newdata = ted, type = "response")')

5

There are 5 best solutions below

6
On

In the galgo package, the cost function can be custom-defined. Can you run the program as described in the paper? For example, you can define AUC as your goal; and you use neural network for the prediction the following code can help:

reg.fitness <- function(chr, parent,tr,te,res) {
  try <- as.factor(parent$data$classes[tr])
  trd <-
    data.frame(parent$data$data[tr,as.numeric(chr)])
  trm <- nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F,
                    size = 5)
  tey <- as.factor(parent$data$classes[te])
  ted <-
    data.frame(parent$data$data[te,as.numeric(chr)])
  pred

 <- predict(trm,newdata = cbind(ted,tey=tey),type = "raw")
  if(res){
    roc(tey,pred,
        levels=levels(tey),
        direction = "<")$auc
  }
  else{
    predict(trm,newdata=cbind(ted,tey=tey),type="class")
  }

}

you can adapt the model by modify this chunk: trm <- nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F, size = 5)

4
On

@Z. Zhang

reg.fitness = function(chr, parent,tr,te,res) {
  try=as.factor(parent$data$classes[tr])
  trd =
    data.frame(parent$data$data[tr,as.numeric(chr)])
  trm = nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F,
                    size = 3)
  tey = as.factor(parent$data$classes[te])
  ted =
    data.frame(parent$data$data[te,as.numeric(chr)])
 pred=predict(trm,newdata = cbind(ted,tey=tey),type = "raw")
  if(res){
    roc(tey,pred,levels=levels(tey),
        direction = "<")$auc
  }
  else{
    predict(trm,newdata=cbind(ted,tey=tey),type="class")
  }

}




reg.bb = configBB.VarSel(data=t(data_set[,-ncol(data_set)]), 
                          classes=data_set$y, 
                          classification.method="user", 
                          classification.userFitnessFunc=reg.fitness, 
                          chromosomeSize=3, niches=1, maxSolutions=10,
                          goalFitness = 0.9, saveVariable="reg.bb",
                          saveFrequency=50, saveFile="reg.bb.Rdata", 
                          main="Logistic")
blast(reg.bb)

This is the output that I got for first 4 iterations

[Bb] Starting, Solutions=10
[Bb]    #bb Sol Last    Fitness %Fit    Gen Time    Elapsed Total   Remaining

[e] Starting: Fitness Goal=0.9, Generations=(10 : 200)
[e] Elapsed Time    Generation  Fitness %Fit    [Next Generations]
[e] 0h 0m 0s    (m) 0   1   111.11% +GGGGGGGGGG
[e] 0h 0m 9s    *** 11  1   111.11% FINISH: 1 2 1 

[Bb]    1   1   Sol Ok  1   111.11% 11  9.33s   9s  10s 42s (0h 0m 42s )

[e] Starting: Fitness Goal=0.9, Generations=(10 : 200)
[e] Elapsed Time    Generation  Fitness %Fit    [Next Generations]
[e] 0h 0m 0s    (m) 0   1   111.11% +GGGGGGGGGG
[e] 0h 0m 10s   *** 11  1   111.11% FINISH: 1 1 3 

[Bb]    2   2   Sol Ok  1   111.11% 11  10.35s  20s 22s 50s (0h 0m 50s )

[e] Starting: Fitness Goal=0.9, Generations=(10 : 200)
[e] Elapsed Time    Generation  Fitness %Fit    [Next Generations]
[e] 0h 0m 0s    (m) 0   1   111.11% +GGGGGGGGGG
[e] 0h 0m 10s   *** 11  1   111.11% FINISH: 3 1 1 

[Bb]    3   3   Sol Ok  1   111.11% 11  9.93s   30s 34s 50s (0h 0m 50s )

[e] Starting: Fitness Goal=0.9, Generations=(10 : 200)
[e] Elapsed Time    Generation  Fitness %Fit    [Next Generations]
[e] 0h 0m 0s    (m) 0   1   111.11% +GGGGGGGGGG
[e] 0h 0m 10s   *** 11  1   111.11% FINISH: 1 2 2 

[Bb]    4   4   Sol Ok  1   111.11% 11  10s 40s 45s 45s (0h 0m 45)

Plot for all 1000 iterations where it gave same fitness function value as 1

0
On

@ Z. Zhang, here is a reproducible example of my code.

library(pROC)
library(galgo) 
library(rtkore)
library(Rcpp)
library(aod)
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
attach(mydata)


reg.fitness = function(chr, parent,tr,te,res) {
  try=as.factor(parent$data$classes[tr])
  trd =
    data.frame(parent$data$data[tr,as.numeric(chr)])
  trm = nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F,size = 2)
  tey = as.factor(parent$data$classes[te])
  ted =
    data.frame(parent$data$data[te,as.numeric(chr)])
 pred=predict(trm,newdata = cbind(ted,tey=tey),type = "raw")
  if(res){
    roc(tey,pred,levels=levels(tey),
        direction = "<")$auc
  }
  else{
    predict(trm,newdata=cbind(ted,tey=tey),type="class")
  }

}




reg.bb = configBB.VarSel(data=t(mydata[,-ncol(mydata)]), 
                          classes=admit ,
                          classification.method="user", 
                          classification.userFitnessFunc=reg.fitness, 
                          chromosomeSize=2 ,niches=1, maxSolutions=10,
                          goalFitness = 0.9, saveVariable="reg.bb",
                          saveFrequency=50, saveFile="reg.bb.Rdata", 
                          main="Logistic")
blast(reg.bb)

Plot(blast(reg.bb))
3
On

The problem is that you use the admin to predict admin, which will of course get a AUC of 1. here is the revised code which can run expectedly in my computer.

library(pROC) library(galgo)
library(rtkore) library(Rcpp) library(aod) mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

reg.fitness = function(chr, parent,tr,te,res) {
try=as.factor(parent$data$classes[tr]) trd = data.frame(parent$data$data[tr,as.numeric(chr)]) trm = nnet::nnet(try ~ ., data = cbind(trd,try=try),trace=F,size = 2) tey = as.factor(parent$data$classes[te]) ted = data.frame(parent$data$data[te,as.numeric(chr)]) pred=predict(trm,newdata = cbind(ted,tey=tey),type = "raw") if(res){ roc(tey,pred,levels=levels(tey), direction = "<")$auc } else{ predict(trm,newdata=cbind(ted,tey=tey),type="class") } }

reg.bb = configBB.VarSel(data=t(mydata[,-1]), classes=mydata$admit , classification.method="user", classification.userFitnessFunc=reg.fitness, chromosomeSize=2 ,niches=1, maxSolutions=10, goalFitness = 0.9, saveVariable="reg.bb", saveFrequency=50, saveFile="reg.bb.Rdata", main="Logistic")

blast(reg.bb)

Plot(blast(reg.bb))

1
On

for the ANN in the example, I use the nnet function, it has only one hidden layer, the input and output is determined by the data dimension. You can reference the manual help for the nnet function. Furthermore you can apply any structure of ANN by replace the function with any other functions