optim() for multi variable returns results on the starting position in R

1.8k Views Asked by At

I would like to use function optim() in R to minimise the target function. The two optimised parameters both have constrains.

I have created a test sampel data. Flow is a random series data separated by NAs. The function NAins() can be seen at the end of this question.

    flow = c(rep(NA,10),NAins(as.data.frame(runif(5000)), .1)$runif)
    rain = runif (length(flow))
    event = with(rle(!is.na(flow )),cbind(length=lengths[values],position=cumsum(c(1,lengths))[values])); 

This function is to calculate r2.

    test_function = function(ndays, event, flow, rain,upboundary){
      flowvolume = rainvolume = raininweek = raininmonth =NULL;
      for (i in 1:(length(event)/2)){
        if (upboundary < event[,'position'][i]){
          flowvolume[i] = sum(flow[(event[,'position'][i]):(event[,'position'][i]+event[,'length'][i]-1)], na.rm = TRUE) # total flow during the non NA period
          rainvolume[i] = sum(rain[(event[,'position'][i]):(event[,'position'][i]+event[,'length'][i]-1)], na.rm = TRUE) # total rain during the non NA period
          raininweek[i] = sum(rain[(event[,'position'][i]-ndays[1]):(event[,'position'][i]-1)], na.rm = TRUE) #total rain imediate before NA with a constrained period of nday[1]
          raininmonth[i] = sum(rain[(event[,'position'][i]-ndays[2]-ndays[1]):(event[,'position'][i]-ndays[1]-1)], na.rm = TRUE) #total rain iprior to nday[1] 
        } else {next}
      }
      -summary(lm(flowvolume ~ rainvolume + raininweek + raininmonth))$r.squared # to minimise R2
    }   

This is the optimisation with constrains.

    results= optim(par=c(2,20), lower=c(1,10), upper=c(10,30),method="L-BFGS-B",test_function, event=event, rain=rain, flow=flow,upboundary=30)

In this simulation, Results always converge to the staring position. If optim() is not a good choose in this question, could you recommend some other packages or function to use?

Here is the function used to create sample flow data with random NA in between.

    ################################################################
    # RANDOMLY INSERT A CERTAIN PROPORTION OF NAs INTO A DATAFRAME #
    ################################################################
    NAins <-  NAinsert <- function(df, prop){
      n <- nrow(df)
      m <- ncol(df)
      num.to.na <- ceiling(prop*n*m)
      id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
      rows <- id %/% m + 1
      cols <- id %% m + 1
      sapply(seq(num.to.na), function(x){
        df[rows[x], cols[x]] <<- NA
      }
      )
      return(df)
    }
2

There are 2 best solutions below

2
On

It seems the optimization is never moving far enough away from the starting point because these parameters are implicitly integer. However optim does not know this. It just sees a flat gradient.

If your parameter space for ndays is small, as you indicate in your question, try enumerating over all these combinations instead. Here is a handy function. Hat tip to How to optimize for integer parameters (and other discontinuous parameter space) in R?.

library(NMOF)
grid<- gridSearch(test_function, list(ndays1=seq(1,10), ndays2=seq(10,22)),
                  event=event, rain=rain, flow=flow, upboundary=30)

grid$minfun
grid$minlevels

Notice I had to cut off part of the search space for ndays[2], because it resulted in a negative subscript error. You will need to add some checks in your function to test for negative subscripts.

2
On

I think enumeration is the best option especially if you have few variables and a very nonlinear function. Nelder Mead or Hooke Jeeves are bound to give you only local solutions. The function here appears to be fairly nonlinear and quite flat in certain areas.

You could get some speedup using parallel packages like foreach and doParallel from Revolution Analytics. In the below example, I do a pure parallel implementation of exhaustive search. I have modified the test_function to also return x variable.

test_function2 = function(ndays, event, flow, rain,upboundary){
  flowvolume = rainvolume = raininweek = raininmonth =NULL;
  for (i in 1:(length(event)/2)){
    if (upboundary < event[,'position'][i]){
      flowvolume[i] = sum(flow[(event[,'position'][i]):(event[,'position'][i]+event[,'length'][i]-1)], na.rm = TRUE) # total flow during the non NA period
      rainvolume[i] = sum(rain[(event[,'position'][i]):(event[,'position'][i]+event[,'length'][i]-1)], na.rm = TRUE) # total rain during the non NA period
      raininweek[i] = sum(rain[(event[,'position'][i]-ndays[1]):(event[,'position'][i]-1)], na.rm = TRUE) #total rain imediate before NA with a constrained period of nday[1]
      raininmonth[i] = sum(rain[(event[,'position'][i]-ndays[2]-ndays[1]):(event[,'position'][i]-ndays[1]-1)], na.rm = TRUE) #total rain iprior to nday[1] 
    } else {next}
  }
  rsq=-summary(lm(flowvolume ~ rainvolume + raininweek + raininmonth))$r.squared # to minimise R2
  return(c(ndays,rsq))
}

x1<-c(1:10)
x2<-c(10:30)
a<-expand.grid(x1,x2)

library(foreach)
library(doParallel)

cl <- makePSOCKcluster(4)
registerDoParallel(cl)

mymin <-function(z1,z2) {
  if (z1[[3]]<=z2[[3]]) {
    return(z1)
  } else {
    return(z2)
  }
}

ptm<-proc.time()
#c<-matrix(foreach(i=1:210) %dopar% test_function(as.numeric(a[i,]),event,flow,rain,30),10)
c<-foreach(i=1:210,.combine=mymin) %dopar% test_function2(as.numeric(a[i,]),event,flow,rain,30)
proc.time()-ptm

stopCluster(cl)

The run time for this was around 4.6s

> ptm<-proc.time()
> #c<-matrix(foreach(i=1:210) %dopar% test_function(as.numeric(a[i,]),event,flow,rain,30),10)
  > c<-foreach(i=1:210,.combine=mymin) %dopar% test_function2(as.numeric(a[i,]),event,flow,rain,30)
> proc.time()-ptm
user  system elapsed 
0.211   0.030   4.596 
> c
[1]  1.0000000 11.0000000 -0.9363349

For the NMOF implementation it was 11s

> ptm<-proc.time()
> grid<- gridSearch(test_function, list(ndays1=seq(1,10), ndays2=seq(10,30)),
                    +                   event=event, rain=rain, flow=flow, upboundary=30)
2 variables with 10, 21 levels: 210 function evaluations required.
> proc.time()-ptm
user  system elapsed 
10.963   0.004  10.974
> grid$minfun
[1] -0.9363349
> grid$minlevels
[1]  1 11

I hope this helps. Please see documentation of foreach, if you plan to take this approach.

Another option for speedup is to use faster ways to solve the lm so you can cut down on the individual function call evaluation time. I see a few options in the below link:

How to compute minimal but fast linear regressions on each column of a response matrix?