Speeding up a function involving mapply and integrate

1.2k Views Asked by At

I've inherited R some code and it runs incredibly slowly. Most of the time is spent evaluating the functions of the form (there are about 15 such functions with different integrands G):

TMin <- 0.5

F <- function (t, d) {
    result <- ifelse(((d > 0) & (t > TMin)),
                     mapply(function(t, d) integrate(G, lower=0, upper=t, t, d)$value, t, d),
                     0)

    return(result)

}

For testing, I'm using the following dummy function, but in the real code the Gs are much more complicated involving exp(), log(), dlnorm(), plnorm() etc..

G <- function(x, t, d) {
    mean(rnorm(1e5))
    x + t - d
}   

F will be calculated around 2 million times in the worst case. The function gets called in 3 different ways, either:
t is a single number and d is a numeric vector or,
t is a numeric vector and d is a single number or,
t is a numeric vector and is a numeric vector

Is there a (simple) way to speed up this function?

For far I've tried variations along the lines of (to get rid of the ifelse loop):

F2 <- function (t,d) {
    TempRes <- mapply(function(t, d) integrate(G, lower=0, upper=t, t, d)$value, t, d)
    TempRes[(d <= 0) | (t <= TMin)] <- 0
    result <- TempRes

    return(result)
}

and

F3 <- function (t,d) {
    result <- rep(0, max(length(t),length(d)))
    test <- ((d > 0) & (t > TMin))
    result[test] <- mapply(function(t, d) integrate(G, lower=0, upper=t, t, d)$value, t, d)[test]

    return(result)
}

but they take almost exactly the same time.

2

There are 2 best solutions below

5
On BEST ANSWER

You are performing a large number of independent integrations. You can speed things up by performing these integrations on separate cores simultaneously (if you have a multicore processor available). The problem is that R performs its calculations in a single threaded manner by default. However, there are a number packages available that allow multithreading support. I have recently answered a few similar questions here and here, with some additional info regarding the relevant packages and functions.

Additionally, As @Mike Dunlavey already mentioned, you should avoid performing the integrations for values of t and d that do not match your criteria. (You are currently performing unneeded function evaluations for these values and then you overwrite the result with 0 afterwards).

I have added a possible improvement below. Note that you will have to create a separate file with your function G included in order to evaluate it on the cluster nodes. In the code below it is assumed that this file is called functionG.R

The snippet:

library(doParallel)
F4 <- function(t,d) {
  results = vector(mode="numeric",max(length=length(t),length(d))) # Zero vector

  logicalVector <- ((d > 0) & (t > TMin))
  relevantT <- t[logicalVector]
  relevantD <- d[logicalVector] # when d is single element, NA values created

  if(length(relevantT) > 1 | length(relevantD) > 1)
  {
    if(length(d)==1) # d is only one element instead of vector --> replicate it
      relevantD <- rep(d,length(relevantT))
    if(length(t)==1) # t is only one element instead of vector --> replicate it
      relevantT <- rep(t,length(relevantD))

    cl <- makeCluster(detectCores()); 
    registerDoParallel(cl)
    clusterEvalQ(cl,eval(parse("functionG.R")))

    integrationResults <- foreach(i=1:length(relevantT),.combine="c") %dopar%
    {
      integrate(G,lower=0,upper=relevantT[i],relevantT[i],relevantD[i])$value;
    }
    stopCluster(cl)
    results[logicalVector] <- integrationResults
  }
  else if(length(relevantT==1)) # Cluster overhead not needd
  {
    results[logicalVector] = integrate(G,lower=0,upper=relevantT,relevantT,relevantD)$value;
  }

  return(results)
}

My CPU contains 6 physical cores with hyperthreading enabled (x2). These are the results:

> t = -5000:20000
> d = -5000:20000
> 
> start = Sys.time()
> testF3 = F3(t,d)
> timeNeededF3 = Sys.time()-start
> 
> start = Sys.time()
> testF4 = F4(t,d)
> timeNeededF4 = Sys.time()-start;

> timeNeededF3
Time difference of 3.452825 mins
> timeNeededF4
Time difference of 29.52558 secs
> identical(testF3,testF4)
[1] TRUE

It seems that the cores are constantly in use while running this code. However, you can potentially optimize this code further by presplitting the data more efficiently around the cores and then subsequently utilize an apply type functions on the separate cores.

If more optimization is required you could also take a deeper look at the integrate function. You can potentially play around with the settings and obtain a performance gain by allowing a less strict numerical approximation. As an alternative you could implement your own simple version of adaptive Simpson quadrature and play around with the discrete stepsizes. Most likely you could obtain massive performance increases like this (if you are able/willing to allow more error in the approximation).

EDIT: Updated code in order for it to work in all scenario's: d and/or t valid/invalid numbers or vectors

REPLY TO COMMENTS @mawir: you are correct. ifelse(test, yes, no) will return corresponding yes values for the rows of which test evaluates to TRUE, it will return the respective no values for the rows for which test evaluate to FALSE. However, it WILL first have to evaluate your yes expression in order to create the yes vector of length(test). This piece of code demonstrates this:

> t = -5000:5
> d = -5000:5
> 
> start = Sys.time()
> testF1 = F(t,d)
> timeNeededF1 = Sys.time()-start
> timeNeededF1
Time difference of 43.31346 secs
> 
> start = Sys.time()
> testF4 = F4(t,d)
> timeNeededF4 = Sys.time()-start
> timeNeededF4
Time difference of 2.284134 secs

Only the last 5 values of t and d are relevant in this scenario. However, inside the F1 function the ifelse evaluates the mapply on all d and t values first in order to create the yes vector. This is why the function execution takes so long. Next, it selects the elements for which the conditions are met, or 0 otherwise. The F4 function works around this issue.

Futhermore, you say that you obtain speedup in the scenario where t and d are non-vectors. However, in this scenario no parallelisation is used. You should normally obtain the maximum speedup in the senario's where one or both of t/d are vectors.

ANOTHER EDIT, in response to Roland's comment: You could potentially replace clusterEvalQ(cl,eval(parse("functionG.R"))) with clusterExport(cl,"G") if you prefer not to create a separate function file(s).

4
On

As a generality, the place to look is in the innermost loop, and you can speed it up either by making it take less time or by calling it fewer times. You have an inner loop running mapply, but then you extract element [test] from it. Does that mean all the other elements are discarded? If so, why bother spending time to calculate the extra elements?