Using optimize() to find the shortest interval that takes 95% area under a curve in R

537 Views Asked by At

Background:

I have a curve whose Y-values are produced by my small R function below (neatly annotated). If you run my entire R code, you see my curve (but remember, it's a function so if I changed the argument values, I could get a different curve):

enter image description here

Question:

Obviously, one can determine/assume many intervals that would cover/take 95% of the total area under this curve. But using, optimize(), how can I find the SHORTEST (in x-value units) of these many possible 95% intervals? What then would be the corresponding x-values for the the two ends of this shortest 95% interval?

Note: The idea of shortest interval for a uni-modal curve like mine makes sense. In reality, the shortest one would be the one that tends to be toward the middle where the height (y-value) is larger, so then x-value doesn't need to be so large for the intended interval to cover/take 95% of the total area under the curve.

Here is my R code (please run the entire code):

ppp <- function(f, N, df1, df2, petasq, alpha, beta) {

 pp <- function(petasq) dbeta(petasq, alpha, beta)
 ll <- function(petasq) df(f, df1, df2, (petasq * N) / (1 - petasq) )

 marg <- integrate(function(x) pp(x)*ll(x), 0, 1)[[1]]

po <- function(x) pp(x)*ll(x) / marg
return(po(petasq) )

}
## @@@ END OF MY R FUNCTION.

# Now I use my function above to get the y-values for my plot:

petasq  <- seq(0, 1, by = .0001) ## These are X-values for my plot
f  <- 30       # a function needed argument
df1 <- 3       # a function needed argument
df2 <- 108     # a function needed argument
N  <- 120      # a function needed argument
alpha = 5      # a function needed argument
beta = 4       # a function needed argument


## Now use the ppp() function to get the Y-values for the X-value range above:
y.values <- ppp(f, N, df1, df2, petasq, alpha, beta)

## Finally plot petasq (as X-values) against the Y.values:
plot(petasq, y.values, ty="l", lwd = 3 )
3

There are 3 best solutions below

7
thc On BEST ANSWER

Based on your revised question, I found the optimization that minimizes the SHORTEST distance (in x-value units) between LEFT and RIGHT boundaries:

ppp <- function(petasq, f, N, df1, df2, alpha, beta) {

 pp <- function(petasq) dbeta(petasq, alpha, beta)
 ll <- function(petasq) df(f, df1, df2, (petasq * N) / (1 - petasq) )

 marg <- integrate(function(x) pp(x)*ll(x), 0, 1)[[1]]

po <- function(x) pp(x)*ll(x) / marg
return(po(petasq) )
}

petasq  <- seq(0, 1, by = .0001) ## These are X-values for my plot
f  <- 30       # a function needed argument
df1 <- 3       # a function needed argument
df2 <- 108     # a function needed argument
N  <- 120      # a function needed argument
alpha = 5      # a function needed argument
beta = 4       # a function needed argument

optim_func <- function(x_left) {
    int_function <- function(petasq) {
        ppp(petasq, f=f, N=N, df1=df1, df2=df2, alpha=alpha, beta=beta)
    }

    # For every LEFT value, find the corresponding RIGHT value that gives 95% area.  

    find_95_right <- function(x_right) {
        (0.95 - integrate(int_function, lower=x_left, upper=x_right, subdivisions = 10000)$value)^2
    }
    x_right_obj <- optimize(f=find_95_right, interval=c(0.5,1))
    if(x_right_obj$objective > .Machine$double.eps^0.25) return(100)

    #Return the DISTANCE BETWEEN LEFT AND RIGHT
    return(x_right_obj$minimum - x_left)
}

#MINIMIZE THE DISTANCE BETWEEN LEFT AND RIGHT
x_left <- optimize(f=optim_func, interval=c(0.30,0.40))$minimum
find_95_right <- function(x_right) {
    (0.95 - integrate(int_function, lower=x_left, upper=x_right, subdivisions = 10000)$value)^2
}
    int_function <- function(petasq) {
        ppp(petasq, f=f, N=N, df1=df1, df2=df2, alpha=alpha, beta=beta)
    }
x_right <- optimize(f=find_95_right, interval=c(0.5,1))$minimum

See the comments in the code. Hopefully this finally satisfies your question :) Results:

> x_right
[1] 0.5409488
> x_left
[1] 0.3201584

Also, you can plot the distance between LEFT and RIGHT as a function of the left boundary:

left_x_values <- seq(0.30, 0.335, 0.0001)
DISTANCE <- sapply(left_x_values, optim_func)

plot(left_x_values, DISTANCE, type="l")

plot

1
MrFlick On

If we think of this as trying to calculate the interval with the smallest area, we can start calculating the areas of each of the regions we are plotting. We can then find the largest area (which presumably will be near the center) and start walking out till we found the area we are looking for.

Since you've already calculate the x and y values for the plot, i'll reuse those to save some calculations. Here's an implementation of that algorithm

pseduoarea <- function(x, y, target=.95) {
  dx <- diff(x)
  areas <- dx * .5 * (head(y,-1) + tail(y, -1))
  peak <- which.max(areas)
  range <- c(peak, peak)
  found <- areas[peak]
  while(found < target) {
    if(areas[range[1]-1] > areas[range[2]+1]) {
      range[1] <- range[1]-1
      found <- found + areas[range[1]-1]
    } else {
      range[2] <- range[2]+1
      found <- found + areas[range[2]+1]
    }   
  }
  val<-x[range]
  attr(val, "indexes")<-range
  attr(val, "area")<-found
  return(val)
}

And we call it with

pseduoarea(petasq, y.values)
# [1] 0.3194 0.5413

This does assume that all the values in petasq are equally spaced

8
IRTFM On

I don't think you need to use optimize (unless this were part of an unadmitted homework assignment). Instead just normalize a cumulative sum and figure out at which points your criteria are met:

> which(cusm.y >= 0.025)[1]
[1] 3163
> which(cusm.y >= 0.975)[1]
[1] 5375

You can check that these are reasonable indices to use for the pulling values from the petasq vector with:

abline( v= c( petasq[  c( which(cusm.y >= 0.025)[1], which(cusm.y >= 0.975)[1])]),
        col="red")

This is admittedly equivalent to constructing an integration function with a normalization constant across the domain of the "density" function. The fact that the intervals are all of equal dimension allows omitting the differencing of "x"-vector from the height times base calculation.

enter image description here

I suppose there is another interpretation possible. That would require that we discover how many values of an ascending-sorted version of petasq are needed to sum to 95% of the total sum. This gives a different strategy and the plot shows where a horizontal line would intersect the curve:

which( cumsum( sort( y.values, decreasing=TRUE) ) > 0.95* sum(y.values, na.rm=TRUE) )[1]
#[1] 2208
sort( y.values, decreasing=TRUE)[2208]
#[1] 1.059978
png()
  plot(petasq, y.values, ty="l", lwd = 3 )
  abline( h=sort( y.values, decreasing=TRUE)[2208], col="blue")
dev.off()

enter image description here

To get the petasq values you would need to determine the first y.values that exceeded that value and then the next y.values that dropped below that level. These can be obtained via:

order(y.values, decreasing=TRUE)[2208]
#[1] 3202
order(y.values, decreasing=TRUE)[2209]
#[1] 5410

And then the plot would look like:

png(); plot(petasq, y.values, ty="l", lwd = 3 )
      abline( v=  petasq[  c(3202, 5410)], col="blue", lty=3, lwd=2)
dev.off()

The area between the two dotted blue lines is 95% of the total area above the zero line:

enter image description here