A neat way in R to get mean and 5 and 95 percentiles given a probability distribution in a string format?

132 Views Asked by At

I would like to write a function that takes a string as an input. To be exact this input represents a probability distribution and the output would be 5 percentile, mean, and 95 percentile as a vector or list.

For example, if I input: "beta(0.2,0.3)" I would get: 2.793073e-06 0.4 0.9992341

First, thing to do, is read the distribution from the string, this can be done with regular expressions:

dist <- gsub("[^A-z]","","beta(0.2,0.3)")
params <- gsub("[^0-9.,]","","beta(0.2,0.3)")

And the parameters can be put in a numeric vector using a strsplit, e.g.

params <- as.numeric(unlist(strsplit(gsub("[^0-9.,]","","beta(0.2,0.3)"),split=",")))

Now, I need to declare a function for the mean (since to my understanding functions for random distribution means do not exist in R). For beta distribution this would be:

beta_mean <- function(alpha, beta) {
    return(alpha / (alpha + beta))
}

And percentiles I can get from qbeta function, i.e.:

qbeta(c(0.05,0.95), params[1], params[2])

Since, I want to deal with many different distributions, is there a more elegant way than:

meanvalue <- NA
percentiles <- NA

if(dist == "beta") {
    meanvalue <- beta_mean(params[1], params[2])
    percentiles <- qbeta(c(0.05,0.95), params[1], params[2])
} else if (dist == "gamma") {
    meanvalue <- gamma_mean(params[1], params[2])
    percentiles <- qgamma(c(0.05,0.95), params[1], params[2])
} #and so on and so on and so on...

return(c(percentiles[1],meanvalue,percentiles[2])

SO, what I want to do is to link distribution name string (e.g. "beta" or "gamma" or whatever) to the corresponding functions (DISTRIBUTIONNAME_mean and qDISTRIBUTIONNAME), so I don't have to use that extremely long if-else-structure, which contains too much (unnecessary?) repetition.

How can I accomplish this?

3

There are 3 best solutions below

0
Michael M On

I can only give a partial answer that covers the quantiles:

a_string <- "beta(0.2, 0.3)"
eval(parse(text = paste0("q", sub("\\(", "(c(0.05, 0.95), ", a_string))))
# 2.793073e-06 9.992341e-01

To get the mean, you would need to parametrize the expectation of all distributions, which is painful.

0
jpsmith On

You can use eval(parse(...)) along with paste0 here:

distfun <- function(x){
  par <- unlist(stringr::str_extract_all(a_string, "\\d+\\.\\d"))
  m <- paste0(gsub("[^a-zA-Z]", "", x), "_mean(", par[1], ",",  par[2], ")")
  setNames(c(
    eval(parse(text = m)), 
    eval(parse(text = paste0("q", sub("\\(", "(c(0.05, 0.95), ", x))))
  ), c("mean", "q05", "q95"))
  
}

strng <- "beta(0.2, 0.3)"
distfun(strng)

#         mean          q05          q95 
# 4.000000e-01 2.793073e-06 9.992341e-01 

Or if you wanted it in the exact order of your question:

distfun <- function(x){
  par <- unlist(stringr::str_extract_all(a_string, "\\d+\\.\\d"))
  res <- setNames(c(
    eval(parse(text = paste0(gsub("[^a-zA-Z]", "", x), "_mean(", par[1], ",",  par[2], ")"))), 
    eval(parse(text = paste0("q", sub("\\(", "(c(0.05, 0.95), ", x))))
    ), c("mean", "q05", "q95"))
  res[c(2,1,3)]
}

strng <- "beta(0.2, 0.3)"
distfun(strng)

#          q05         mean          q95 
# 2.793073e-06 4.000000e-01 9.992341e-01 
0
Aku-Ville Lehtimäki On

I came up with this solution today. First, we can grab each and every distribution there is in the stats package:

library(dplyr)

fnnames <- ls("package:stats") %>% #Get the list of function names in stats package

  grep("^d|^p|^q|^r", ., value = TRUE) %>% #Exclude names NOT starting with "d", "p", "q" or "r"

  substring(2) %>% #Remove first letter (e.g. rnorm, pnorm, qnorm, and rnorm become all just norm)

  table %>% #Create a frequency table, a vector values are frequencies

  .[. == 4] %>% #Since distribution related functions come in quadruplets, we keep only quadruplets
 
  names(.) #Finally, we are interested in the function names, not the frequencies any more.

Now, we know what functions are available. Let's create a lookup dictionary (this is R, so it's actually a named list although with same functionality as a dictionary):

distlist <- lapply(fnnames, function(fnname){
  list( #a list inside of dist list
    "q" = get(paste0("q",fnname)), #these sub-lists contain "key" q for quantile function. However, I am not sure whether or not "get" function is doing the same as eval(parse(text=...)) under the hood.
    "mean" = NULL #I'll return to this later.
  )
}) %>% setNames(fnnames) #Let's set names for the sub-lists

Now, we use our lookup dictionary like this:

distlist[["beta"]][["q"]](c(0.05, 0.95), params[1], params[2])
[1] 2.793073e-06 9.992341e-01

Now, the mean function. Most random distributions have analytical solution for mean, which would be easier to use. Unfortunately (to my understanding), no such functions exist (meannorm, meanbeta, meangamma etc, would be a nice addition). But of course we can integrate:

integrate(function(x)(x*dbeta(x, params[1], params[2])), lower = 0, upper = 1)
0.4 with absolute error < 1.9e-05

This is slow, and d functions do not like (i.e. they return an error), if a value outside their support is provided. However, this can be circumvented with try-catch, but since the analytical forms of expected values are mostly quite straight-forward, I would prefer them.

Maybe, I'll write and publish a package for different moments (mean is the first moment) for different random distributions, if no one hasn't already done so.