Sampling two correlated random variables using the gamma distribution

114 Views Asked by At

I have two random variables that are distributed by the gamma distribution. These random variables are negatively correlated by a coefficient of -.1. I'm not sure how to incorporate the correlation correctly. By looking at many different examples, I've boiled it down to the following code, but I don't think this is correct for my purposes because it only changes one of the RVs and leaves the other exactly the same.

# Number of observations
n <- 10

# Parameters for the gamma distribution
shape1 <- 2
shape2 <- 3

# Generate two independent gamma-distributed random variables
X <- rgamma(n, shape = shape1)
Y <- rgamma(n, shape = shape2)

# Correlation coefficient
corrcoef <- -.1

# Create a correlation matrix
corr_matrix <- matrix(c(1, corrcoef, corrcoef, 1), nrow = 2)

# Cholesky decomposition to get the lower triangular matrix
cholesky_matrix <- chol(corr_matrix)

# Create a matrix of uncorrelated variables
uncorrelated_matrix <- cbind(X, Y)

# Transform the uncorrelated variables to have the desired correlation
correlated_matrix <- uncorrelated_matrix %*% cholesky_matrix

# Extract the correlated variables
correlated_X <- correlated_matrix[, 1]
correlated_Y <- correlated_matrix[, 2]

# Compare summarry of X, Y, correlated_X, and correlated_Y
summary(X)
summary(Y)
summary(correlated_X)
summary(correlated_Y)

I understand that this gets me what I "want", in that it reshapes Y such that it's follows a gamma distribution and it's negatively correlated to X as specified. But I don't want to leave X static and only change Y. Ideally both correlated_X and correlated_Y would change and have the same mean as X and Y.

I guess what I'm looking for is for the samples to be transformed such that the sum of squares is minimized: (mean(correlated_X) - mean(X))^2 + (mean(correlated_Y) - mean(Y))^2

Is there another process that would change both random variables "equally" instead of leaving one static and changing the other?

2

There are 2 best solutions below

0
On

Something like this , which I thought was clear in the comment:

uncorrelated_matrix <- cbind(X, Y)

# Transform the uncorrelated variables to have the desired correlation
correlated_matrix2 <- uncorrelated_matrix[, 2:1] %*%   #reverse X and Y
                                 cholesky_matrix
corr_both <- (correlated_matrix + correlated_matrix2)/2

I checked to make sure that the sum of two gammas was again gamma, so the mean of the two gamma matrices should be two gamma columns.

0
On

A general procedure to sample from any bivariate distribution for which you have the quantile functions of the marginal distributions and the desired correlation coefficient (rho):

  1. Initialize r = rho.
  2. Generate a bivariate Gaussian copula with a correlation coefficient equal to r.
  3. Use inverse transform sampling to convert the copula to one with the desired marginal distributions.
  4. Calculate the correlation coefficient of the bivariate distribution resulting from Step 3.
  5. Update r according to a stochastic root finding scheme. Check if r has converged. If it has, return r. If it has not, return to Step 2.

For Step 2, we can use a bivariate log-normal copula, since the log-normal distribution has the same support as your bivariate gamma distribution. Additionally, we can match moments.

I'll demonstrate the first four steps using a very large sample size so that the correlation coefficient of the sample is close to the actual correlation coefficient.

library(mvtnorm)

set.seed(976317713)
n <- 1e7 # number of samples
# parameters of the gamma marginal distributions
shape1 <- 2
shape2 <- 3
rate1 <- 1
rate2 <- 1
rho <- -0.1 # desired correlation coefficient

shapes <- c(shape1, shape2)
rates <- c(rate1, rate2)
m <- shapes/rates # means of the gamma marginal distributions
s <- sqrt(m/rates) # standard deviations of the gamma marginal distributions
# parameters of the moment-matched log-normal marginal distributions
sigma <- sqrt(log(s^2/m^2 + 1))
mu <- log(m) - sigma^2/2
d <- log(rho*prod(s)/prod(m) + 1) # covariance anti-diagonal
# Step 2: bivariate normal whose corresponding log-normal has the desired
# correlation coefficient
y <- rmvnorm(n, mu, diag(sigma^2 - d) + d)
# check the correlation coefficient
cor(exp(y[,1]), exp(y[,2]))
#> [1] -0.1001247
# Step 2: bivariate Gaussian copula
u <- pnorm(y, rep(mu, each = n), rep(sigma, each = n))
# Step 3: inverse transform sampling
x <- qgamma(u, rep(shapes, each = n), rep(rates, each = n))
# check that the resulting marginal distributions are the desired gammas
ks.test(x[,1], pgamma, shape1, rate1)
#> 
#>  Asymptotic one-sample Kolmogorov-Smirnov test
#> 
#> data:  x[, 1]
#> D = 0.00021873, p-value = 0.725
#> alternative hypothesis: two-sided
ks.test(x[,2], pgamma, shape2, rate2)
#> 
#>  Asymptotic one-sample Kolmogorov-Smirnov test
#> 
#> data:  x[, 2]
#> D = 0.00025568, p-value = 0.5303
#> alternative hypothesis: two-sided
# Step 4: Correlation coefficient of `x`
cor(x[,1], x[,2])
#> [1] -0.1104764

The correlation coefficient is a bit off the target of -0.1. Using stochastic root-finding, we can adjust rho in order to get much closer to the desired correlation coefficient. The following function performs the entire process.

rgammaCopula <- function(n, shape1, shape2, rate1, rate2, rho, k = 1e3, reps = 1e3) {
  shapes <- c(shape1, shape2)
  rates <- c(rate1, rate2)
  m <- shapes/rates
  s <- sqrt(m/rates)
  sigmasq <- log(s^2/m^2 + 1)
  sigma <- sqrt(sigmasq)
  mu <- log(m) - sigmasq/2
  mus <- rep(mu, each = k)
  sigmas <- rep(sigma, each = k)
  shapes <- rep(shapes, each = k)
  rates <- rep(rates, each = k)
  sm <- prod(s/m)
  num <- den <- 0
  r <- rho
  # stochastic root finding
  for (i in seq_len(reps)) {
    d <- log(r*sm + 1)
    x <- matrix(
      qgamma(
        pnorm(rmvnorm(k, mu, diag(sigmasq - d) + d), mus, sigmas),
        shapes,
        rates
      ), k, 2
    )
    
    num <- num + r^2
    den <- den + r*cor(x[,1], x[,2])
    r <- rho*num/den
  }
  
  d <- log(r*sm + 1)
  list(
    x = matrix(
      qgamma(
        pnorm(
          rmvnorm(n, mu, diag(sigmasq - d) + d),
          rep(mu, each = n),
          rep(sigma, each = n)
        ),
        rep(c(shape1, shape2), each = n),
        rep(c(rate1, rate2), each = n)
      ), n, 2
    ),
    r = r
  )
}

Testing:

n <- 10
res <- rgammaCopula(n, shape1, shape2, rate1, rate2, rho)
colMeans(res$x)
#> [1] 1.988554 2.875326
apply(res$x, 2, var)
#> [1] 2.803401 2.270023
cor(res$x[,1], res$x[,2])
#> [1] 0.1274439

With only 10 samples the mean, variance, and correlation coefficient statistics are highly variable. Use res$r to generate 10 million samples and check the statistics again:

res <- rgammaCopula(1e7, shape1, shape2, rate1, rate2, res$r, 0, 0)
colMeans(res$x)
#> [1] 2.000603 3.000251
apply(res$x, 2, var)
#> [1] 2.000278 3.000842
cor(res$x[,1], res$x[,2])
#> [1] -0.1003074