Simulate unbalanced clustered data

618 Views Asked by At

I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. Anyone knows how to realize that in R? Here is a smaller example dataset. The number of observation per cluster doesn't follow the condition specified above though, I just used this to convey my idea.

> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> df <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
> df
   id cluster           x           y
1   1       1  0.30003855  0.65325768
2   2       1 -1.00563626 -0.12270866
3   3       1  0.01925927 -0.41367651
4   4       1 -1.07742065 -2.64314895
5   5       1  0.71270333 -0.09294102
6   1       2  1.08477509  0.43028470
7   2       2 -2.22498770  0.53539884
8   3       2  1.23569346 -0.55527835
9   4       2 -1.24104450  1.77950291
10  5       2  0.45476927  0.28642442
11  1       3  0.65990264  0.12631586
12  2       3 -0.19988983  1.27226678
13  3       3 -0.64511396 -0.71846622
14  4       3  0.16532102 -0.45033862
15  5       3  0.43881870  2.39745248
16  1       4  0.88330282  0.01112919
17  2       4 -2.05233698  1.63356842
18  3       4 -1.63637927 -1.43850664
19  4       4  1.43040234 -0.19051680
20  5       4  1.04662885  0.37842390

After randomly adding and deleting some data, the unbalanced data become like this:

            id   cluster   x     y
       1     1       1  0.895 -0.659 
       2     2       1 -0.160 -0.366 
       3     1       2 -0.528 -0.294 
       4     2       2 -0.919  0.362 
       5     3       2 -0.901 -0.467 
       6     1       3  0.275  0.134 
       7     2       3  0.423  0.534 
       8     3       3  0.929 -0.953 
       9     4       3  1.67   0.668 
      10     5       3  0.286  0.0872
      11     1       4 -0.373 -0.109 
      12     2       4  0.289  0.299 
      13     3       4 -1.43  -0.677 
      14     4       4 -0.884  1.70  
      15     5       4  1.12   0.386 
      16     1       5 -0.723  0.247 
      17     2       5  0.463 -2.59  
      18     3       5  0.234  0.893 
      19     4       5 -0.313 -1.96  
      20     5       5  0.848 -0.0613

EDIT This part of the problem solved (credit goes to jay.sf). Next, I want to repeat this process 1000 times and run regression on each generated dataset. However, I don't want to run regression on the whole dataset but rather on some selected clusters with the clusters being selected randomly (can use this function: df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ]. In the end, I would like to get confidence intervals from those 1000 regressions. How to proceed?

2

There are 2 best solutions below

5
On BEST ANSWER

Let ncl be the desired number of clusters. We may generate a sampling space S which is a sequence of tolerance tol around mean observations per cluster mnobs. From that we draw repeatetly a random sample of size 1 to obtain a list of clusters CL. If the sum of cluster lengths meets ncl*mnobs we break the loop, add random data to the clusters and rbind the result.

FUN <- function(ncl=20, mnobs=30, tol=.1) {
  S <- do.call(seq.int, as.list(mnobs*(1 + tol*c(-1, 1))))
  repeat({
    CL <- lapply(1:ncl, function(x) rep(x, sample(S, 1, replace=T)))
    if (sum(lengths(CL)) == ncl*mnobs) break
  })
  L <- lapply(seq.int(CL), function(i) {
    id <- seq.int(CL[[i]])
    cbind(id, cluster=i, 
          matrix(rnorm(max(id)*2),,2, dimnames=list(NULL, c("x", "y"))))
  })
  do.call(rbind.data.frame, L)
}

Usage

set.seed(42)
res <- FUN()  ## using defined `arg` defaults
dim(res)
# [1] 600   4

(res.tab <- table(res$cluster))
#  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
# 29 29 31 31 30 32 31 30 32 28 28 27 28 31 32 33 31 30 27 30

table(res.tab)
# 27 28 29 30 31 32 33 
#  2  3  2  4  5  3  1

sapply(c("mean", "sd"), function(x) do.call(x, list(res.tab)))
#      mean        sd 
# 30.000000  1.747178 

Displayable example

set.seed(42)
FUN(4, 5, tol=.3)  ## tol needs to be adjusted for smaller samples
#    id cluster           x          y
# 1   1       1  1.51152200 -0.0627141
# 2   2       1 -0.09465904  1.3048697
# 3   3       1  2.01842371  2.2866454
# 4   1       2 -1.38886070 -2.4404669
# 5   2       2 -0.27878877  1.3201133
# 6   3       2 -0.13332134 -0.3066386
# 7   4       2  0.63595040 -1.7813084
# 8   5       2 -0.28425292 -0.1719174
# 9   6       2 -2.65645542  1.2146747
# 10  1       3  1.89519346 -0.6399949
# 11  2       3 -0.43046913  0.4554501
# 12  3       3 -0.25726938  0.7048373
# 13  4       3 -1.76316309  1.0351035
# 14  5       3  0.46009735 -0.6089264
# 15  1       4  0.50495512  0.2059986
# 16  2       4 -1.71700868 -0.3610573
# 17  3       4 -0.78445901  0.7581632
# 18  4       4 -0.85090759 -0.7267048
# 19  5       4 -2.41420765 -1.3682810
# 20  6       4  0.03612261  0.4328180
0
On

As per Ben Bolker's request, I am posting my solution but see jay.sf for a more generalizable answer.

#First create an oversampled dataset: 
  y <- rnorm(24)
  x <- rnorm(24)
  z <- rep(1:6, 4)
  w <- rep(1:4, each=6)
  df <- data.frame(id=z,cluster=w,x=x,y=y)
#Then just slice_sample to arrive at the sample size as desired
  df %>% slice_sample(n = 20) %>%
  arrange(cluster)
#Or just use base R
  a <- df[sample(nrow(df), 20), ]  
  df2 <- a[order(a$cluster), ]