Simulate unbalanced clustered data

Question

Simulate unbalanced clustered data

612 Views Asked by cliu At 17 August 2025 at 15:41

I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. Anyone knows how to realize that in R? Here is a smaller example dataset. The number of observation per cluster doesn't follow the condition specified above though, I just used this to convey my idea.

> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> df <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
> df
   id cluster           x           y
1   1       1  0.30003855  0.65325768
2   2       1 -1.00563626 -0.12270866
3   3       1  0.01925927 -0.41367651
4   4       1 -1.07742065 -2.64314895
5   5       1  0.71270333 -0.09294102
6   1       2  1.08477509  0.43028470
7   2       2 -2.22498770  0.53539884
8   3       2  1.23569346 -0.55527835
9   4       2 -1.24104450  1.77950291
10  5       2  0.45476927  0.28642442
11  1       3  0.65990264  0.12631586
12  2       3 -0.19988983  1.27226678
13  3       3 -0.64511396 -0.71846622
14  4       3  0.16532102 -0.45033862
15  5       3  0.43881870  2.39745248
16  1       4  0.88330282  0.01112919
17  2       4 -2.05233698  1.63356842
18  3       4 -1.63637927 -1.43850664
19  4       4  1.43040234 -0.19051680
20  5       4  1.04662885  0.37842390

After randomly adding and deleting some data, the unbalanced data become like this:

            id   cluster   x     y
       1     1       1  0.895 -0.659 
       2     2       1 -0.160 -0.366 
       3     1       2 -0.528 -0.294 
       4     2       2 -0.919  0.362 
       5     3       2 -0.901 -0.467 
       6     1       3  0.275  0.134 
       7     2       3  0.423  0.534 
       8     3       3  0.929 -0.953 
       9     4       3  1.67   0.668 
      10     5       3  0.286  0.0872
      11     1       4 -0.373 -0.109 
      12     2       4  0.289  0.299 
      13     3       4 -1.43  -0.677 
      14     4       4 -0.884  1.70  
      15     5       4  1.12   0.386 
      16     1       5 -0.723  0.247 
      17     2       5  0.463 -2.59  
      18     3       5  0.234  0.893 
      19     4       5 -0.313 -1.96  
      20     5       5  0.848 -0.0613

EDIT This part of the problem solved (credit goes to jay.sf). Next, I want to repeat this process 1000 times and run regression on each generated dataset. However, I don't want to run regression on the whole dataset but rather on some selected clusters with the clusters being selected randomly (can use this function: df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ]. In the end, I would like to get confidence intervals from those 1000 regressions. How to proceed?

Original Q&A

There are 2 best solutions below

cliu On 29 December 2020 at 01:35

As per Ben Bolker's request, I am posting my solution but see jay.sf for a more generalizable answer.

#First create an oversampled dataset: 
  y <- rnorm(24)
  x <- rnorm(24)
  z <- rep(1:6, 4)
  w <- rep(1:4, each=6)
  df <- data.frame(id=z,cluster=w,x=x,y=y)
#Then just slice_sample to arrive at the sample size as desired
  df %>% slice_sample(n = 20) %>%
  arrange(cluster)
#Or just use base R
  a <- df[sample(nrow(df), 20), ]  
  df2 <- a[order(a$cluster), ]

**jay.sf** · Accepted Answer

Let ncl be the desired number of clusters. We may generate a sampling space S which is a sequence of tolerance tol around mean observations per cluster mnobs. From that we draw repeatetly a random sample of size 1 to obtain a list of clusters CL. If the sum of cluster lengths meets ncl*mnobs we break the loop, add random data to the clusters and rbind the result.

FUN <- function(ncl=20, mnobs=30, tol=.1) {
  S <- do.call(seq.int, as.list(mnobs*(1 + tol*c(-1, 1))))
  repeat({
    CL <- lapply(1:ncl, function(x) rep(x, sample(S, 1, replace=T)))
    if (sum(lengths(CL)) == ncl*mnobs) break
  })
  L <- lapply(seq.int(CL), function(i) {
    id <- seq.int(CL[[i]])
    cbind(id, cluster=i, 
          matrix(rnorm(max(id)*2),,2, dimnames=list(NULL, c("x", "y"))))
  })
  do.call(rbind.data.frame, L)
}

Usage

set.seed(42)
res <- FUN()  ## using defined `arg` defaults
dim(res)
# [1] 600   4

(res.tab <- table(res$cluster))
#  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
# 29 29 31 31 30 32 31 30 32 28 28 27 28 31 32 33 31 30 27 30

table(res.tab)
# 27 28 29 30 31 32 33 
#  2  3  2  4  5  3  1

sapply(c("mean", "sd"), function(x) do.call(x, list(res.tab)))
#      mean        sd 
# 30.000000  1.747178

Displayable example

set.seed(42)
FUN(4, 5, tol=.3)  ## tol needs to be adjusted for smaller samples
#    id cluster           x          y
# 1   1       1  1.51152200 -0.0627141
# 2   2       1 -0.09465904  1.3048697
# 3   3       1  2.01842371  2.2866454
# 4   1       2 -1.38886070 -2.4404669
# 5   2       2 -0.27878877  1.3201133
# 6   3       2 -0.13332134 -0.3066386
# 7   4       2  0.63595040 -1.7813084
# 8   5       2 -0.28425292 -0.1719174
# 9   6       2 -2.65645542  1.2146747
# 10  1       3  1.89519346 -0.6399949
# 11  2       3 -0.43046913  0.4554501
# 12  3       3 -0.25726938  0.7048373
# 13  4       3 -1.76316309  1.0351035
# 14  5       3  0.46009735 -0.6089264
# 15  1       4  0.50495512  0.2059986
# 16  2       4 -1.71700868 -0.3610573
# 17  3       4 -0.78445901  0.7581632
# 18  4       4 -0.85090759 -0.7267048
# 19  5       4 -2.41420765 -1.3682810
# 20  6       4  0.03612261  0.4328180

Simulate unbalanced clustered data

There are 2 best solutions below

Related Questions in R

Related Questions in SIMULATION

Related Questions in DATA-MANIPULATION

Related Questions in DATA-GENERATION

Trending Questions

Popular # Hahtags

Popular Questions