How to increase performance when randomly selecting clusters and adding observations?

Question

How to increase performance when randomly selecting clusters and adding observations?

91 Views Asked by cliu At 17 December 2020 at 05:32

In a clustered dataset, I want to randomly pick some clusters and then add some simulated observations to the selected clusters. Then I want to create a dataset that combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters. I would also like to repeat this process many times and thus create many (maybe 1000) new datasets. I managed to do this using for loop but would like to know if there is any more efficient and concise way to accomplish this. Here is an example dataset:

## simulate some data
y <- rnorm(20)
x <- rnorm(20)
z <- rep(1:5, 4)
w <- rep(1:4, each=5)
dd <- data.frame(id=z, cluster=w, x=x, y=y)
#    id cluster           x           y
# 1   1       1  0.30003855  0.65325768
# 2   2       1 -1.00563626 -0.12270866
# 3   3       1  0.01925927 -0.41367651
# 4   4       1 -1.07742065 -2.64314895
# 5   5       1  0.71270333 -0.09294102
# 6   1       2  1.08477509  0.43028470
# 7   2       2 -2.22498770  0.53539884
# 8   3       2  1.23569346 -0.55527835
# 9   4       2 -1.24104450  1.77950291
# 10  5       2  0.45476927  0.28642442
# 11  1       3  0.65990264  0.12631586
# 12  2       3 -0.19988983  1.27226678
# 13  3       3 -0.64511396 -0.71846622
# 14  4       3  0.16532102 -0.45033862
# 15  5       3  0.43881870  2.39745248
# 16  1       4  0.88330282  0.01112919
# 17  2       4 -2.05233698  1.63356842
# 18  3       4 -1.63637927 -1.43850664
# 19  4       4  1.43040234 -0.19051680
# 20  5       4  1.04662885  0.37842390

cl <- split(dd, dd$cluster)  ## split the data based on clusters 
k <- length(dd$id)
l <- length(cl)

`%notin%` <- Negate(`%in%`)  ## define "not in" to exclude unselected clusters so
                             ## as to retain their original observations

A clsamp function in the following code is then created which includes two for loops. The first for loop is to exclude the unselected clusters and the second for loop is to simulate new observations and append them to the selected clusters. Note that I randomly sample 2 clusters (10% of the total number of observations), without replacement

clsamp <- function(cl, k) {
  a <- sample(cl, size=0.1*k, replace=FALSE)  
  jud <- (names(cl) %notin% names(a))  
  need <- names(cl)[jud] 
  T3 <- NULL
  for (k in need) {
    T3 <- rbind(T3, cl[[k]])  
  }
  subt <- NULL
  s <- a
  for (j in 1:2) {
    y <- rnorm(2)
    x <- rnorm(2)
    d <- cbind(id=nrow(a[[j]]) + c(1:length(x)), 
               cluster=unique(a[[j]]$cluster), x, y)
    s[[j]] <- rbind(a[[j]], d)
    subt <- rbind(subt, s[[j]])
  }
  T <- rbind(T3, subt)
  return(T)
}

Finally, this creates a list of 5 datasets each of which combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters

Q <- vector(mode="list", length=5)
for (i in 1:length(Q)) {
  Q[[i]] <- clsamp(cl, 20)
}

Anyone knows a shorter way to do this? Maybe use the replicate function? Thanks.

Original Q&A

There are 1 best solutions below

**jay.sf** · Accepted Answer · 2020-12-17T15:26:24.240000

This generates a sizeX2 matrix of random values and cbinds sampled cluster names and consecutive ids to it. It directly starts with dd and also works when you convert dd to a matrix mm, which might be slightly faster. Output is a data frame, though. Instead of your k I use f to directly calculate the number of rows that should be added to the two selected clusters. In case the size gets zero, the original data frame is returned.

clsamp2 <- function(m, f=.1) {
  size <- round(nrow(m)*f)
  if (size == 0) as.data.frame(m)
  else {
    ids <- unique(m[,1])
    cls <- unique(m[,2])
    rd <- matrix(rnorm(size * 4), ncol=2, dimnames=list(NULL, c("x", "y")))
    out <- rbind.data.frame(m, cbind(id=rep(max(ids) + 1:size, each=2), 
                                     cluster=sample(cls, 2), rd))
    `rownames<-`(out[order(out$cluster, out$id), ], NULL)
  }
}

Result

set.seed(42)  ## same seed also used for creating `dd`
clsamp2(dd, .1)

## or
mm <- as.matrix(dd)
clsamp2(mm, .1)

#    id cluster           x           y
# 1   1       1 -0.30663859  1.37095845
# 2   2       1 -1.78130843 -0.56469817
# 3   3       1 -0.17191736  0.36312841
# 4   4       1  1.21467470  0.63286260
# 5   5       1  1.89519346  0.40426832
# 6   1       2 -0.43046913 -0.10612452
# 7   2       2 -0.25726938  1.51152200
# 8   3       2 -1.76316309 -0.09465904
# 9   4       2  0.46009735  2.01842371
# 10  5       2 -0.63999488 -0.06271410
# 11  6       2  1.37095845  0.40426832
# 12  7       2  0.36312841  1.51152200
# 13  1       3  0.45545012  1.30486965
# 14  2       3  0.70483734  2.28664539
# 15  3       3  1.03510352 -1.38886070
# 16  4       3 -0.60892638 -0.27878877
# 17  5       3  0.50495512 -0.13332134
# 18  1       4 -1.71700868  0.63595040
# 19  2       4 -0.78445901 -0.28425292
# 20  3       4 -0.85090759 -2.65645542
# 21  4       4 -2.41420765 -2.44046693
# 22  5       4  0.03612261  1.32011335
# 23  6       4 -0.56469817 -0.10612452
# 24  7       4  0.63286260 -0.09465904

To create the list of five samples, you may use replicate.

replicate(5, clsamp2(dd, .1), simplify=FALSE)

Running time is negligible.

system.time(replicate(1000, clsamp2(dd, .1), simplify=FALSE))
# user  system elapsed 
# 0.44    0.03    0.44

How to increase performance when randomly selecting clusters and adding observations?

There are 1 best solutions below

Result

Related Questions in R

Related Questions in FOR-LOOP

Related Questions in CODING-EFFICIENCY

Related Questions in DATA-GENERATION

Trending Questions

Popular # Hahtags

Popular Questions