In a clustered dataset, I want to randomly pick some clusters and then add some simulated observations to the selected clusters. Then I want to create a dataset that combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters. I would also like to repeat this process many times and thus create many (maybe 1000) new datasets. I managed to do this using for
loop but would like to know if there is any more efficient and concise way to accomplish this. Here is an example dataset:
## simulate some data
y <- rnorm(20)
x <- rnorm(20)
z <- rep(1:5, 4)
w <- rep(1:4, each=5)
dd <- data.frame(id=z, cluster=w, x=x, y=y)
# id cluster x y
# 1 1 1 0.30003855 0.65325768
# 2 2 1 -1.00563626 -0.12270866
# 3 3 1 0.01925927 -0.41367651
# 4 4 1 -1.07742065 -2.64314895
# 5 5 1 0.71270333 -0.09294102
# 6 1 2 1.08477509 0.43028470
# 7 2 2 -2.22498770 0.53539884
# 8 3 2 1.23569346 -0.55527835
# 9 4 2 -1.24104450 1.77950291
# 10 5 2 0.45476927 0.28642442
# 11 1 3 0.65990264 0.12631586
# 12 2 3 -0.19988983 1.27226678
# 13 3 3 -0.64511396 -0.71846622
# 14 4 3 0.16532102 -0.45033862
# 15 5 3 0.43881870 2.39745248
# 16 1 4 0.88330282 0.01112919
# 17 2 4 -2.05233698 1.63356842
# 18 3 4 -1.63637927 -1.43850664
# 19 4 4 1.43040234 -0.19051680
# 20 5 4 1.04662885 0.37842390
cl <- split(dd, dd$cluster) ## split the data based on clusters
k <- length(dd$id)
l <- length(cl)
`%notin%` <- Negate(`%in%`) ## define "not in" to exclude unselected clusters so
## as to retain their original observations
A clsamp
function in the following code is then created which includes two for
loops. The first for
loop is to exclude the unselected clusters and the second for
loop is to simulate new observations and append them to the selected clusters. Note that I randomly sample 2 clusters (10% of the total number of observations), without replacement
clsamp <- function(cl, k) {
a <- sample(cl, size=0.1*k, replace=FALSE)
jud <- (names(cl) %notin% names(a))
need <- names(cl)[jud]
T3 <- NULL
for (k in need) {
T3 <- rbind(T3, cl[[k]])
}
subt <- NULL
s <- a
for (j in 1:2) {
y <- rnorm(2)
x <- rnorm(2)
d <- cbind(id=nrow(a[[j]]) + c(1:length(x)),
cluster=unique(a[[j]]$cluster), x, y)
s[[j]] <- rbind(a[[j]], d)
subt <- rbind(subt, s[[j]])
}
T <- rbind(T3, subt)
return(T)
}
Finally, this creates a list of 5 datasets each of which combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters
Q <- vector(mode="list", length=5)
for (i in 1:length(Q)) {
Q[[i]] <- clsamp(cl, 20)
}
Anyone knows a shorter way to do this? Maybe use the replicate
function? Thanks.
This generates a sizeX2 matrix of random values and
cbind
s sampled cluster names and consecutive ids to it. It directly starts withdd
and also works when you convertdd
to a matrixmm
, which might be slightly faster. Output is a data frame, though. Instead of yourk
I usef
to directly calculate the number of rows that should be added to the two selected clusters. In case the size gets zero, the original data frame is returned.Result
To create the list of five samples, you may use
replicate
.Running time is negligible.