How to easily generate/simulate example data with different groups for modelling

696 Views Asked by st4co4 At 07 March 2022 at 13:25

How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?

For example, what would be the simplest way for generating such data?

groups: two groups: A, B
sex: different sex distributions: A 30%, B 70%
age: different mean ages: A 50, B 70

PS! Tidyverse solutions are especially welcome.

My best try so far is still quite a lot of code:

n=100
d = bind_rows(
  #group A females
  tibble(group = rep("A"),
         sex = rep("Female"),
         age = rnorm(n*0.4, 50, 4)),
  #group B females
  tibble(group = rep("B"),
         sex = rep("Female"),
         age = rnorm(n*0.3, 45, 4)),
  #group A males
  tibble(group = rep("A"),
         sex = rep("Male"),
         age = rnorm(n*0.20, 60, 6)),
  #group B males
  tibble(group = rep("B"),
         sex = rep("Male"),
         age = rnorm(n*0.10, 55, 4)))

d %>% group_by(group, sex) %>% 
  summarise(n = n(),
            mean_age = mean(age))

Original Q&A

There are 1 best solutions below

Allan Cameron On 07 March 2022 at 13:42 BEST ANSWER

There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:

set.seed(69) # Makes samples reproducible

df <- data.frame(groups = rep(c("A", "B"), each = 100),
                 sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
                         sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
                 age = c(runif(100, 25, 75), runif(100, 50, 90)))

And we can use the tidyverse to show it does what was expected:

library(dplyr)

df %>% 
  group_by(groups) %>% 
  summarise(age = mean(age),
            percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#>   groups   age percent_male
#>   <chr>  <dbl>        <int>
#> 1 A       49.4           29
#> 2 B       71.0           50

How to easily generate/simulate example data with different groups for modelling

There are 1 best solutions below

Related Questions in R

Related Questions in DATAFRAME

Related Questions in TIDYVERSE

Related Questions in SIMULATION

Related Questions in SAMPLE-DATA

Trending Questions

Popular # Hahtags

Popular Questions