How to easily generate/simulate example data with different groups for modelling

664 Views Asked by At

How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?

For example, what would be the simplest way for generating such data?

  • groups: two groups: A, B
  • sex: different sex distributions: A 30%, B 70%
  • age: different mean ages: A 50, B 70

PS! Tidyverse solutions are especially welcome.

My best try so far is still quite a lot of code:

n=100
d = bind_rows(
  #group A females
  tibble(group = rep("A"),
         sex = rep("Female"),
         age = rnorm(n*0.4, 50, 4)),
  #group B females
  tibble(group = rep("B"),
         sex = rep("Female"),
         age = rnorm(n*0.3, 45, 4)),
  #group A males
  tibble(group = rep("A"),
         sex = rep("Male"),
         age = rnorm(n*0.20, 60, 6)),
  #group B males
  tibble(group = rep("B"),
         sex = rep("Male"),
         age = rnorm(n*0.10, 55, 4)))

enter image description here

d %>% group_by(group, sex) %>% 
  summarise(n = n(),
            mean_age = mean(age))

enter image description here

1

There are 1 best solutions below

0
On BEST ANSWER

There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:

set.seed(69) # Makes samples reproducible

df <- data.frame(groups = rep(c("A", "B"), each = 100),
                 sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
                         sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
                 age = c(runif(100, 25, 75), runif(100, 50, 90)))

And we can use the tidyverse to show it does what was expected:

library(dplyr)

df %>% 
  group_by(groups) %>% 
  summarise(age = mean(age),
            percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#>   groups   age percent_male
#>   <chr>  <dbl>        <int>
#> 1 A       49.4           29
#> 2 B       71.0           50