R: mean, variance and standard deviation columns per subject

1.2k Views Asked by At

I need to add a mean, variance and standard deviation column (per subject), but my data is a little complex:

I have subject ID's, dates & times, weeks of the year, overall attendance count and attendances per week. Now what I need are 3 more columns, giving me the mean visit per week, the variance of attendance and the standard deviation.

To make it more clear, this is a snapshot of my data set:

df <- c(Contact.ID, Date.Time, Week, Attendance, WeeklyAT)

Contact.ID     Date      Time    Week    Attendance  WeeklyAT  *Mean      *v    *sd
1   A       2012-10-06 18:54:48   40          3          2    *0.214   *0.335  *0.579
2   A       2012-10-08 20:50:18   40          3          2    *0.214   *0.335  *0.579  
3   A       2012-11-24 20:18:44   47          3          1    *0.214   *0.335  *0.579  
4   B       2012-11-15 16:58:15   46          4          1 
5   B       2013-01-09 10:57:02    2          4          3
6   B       2013-01-11 17:31:22    2          4          3
7   B       2013-01-14 18:37:00    2          4          3
8   C       2013-02-22 17:46:07    8          2          1
9   C       2013-02-27 11:21:00    9          2          1
10  D       2012-10-28 14:48:33   43          1          1

To calculate the mean attendance, it needs to be considered, that the timeframe I am looking at is 14 weeks and the weekly attendance is repeated, thus needs to be bound to the week number. So, to calculate subject A and B's mean for example it would have to be:

meanA = (2+1+0+0+0+0+0+0+0+0+0+0+0+0)/14=0.214

meanB = (1+3+0+0+0+0+0+0+0+0+0+0+0+0)/14=0.286

(here the 14 weeks don't matter too much but for the variance and sd it does:

varianceA = ∑(x-µ)^2 = [(2-0.214)^2+(1-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2+(0-0.214)^2]/(14-1) = 4.357/13 = 0.335

sdA= √varianceA = √0.335 = 0.579

I cannot figure out how to do this in code. I have tried ifelse functions and general var and mean and tried to create new columns with these but failed at defining it per subject (Contact.ID) and for my n=14.

I greatly appreciate the help. Many Thanks!

2

There are 2 best solutions below

3
On BEST ANSWER

Data

df <- structure(list(Contact.ID = 1:10, Date = c("A", "A", "A", "B", 
"B", "B", "B", "C", "C", "D"), Time = c("2012-10-06 18:54:48", 
"2012-10-08 20:50:18", "2012-11-24 20:18:44", "2012-11-15 16:58:15", 
"2013-01-09 10:57:02", "2013-01-11 17:31:22", "2013-01-14 18:37:00", 
"2013-02-22 17:46:07", "2013-02-27 11:21:00", "2012-10-28 14:48:33"
), Week = c(40L, 40L, 47L, 46L, 2L, 2L, 2L, 8L, 9L, 43L), Attendance = c(3L, 

3L, 3L, 4L, 4L, 4L, 4L, 2L, 2L, 1L), WeeklyAT = c(2L, 2L, 1L, 
1L, 3L, 3L, 3L, 1L, 1L, 1L)), .Names = c("Contact.ID", "Date", 
"Time", "Week", "Attendance", "WeeklyAT"), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

tidyverse solution

library(tidyverse)
df1 <- df %>%
         group_by(Date) %>%
         nest(Week, WeeklyAT) %>%          # nest relevant data
         mutate(data = map(data, ~.x %>% filter(duplicated(Week)==F))) %>%    # filter out duplicated Weeks
         mutate(data = map(data, ~c(.x$WeeklyAT, rep(0, 14-length(.x$WeeklyAT))))) %>%      # make WeeklyAT into 14-element vector
         mutate(data = map(data, ~data.frame(Mean = mean(.x), sd = sd(.x), v = sd(.x)**2))) %>%    # calculate statistics and save as data frame
         unnest(data) %>%        # unnest results
         left_join(df, ., by="Date")       # combine with original data frame

Output

   Contact.ID Date                Time Week Attendance WeeklyAT       Mean
1           1    A 2012-10-06 18:54:48   40          3        2 0.21428571
2           2    A 2012-10-08 20:50:18   40          3        2 0.21428571
3           3    A 2012-11-24 20:18:44   47          3        1 0.21428571
4           4    B 2012-11-15 16:58:15   46          4        1 0.28571429
5           5    B 2013-01-09 10:57:02    2          4        3 0.28571429
6           6    B 2013-01-11 17:31:22    2          4        3 0.28571429
7           7    B 2013-01-14 18:37:00    2          4        3 0.28571429
8           8    C 2013-02-22 17:46:07    8          2        1 0.14285714
9           9    C 2013-02-27 11:21:00    9          2        1 0.14285714
10         10    D 2012-10-28 14:48:33   43          1        1 0.07142857
          sd          v
1  0.5789342 0.33516484
2  0.5789342 0.33516484
3  0.5789342 0.33516484
4  0.8254203 0.68131868
5  0.8254203 0.68131868
6  0.8254203 0.68131868
7  0.8254203 0.68131868
8  0.3631365 0.13186813
9  0.3631365 0.13186813
10 0.2672612 0.07142857
3
On

Here is a VERY rough solution (need to go). I assume that the calculations in the original questions were not entirely right, if I am wrong you should be able to adapt my code:

EDIT 1: update code - since the mean used in the variance calculations was wrong and added some comments.

# Set to data.table
setDT(df)

# Number of weeks in our data
nweeks <- df[, uniqueN(Week)] # 7 

# Calculate mean number of visits per week
df[, Mean := .N / nweeks, by = .(Contact.ID)]

# Add the rank of the week, this variable is used in the loop below
df <- merge(df,
            df[!duplicated(Week), .(Week, num_week = rank(Week))])

# Set key for tha data.table... makes syntax simpler
setkey(df, Contact.ID, num_week)

# Initalize variance variable
df[, v := 0]

# For each id go through every week and fill in vector of number of visits
# attendance_vector based on which we will calculate variance for each id.
for (id in unique(df$Contact.ID)) {
  attendance_vector <- integer(nweeks)
  mean <- df[id, Mean][1] # mean for this id...
  for (week in unique(df$num_week)) {
    attendance_vector[week] <- 
      df[.(id, week)][1, ifelse(!is.na(WeeklyAT), WeeklyAT, 0)]
  }
  df[id, v := sum((attendance_vector - mean)^2) / (nweeks - 1L)]
  cat("for", id, "the weekly attendance was: \n")
  print(cbind(unique(df$Week), attendance_vector, mean))
}

# Standard deviation
df[, sd := sqrt(v), by = Contact.ID]

# Drop num_week variable
df[, num_week := NULL]
df

    Week Contact.ID       Date     Time Attendance WeeklyAT      Mean         v        sd
 1:   40          A 2012-10-06 18:54:48          3        2 0.4285714 0.6190476 0.7867958
 2:   40          A 2012-10-08 20:50:18          3        2 0.4285714 0.6190476 0.7867958
 3:   47          A 2012-11-24 20:18:44          3        1 0.4285714 0.6190476 0.7867958
 4:    2          B 2013-01-09 10:57:02          4        3 0.5714286 1.2857143 1.1338934
 5:    2          B 2013-01-11 17:31:22          4        3 0.5714286 1.2857143 1.1338934
 6:    2          B 2013-01-14 18:37:00          4        3 0.5714286 1.2857143 1.1338934
 7:   46          B 2012-11-15 16:58:15          4        1 0.5714286 1.2857143 1.1338934
 8:    8          C 2013-02-22 17:46:07          2        1 0.2857143 0.2380952 0.4879500
 9:    9          C 2013-02-27 11:21:00          2        1 0.2857143 0.2380952 0.4879500
10:   43          D 2012-10-28 14:48:33          1        1 0.1428571 0.1428571 0.3779645