I have a pretty large longitudinal data set ranging from 2014 to 2021. Most of the variables are available for every years. However there are a few variables that are available for 2014 and 2016, but not for 2015. In these cases, I want to calculate the value for 2015 as the mean of the value from 2014 and 2016.
So the data structure looks as follows. Note this is extremely simplified, the data set has way more variables and observation. Also for each respondent there are rows for the other years as well (obviously), which I didnt write down here
| PID | Year | Var 1 | Var 2 | Var 3 |
|---|---|---|---|---|
| 1 | 2014 | 10 | 2 | 2 |
| 1 | 2015 | 15 | 8 | NA |
| 1 | 2016 | 12 | 6 | 4 |
| 2 | 2014 | 11 | 7 | 5 |
| 2 | 2015 | 16 | 3 | NA |
| 2 | 2016 | 14 | 5 | 9 |
PID is the id/number that identifies each respondent. Var1 and Var2 are available for every year, Var3 is only available in 2014 and 2015
What I want is this:
| PID | Year | Var 1 | Var 2 | Var 3 |
|---|---|---|---|---|
| 1 | 2014 | 10 | 2 | 2 |
| 1 | 2015 | 15 | 8 | 3 |
| 1 | 2016 | 12 | 6 | 4 |
| 2 | 2014 | 11 | 7 | 5 |
| 2 | 2015 | 16 | 3 | 7 |
| 2 | 2016 | 14 | 5 | 9 |
For Var3, instead of NA, the row for 2015 contains the mean of the value in 2014 and 2016. How can I achieve this?
My first ideas was to adress the missing values in 2015 by is.na() but this would address all the NAs in the whole data set and not just the NAs in 2015 for Var2. How can I adress these NAs specifically, so that it a) only calculates the value for 2015 as mean of 2014 and 2016 for Var2 and b) only for those rows where PID is the same, so that values of different respondents do not get mixed up?
What you're asking about is imputation, where there are different methods for how to replace empty/null/
NAvalues. One such method involves the regressed value based on one or more other non-empty variables.I'm actually taking a little liberty with this ... you said "mean of 2014 and 2016", which in this case is going to be the same, but if for some reason you have 2014 and 2017 and are missing 2015-2016, then "mean" will be biased and be the same value for both 2015-2016.
approximation works well here.dplyr
coalesceis logically the same asifelse(is.na(.x), approx(..)$y, .x), which means that if it is notNAthen the original value will always be used. Frankly, we could likely just do(removing the
coalesce, always replacing all values) which still results in the same desired results. But since I don't know the ins and outs of your data, I thought I would be safer about the process.base R
While
dplyrand friends have built-in grouping for transformations, in base R we need to use something likeavefor non-aggregating by-group calculations.-(1:2); it can just as easily be done by specifying which columns you need to impute, such asquux[,3:5] <- lapply(quux[,3:5], ...).aveonly works on one column (vector) at a time, so it's not possible to directly use bothYearand (say)Var 1within it. Because of this, instead of tellingavethe data isVar 1, we use row-indices as the grouped variable and use that internally to calculate on the correct indices ofYearand theVar #variable.avereturns the same class as its first argument (regardless of what is returned), so we force the row-indices to benumericinstead ofinteger. It isn't clear if your data isnumericorinteger, so you may be able to not useas.numeric, over to you.coalescefunction (similar todplyr::coalesce,data.table::fcoalesce, and SQL'sCOALESCEfunctions) and replace much of theifelsestatement, but ... internally it would be doing much the same thing.Data