I have two files, one of which being a .csv file and the other an Excel document with a data dictionary in it. I want to label the values of the .csv file with the values from the data dictionary, so that the data frame looks like one loaded from a .dta/.sav file (in that the cells don't contain the values as a character/factor, but a labelled value). Is there a way of assigning value labels to cells from a data dictionary?"
Dummy data are below. I did work out how to do it using match_df
from {matchmaker}
, but this put all the values in as character rather than labelling them.
# where dd is the data dictionary and df the data frame
dd <- data.frame(variable.name = c(rep("sex", 2),
rep("age", 1),
rep("school", 3),
rep("parent", 2)),
value = c(1, 2,
-96,
1, 2, -96,
1, 2),
variable.label = c("male", "female",
NA,
"private", "state", "missing",
"mother", "father"))
df <- data.frame(id = 1:10,
sex = rep(c(1, 2), 5),
age = c(rep(5, 9), -96),
school = as.integer(c(rep(c(1, 2, 3), 3), 1)),
parent = c(rep(1, 2), rep(2, 8)))
# the following works to an extent, but it's not the desired outcome:
match_df(df,
dictionary = dd,
from = "value",
to = "variable.label",
by = "variable.name")
# desired outcome would look something like this:
dput()
structure(list(id = 1:10, sex = structure(c(1, 2, 1, 2, 1, 2,
1, 2, 1, 2), labels = c(male = 1, female = 2), class = c("haven_labelled",
"vctrs_vctr", "double")), age = structure(c(5, 5, 5, 5, 5, 5,
5, 5, 5, -96), labels = c(missing = -96), class = c("haven_labelled",
"vctrs_vctr", "double")), school = structure(c(1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L), labels = structure(1:3, names = c("private",
"state", "missing")), class = c("haven_labelled", "vctrs_vctr",
"integer")), parent = structure(c(1, 1, 2, 2, 2, 2, 2, 2, 2,
2), labels = c(mother = 1, father = 2), class = c("haven_labelled",
"vctrs_vctr", "double"))), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I think the first thing to do is
split(dd)
into a list of data frames, each of which contains the appropriate factor levels for one column.You can then write a function to construct the
haven_labelled
class which takes one of these data frames anddf
and returns a labelled vector.This function labels any values not found as
"missing"
, except in the case where there are no labels other than missing values provided, e.g."age"
, where it assumes that no labels except"missing"
are required.We can then apply this to the relevant columns: