How to label values from a data dictionary in R

198 Views Asked by At

I have two files, one of which being a .csv file and the other an Excel document with a data dictionary in it. I want to label the values of the .csv file with the values from the data dictionary, so that the data frame looks like one loaded from a .dta/.sav file (in that the cells don't contain the values as a character/factor, but a labelled value). Is there a way of assigning value labels to cells from a data dictionary?"

Dummy data are below. I did work out how to do it using match_df from {matchmaker}, but this put all the values in as character rather than labelling them.

# where dd is the data dictionary and df the data frame
dd <- data.frame(variable.name = c(rep("sex", 2), 
                                   rep("age", 1), 
                                   rep("school", 3), 
                                   rep("parent", 2)), 
                 value = c(1, 2, 
                           -96, 
                           1, 2, -96, 
                           1, 2), 
                 variable.label = c("male", "female", 
                                    NA, 
                                    "private", "state", "missing", 
                                    "mother", "father"))


df <- data.frame(id = 1:10, 
                 sex = rep(c(1, 2), 5), 
                 age = c(rep(5, 9), -96), 
                 school = as.integer(c(rep(c(1, 2, 3), 3), 1)),
                 parent = c(rep(1, 2), rep(2, 8)))

# the following works to an extent, but it's not the desired outcome:
match_df(df, 
         dictionary = dd, 
         from = "value", 
         to = "variable.label",
         by = "variable.name")

# desired outcome would look something like this:
 dput()
structure(list(id = 1:10, sex = structure(c(1, 2, 1, 2, 1, 2, 
1, 2, 1, 2), labels = c(male = 1, female = 2), class = c("haven_labelled", 
"vctrs_vctr", "double")), age = structure(c(5, 5, 5, 5, 5, 5, 
5, 5, 5, -96), labels = c(missing = -96), class = c("haven_labelled", 
"vctrs_vctr", "double")), school = structure(c(1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L), labels = structure(1:3, names = c("private", 
"state", "missing")), class = c("haven_labelled", "vctrs_vctr", 
"integer")), parent = structure(c(1, 1, 2, 2, 2, 2, 2, 2, 2, 
2), labels = c(mother = 1, father = 2), class = c("haven_labelled", 
"vctrs_vctr", "double"))), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))
2

There are 2 best solutions below

0
On BEST ANSWER

I think the first thing to do is split(dd) into a list of data frames, each of which contains the appropriate factor levels for one column.

levels_list <- split(dd, ~variable.name)

# This looks like:
# $age
#   variable.name value variable.label
# 3           age   -96           <NA>

# $parent
#   variable.name value variable.label
# 7        parent     1         mother
# 8        parent     2         father
# <etc>

You can then write a function to construct the haven_labelled class which takes one of these data frames and df and returns a labelled vector.

create_haven_labelled <- function(col, dat = df, ldf = levels_list[[col]]) {
    levels <- unique(dat[[col]])
    labels <- ldf[, "variable.label"]

    if (all(is.na(labels))) {
        # Assume no labels if none except NA provided
        levels <- ldf[, "value"]
        labels <- "missing"
    } else { 
        # Add "missing" to labels not found
        labels <- labels[match(levels, ldf$value)]
        labels[is.na(labels)] <- "missing"
    }

    dat[[col]] |>
        haven::labelled() |>
        `attr<-`("labels", setNames(levels, labels))
}

This function labels any values not found as "missing", except in the case where there are no labels other than missing values provided, e.g. "age", where it assumes that no labels except "missing" are required.

We can then apply this to the relevant columns:

out_df <- dplyr::as_tibble(df) # just so it prints nicely
out_df[names(levels_list)] <- lapply(names(levels_list), create_haven_labelled)

out_df
# # A tibble: 10 × 5
#       id sex        age           school      parent    
#    <int> <dbl+lbl>  <dbl+lbl>     <int+lbl>   <dbl+lbl> 
#  1     1 1 [male]     5           1 [private] 1 [mother]
#  2     2 2 [female]   5           2 [state]   1 [mother]
#  3     3 1 [male]     5           3 [missing] 2 [father]
#  4     4 2 [female]   5           1 [private] 2 [father]
#  5     5 1 [male]     5           2 [state]   2 [father]
#  6     6 2 [female]   5           3 [missing] 2 [father]
#  7     7 1 [male]     5           1 [private] 2 [father]
#  8     8 2 [female]   5           2 [state]   2 [father]
#  9     9 1 [male]     5           3 [missing] 2 [father]
# 10    10 2 [female] -96 [missing] 1 [private] 2 [father]

identical(out_df, desired) # TRUE
1
On

Maybe the reason for the error there is no label for school == 3 in dd dataset. instead -96 was used.

I usually prefer, "long format-> left join -> wide format" approach.

        dd <- data.frame(variable.name = c(rep("sex", 2), 
                                           rep("age", 1), 
                                           rep("school", 3), 
                                           rep("parent", 2)), 
                         value = c(1, 2, 
                                   -96, 
                                   1, 2, -96, 
                                   1, 2), 
                         variable.label = c("male", "female", 
                                            NA, 
                                            "private", "state", "missing", 
                                            "mother", "father"))
        
        
        df <- data.frame(id = 1:10, 
                         sex = rep(c(1, 2), 5), 
                         age = c(rep(5, 9), -96), 
                         school = as.integer(c(rep(c(1, 2, 3), 3), 1)),
                         parent = c(rep(1, 2), rep(2, 8)))
        
   
        library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
        library(tidyr)
        
        df |> 
                mutate(school = if_else(school == 3, -96, school)) |>  # Ignore if 3 should be stay as 3
                mutate(across(everything(), ~as.character(.))) |> 
                pivot_longer(
                        -id,
                        names_to = "names",
                        values_to = "values"
                        ) |> 
                left_join(dd |> 
                                  mutate(value = as.character(value)), 
                          by = c("names" = "variable.name", "values" = "value")) |> 
                mutate(new = if_else((names == "age" & values != "-96"), values, variable.label)) |> 
                select(id, names, new) |>  
                pivot_wider(
                        names_from = "names",
                        values_from = "new"
                        )
#> # A tibble: 10 × 5
#>    id    sex    age   school  parent
#>    <chr> <chr>  <chr> <chr>   <chr> 
#>  1 1     male   5     private mother
#>  2 2     female 5     state   mother
#>  3 3     male   5     missing father
#>  4 4     female 5     private father
#>  5 5     male   5     state   father
#>  6 6     female 5     missing father
#>  7 7     male   5     private father
#>  8 8     female 5     state   father
#>  9 9     male   5     missing father
#> 10 10    female <NA>  private father

Created on 2023-12-19 with reprex v2.0.2