Restructuring data (for IRR-analysis)

181 Views Asked by At

I have the following data-frame df (fictitious data) with several variables var1, var2, ..., var_n:

var1<-c("A","A","A","B","A","C","C","A", "A", "E", "E", "B")
var2<-c(NA,"1","1","5","6","2","3","1", "1", "3", "3", "2")
id<-c(1,2,2,3,3,4,4,5,5,6,6,7)

df<-data.frame(id, var1, var2)
df

   id var1 var2
   1    A <NA>
   2    A    1
   2    A    1
   3    B    5
   3    A    6
   4    C    2
   4    C    3
   5    A    1
   5    A    1
   6    E    3
   6    E    3
   7    B    2

The data are retrieved from a document analysis where several coders extracted the values from physical files. Each file does have a specific id. Thus, if there are two entries with the same id this means that two different coders coded the same document. For example in document no. 4 both coders agreed that var1 has the value C, whereas in document no. 3 there is a dissent (A vs. B).

In order to calculate inter-rater-reliability (irr) I need to restructure the dataframe as follows:

id var1  var1_coder2 var2 var2_coder2
2  A     A           1    5
3  B     A           5    6
4  C     C           2    3
5  C     C           1    1
6  E     E           3    3

Can anyone tell me how to get this done? Thanks!

1

There are 1 best solutions below

0
On BEST ANSWER

You can transform your data with functions from dplyr (group_by, mutate) and tidyr (gather, spread, unite):

library(tidyr)
library(dplyr)

new_df <- df %>% 
  group_by(id) %>% 
  mutate(coder = paste0("coder_", 1:n())) %>% 
  gather("variables", "values", -id, -coder) %>% 
  unite(column, coder, variables) %>% 
  spread(column, values) 

new_df
# A tibble: 7 x 5
# Groups:   id [7]
#      id coder_1_var1 coder_1_var2 coder_2_var1 coder_2_var2
#   <dbl> <chr>        <chr>        <chr>        <chr>       
# 1     1 A            NA           NA           NA          
# 2     2 A            1            A            1           
# 3     3 B            5            A            6           
# 4     4 C            2            C            3           
# 5     5 A            1            A            1           
# 6     6 E            3            E            3           
# 7     7 B            2            NA           NA 

If you only want to keep the rows where all coder have entered values you can use filter_all.

new_df %>% 
  filter_all(all_vars(!is.na(.)))

# A tibble: 5 x 5
# Groups:   id [5]
#      id coder_1_var1 coder_1_var2 coder_2_var1 coder_2_var2
#   <dbl> <chr>        <chr>        <chr>        <chr>       
# 1     2 A            1            A            1           
# 2     3 B            5            A            6           
# 3     4 C            2            C            3           
# 4     5 A            1            A            1           
# 5     6 E            3            E            3