Inter-rater reliability of multiple raters responding to (some subset of) multiple questions (in R)

221 Views Asked by Nick Byrd At 28 June 2025 at 00:26

I have data from 5 raters who provided ratings of transcripts by answering up to a dozen questions about each transcript. Each question used a different rating system (e.g., yes vs. no, 1-7, or this vs. that vs. indeterminant).

A toy example of the data can be made with this code.

data.table(Rater = c("A","B","C","D","E"),
           Content = c("I","I","I","I","I","II","II","II","II","II"),
           Question1 = c("Yes","No","Yes","No","NA"),
           Question2 = c("1","3","5","7","NA"),
           Question3 = c("This","That","Indeterminate","This","Indeterminate"))

Which produces what is below:

    Rater Content Question1 Question2     Question3
 1:     A       I       Yes         1          This
 2:     B       I        No         3          That
 3:     C       I       Yes         5 Indeterminate
 4:     D       I        No         7          This
 5:     E       I        NA        NA Indeterminate
 6:     A      II       Yes         1          This
 7:     B      II        No         3          That
 8:     C      II       Yes         5 Indeterminate
 9:     D      II        No         7          This
10:     E      II        NA        NA Indeterminate

I need to compute the interrater reliability for the raters.

The kappa2 function of the irr package would need data to be in long format (if I understand correctly)—something like:

Rater                   A     B  ...     E
Question1_Content_I   Yes    No  ...    NA    
Question2_Content_I     1     3  ...    NA     
Question3_Content_I  This  That  ...  Ind.
Question1_Content_II  Yes    No  ...  Ind. 
...

How can I (re)format the data to compute IRR scores (with kappa2 or another function)? (Would melt do the trick?)
What functions would compute IRR scores for each kind of question/rating? (And, if applicable, for data (re)format(ing) would they require?)
Must there be separate IRR scores for each question/rating or is there a way to compute an overall IRR (across the questions)?
What needs to be done to accomodate the fact that some raters didn't respond to every question?

Thank you for your advice!

Original Q&A

Inter-rater reliability of multiple raters responding to (some subset of) multiple questions (in R)

There are 0 best solutions below

Related Questions in R

Related Questions in DATA-CLEANING

Related Questions in MELT

Related Questions in REFORMATTING

Related Questions in COHEN-KAPPA

Trending Questions

Popular # Hahtags

Popular Questions