Is fleiss kappa a reliable measure for interannotator agreement? The following results confuses me, are there any involved assumptions while using it?

Question

Is fleiss kappa a reliable measure for interannotator agreement? The following results confuses me, are there any involved assumptions while using it?

9.1k Views Asked by Rohan At 28 June 2025 at 10:45

I have annotation matrix with following description: 3 Annotators, 3 categories, 206 subjects

The data is stored in a numpy.ndarray variable z:

array([[ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 1.,  1.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.],
   [ 0.,  3.,  0.]])

As can be seen 200 out of 206 annotations are for the same categories by all three annotators. Now implementing the Fleiss Kappa:

from statsmodels.stats.inter_rater import fleiss_kappa
fleiss_kappa(z)
0.062106000466964177

Why is the score so low in spite majority subjects (200/206) are annotated for the same category?

Original Q&A

There are 5 best solutions below

**MoJo494** · Answer 1

I think the statsmodels score is perfectly fine. The problem of your example is that the second category is picked almost all the time. This implies by definition of Fleiss Kappa that two random raters both pick the second category by chance is very high. Mathematically, following the notation of the paper of the wikipedia article (which exactly matches the paper), Fleiss Kappa is defined as:

k=(\bar{P}-\bar{P_e})/(1-\bar{P_e})

where

\bar{P} is the actual probability that the rating of two random raters agree
bar{P_e} is the probability that the rating of two random raters agree by chance

In your case, \bar{P} and (and thats the problem) \bar{P_e} are close to 1.

A solution to your problem would be that the raters agree also on the other two categories. So for example change your example in a way that you have 306 subjects, with still 3 categories and 3 raters. Lets assume that the annotations of the first 6 subjects are similar to your example. Then for the next 100 subjects all 3 raters agree on category 1. For the next 100 subjects all 3 raters agree on category 2. Accordingly all raters agree on category 3 for the last 100 subjects. Now the probability that two raters end up with the same rating by chance is much lower, since the overall number of ratings per category is much more balanced. In this exact example, the Fleiss Kappa is 0.9787

from statsmodels.stats.inter_rater import fleiss_kappa
import numpy as np


firstSixAnnotations=np.array([[ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 0.,  2.,  1.],
   [ 1.,  1.,  1.],
   [ 0.,  2.,  1.]])
allRatersAgreeOnFirstCategory=np.tile(np.array([3.,0.,0.]), (100,1))
allRatersAgreeOnSecondCategory=np.tile(np.array([0.,3.,0.]), (100,1))
allRatersAgreeOnThirdCategory=np.tile(np.array([0.,0.,3.]), (100,1))
z=np.concatenate([firstSixAnnotations,allRatersAgreeOnFirstCategory,allRatersAgreeOnSecondCategory,allRatersAgreeOnThirdCategory])
print(fleiss_kappa(z))

**user3636159** · Answer 2

I also want to calculate Fleiss's kappa or krippendorff. ut, the value for krippendorff is lower than the Fleiss, much lower, it's 0.032 while my fleiss is 0.49.

I have 3 categories, rated by 3 annotators each. In 52% of the cases, the 3 annotators agreed on the same category and in 43% two annotators agreed on one category and in only 5% of the times, each annotator chose a different category. Isn't the agreement too low, especially using krippendorff?

**DieseRobin** · Answer 3

I am using 'rater' instead of 'annotator'. Remember these are measures of agreement above chance between raters.

Fleiss' kappa

For Fleiss kappa you need to aggregate_raters() first: this takes you from subjects as rows and raters as columns (subjects, raters) to -> subjects as rows and categories as columns (subjects, categories)

from statsmodels.stats import inter_rater as irr
agg = irr.aggregate_raters(arr) # returns a tuple (data, categories)
agg

Each row values will add up to number of raters (3), if every rater assigned one category per subject. Now the columns represent the categories as seen here https://en.wikipedia.org/wiki/Fleiss'_kappa#Data

(array([[1, 1, 1, 0],   # all three raters disagree
        [1, 1, 1, 0],   # again
        [1, 1, 1, 0],   # and again
        [1, 1, 1, 0],   # and again
        [0, 3, 0, 0],   # all three raters agree on 1
        [1, 1, 1, 0],   
        [2, 0, 0, 1],   # two raters agree on 0, third picked 3
        [2, 0, 0, 1],   # perfect disagreement 
        [2, 0, 0, 1],   # for the rest of the dataset.
        [2, 0, 0, 1],  . . . ),           
 array([0, 1, 2, 3]))   # categories

perfect disagreement: 'every time I choose 0, you choose 3'

…your data says that you have 4 categories [0, 1, 2, 3]

For the first 4 subjects each rater coded a different category! Then for the remaining subjects rater one and three agree on category 0 while rater two rated 3s. Now this is perfect disagreement for most of the dataset so I would not be surprised to see a negative alpha or kappa! Lets have a look... we only need the aggregated data agg[0] (first part of the tuple).

irr.fleiss_kappa(agg[0], method='fleiss')

-0.44238 …which makes sense given the disagreement on most subjects

Krippendorff alpha

The current krippendorff implementation expects raters as rows and subjects as columns (raters, subjects). So we need to transpose the original data. If we do not do this then 206 raters are assumed to have rated 3 subjects with four categories [0,1,2,3] resulting in the answer given previously (0.98). Krippendorff does not expect the aggregated format!

import numpy as np
import krippendorff as kd 
arrT = np.array(arr).transpose()  #returns a list of three lists, one per rater
kd.alpha(arrT, level_of_measurement='nominal')  #assuming nominal categories

-0.4400 …which makes sense, because it should be close to / equal to Fleiss' kappa.

**Szymon Palucha** · Answer 4

I also encountered this post because I am looking for an answer to the same problem. I also got the same 0.062106000466964177 as you both with the statsmodels implementation and my own implementation.

I think I agree with MoJo494's answer. For this example the chance agreement is 0.97757145 and observed agreement is 0.9789644 which is very similar and that explains the low Fleiss Kappa score. The real question is why is it so low when most of the raters agree? Surely, this is a good situation where all raters agree that the subjects are all good for instance. Like mentioned by MoJo if the raters just agreed in a different way as well the kappa score would be much better as the chance agreement would be much lower.

In my opinion, I think it must be just a limitation of the metric. After all we are trying to explain something with just a single number which is bound to have limitations. But I am looking forward to know if anyone has views on this. I am going to investigate a bit more too.

**Abu Shoeb** · Answer 5

This solution may serve your purpose. Here's code snippet for you which yields kappa score 0.98708.

import krippendorff

arr = [[ 0,  2,  1],
   [ 0,  2,  1],
   [ 0,  2,  1],
   [ 0,  2,  1],
   [ 1,  1,  1],
   [ 0,  2,  1],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0],
   [ 0,  3,  0]]

kappa = krippendorff.alpha(arr)
print(kappa)

It works with Python 3.4+ and here are the dependencies you need to install

pip install numpy krippendorff

Is fleiss kappa a reliable measure for interannotator agreement? The following results confuses me, are there any involved assumptions while using it?

There are 5 best solutions below

Fleiss' kappa

…your data says that you have 4 categories [0, 1, 2, 3]

Krippendorff alpha

Related Questions in PYTHON

Related Questions in ANNOTATIONS

Related Questions in KAPPA

Trending Questions

Popular # Hahtags

Popular Questions