I have annotation matrix with following description: 3 Annotators, 3 categories, 206 subjects
The data is stored in a numpy.ndarray variable z:
array([[ 0., 2., 1.],
[ 0., 2., 1.],
[ 0., 2., 1.],
[ 0., 2., 1.],
[ 1., 1., 1.],
[ 0., 2., 1.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.],
[ 0., 3., 0.]])
As can be seen 200 out of 206 annotations are for the same categories by all three annotators. Now implementing the Fleiss Kappa:
from statsmodels.stats.inter_rater import fleiss_kappa
fleiss_kappa(z)
0.062106000466964177
Why is the score so low in spite majority subjects (200/206) are annotated for the same category?
I think the statsmodels score is perfectly fine. The problem of your example is that the second category is picked almost all the time. This implies by definition of Fleiss Kappa that two random raters both pick the second category by chance is very high. Mathematically, following the notation of the paper of the wikipedia article (which exactly matches the paper), Fleiss Kappa is defined as:
where
In your case, \bar{P} and (and thats the problem) \bar{P_e} are close to 1.
A solution to your problem would be that the raters agree also on the other two categories. So for example change your example in a way that you have 306 subjects, with still 3 categories and 3 raters. Lets assume that the annotations of the first 6 subjects are similar to your example. Then for the next 100 subjects all 3 raters agree on category 1. For the next 100 subjects all 3 raters agree on category 2. Accordingly all raters agree on category 3 for the last 100 subjects. Now the probability that two raters end up with the same rating by chance is much lower, since the overall number of ratings per category is much more balanced. In this exact example, the Fleiss Kappa is 0.9787