Review Article

Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review

Table 3

Statistics for calculating the interrater agreement.

MethodsDescription

Joint-probability of agreement (percentage agreement) [20]The simplest and easiest measure based on dividing number of times for each rating (e.g., ), assigned by each assessor, by the total number of the ratings

Cohen’s kappa [21]A statistical measure to calculate interrater agreement among raters. This measurement is more robust than percentage agreement since this method considers the effects of random agreement between two assessors

Fleiss’ kappa [22]An extended version of Cohen's kappa. This measurement considers the agreement among any number of raters (not only two)

Krippendorff’s alpha [23]The measurement is based on the overall distribution of assessors regardless of which assessors produced the judgments