Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review

<table class="table-group" id="tab3"><tr><td><table class="table"><tr><td class="thead-hr" colspan="2"><hr/></td></tr><tr class="thead"><td align="left">Methods</td><td align="left">Description</td></tr><tr><td class="thead-hr" colspan="2"><hr/></td></tr><tr><td align="left">Joint-probability of agreement (percentage agreement) [<a href="/journals/tswj/2014/135641/#B20">20</a>]</td><td align="left">The simplest and easiest measure based on dividing number of times for each rating (e.g.,<svg height="12.9125" id="M3" style="vertical-align:-1.76814pt" version="1.1" viewbox="0 0 64.050003 12.9125" width="64.050003" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g transform="matrix(.017,-0,0,-.017,.062,10.862)"><path d="M384 0h-275v27q67 5 81.5 18.5t14.5 68.5v385q0 38 -7.5 47.5t-40.5 10.5l-48 2v24q85 15 178 52v-521q0 -55 14.5 -68.5t82.5 -18.5v-27z" id="x31"></path></g><g transform="matrix(.017,-0,0,-.017,8.222,10.862)"><path d="M95 130q31 0 61 -30t30 -78q0 -53 -38 -87.5t-93 -51.5l-11 29q77 31 77 85q0 26 -17.5 43t-44.5 24q-4 0 -8.5 6.5t-4.5 17.5q0 18 15 30t34 12z" id="x2C"></path></g><g transform="matrix(.017,-0,0,-.017,14.919,10.862)"><path d="M412 140l28 -9q0 -2 -35 -131h-373v23q112 112 161 170q59 70 92 127t33 115q0 63 -31 98t-86 35q-75 0 -137 -93l-22 20l57 81q55 59 135 59q69 0 118.5 -46.5t49.5 -122.5q0 -62 -29.5 -114t-102.5 -130l-141 -149h186q42 0 58.5 10.5t38.5 56.5z" id="x32"></path></g><g transform="matrix(.017,-0,0,-.017,23.079,10.862)"><path d="M95 130q31 0 61 -30t30 -78q0 -53 -38 -87.5t-93 -51.5l-11 29q77 31 77 85q0 26 -17.5 43t-44.5 24q-4 0 -8.5 6.5t-4.5 17.5q0 18 15 30t34 12z" id="x2C"></path></g><g transform="matrix(.017,-0,0,-.017,29.793,10.862)"><path d="M160 -12q-24 0 -39.5 16t-15.5 42q0 24 16 40.5t40 16.5t40 -16.5t16 -40.5q0 -26 -16 -42t-41 -16zM485 -12q-24 0 -39.5 16t-15.5 42q0 24 16 40.5t40 16.5t40 -16.5t16 -40.5q0 -26 -16 -42t-41 -16zM806 -12q-24 0 -39.5 16t-15.5 42q0 24 16 40.5t40 16.5t40 -16.5
t16 -40.5q0 -26 -15.5 -42t-41.5 -16z" id="x2026"></path></g><g transform="matrix(.017,-0,0,-.017,49.104,10.862)"><path d="M95 130q31 0 61 -30t30 -78q0 -53 -38 -87.5t-93 -51.5l-11 29q77 31 77 85q0 26 -17.5 43t-44.5 24q-4 0 -8.5 6.5t-4.5 17.5q0 18 15 30t34 12z" id="x2C"></path></g><g transform="matrix(.017,-0,0,-.017,55.818,10.862)"><path d="M153 550l-26 -186q79 31 111 31q90 0 141.5 -51t51.5 -119q0 -93 -89 -166q-85 -69 -173 -71q-32 0 -61.5 11.5t-41.5 23.5q-18 17 -17 34q2 16 22 33q14 9 26 -1q61 -50 124 -50q60 0 93 43.5t33 104.5q0 69 -41.5 110t-121.5 41q-53 0 -102 -20l38 305h286l6 -8
l-26 -65h-233z" id="x35"></path></g>
</svg>), assigned by each assessor, by the total number of the ratings</td></tr><tr><td align="center" colspan="2"><hr/></td></tr><tr><td align="left">Cohen’s kappa [<a href="/journals/tswj/2014/135641/#B21">21</a>]</td><td align="left">A statistical measure to calculate interrater agreement among raters. This measurement is more robust than percentage agreement since this method considers the effects of random agreement between two assessors</td></tr><tr><td align="center" colspan="2"><hr/></td></tr><tr><td align="left">Fleiss’ kappa [<a href="/journals/tswj/2014/135641/#B22">22</a>]</td><td align="left">An extended version of Cohen's kappa. This measurement considers the agreement among any number of raters (not only two)</td></tr><tr><td align="center" colspan="2"><hr/></td></tr><tr><td align="left">Krippendorff’s alpha [<a href="/journals/tswj/2014/135641/#B23">23</a>]</td><td align="left">The measurement is based on the overall distribution of assessors regardless of which assessors produced the judgments</td></tr><tr class="table-tr"><td colspan="2"><hr class="tbody-hr"/></td></tr></table></td></tr></table>

Statistics for calculating the interrater agreement.

The Scientific World Journal

tab3

Table 3

Table 3: Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review