Fixed-Effects Modeling of Cohen's Weighted Kappa for Bivariate Multinomial Data: A Perspective
of Generalized Inverse

Yang, Jingyun; Chinchilli, Vernon M.

doi:https://doi.org/10.1155/2011/603856

Journal of Probability and Statistics

On this page

Abstract Introduction Examples Conclusion Appendix References Copyright Related Articles

Research Article | Open Access

Volume 2011 | Article ID 603856 | https://doi.org/10.1155/2011/603856

Fixed-Effects Modeling of Cohen's Weighted Kappa for Bivariate Multinomial Data: A Perspective of Generalized Inverse

Jingyun Yang¹and Vernon M. Chinchilli²

Academic Editor: Man Lai Tang

Received27 Aug 2011

Accepted18 Oct 2011

Published10 Dec 2011

Abstract

Cohen's kappa and weighted kappa statistics are the conventional methods used frequently in measuring agreement for categorical responses. In this paper, through the perspective of a generalized inverse, we propose an alternative general framework of the fixed-effects modeling of Cohen's weighted kappa, proposed by Yang and Chinchilli (2011). Properties of the proposed method are provided. Small sample performance is investigated through bootstrap simulation studies, which demonstrate good performance of the proposed method. When there are only two categories, the proposed method reduces to Cohen's kappa.

1. Introduction

Measurement of agreement is used widely in diverse areas of scientific research to assess the reproducibility of a new assay, instrument, or method, the acceptability of a new or generic process and methodology, as well as in method comparison. Examples include the agreement when two or more methods or raters simultaneously assess a response [1, 2] or when one rater makes the same assessment at two times [3], the agreement of a newly developed method with a gold standard method [4], and the agreement of observed values with predicted values [5].

Traditionally, kappa [6] is used for measurement of agreement for categorical responses and weighted kappa [7] for ordinal responses. The concordance correlation coefficient [8, 9] is often used when the responses are continuous [10, 11]. In this paper, we focus on the measurement of agreement for categorical responses, that is, when the responses are either nominal or ordinal.

Suppose there is a bivariate response , where each of and yields a categorical response. For convenience, the categories are denoted as . Suppose the bivariate distribution is defined as in Table 1.

Cohen [6] proposed a coefficient of agreement, called the kappa coefficient, for nominal scales of responses, which is defined by where is the observed proportion of agreement and is the proportion of agreement expected by chance.

To deal with ordinal responses, Cohen [7] proposed a weighted version of the kappa statistic: with where is the nonnegative weight assigned to the disagreement for the cell , is the weight assigned to the maximum disagreement, is the proportion of joint judgement in the cell, and is the proportion expected by chance in the same cell.

Despite the fact that they may fail to work well under certain situations [12, 13], Cohen’s kappa and weighted kappa coefficients have been widely used in various areas as a measure of agreement for categorical responses, partly due to their ease in calculation; see, for example, Justice et al. [14], Kerosuo and Ørstavik [15], Landers et al. [16], and Suebnukarn et al. [17]. In the meanwhile, researchers have been exploring alternative methods for measuring agreement for categorical data; see Agresti [18], Barlow [19], Broemeling [20], Carrasco and Jover [21], Dou et al. [22], Fanshawe et al. [23], Graham and Jackson [24], King and Chinchilli [25], Kraemer [26], Landis and Koch [27], Laurent [28], Lin et al. [29], and Svensson [30], to name only a few. It is worth noting that, for ordinal or binary data, the concordance correlation coefficient (CCC) [8] based on the squared function of distance reduces to Cohen’s weighted kappa when the Fleiss-Cohen weight function is used and the CCC based on the absolute value function of distance to the power is equivalent to Cohen’s weighted kappa if Cicchetti-Allison weight function is used [25]. In addition, variance of the CCC is identical to the one given by Fleiss et al. [31] if the Fleiss-Cohen weight function and the GEE approach are used [29].

Yang and Chinchilli developed a fixed-effects modeling of kappa and weighted kappa [32, 33]. Their simulation studies and illustrative examples show good performance of the proposed methods. In this paper, we propose another version of fixed-effects modeling of Cohen’s kappa and weighted kappa through a perspective of a generalized inverse. We want to stress that this paper is not to demonstrate the advantage of the proposed method in constructing an agreement coefficient. Rather, it provides an alternative form of a general framework and direction under which new and better matrix functions could be explored.

In Section 2 we provide the derivation and properties of the proposed method. In Section 3, we compare the small sample performance of the proposed method with that of three other agreement coefficients through bootstrap simulation studies. Examples are given in Section 4 followed by a conclusion.

2. The Generalized Inverse Version of Fixed-Effects Modeling of Kappa and Weighted Kappa

2.1. Derivation

Using similar notation as in Yang and Chinchilli [32], we let and be categorical variables, with the categories being designated as , as displayed in Table 1. Let , , denote the bivariate probability, and let and denote the marginal probabilities. Let denote the indicator function, and define the vectors as

Define the matrices and as where denotes expectation under the assumption that and are independent. It easily can be verified that and are nonnegative definite matrices, denoted by and .

Yang and Chinchilli [32] show that

and are matrices of rank because and , where 1 is a vector of unit values. Thus, , indicating that 0 is an eigenvalue for with corresponding (standardized) eigenvector . The same is true for as well. Let denote the matrix of eigenvectors for . By definition, is orthogonal, so , which yields that . Because is not of full rank, its inverse does not exist, but it does have a Moore-Penrose generalized inverse [34, 35], denoted by . If denotes the diagonal matrix of eigenvalues for , then the Moore-Penrose generalized inverse of is and the Moore-Penrose generalized inverse of is .

It easily can be verified that , which implies that where .

Since , (2.4) can be rewritten as

Suppose is a function of the symmetric matrix and satisfies the following two definitions [32].

Definition 2.1. Suppose is a function. is said to be nondecreasing if for all nonnegative definite matrices , , where , one has .

Definition 2.2. Suppose is a function. is said to be a scale-equivariant function [36] if for all matrices and constant .

Then a class of agreement coefficients is given by

If , then

If we set , then where is a nonnegative definite symmetric matrix of agreement weights with for and for . Two frequently used weighting schemes are the Cicchetti-Allison weights [37] and the Fleiss-Cohen weights [38]

We can also use , the largest eigenvalue of as another function for , which leads to a new agreement coefficient .

Therefore, including the coefficients that we propose in this paper, there are four novel coefficients to assess agreement for categorical data: , , , and . In practice, all these indices can be estimated using their sample counterparts. In this paper, a “” is used to denote the estimate of an index.

2.2. Properties

In general, if satisfies the properties in the two aforementioned definitions, then has the following properties:(1), (2) if with probability one, that is, for all , (3) if and only if for each and for one choice of , excluding the set of degenerate cases, (4) if and are independent.

Proof. Without loss of generality, we set . The proofs for other weight matrices are slightly more complex.(1)From the previous proof, we know that Since is a nondecreasing and scale-equivariant function, we have It is then straightforward to see that .(2)If with probability one, that is, for all , then [32]. It is then straightforward to see that .(3)Excluding the set of degenerate cases, if for each and for one choice of , then [32]. Hence, and . Therefore, and . Hence, .(4)If and are independent, then [32]. It is straightforward to see that .

Lemma 2.3. When , that is, when there are only two categories, the four coefficients are equivalent and they all reduce to Cohen’s kappa.

For proof of the lemma, please see the appendix.

3. Small Sample Performance

3.1. Simulation Design

In this section, we compare the performance of Cohen’s weighted kappa (), , , and . We used the bootstrap to estimate the sample variances of the last three coefficients because the asymptotic distributions of their estimates are too complex.

We used Matlab (version 7.10.0.499, The MathWorks Inc., Natick, Mass) to generate samples following multinominal distributions. Twelve different distributions were employed, representing different levels of agreement (from poor to excellent) and different numbers of categories in the responses (from 3 to 5). For details about the distributions and the design of the simulations, see Yang and Chinchilli [32]. We calculated the true coefficient values for , , , and as well as their estimate, bias, mean square error, the empirical variance, and the bootstrap variance. The weights we used are Cicchetti-Allison weights.

We did 5000 simulations for each sample size in each distribution and 1000 bootstrap replicates for each simulation. The observed variances of the proposed method and the mean bootstrap variances are compared. We follow Yang and Chinchilli [32] in the interpretation of the results. That is, agreement measures greater than 0.75 or so may be taken to represent excellent agreement beyond chance, values between 0.40 and 0.75 may be taken to represent fair to good agreement beyond chance and values below 0.40 or so may be taken to represent poor agreement.

3.2. Simulation Results and Discussion

Table 2 gives the simulation results for the four coefficients with the weighting scheme being Cicchetti-Allison weights (the simulation results are split into two tables because it is unwieldy to fit in one page).

In almost all the cases, the four agreement coefficients give the same classification in terms of the magnitude of the degrees of agreement. An interesting case is case 6 where the degree of agreement is around the boundary of poor and medium agreement. Cohen’s weighted kappa and indicate close to medium agreement while and indicate poor agreement, according to our rule of interpretation of the results. Biases of all the four coefficients are small, and, as sample sizes increase, they become almost negligible. The bootstrap turns out to be a good alternative as an estimation of the sample variances, which in many cases gives the estimates of the variances equivalent to the observed values.

It also should be noted that the true values of coefficients within Table 2(a) can vary, especially for cases 5–9 () and cases 10–12 (). This suggests that, as table size increases, the coefficients could lead to different interpretations.

4. Examples

In this section, we use two empirical examples to illustrate the application and compare the result of the four agreement coefficients. The first example is from Brenner et al. [39] who examined the agreement of cause-of-death data for breast cancer patients between Ontario Cancer Registry (OCR) and classification results from Mount Sinai Hospital (MSH). OCR, established in 1964, collects information about all new cases of cancer in the province of Ontario while MSH has a systematic and rigorous monitoring and follow-up of confirmed node-negative breast cancer patients. Due to its relative complete and accurate data, MSH also provides specialist-determined cause of death. 1648 patients entered the analysis through linking the OCR data to the MSH study patients via the OCR standard procedure. The purpose of the study is to examine the degree of agreement between OCR, which is often used in cancer studies, and MSH, which was taken as the reference standard. A missed cancer-related death is considered of greater importance. The data are summarized in Table 3.

The second example is from Jiménez-Navarro et al. [40] who studied the concordance of oral glucose tolerance test (OGTT), proposed by the World Health Organization, in the diagnosis of diabetes mellitus (DM) which may possibly lead to future epidemic of coronary disease. OGTT classifies a diagnosis result as DM, glucose intolerance, or normal. Eighty-eight patients admitted with an acute coronary syndrome who had no previous diagnosis of DM underwent percutaneous coronary revascularization and received OGTT the day after revascularization and one month after that. Researchers are interested in the reproducibility of OGTT performed at these two different time points. The data are summarized in Table 4.

We calculated all the four agreement coefficients as well as their corresponding 95% confidence intervals (CI). The 95% CI for Cohen’s weighted kappa was based on the large sample variance derived by Fleiss et al. [31]. 5000 bootstrap replicates were used in the calculation of the CI for the other three coefficients. Table 5 gives the results for the above two examples. In all the calculations, the Cicchetti-Allison weights were used (see (2.9)).

As can be seen from Table 5, there seems to be no substantial difference among all the four coefficients using the aforementioned interpretation rule. For example 1, all coefficients indicate that there is very strong agreement between OCR and MSH studies in classifying the cause of death for the 1648 cancer patients. The second example, however, indicates that OGTT, when taken at two different time points with a span of one month, has very poor reproducibility in the diagnosis of DM.

5. Conclusion

We developed a fixed-effects modeling of Cohen’s kappa and weighted kappa for bivariate multinomial data through the perspective of generalized inverse, which provides an alternative form of framework to the one proposed by Yang and Chinchilli [33]. The proposed method also allows the application of different matrix functions, as long as they satisfy certain conditions. When there are only two categories in the response, the coefficient reduces to Cohen’s kappa. The proposed method is new and promising in that it allows the use of other matrix functions, such as the largest eigenvalue of a matrix, that might also yield reasonable, and possibly better, result. Exploration of such matrix functions is our future research topic.

Appendix

Proof of Lemma 2.3

Proof. (1) When , the Cicchetti-Allison weights become an identity matrix. Therefore,
(2) Since the weighting matrix is a matrix, the two eigenvalues of can be denoted by 0 and . Therefore, the largest eigenvalue is equal to . Similarly, it can be shown that the largest eigenvalue of is also equal to . Thus, .
(3) From the previous proof, we know that . When , is and is Because , where is the eigenvector of associated with the largest eigenvalue of .
But for the case , because , also is the eigenvector of associated with the largest eigenvalue of . Thus,
(4) From the derivation of , where , we know that it can be written as But the largest eigenvalue for the case is equal to the trace, so .

References

S. M. Gregoire, U. J. Chaudhary, M. M. Brown et al., “The Microbleed Anatomical Rating Scale (MARS): reliability of a tool to map brain microbleeds,” Neurology, vol. 73, no. 21, pp. 1759–1766, 2009.
View at: Publisher Site | Google Scholar
A. Riaz, F. H. Miller, L. M. Kulik et al., “Imaging response in the primary index lesion and clinical outcomes following transarterial locoregional therapy for hepatocellular carcinoma,” Journal of the American Medical Association, vol. 303, no. 11, pp. 1062–1069, 2010.
View at: Publisher Site | Google Scholar
J. Johnson and J. A. Kline, “Intraobserver and interobserver agreement of the interpretation of pediatric chest radiographs,” Emergency Radiology, vol. 17, no. 4, pp. 285–290, 2010.
View at: Publisher Site | Google Scholar
J. F. Hamel, D. Foucaud, and S. Fanello, “Comparison of the automated oscillometric method with the gold standard doppler ultrasound method to access the ankle-brachial pressure index,” Angiology, vol. 61, no. 5, pp. 487–491, 2010.
View at: Publisher Site | Google Scholar
K. Y. Bilimoria, M. S. Talamonti, J. S. Tomlinson et al., “Prognostic score predicting survival after resection of pancreatic neuroendocrine tumors: analysis of 3851 patients,” Annals of surgery, vol. 247, no. 3, pp. 490–500, 2008.
View at: Google Scholar
J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, pp. 37–46, 1960.
View at: Google Scholar
J. Cohen, “Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit,” Psychological Bulletin, vol. 70, no. 4, pp. 213–220, 1968.
View at: Publisher Site | Google Scholar
L. I. Lin, “A concordance correlation coefficient to evaluate reproducibility,” Biometrics, vol. 45, no. 1, pp. 255–268, 1989.
View at: Google Scholar
L. I. K. Lin, “Assay validation using the concordance correlation coefficient,” Biometrics, vol. 48, no. 2, pp. 599–604, 1992.
View at: Publisher Site | Google Scholar
P. B. Barbosa, M. M. Franco, F. O. Souza, F. I. Antônio, T. Montezuma, and C. H. J. Ferreira, “Comparison between measurements obtained with three different perineometers,” Clinics, vol. 64, no. 6, pp. 527–533, 2009.
View at: Publisher Site | Google Scholar
X. C. Wang, P. Y. Gao, J. Xue, G. R. Liu, and L. Ma, “Identification of infarct core and penumbra in acute stroke using CT perfusion source images,” American Journal of Neuroradiology, vol. 31, no. 1, pp. 34–39, 2010.
View at: Publisher Site | Google Scholar
D. V. Cicchetti and A. R. Feinstein, “High agreement but low kappa: II. Resolving the paradoxes,” Journal of Clinical Epidemiology, vol. 43, no. 6, pp. 551–558, 1990.
View at: Publisher Site | Google Scholar
A. R. Feinstein and D. V. Cicchetti, “High agreement but low kappa: I. the problems of two paradoxes,” Journal of Clinical Epidemiology, vol. 43, no. 6, pp. 543–549, 1990.
View at: Publisher Site | Google Scholar
A. C. Justice, J. A. Berlin, S. W. Fletcher, R. H. Fletcher, and S. N. Goodman, “Do readers and peer reviewers agree on manuscript quality?” Journal of the American Medical Association, vol. 272, no. 2, pp. 117–119, 1994.
View at: Publisher Site | Google Scholar
E. Kerosuo and D. Ørstavik, “Application of computerised image analysis to monitoring endodontic therapy: Reproducibility and comparison with visual assessment,” Dentomaxillofacial Radiology, vol. 26, no. 2, pp. 79–84, 1997.
View at: Google Scholar
S. Landers, W. Bekheet, and L. C. Falls, “Cohen's weighted kappa statistic in quality control-quality assurance procedures: application to network-level contract pavement surface condition surveys in British Columbia, Canada,” Transportation Research Record, no. 1860, pp. 103–108, 2003.
View at: Google Scholar
S. Suebnukarn, S. Ngamboonsirisingh, and A. Rattanabanlang, “A systematic evaluation of the quality of meta-analyses in endodontics,” Journal of Endodontics, vol. 36, no. 4, pp. 602–608, 2010.
View at: Publisher Site | Google Scholar
A. Agresti, “A model for agreement between ratings on an ordinal scale,” Biometrics, vol. 44, no. 2, pp. 539–548, 1988.
View at: Google Scholar
W. Barlow, “Measurement of interrater agreement with adjustment for covariates,” Biometrics, vol. 52, no. 2, pp. 695–702, 1996.
View at: Publisher Site | Google Scholar
L. D. Broemeling, Bayesian methods for measures of agreement, Chapman & Hall/CRC, Boca Raton, Fla, USA, 2009.
J. L. Carrasco and L. Jover, “Concordance correlation coefficient applied to discrete data,” Statistics in Medicine, vol. 24, no. 24, pp. 4021–4034, 2005.
View at: Publisher Site | Google Scholar
W. Dou, Y. Ren, Q. Wu et al., “Fuzzy kappa for the agreement measure of fuzzy classifications,” Neurocomputing, vol. 70, no. 4–6, pp. 726–734, 2007.
View at: Publisher Site | Google Scholar
T. R. Fanshawe, A. G. Lynch, I. O. Ellis, A. R. Green, and R. Hanka, “Assessing agreement between multiple raters with missing rating information, applied to breast cancer tumour grading,” PLoS ONE, vol. 3, no. 8, Article ID e2925, 2008.
View at: Publisher Site | Google Scholar
P. Graham and R. Jackson, “The analysis of ordinal agreement data: Beyond weighted kappa,” Journal of Clinical Epidemiology, vol. 46, no. 9, pp. 1055–1062, 1993.
View at: Publisher Site | Google Scholar
T. S. King and V. M. Chinchilli, “A generalized concordance correlation coefficient for continuous and categorical data,” Statistics in Medicine, vol. 20, no. 14, pp. 2131–2147, 2001.
View at: Publisher Site | Google Scholar
H. C. Kraemer, “Measurement of reliability for categorical data in medical research,” Statistical methods in medical research, vol. 1, no. 2, pp. 183–199, 1992.
View at: Google Scholar
J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, pp. 159–174, 1977.
View at: Google Scholar
R. T. S. Laurent, “Evaluating agreement with a gold standard in method comparison studies,” Biometrics, vol. 54, no. 2, pp. 537–545, 1998.
View at: Publisher Site | Google Scholar
L. Lin, A. S. Hedayat, and W. Wu, “A unified approach for assessing agreement for continuous and categorical data,” Journal of Biopharmaceutical Statistics, vol. 17, no. 4, pp. 629–652, 2007.
View at: Publisher Site | Google Scholar
E. Svensson, “A coefficient of agreement adjusted for bias in paired ordered categorical data,” Biometrical Journal, vol. 39, no. 6, pp. 643–657, 1997.
View at: Google Scholar
J. L. Fleiss, J. Cohen, and B. S. Everitt, “Large sample standard errors of kappa and weighted kappa,” Psychological Bulletin, vol. 72, no. 5, pp. 323–327, 1969.
View at: Publisher Site | Google Scholar
J. Yang and V. M. Chinchilli, “Fixed-effects modeling of cohen's kappa for bivariate multinomial data,” Communications in Statistics—Theory and Methods, vol. 38, no. 20, pp. 3634–3653, 2009.
View at: Publisher Site | Google Scholar
J. Yang and V. M. Chinchilli, “Fixed-effects modeling of Cohen's weighted kappa for bivariate multinomial data,” Computational Statistics and Data Analysis, vol. 55, no. 2, pp. 1061–1070, 2011.
View at: Publisher Site | Google Scholar
E. H. Moore, “On the reciprocal of the general algebraic matrix,” Bulletin of the American Mathematical Society, vol. 26, pp. 394–395, 1920.
View at: Google Scholar
R. A. Penrose, “A generalized inverse for matrices,” Proceedings of the Cambridge Philosophical Society, vol. 51, pp. 406–413, 1955.
View at: Google Scholar
E. L. Lehmann, Theory of Point Estimation, Springer, New York, NY, USA, 1998.
D. V. Cicchetti and T. Allison, “A new procedure for assessing reliability of scoring EEG sleep recordings,” American Journal of EEG Technology, vol. 11, pp. 101–109, 1971.
View at: Google Scholar
J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability,” Educational and Psychological Measurement, vol. 33, pp. 613–619, 1973.
View at: Google Scholar
D. R. Brenner, M. C. Tammemägi, S. B. Bull, D. Pinnaduwaje, and I. L. Andrulis, “Using cancer registry data: agreement in cause-of-death data between the Ontario Cancer Registry and a longitudinal study of breast cancer patients,” Chronic Diseases in Canada, vol. 30, no. 1, pp. 16–19, 2009.
View at: Google Scholar
M. F. Jiménez-Navarro, J. M. Garcia-Pinilla, L. Garrido-Sanchez et al., “Poor reproducibility of the oral glucose tolerance test in the diagnosis of diabetes during percutaneous coronary intervention,” International Journal of Cardiology, vol. 142, no. 3, pp. 245–249, 2010.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2011 Jingyun Yang and Vernon M. Chinchilli. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

960

Downloads

1054

Citations