Abstract

Cohen's kappa and weighted kappa statistics are the conventional methods used frequently in measuring agreement for categorical responses. In this paper, through the perspective of a generalized inverse, we propose an alternative general framework of the fixed-effects modeling of Cohen's weighted kappa, proposed by Yang and Chinchilli (2011). Properties of the proposed method are provided. Small sample performance is investigated through bootstrap simulation studies, which demonstrate good performance of the proposed method. When there are only two categories, the proposed method reduces to Cohen's kappa.

1. Introduction

Measurement of agreement is used widely in diverse areas of scientific research to assess the reproducibility of a new assay, instrument, or method, the acceptability of a new or generic process and methodology, as well as in method comparison. Examples include the agreement when two or more methods or raters simultaneously assess a response [1, 2] or when one rater makes the same assessment at two times [3], the agreement of a newly developed method with a gold standard method [4], and the agreement of observed values with predicted values [5].

Traditionally, kappa [6] is used for measurement of agreement for categorical responses and weighted kappa [7] for ordinal responses. The concordance correlation coefficient [8, 9] is often used when the responses are continuous [10, 11]. In this paper, we focus on the measurement of agreement for categorical responses, that is, when the responses are either nominal or ordinal.

Suppose there is a bivariate response , where each of and yields a categorical response. For convenience, the categories are denoted as . Suppose the bivariate distribution is defined as in Table 1.

Cohen [6] proposed a coefficient of agreement, called the kappa coefficient, for nominal scales of responses, which is defined by where is the observed proportion of agreement and is the proportion of agreement expected by chance.

To deal with ordinal responses, Cohen [7] proposed a weighted version of the kappa statistic: with where is the nonnegative weight assigned to the disagreement for the cell , is the weight assigned to the maximum disagreement, is the proportion of joint judgement in the cell, and is the proportion expected by chance in the same cell.

Despite the fact that they may fail to work well under certain situations [12, 13], Cohen’s kappa and weighted kappa coefficients have been widely used in various areas as a measure of agreement for categorical responses, partly due to their ease in calculation; see, for example, Justice et al. [14], Kerosuo and Ørstavik [15], Landers et al. [16], and Suebnukarn et al. [17]. In the meanwhile, researchers have been exploring alternative methods for measuring agreement for categorical data; see Agresti [18], Barlow [19], Broemeling [20], Carrasco and Jover [21], Dou et al. [22], Fanshawe et al. [23], Graham and Jackson [24], King and Chinchilli [25], Kraemer [26], Landis and Koch [27], Laurent [28], Lin et al. [29], and Svensson [30], to name only a few. It is worth noting that, for ordinal or binary data, the concordance correlation coefficient (CCC) [8] based on the squared function of distance reduces to Cohen’s weighted kappa when the Fleiss-Cohen weight function is used and the CCC based on the absolute value function of distance to the power is equivalent to Cohen’s weighted kappa if Cicchetti-Allison weight function is used [25]. In addition, variance of the CCC is identical to the one given by Fleiss et al. [31] if the Fleiss-Cohen weight function and the GEE approach are used [29].

Yang and Chinchilli developed a fixed-effects modeling of kappa and weighted kappa [32, 33]. Their simulation studies and illustrative examples show good performance of the proposed methods. In this paper, we propose another version of fixed-effects modeling of Cohen’s kappa and weighted kappa through a perspective of a generalized inverse. We want to stress that this paper is not to demonstrate the advantage of the proposed method in constructing an agreement coefficient. Rather, it provides an alternative form of a general framework and direction under which new and better matrix functions could be explored.

In Section 2 we provide the derivation and properties of the proposed method. In Section 3, we compare the small sample performance of the proposed method with that of three other agreement coefficients through bootstrap simulation studies. Examples are given in Section 4 followed by a conclusion.

2. The Generalized Inverse Version of Fixed-Effects Modeling of Kappa and Weighted Kappa

2.1. Derivation

Using similar notation as in Yang and Chinchilli [32], we let and be categorical variables, with the categories being designated as , as displayed in Table 1. Let , , denote the bivariate probability, and let and denote the marginal probabilities. Let denote the indicator function, and define the vectors as

Define the matrices and as where denotes expectation under the assumption that and are independent. It easily can be verified that and are nonnegative definite matrices, denoted by and .

Yang and Chinchilli [32] show that

and are matrices of rank because and , where 1 is a vector of unit values. Thus, , indicating that 0 is an eigenvalue for with corresponding (standardized) eigenvector . The same is true for as well. Let denote the matrix of eigenvectors for . By definition, is orthogonal, so , which yields that . Because is not of full rank, its inverse does not exist, but it does have a Moore-Penrose generalized inverse [34, 35], denoted by . If denotes the diagonal matrix of eigenvalues for , then the Moore-Penrose generalized inverse of is and the Moore-Penrose generalized inverse of is .

It easily can be verified that , which implies that where .

Since , (2.4) can be rewritten as

Suppose is a function of the symmetric matrix and satisfies the following two definitions [32].

Definition 2.1. Suppose is a function. is said to be nondecreasing if for all nonnegative definite matrices , , where , one has .

Definition 2.2. Suppose is a function. is said to be a scale-equivariant function [36] if for all matrices and constant .

Then a class of agreement coefficients is given by

If , then

If we set , then where is a nonnegative definite symmetric matrix of agreement weights with for and for . Two frequently used weighting schemes are the Cicchetti-Allison weights [37] and the Fleiss-Cohen weights [38]

We can also use , the largest eigenvalue of as another function for , which leads to a new agreement coefficient .

Therefore, including the coefficients that we propose in this paper, there are four novel coefficients to assess agreement for categorical data: , , , and . In practice, all these indices can be estimated using their sample counterparts. In this paper, a “” is used to denote the estimate of an index.

2.2. Properties

In general, if satisfies the properties in the two aforementioned definitions, then has the following properties:(1), (2) if with probability one, that is, for all , (3) if and only if for each and for one choice of , excluding the set of degenerate cases, (4) if and are independent.

Proof. Without loss of generality, we set . The proofs for other weight matrices are slightly more complex.(1)From the previous proof, we know that Since is a nondecreasing and scale-equivariant function, we have It is then straightforward to see that .(2)If with probability one, that is, for all , then [32]. It is then straightforward to see that .(3)Excluding the set of degenerate cases, if for each and for one choice of , then [32]. Hence, and . Therefore, and . Hence, .(4)If and are independent, then [32]. It is straightforward to see that .

Lemma 2.3. When , that is, when there are only two categories, the four coefficients are equivalent and they all reduce to Cohen’s kappa.

For proof of the lemma, please see the appendix.

3. Small Sample Performance

3.1. Simulation Design

In this section, we compare the performance of Cohen’s weighted kappa (), , , and . We used the bootstrap to estimate the sample variances of the last three coefficients because the asymptotic distributions of their estimates are too complex.

We used Matlab (version 7.10.0.499, The MathWorks Inc., Natick, Mass) to generate samples following multinominal distributions. Twelve different distributions were employed, representing different levels of agreement (from poor to excellent) and different numbers of categories in the responses (from 3 to 5). For details about the distributions and the design of the simulations, see Yang and Chinchilli [32]. We calculated the true coefficient values for , , , and as well as their estimate, bias, mean square error, the empirical variance, and the bootstrap variance. The weights we used are Cicchetti-Allison weights.

We did 5000 simulations for each sample size in each distribution and 1000 bootstrap replicates for each simulation. The observed variances of the proposed method and the mean bootstrap variances are compared. We follow Yang and Chinchilli [32] in the interpretation of the results. That is, agreement measures greater than 0.75 or so may be taken to represent excellent agreement beyond chance, values between 0.40 and 0.75 may be taken to represent fair to good agreement beyond chance and values below 0.40 or so may be taken to represent poor agreement.

3.2. Simulation Results and Discussion

Table 2 gives the simulation results for the four coefficients with the weighting scheme being Cicchetti-Allison weights (the simulation results are split into two tables because it is unwieldy to fit in one page).

In almost all the cases, the four agreement coefficients give the same classification in terms of the magnitude of the degrees of agreement. An interesting case is case 6 where the degree of agreement is around the boundary of poor and medium agreement. Cohen’s weighted kappa and indicate close to medium agreement while and indicate poor agreement, according to our rule of interpretation of the results. Biases of all the four coefficients are small, and, as sample sizes increase, they become almost negligible. The bootstrap turns out to be a good alternative as an estimation of the sample variances, which in many cases gives the estimates of the variances equivalent to the observed values.

It also should be noted that the true values of coefficients within Table 2(a) can vary, especially for cases 5–9 () and cases 10–12 (). This suggests that, as table size increases, the coefficients could lead to different interpretations.

4. Examples

In this section, we use two empirical examples to illustrate the application and compare the result of the four agreement coefficients. The first example is from Brenner et al. [39] who examined the agreement of cause-of-death data for breast cancer patients between Ontario Cancer Registry (OCR) and classification results from Mount Sinai Hospital (MSH). OCR, established in 1964, collects information about all new cases of cancer in the province of Ontario while MSH has a systematic and rigorous monitoring and follow-up of confirmed node-negative breast cancer patients. Due to its relative complete and accurate data, MSH also provides specialist-determined cause of death. 1648 patients entered the analysis through linking the OCR data to the MSH study patients via the OCR standard procedure. The purpose of the study is to examine the degree of agreement between OCR, which is often used in cancer studies, and MSH, which was taken as the reference standard. A missed cancer-related death is considered of greater importance. The data are summarized in Table 3.

The second example is from Jiménez-Navarro et al. [40] who studied the concordance of oral glucose tolerance test (OGTT), proposed by the World Health Organization, in the diagnosis of diabetes mellitus (DM) which may possibly lead to future epidemic of coronary disease. OGTT classifies a diagnosis result as DM, glucose intolerance, or normal. Eighty-eight patients admitted with an acute coronary syndrome who had no previous diagnosis of DM underwent percutaneous coronary revascularization and received OGTT the day after revascularization and one month after that. Researchers are interested in the reproducibility of OGTT performed at these two different time points. The data are summarized in Table 4.

We calculated all the four agreement coefficients as well as their corresponding 95% confidence intervals (CI). The 95% CI for Cohen’s weighted kappa was based on the large sample variance derived by Fleiss et al. [31]. 5000 bootstrap replicates were used in the calculation of the CI for the other three coefficients. Table 5 gives the results for the above two examples. In all the calculations, the Cicchetti-Allison weights were used (see (2.9)).

As can be seen from Table 5, there seems to be no substantial difference among all the four coefficients using the aforementioned interpretation rule. For example 1, all coefficients indicate that there is very strong agreement between OCR and MSH studies in classifying the cause of death for the 1648 cancer patients. The second example, however, indicates that OGTT, when taken at two different time points with a span of one month, has very poor reproducibility in the diagnosis of DM.

5. Conclusion

We developed a fixed-effects modeling of Cohen’s kappa and weighted kappa for bivariate multinomial data through the perspective of generalized inverse, which provides an alternative form of framework to the one proposed by Yang and Chinchilli [33]. The proposed method also allows the application of different matrix functions, as long as they satisfy certain conditions. When there are only two categories in the response, the coefficient reduces to Cohen’s kappa. The proposed method is new and promising in that it allows the use of other matrix functions, such as the largest eigenvalue of a matrix, that might also yield reasonable, and possibly better, result. Exploration of such matrix functions is our future research topic.

Appendix

Proof of Lemma 2.3

Proof. (1) When , the Cicchetti-Allison weights become an identity matrix. Therefore,
(2) Since the weighting matrix is a matrix, the two eigenvalues of can be denoted by 0 and . Therefore, the largest eigenvalue is equal to . Similarly, it can be shown that the largest eigenvalue of is also equal to . Thus, .
(3) From the previous proof, we know that . When , is and is Because , where is the eigenvector of associated with the largest eigenvalue of .
But for the case , because , also is the eigenvector of associated with the largest eigenvalue of . Thus,
(4) From the derivation of , where , we know that it can be written as But the largest eigenvalue for the case is equal to the trace, so .