Abstract

Decisions regarding competing risks are usually based on a continuous-valued marker toward predicting a cause-specific outcome. The classification power of a marker can be summarized using the time-dependent receiver operating characteristic curve and the corresponding area under the curve (AUC). This paper proposes a Gaussian copula-based model to represent the joint distribution of the continuous-valued marker, the overall survival time, and the cause-specific outcome. Then, it is used to characterize the time-varying ROC curve in the context of competing risks. Covariate effects are incorporated by linking linear components to the skewed normal distribution for the margin of the marker, a parametric proportional hazards model for the survival time, and a logit model for the cause of failure. Estimation is carried out using maximum likelihood, and a bootstrap technique is implemented to obtain confidence intervals for the AUC. Information-criteria strategies are employed to find a parsimonious model. The performance of the proposed model is evaluated in simulation studies, considering different sample sizes and censoring distributions. The methods are illustrated with the reanalysis of a prostate cancer clinical trial. The joint regression strategy produces a straightforward and flexible model of the time-dependent ROC curve in the presence of competing risks, enhancing the understanding of how covariates may affect the discrimination of a marker.

1. Introduction

In clinical medical practice, decisions about personalized treatments are generally guided by markers that can discriminate between different levels of risk of a specific type of death. In time-to-event data, an individual can be exposed to two types of failure, the so-called competing risks; here, the phenomenon of interest is to assess the classification capacity of the marker to predict a cause-specific outcome within a certain period, leading to a more objective choice of treatment of a condition. In this context, Saha and Heagerty [1] defined cumulative/dynamic discriminatory measures to evaluate the prediction accuracy of a marker to distinguish between the subjects who experience the cause-specific event within survival time and those who do not. Specifically, they define the cumulative true positive rate as the probability of a high marker value among those subjects who experience the event within time and the dynamic false positive rate as the probability of a high marker value among subjects who are event-free through time . The cumulative/dynamic curve for each event type can be obtained by plotting versus at time , for all possible values of the marker. The corresponding area under the curve is then used as a global summary of the classification power of the marker.

Inference methods have mainly used nonparametric approaches for the estimation of [13]. The primary limitation of such formulations is the difficulty that occurs when it comes to incorporating covariates into the analysis. Zheng et al. [4] obtained expressions of and in terms of the cumulative incidence functions (CIF’s), which are represented using cause-specific regression hazard functions models for the different failure types; however, it has been documented that standard specifications for those models, including the proportional hazards assumption they employ, lead to model selection issues since testing for covariate effects on the CIF’s is not possible [5], thus hampering the comparison between different ROC curves.

In this article, we employ the joint regression analysis using Gaussian copulas introduced in [6] to model the joint distribution of . This approach represents the joint model in terms of a trivariate Gaussian copula distribution, subsequently utilized to characterize the curve. The regression model is constructed by specifying the marginal distribution of the marker with a skewed normal regression model, the marginal distribution of the survival time with a parametric proportional hazards regression model, and the marginal distribution of the cause of failure with a logistic regression model. Maximum likelihood is employed with a constraint that ensures that the dependence parameters in the Gaussian copula yield a positive-definitive correlation matrix, which must be satisfied. Also, information-criterion metrics are employed for variable selection, leading to the formulation of a parsimonious model. The joint regression modeling yields a straightforward and flexible model for the time-dependent ROC curve in the presence of competing risks. This direct approach enhances comprehension regarding covariates’ impact on a marker’s discriminatory ability, representing the paper’s principal contribution.

The rest of the paper is organized as follows. Section 2 defines the parametric Gaussian copula-based model, the copula relation to time-dependent discrimination metrics, and the marginal specifications. In Section 3, the maximum likelihood estimation procedure is described. Section 4 provides some simulation results to explore the properties of the proposed estimators under different sample sizes and censoring scenarios. Section 5 illustrates the method with data from a well-known prostate cancer dataset [7]. The paper culminates with a conclusion.

2. Methods

2.1. Gaussian Copula-Based Models and the Relation with ROC and AUC

In applied research, the copula model represents a convenient and flexible way to construct multivariate distributions that embody wide ranges of dependence structures while allowing for the specification of the underlying marginal distributions [8]. Sklar’s theorem provides the decomposition of any continuous -variate distribution function in terms of the marginal cumulative distribution functions (CDF’s) and the associated copula , which is a -variate distribution function with uniform margins [9], and thus the representation of the multivariate distribution function (MDF) is given by ; also, the theorem states that given , the copula is unique on , implying that the copula is unique if the marginals are continuous.

Given its flexibility, analytical tractability, and capacity to characterize rich dependence structures, similar to the multivariate Gaussian distribution, this study adopted the trivariate Gaussian copula, which is defined bywhere is the standard normal trivariate distribution function (zero means and unit variances), is the quantile function corresponding to the univariate standard normal distribution, and is a nonnegative symmetric positive definite correlation matrix given bywhere each dependence parameter takes values in .

Following the Radon–Nikodym derivative methodology introduced in [6], the joint density function of corresponding to the copula model in equation (1) is represented bywhere and are, respectively, the density and CDF functions of the random variable , , , is the CDF of the event type,and .

Given the cumulative true positive rate for cause , defined by , and the dynamic false positive rate, defined by , the time-dependent cumulative/dynamic ROC curve at time for cause is expressed as follows:where . The corresponding area under the curve at time for cause , , represents the probability that an individual with a positive diagnosis has a higher value of the marker than an individual with a negative diagnosis [10].

The discrimination metrics can be expressed in terms of the joint model equation (4):

Here, and represent the marginal bivariate Gaussian copula functions associated with the trivariate Gaussian copula, following the marginalization property of the multivariate Gaussian copula [11]. The derivation of equations (6) and (7) is detailed in Appendix A.

2.2. Marginal Specifications

In practice, markers tend to exhibit skewed distributions [12]. Thus, in this study, the margin of follows the skewed-normal distribution, denoted here by , whose density function is given by [13]where and are, respectively, the density and CDF of a standard normal random variable, is the location parameter, is the scale parameter, and is the slant parameter. Explanatory variables are incorporated by linking location parameter to a linear component, namely, , where and are vectors of parameters and covariates, respectively.

For modeling the margin of , this study employed the proportional hazards model, which is given by , where is a baseline distribution with vector of parameters , and and are vectors of parameters and covariates, respectively; here, does not contain the intercept and does not necessarily contain the same variables in . Since it has shown good fits to survival data in various applied analyses [14], this study used the Weibull distribution as the baseline; that is, , where , and and are, respectively, the shape and scale parameters. The incorporation of covariates to the margin was specified with the following which will be referred to as the survival component:where is the baseline survival function (which follows a Weibull distribution) and is the parameter vector (which does not include intercept).

Similar to various competing risk regression formulations for the cause of the event [15, 16], the marginal model for was specified with a logistic model, which is given bywhere is the vector of covariates, not necessarily containing the same variables in and , and is a vector representing the model parameters when .

In the proposed fully parametric Gaussian copula regression model, the cumulative true positive rate for cause and the dynamic false positive rate are now expressed conditionally on the covariates: and . Consequently, the time-dependent cumulative/dynamic ROC curve for event type and the corresponding AUC are also conditional expressions, and , respectively.

3. Estimation

Consider the vector of marker, survival time, and cause of death for the -th subject in the study, . The observed dataset is , where is the observed value of , is the time to either failure or censoring, identifies the type of failure, taking values 1 and 2 for the observed survival times and 0 for the censored times, and is a vector of covariates. The full likelihood function is given bywhere is a vector containing all parameters, , and ; here can be obtained by marginalizing the trivariate joint model [11].

Due to the complexity of the likelihood function formulation, we propose a numerical procedure for the corresponding estimation process, which involves employing an optimization algorithm called PRAXIS (Principal Axis), developed by Richard Brent [17]. The optimization algorithm is implemented in the R package using the function .

Maximizing the log-likelihood function in equation (11) requires setting lower and upper bound constraints for each dependence parameter, , , and , to ensure the positive definiteness of the correlation matrix . To achieve this, we developed a heuristic algorithm to create box constraints for each dependence parameter, ensuring , , and . A detailed explanation of the heuristic algorithm can be found in Appendix B. Log links are used for the strictly positive parameters.

4. Simulation Experiments

To evaluate the methods’ performance and robustness, we generate 200 samples of and from the trivariate Gaussian and Student copulas with margins . The dependence parameters are set to , and . The copulas are implemented using the functions and from the package.

To generate variates following the skew-normal distribution with parameters , , and , we employ the inverse distribution function method [8]. For the proportional hazards model, we follow the approach proposed in [18] to generate variates conforming to the proportional hazards model with a Weibull baseline survival function, setting parameter values as , , and . These marginals share the same covariate , which is assumed to follow a normal distribution . We set the logistic model to , with a dichotomous covariate that is generated from the binomial distribution .

The censoring time is simulated from a uniform distribution , where takes values of 200 and 85. The observation is considered censored if a simulated failure time exceeds the corresponding simulated censoring time. The average percent censored for and was around 11% and 26%, respectively.

Tables 1 and 2 present empirical bias, standard error (SE), and mean square error (MSE) of the coefficients obtained from the Gaussian copula regression model for simulations from both copulas. Generally, empirical biases and standard errors are relatively small compared to the actual parameters, resulting in a low mean square error, which suggests that the estimation process is appropriate for both sample sizes.

Overall, all three performance measures tend to approach zero with an increased sample size. Notably, an increase in censoring has minimal impact on the estimation process, as evidenced by the only marginal increase in mean square error. Additionally, although there is a slight increase in the mean square error for estimators derived from simulations of a trivariate Student copula, this increase is not drastic. Consequently, the Gaussian copula regression model exhibits favorable estimation properties in the considered moderate sample sizes and appears robust concerning the Student copula.

5. Prostate Cancer Study Data

This study utilizes the dataset described in [7] to exemplify the proposed methods’ use. The dataset comprises an approximate five-year follow-up of a cohort consisting of 506 patients diagnosed with prostate cancer of stages III and IV. These patients were recruited in a clinical trial conducted between 1967 and 1969. A total of 4.54% of patients were excluded from the analysis due to incomplete covariate information. Among the remaining cohort, 125 patients (25.88%) succumbed to prostate cancer , 219 patients (45.34%) died from other causes , and the remaining 139 patients (28.78%) had censored survival times.

The underlying research problem consists of identifying a marker to then measure its discriminatory power. French et al. [19] suggested the utilization of a Cox regression model to formulate a composite marker, which is obtained as a weighted combination of both biomarkers and clinical variables, wherein the estimated regression coefficients serve as the weighting factors. In this study, the marker is derived as the linear predictor of a cause-specific parametric Cox regression model [20], with the event of interest defined as death from prostate cancer. Specifically, the linear component of the cause-specific hazard for is estimated by treating the event times of all individuals who failed due to cause as censored. The fitting process is carried out utilizing the function from the package in R, considering a Weibull distribution.

A backward elimination approach that is guided by the Bayesian information criterion (BIC) was employed to systematically eliminate the least significant covariate in each iteration, aiming to derive the most parsimonious model. The final regression model incorporates the following covariates: PF (performance rating: 0, standard; 1, limitation of activity), HG (serum hemoglobin in g/100 ml), SZ (size of the primary lesion in ), and SG (Gleason stage-grade category). The resulting composite marker is defined as . Figure 1 shows the histogram of the marker , which exhibits a skewed shape. Superimposed are the Gaussian kernel-based nonparametric density and the fitted skewed-normal density.

5.1. Discriminatory Performance of the Composite Marker

The discrimination power of the composite marker is initially examined without considering covariates. Following the model fitting, it is observed that the sole parameter lacking individual significance is , with an estimated value of . A likelihood ratio test assesses the null hypothesis of employing the chi-squared distribution approximation with one degree of freedom. With a value of 0.2441, the null hypothesis cannot be rejected. Consequently, the simpler model with is chosen on the grounds of parsimony. The parameter estimates of the joint simpler model and their standard errors are detailed in Table 3.

Analyzing the dependence parameters, the negative correlation indicates that high values of are associated with the early occurrence of any event ( represents overall survival time). Furthermore, because , high values of are more strongly associated with cause 1 than cause 2. Thus, the estimated correlations suggest a marker with desirable classification properties.

To assess the appropriateness of the parametric model’s fit, we perform a comparative analysis between the ROC curves obtained from the Gaussian copula regression model and the IPCW nonparametric ROC curves introduced by Blanche et al. [2]. The IPCW nonparametric approach possesses the advantage of not requiring any tuning parameters. Additionally, it can be easily implemented using the function within the package in R. The plots of the parametric and nonparametric estimators of at time horizons 12, 36, and 60 months are depicted in Figure 2. It appears that the two curves are relatively close, suggesting that the fit of the parametric model is adequate.

5.2. Inclusion of Covariates

A Gaussian copula regression model is fitted to evaluate how the classification power of changes on including covariates. We consider potential predictors as the remaining variables not included in the composite marker: RX (drug treatment), HX (history of cardiovascular disease: 0, no; 1, yes), AGE (age in years), and WT (weight index: weight in kg–height in cm + 200). A backward elimination process based on the BIC is employed to remove the least significant covariate sequentially, resulting in the most parsimonious regression model. Table 4 displays BIC values from various regression models, indicating the predictors included in each marginal regression model during the elimination process. The chosen model is Model 9, as it exhibits the lowest BIC.

Table 5 presents the best-fitting parametric Gaussian copula regression model coefficients. Similar to the case without covariates, only one parameter is not statistically significant: . A likelihood ratio test is conducted, considering a nested model with as the null hypothesis. Based on a chi-squared distribution with one degree of freedom, the null hypothesis of a model with cannot be rejected.

To assess the validity of the proportionality assumption for the factor AGE in the survival component, Figure 3 illustrates the plot of the empirical estimator of as a function of . As AGE is a continuous variable, two balanced groups are created: Group 1 comprises individuals with ages less than 74, and Group 2 comprises individuals with ages greater than or equal to 74. The log-minus-log plot shows approximate parallelism for most of the time, with some sparsity observed toward the end due to a limited number of observations in both groups at extended time points. This visual examination provides evidence in favor of the validity of the proportionality assumption in the Cox regression model.

Figure 4 illustrates based on the best-fitting parametric Gaussian copula regression model for the two levels of the variable HX (history of cardiovascular disease) and ages 60, 73, and 85 years old at a time horizon of 60 months. In this instance, the IPCW estimators of the time-dependent ROC curves are not considered due to the impracticality of incorporating covariates in this nonparametric approach. The corresponding for the six combinations of predictors at the 60-month horizon are shown in Table 6. 95% bootstrap confidence intervals (with biased correction [21]) are estimated using 500 numbers of runs. Notably, insights reveal that older age and a history of cardiovascular disease enhance the classification performance of the composite marker .

6. Conclusion

This study introduces a parametric Gaussian copula regression model for estimating time-dependent predictive accuracy metrics subject to right-censoring in the presence of competing risks. The marginal specifications encompass a normal asymmetric distribution for the marker, wherein a linear component is linked to the location parameter. Additionally, the model includes a parametric Cox regression model with a Weibull distribution for survival time and a logistic regression model for event type. The dependencies among these variables are captured using a trivariate Gaussian copula.

Simulation studies demonstrate that the Gaussian copula regression model exhibits favorable properties in moderate sample sizes and appears robust to slight deviations from the data-generating process. To validate the utility of our approach in a real-world scenario, we apply our proposed methodology to a well-known prostate cancer dataset. Based on the linear predictor of a cause-specific parametric Cox regression model, the composite biomarker demonstrates moderate predictive performance in identifying patients at risk of dying from prostate cancer. Interestingly, older individuals with cardiovascular disease contribute to an increased discriminatory power of .

The primary contribution of this paper is the straightforward incorporation of covariates into the estimation of ROC curves and AUCs through joint regression modeling. Considering predictors in the initial Gaussian copula regression model eliminates the need for stratified analyses or adjustments through indirect methodologies, enabling a better understanding of how covariates may affect the classification power of a marker. The computation of the models relies predominantly on vector and matrix operations, along with readily available functions in the R language. The R code that implements the proposed methodology is available upon request. Furthermore, the introduced modeling strategy possesses additional favorable properties, demonstrating proficiency under moderate sample sizes and an ability to identify parsimonious models through information-criterion metrics.

Despite the compelling properties of the proposed methodology, some open questions remain. The linear dependence structure enforced by the Gaussian copula may be broadened by utilizing alternative copula models. Vine copula models are grounded in decomposing dependence into a series of interdependencies among (conditional) pairs of variables, affording the capability to manage heavy tails or asymmetric dependence [22]. We intend to investigate this approach in forthcoming research work.

Appendix

A. Discrimination Metrics in terms of Gaussian Copulas

This appendix shows how the expressions for the cumulative true positive rate for cause and the dynamic false positive rate are obtained using the trivariate and bivariate Gaussian copulas.

Using the definition of conditional probability, and can be rewritten as

Next, employing the law of total probability, the numerator term in (A.1) can be expressed as

Substituting this last expression in (A.1) leads to

As the random variable is discrete, the expression for the term in (A.4) can be formulated as follows:where the Sklar theorem [9] is used to represent the joint distribution function in terms of a trivariate Gaussian copula .

The expression in (A.4) can be rewritten as follows:

The bivariate Gaussian copula arises through the marginalization of the trivariate Gaussian copula, leveraging the property that the multivariate Gaussian copula is closed under marginalization [11].

Using equations (A.5) and (A.6), can be defined in terms of copulas:where and .

Focusing on , the numerator in (A.2) corresponds to a bivariate survival distribution function. This function can be expressed in terms of marginal distribution functions and the associated bivariate Gaussian copula [8]:where represents the marginal bivariate Gaussian copula .

Finally, expressing the denominator in (A.2) as , the dynamic false positive fraction becomes

B. Bound Constraints to Guarantee a Positive Definite Matrix

Given the random vector , the correlation matrix is characterized by three dependency parameters, namely, , , and . A fundamental criterion for a matrix to be positive definite is that all of its principal minors are positive [23], implying that each upper-left submatrix must have a positive determinant. Concerning the correlation matrix, the upper and upper submatrices consistently exhibit positive determinants, assuming . Consequently, the focus shifts to ensuring that the determinant of the matrix, denoted as , remains positive. The expression for is as follows:

To construct a three-dimensional region guaranteeing a positive determinant for all conceivable values of , , and , we propose the following heuristic algorithm:(1)Calculate the initial values , and .(2)Add a small amount (for instance 0.001) to each initial value: , , and . If one of these values is greater than 1, the algorithm stops. Else, calculate the eight possible combinations of determinants: , , , , , , , and .If all the determinants are positive, the functional relationship of equation (B.1) guarantees positive determinants for any point in the cubic region .(3)Subtract a small amount (for instance 0.001) to each initial value: , , and . If one of these values is less than , the algorithm stops. Else, calculate the eight possible combinations of determinants: , , , , , , , and .If all the determinants are positive, the functional relationship of equation (B.1) guarantees positive determinants for any point in the cubic region .(4)Add another small amount to each initial value: , , and . If one of these values is greater than 1, the algorithm stops. Else, calculate the eight possible combinations of determinants: , , , , , , , and .If all the determinants are positive, equation (B.1) guarantees positive determinants for any point in the cubic region .(5)Subtract another amount to each initial value: , , and . If one of these values is less than , the algorithm stops. Else, calculate the eight possible combinations of determinants: , , , , , , , and .If all the determinants are positive, equation (B.1) guarantees positive determinants for any point in the cubic region .(6)In this manner, the algorithm iterates times, incrementally expanding the region centered on the initial values until either of the eight determinants becomes nonpositive or any of the points exceeds 1 or falls below −1.

As the arguments above are heuristic, a series of simulation experiments were conducted to verify whether the points within three-dimensional region (centered at the initial values , , and ) exhibit a positive determinant. The simulation results unequivocally illustrated that all points within the cubic region yielded positive determinants.

Data Availability

The prostate cancer dataset used to support the findings of this study has been deposited in the Vanderbilt Biostatistics repository (https://hbiostat.org/data).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

The authors wish to thank Benemérita Universidad Autónoma de Puebla for all the support provided to achieve this research project.