Figures
Abstract
This study examines the application of flexible copula regression models to analyze the complex interdependencies among clinical variables in breast cancer data. As the most commonly diagnosed cancer and the second leading cause of cancer-related deaths among women worldwide, breast cancer presents both clinical and analytical challenges. Unlike traditional multivariate approaches, copulas offer greater flexibility in capturing complex, nonlinear, and asymmetric dependencies between mixed-type outcomes. The present study examines copula-based regression models to investigate the joint behavior of clinical variables in patients with breast cancer. We explored multiple copula families to jointly model overall survival (binary) and age at diagnosis (continuous) in the METABRIC dataset. Goodness-of-fit metrics guide model comparison and selection, with the Gumbel copula demonstrating superior performance in capturing the upper tail dependence associated with favorable outcomes, such as younger age and improved survival. Formal model comparison against an independent margins baseline confirmed that accounting for dependence via a copula significantly improves model fit (likelihood ratio test: , df = 1, p < 0.0001), and PIT diagnostics validated the adequacy of both marginal specifications. The findings support the integration of copula models into clinical research, facilitating a more nuanced understanding of cancer progression and enabling more accurate risk assessment and data-driven decision-making in oncology.
Citation: Rani H, Mehmood T, Aslam M, Al-Essa LA (2026) Unveiling the Multifaceted Dynamics of Breast Cancer: A Copula Regression Approach to Modeling and Predicting Outcomes. PLoS One 21(4): e0346495. https://doi.org/10.1371/journal.pone.0346495
Editor: Yajie Zou, Tongji University, CHINA
Received: October 29, 2025; Accepted: March 19, 2026; Published: April 10, 2026
Copyright: © 2026 Rani et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data that support the findings of this study is available in Figshare at https://figshare.com/s/1347711e0e6d2a2bc6fe, and will be publicly available after the acceptance of this article.
Funding: Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R443), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Copula regression has emerged as a powerful tool for modelling dependence structures in multivariate data, providing a flexible framework for constructing joint distributions when outcomes follow non-normal or mixed distributions [1,2]. Traditional correlation measures often fail to capture complex dependencies, particularly in non-Gaussian settings and tail dependence [3], making copula theory essential for accurate multivariate analysis. The most attractive feature of copula modeling is that parameter estimation and inference can be performed using standard likelihood procedures [4].
Copula methods have seen widespread adoption across diverse scientific fields due to their flexibility in capturing complex dependence structures. In finance and economics, they model asymmetric exchange rate dependencies [5] and income-consumption relationships [6]. Environmental scientists employ copulas for analyzing concurrent extreme weather events [7,8] and hydrological droughts [9]. Transportation researchers use them to jointly analyze incident clearance and response times [10], while agricultural economists model crop yield dependencies [11]. Different copula models have been proposed, utilizing innovative approaches, and applied to various datasets. [12] applied copulas in health economics, while [13] developed copula methods for correlated survival data. In genomics, [14] used copulas to model gene dependencies, and [15] applied them to construct biological networks. Recent biomedical applications include modeling breast cancer metastasis patterns [16], survival prediction under dependent censoring [17], and joint analysis of mixed discrete-continuous outcomes [18]. Copula models are also discussed by [19,20], but their application to clinical biostatistics, especially in cancer research, remains underdeveloped.
This gap is particularly salient in breast cancer studies, where key outcomes and predictors may exhibit complex dependence patterns that traditional regression approaches (e.g., logistic or generalized linear regression) fail to capture. Breast cancer remains the most commonly diagnosed malignancy and the second leading cause of cancer-related mortality in women worldwide, with approximately 2.3 million new cases and 685,000 deaths annually [21]. Age at diagnosis is a critical prognostic factor [22], with survival patterns showing significant variation across age groups and geographical regions [23,24]. These classical regression models assume independence between outcomes and often obscure clinically important dependence structures and relationships. This article explores the copula approach to model the breast cancer dataset. Despite advances in screening and treatment for primary prevention, early diagnosis, and survival prediction have improved outcomes, but further methodological improvements in capturing complex dependencies are still needed.
In the present study, we address the methodological challenge in two ways. First, by properly handling mixed discrete and continuous outcomes, and second, by incorporating interpretable dependence measures through copula models that are required for clinical decision-making. Previous approaches relied on restrictive distributional assumptions [25] or on either modeling outcomes separately (losing dependence information). Our approach builds on recent advances in copula regression for mixed outcomes [18,26] while addressing the specific needs of cancer research.
Our work makes two key statistical contributions:
- We demonstrate how copula regression models with latent variable probit margins can effectively model the dependence between binary and continuous outcomes in breast cancer data, thereby overcoming the limitations of previous methods.
- We provide a reproducible framework for testing and comparing alternative dependence structures through information criteria in cancer research.
Using data from the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) cohort (n = 1,904), we show how copula methods reveal clinically meaningful dependence between the chosen pair of response variables (binary-continuous), i.e., overall survival(y1), and age at diagnosis(y2), accounting for the influence of covariates, using various copula families. Importantly, we model age at diagnosis and survival as joint manifestations of the disease process rather than causally linked variables, allowing estimation of their residual association after covariate adjustment, an approach particularly valuable for identifying patient subgroups with unusual age-survival patterns. The implementation is through the R package GJRM (Generalized Joint Regression Modeling) [27].
We develop a template for cancer research by characterizing the interplay between treatment response and age stratification.
The rest of the paper is organized as follows. Section II describes the data and study variables included in the analysis. Section III presents the mathematical framework of the bivariate copula models with mixed binary–continuous margins and associated likelihood formulations. Section IV presents the empirical results, along with a discussion and conclusions. Additional materials and technical derivations are provided in the Supporting Information Appendix section.
Materials and methods
Data source
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) is a large-scale, retrospective study jointly conducted by research teams in Canada and the United Kingdom. It provides comprehensive genomic and clinical profiles for over 2,000 breast cancer patients [28]. The dataset used in this study was obtained from Kaggle, with the original source being cBioPortal for Cancer Genomics. For the present analysis, only clinical and demographic variables were retained, while genomic features were excluded to focus on clinical predictors of breast cancer outcomes. Data preprocessing involved imputing missing values using the mean for continuous variables and the mode for categorical variables. Redundant or irrelevant features were removed prior to analysis, resulting in a final dataset comprising 1,904 patients and 30 clinical variables.
Study variables
The analysis focused on two primary response variables: 1) a binary overall survival indicator, and 2) a continuous age at diagnosis. Clinical covariates were selected based on their known prognostic relevance in breast cancer (Fig 1).
- Overall Survival (y1): A binary survival indicator as provided in the METABRIC dataset. The variable is coded as 0 if the patient was deceased and 1 if the patient was alive at the last clinical follow-up, irrespective of cause of death. This coding was verified by cross-referencing with the descriptive variable death_from_cancer. The binary formulation was adopted to facilitate the application of the copula framework to mixed binary–continuous outcomes, which is the primary methodological focus of this study.
- Age at Diagnosis (y2): A continuous variable (in years) representing the patient’s age at the time of breast cancer diagnosis. Age is a well-established prognostic factor in breast cancer, although its relationship with disease aggressiveness and survival is complex and population-dependent [22,23]. In this study, age is treated as a marginal response solely to enable joint modeling with survival status and to quantify their dependence structure within the METABRIC cohort. No causal or temporal interpretation is implied by this specification.
- Clinical Covariates: The model adjusted for the following prognostic factors:
- Type of breast surgery: breast-conserving vs mastectomy.
- Cellularity: low, moderate, high.
- ER, PR, and HER2 status: positive vs negative.
- Neoplasm histologic grade (NHG): ordinal (grades 1–4, plus x).
- Other histologic subtype (OHS): ductal/NST, lobular, mixed, mucinous, medullary, metaplastic, other.
- Tumor laterality: left vs right.
- Integrative cluster (IC): subgroups 1–10.
- Treatment variables: chemotherapy, radiotherapy, and hormone therapy (all binary).
- Mutation count: numeric.
- Nottingham Prognostic Index (NPI): numeric index combining tumor size, grade, and lymph node status.
- Menopausal state: pre vs post.
- Tumor size: continuous, measured in mm.
Methodology
Copula-based modeling framework
The joint modeling framework in this study is based on Sklar’s theorem [29], which states that any multivariate distribution can be decomposed into its marginal distributions and a copula function that captures the dependence structure. This makes copula models particularly suitable for analyzing outcomes with different distributional forms.
We focus on a bivariate setting where the two outcomes of interest are: (i) a binary response representing overall survival (Y1), and (ii) a continuous response representing age at diagnosis (Y2). The binary outcome Y1 represents overall survival status and is defined as
Although time-to-event information is available in the METABRIC cohort, a binary formulation is adopted here to align with the primary methodological objective of this study, namely, copula-based joint modeling of mixed binary–continuous outcomes rather than hazard-based survival analysis.
Age at diagnosis (Y2) is included as a continuous marginal outcome solely to enable joint modeling and to quantify its dependence with survival status. It is not interpreted as being predicted by survival, nor does its inclusion imply any temporal or causal direction between the two outcomes.
The joint distribution of these outcomes can be expressed as
where F1 and F2 are the marginal cumulative distribution functions (CDFs) of Y1 and Y2, respectively, is a copula function, and
is the parameter capturing the strength and shape of dependence.
Modeling mixed binary and continuous outcome.
Directly applying copulas to a mixed binary-continuous case is not straightforward because the copula function is not unique when a marginal distribution is discrete. To address this, we adopted a latent variable approach following [30]. We assume that the binary survival outcome Y1 is determined by an underlying, continuous latent variable such that:
where is the indicator function. We assume
follows a standard normal distribution, which inherently leads to a probit link for the probability of survival.
This formulation allows us to derive the joint likelihood for an observation . The resulting joint probability mass and density function is:
where is the marginal density of Y2 (age at diagnosis), and the partial derivative represents the conditional distribution of the latent variable
given the observed age Y2.
Here, Equation (2) defines the joint probability structure implied by the copula and characterizes a symmetric dependence between the latent survival process and age at diagnosis. It is critical to note that this formulation, and the copula model in general, is symmetric with respect to Y1 and Y2. The model estimates the stochastic dependence between the two outcomes after accounting for covariates, and the copula parameter quantifies this association without implying that survival “explains” or “predicts” age, or vice versa. The conditional representation appearing in the likelihood is a mathematical device required for estimation in mixed discrete–continuous settings and does not denote any causal, temporal, or directional relationship.
Specification of marginal distributions.
- For the binary survival outcome Y1, the latent variable approach with a probit link was used, as described above.
- For the continuous age outcome Y2, several candidate distributions were evaluated, including the normal, gamma, log-normal, logistic, and inverse gamma etc. Model comparison based on Akaike and Bayesian Information Criteria (AIC/BIC) indicated that the normal distribution provided the best fit for our data (see S1 Table and S1 Fig in Supporting Information).
Copula functions and dependence measurement.
To capture the dependence between survival and age, we considered both elliptical and Archimedean copula families. Specifically, we examined the Gaussian, Clayton, Gumbel, and Frank copulas, which represent distinct forms of dependence, including symmetric, lower-tail, upper-tail, and central association [31,32]. The copula parameter was transformed into Kendall’s
, a rank-based correlation coefficient ranging from −1 to 1. This provides an intuitive, standardized measure of the association strength between survival and age, facilitating interpretation and comparison across copulas.
Regression framework and estimation.
We embedded the joint model within a regression framework to relate covariates to all parameters of the distribution. Specifically, we defined additive predictors for each parameter:
- The probability of survival (
) was linked to covariates via a probit function.
- The mean (
) of the age distribution was modeled with a linear predictor.
- The variance (
) of the age distribution was modeled with a log-link to ensure positivity.
- The copula dependence parameter (
) was modeled as a constant parameter within each copula family, capturing the overall strength of dependence between the two outcomes. Although the general modeling framework allows the copula dependence parameter
to link to covariates, all results reported in this study correspond to models with a constant
within each copula family. Covariate-dependent dependence structures were explored only in preliminary analyses and are beyond the scope of the present paper.
More formally, let denote the parameter vector for patient i. The model is specified through the following regression structures:
For the binary survival margin ():
For the continuous age mean ():
For the age variance ():
where the log link ensures positivity, and only an intercept is included.
For the copula dependence ():
This formulation explicitly specifies which covariates enter each component of the model and confirms that the copula dependence parameter is constant (i.e., no covariates) across all reported models.
The model parameters were estimated simultaneously by maximizing the log-likelihood function, which for a sample of n independent observations is:
where is the conditional probability from Equation (2). Parameter estimation was performed using numerical optimization. Full details of the likelihood derivatives and the regression formulation are provided in S1 Appendix.
Ethics statement
This study used publicly available, de-identified data from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) project [28]. The dataset was accessed on August 24, 2023, via Kaggle (originally from the cBioPortal for Cancer Genomics). As this analysis utilized previously collected and fully anonymized data, no direct patient contact or intervention occurred, and no identifiable personal information was accessed. Therefore, institutional review board (IRB) approval and informed consent were not required. The original METABRIC study received ethics approval from relevant institutional review boards in the United Kingdom and Canada, as reported in [28].
Results and discussion
Descriptive analysis of clinical covariates
To provide an overview of the study population, we first conducted a descriptive analysis of the METABRIC cohort to characterize the demographic, clinical, and molecular features of patients with breast cancer. This summary lays the groundwork for understanding the variability within the dataset and supports the subsequent modeling and copula-based analyses. Table 1 summarizes the clinical and molecular characteristics of the METABRIC breast cancer cohort (N = 1,904). Patients were diagnosed at a mean age of 61 years (±13), with tumors typically measuring around 26 mm in size. On average, two lymph nodes were positive, though the distribution was highly skewed, reflecting a wide range of disease progression.
Most tumors were stage 2 (68.3%) and classified as high-grade (grade 3 in 52.5% of cases), indicating a tendency toward aggressive disease biology in this cohort. Cellular morphology was most frequently reported as high (52.2%), further supporting this pattern.
In terms of molecular markers, ER-positive status was the most common (76.6%), followed by PR-positive (53.0%) and HER2-positive (12.4%) tumors. These distributions are consistent with typical hormone-receptor-driven breast cancer. Treatment data showed that nearly two-thirds of patients received hormone therapy (61.7%), and a similar proportion underwent radiotherapy (59.7%). Chemotherapy was administered to approximately one-fifth of patients (20.8%).
Finally, integrative molecular clustering revealed substantial heterogeneity. IntClust 8 (15.2%) and IntClust 3 (14.8%) were the most frequent subtypes, reflecting the complex genomic architecture of breast cancer. Further, various key factors of breast cancer data are visualized to examine their impact on survival status and age by groups. Only a few key covariates are chosen for descriptive analysis through bar graphs. Fig 2 displays both response variables, i.e., age at diagnosis (groups) and overall survival, in a single graph. It is evident from the graph that the risk of not surviving the disease increases with age. The proportion of deceased patients increased markedly in the 60–79 and 80–100 year groups, reflecting the impact of age on breast cancer prognosis.
Older patients show higher mortality, while survival is more common among younger groups, underscoring the strong association between advanced age and poorer survival outcomes.
The comparative analysis of clinical and molecular features by survival status is illustrated in Figs 3–6.
Chemotherapy, radiotherapy, and hormone therapy exhibit distinct patterns across deceased and surviving patients.
Higher cellularity appears more common among non-survivors, suggesting aggressive tumor phenotypes.
Invasive ductal/NST carcinoma dominates both groups.
Specific clusters show a higher mortality association, reflecting molecular heterogeneity.
The Fig 3 illustrates the distribution of major treatment types, i.e., chemotherapy, hormone therapy, and radiotherapy, by survival status. A larger proportion of patients who received chemotherapy or hormone therapy did not survive, while many survivors did not undergo radiotherapy. These patterns suggest that treatment response and survival outcomes vary across therapies, but they do not imply a direct causal relationship.
Fig 4 illustrates the cellularity, comprising its three key factors. The highest bars, indicating the “high” factor, signify that patients did not survive the disease diagnosed with high and moderate cellularity. Still, these factors also demonstrate the second-highest survival rate among patients with the disease.
According to the findings in Fig 5, the factor of tumor histologic subtype that grasps the utmost significance is Ductal/NST. This factor was diagnosed in patients who did not survive the cancer as well as those who did.
Fig 6 shows that patients in integrative clusters “3” and “8” had lower survival rates, while those in cluster “4ER+” showed noticeably better survival outcomes.
Additional visualizations are provided in the Supporting Information (S2–S5 Figs) to complement the main descriptive findings. S2 Fig explores surgical type and tumor laterality by survival. S3 Fig highlights that higher tumor cellularity tends to occur among older patients, whereas S4 Fig illustrates age-related variation in integrative molecular clusters. Finally, S5 Fig presents histologic subtype distribution by age, confirming that ductal/NST carcinoma is the predominant histologic subtype across all groups.
Correlation structure of clinical covariates
The dependency structure among the METABRIC clinical variables was characterized using a mixed-type correlation matrix (Fig 7). Since the dataset included continuous, binary, and categorical variables, appropriate statistical measures were applied for each pair: Pearson’s correlation for continuous pairs, Cramér’s V for categorical pairs, and point-biserial correlations for continuous-binary pairs. For multi-level categorical variables paired with continuous measures, ANOVA-based approximations were used. The resulting symmetric matrix allows direct visual comparisons across variable types.
Correlations are computed using Pearson’s r, Cramér’s V, and point-biserial or ANOVA-based methods, depending on variable type. Strong associations are shown in red, while weaker or negative associations are displayed in purple or white.
As shown in Fig 7, several patterns emerged. Lymph node status and the Nottingham Prognostic Index (NPI) were strongly associated (), as were tumor grade and NPI (
), which is expected since both contribute to NPI calculation. Tumor size showed moderate correlation with both NPI (
) and lymph node status, reinforcing its role in disease progression.
In contrast, treatment-related variables (chemotherapy, hormone therapy, radiotherapy) exhibited weak correlations with prognostic indicators. Age at diagnosis showed minimal associations, with a weak inverse correlation with NPI () and with overall survival (
), suggesting that age alone is not a strong driver of survival differences in this cohort. This correlation analysis supports the inclusion of both prognostic and biological factors in the joint model. It also demonstrates how a mixed correlation matrix helps identify overlapping variables and guide the selection of key predictors for copula-based regression. As part of this exploratory analysis, several candidate outcome pairings were examined. Joint associations between overall survival and tumor size, as well as between overall survival and the Nottingham Prognostic Index, were weak and close to zero. In contrast, the association between age at diagnosis and overall survival, while modest, was consistently non-negligible. This observation motivated the selection of the age–survival pairing for subsequent copula-based joint modeling.
Interpretation of copula model results
To jointly model the binary response variable overall survival (Y1) and the continuous response variable age at diagnosis (Y2), we fitted copula-based regression models using Gaussian, Gumbel, Frank, and Clayton copulas. The models account for dependencies between the two outcomes while allowing distinct marginal specifications, a probit model for Y1 and an identity link for Y2. Further, the Joint densities and contour plots of fitted copulas are shown in Fig 8. As outlined in the Methods, this symmetric formulation is intended to quantify statistical association rather than causal relationships between the outcomes. Accordingly, age at diagnosis is treated as a joint outcome to characterize its empirical dependence with survival, not as a temporally causal response.
(A) Joint Density Surface of Gaussian Copula (,
). (B) Contour Plot of Gaussian Copula. (C) Joint Density Surface of Gumbel Copula (
,
). (D) Contour Plot of Gumbel Copula. (E) Joint Density Surface of Frank Copula (
,
). (F) Contour Plot of Frank Copula. (G) Joint Density Surface of Clayton Copula (
,
). (H) Contour Plot of Clayton Copula. The copula parameter
determines the shape of the joint density surface.
Table 2 summarizes the results of the bivariate regression model with Gaussian and Gumbel copula functions for clinical variables of the METABRIC data. The Gaussian copula model revealed that age at diagnosis was negatively associated with survival probability (Estimate = −0.085, p < 0.001), indicating that younger patients were more likely to survive. Similarly, ER-positive status was associated with a significant increase in survival (Estimate = 0.765, p < 0.001). However, PR status, HER2 status, and treatment-related covariates were not significant predictors. In the model for age, overall survival showed a strong inverse association (Estimate = −7.898, p < 0.001), with survivors being nearly 8 years younger on average. This association reflects the joint dependence structure captured by the copula and should not be interpreted as survival explaining or causing age. The Gaussian copula parameter, Kendall’s , suggests weak symmetric dependence between survival and age. The corresponding joint density and contour plots of Gaussian copula (Fig 8, Panel A and B), respectively, confirmed an elliptical, symmetric structure without strong tail dependency. The Gumbel model revealed a much stronger dependence structure, with a Kendall’s
, manifesting the dependence in the upper-tail. This indicates that exceptionally favorable outcomes, being alive at follow-up and having a younger age, tend to co-occur more frequently than expected under symmetric dependence. Statistically, the model improved fit over the Gaussian (AIC: 11,927 vs. 12,458). Certain covariates, such as ER and PR status, tumor size, and Nottingham Prognostic Index, were statistically significant in both response variables. Particularly, tumor size emerged as a stronger predictor in the Gumbel model (Estimate = 0.005, p = 0.008).
Given the coding of the survival outcome (1 = alive), the positive tumor size coefficient indicates a higher modeled survival probability conditional on the joint dependence structure. This effect was not consistently observed across copula families (Tables 2–3), suggesting that it represents a model-specific conditional association rather than a robust marginal effect. The result is likely influenced by adjustment for correlated prognostic factors, such as treatment variables and the Nottingham Prognostic Index, together with the strong upper-tail dependence captured by the Gumbel copula. The joint density plot of the Gumbel copula (Fig 8, Panel C) and contour plot (Panel D) revealed a concentration of mass in the upper-right quadrant, supporting the presence of upper-tail dependence. For further investigation of the dependence between overall survival and age at diagnosis, we employed Frank and Clayton copula models, both from the Archimedean family, and useful for modeling asymmetric and nonlinear dependencies.
In Table 3, the Frank copula model yielded a Kendall’s , suggesting moderate symmetric dependence between survival and age. Unlike the Gaussian copula, Frank is more flexible around the mode, enabling better modeling of the joint distribution when the dependence is strongest in the central range of the data.
In the survival model, age at diagnosis was negatively associated with survival (Estimate = −0.0819, p < 0.001), reaffirming the strong clinical relationship between younger age and better survival. ER-positive status had a high positive association (Estimate = 0.700, p < 0.001), while PR and HER2 statuses were significant but showed weak predictive value. In the second equation, survival status again showed a strong inverse relationship with age (Estimate = −8.53, p < 0.001), which should be interpreted as a symmetric association rather than reverse causality. Histologic grade 2 and ER/PR statuses remained significant. The density plot of the Frank copula (Fig 8, Panel E) revealed moderate clustering around the diagonal, consistent with central dependence. The contour plot (Panel F) showed symmetric contour lines that broaden from the centre, supporting the model’s flexibility across the joint range. The Clayton copula exhibited a Kendall’s , indicating very strong lower-tail dependence. This suggests that unfavorable outcomes,older age at diagnosis and poor survival, tend to co-occur.
In the survival sub-model, age at diagnosis and ER status were again statistically significant. However, other covariates (PR status, HER2, and treatments) did not show significant associations. In the second equation, overall survival and ER status were significantly associated with age, whereas tumor size and lymph nodes did not contribute meaningfully. The density plot of the Clayton copula (Fig 8, Panel G) showed mass concentrated in the lower-left quadrant, highlighting dependence in the lower tail (older patients with poorer survival). The contour plot (Panel H) displayed asymmetric contour lines, with tighter spacing in the lower tail.
In summary, the Gumbel copula best captures favorable outcomes, not only in terms of goodness-of-fit and the lowest AIC and BIC, but also by exhibiting strong upper-tail dependence. The Clayton copula highlights joint risks in unfavorable outcomes, the Frank copula provides a balanced view of moderate symmetric dependence, and the Gaussian copula serves as a baseline model with limited ability to represent tail-dependent clinical patterns.
Model diagnostics and comparison.
To evaluate whether the copula specification provides meaningful improvement over independent modeling, we compared the selected Gumbel copula model to separate marginal models fitted independently (probit for survival and Gaussian for age). The joint copula model demonstrated substantially better fit, with a likelihood ratio test strongly rejecting independence (, df = 1, p < 0.0001). The AIC decreased from 17,080 under the independent specification to 14,911 for the Gumbel copula (
AIC = 2,168), and BIC decreased from 17,142–15,058 (
BIC = 2,084). These large information-criterion differences provide overwhelming evidence that explicitly modeling dependence between survival and age improves overall model fit. Goodness-of-fit of the marginal distributions was further evaluated using Probability Integral Transform (PIT) diagnostics. Fig 9 presents the PIT results for both margins. For the continuous age margin, the PIT histogram displayed approximate uniformity (Kolmogorov–Smirnov test: D = 0.027, p = 0.126; Panel B), supporting the adequacy of the Gaussian marginal specification.
(A) Randomized PIT residuals for the binary survival margin. The histogram shows approximate uniformity (Kolmogorov–Smirnov test: D = 0.022, p = 0.301), indicating adequate probit specification. (B) PIT residuals for the continuous age margin (KS test: D = 0.027, p = 0.126), supporting the Gaussian marginal assumption. (C) QQ plot against the uniform distribution for the survival margin. (D) QQ plot against uniform distribution for age margin. The dashed red line represents uniform reference distribution in all panels.
For the binary survival margin, a randomized PIT approach was used to account for discreteness [33]. The resulting randomized PIT values were approximately uniform (KS test: D = 0.022, p = 0.301; Panel A), indicating adequate specification of the probit marginal model. The corresponding QQ plots (Panels C and D) show close agreement with the uniform reference distribution.
Collectively, these diagnostics confirm that both marginal models are appropriately specified and that the copula framework captures additional dependence structure beyond what independent margins can represent.
Conclusion
This study applied a copula-based modeling framework to jointly analyze survival status and age at diagnosis in breast cancer patients from the METABRIC cohort. The approach flexibly captured dependencies between the chosen pair of outcomes that traditional models fail to capture. Crucially, the copula formulation is symmetric and estimates statistical association rather than causal effects, addressing a common misinterpretation in joint models of temporally ordered outcomes. Among the copulas applied, the Gumbel copula showed the best fit to the data, effectively capturing the upper-tail dependence between younger age and better survival (as supported by AIC and BIC), while the Clayton copula highlighted the co-occurrence of poorer survival and older age through lower-tail dependence. Formal model comparison against an independent margins baseline confirmed that accounting for dependence via a copula significantly improves model fit (likelihood ratio test: , df = 1, p < 0.0001), and PIT diagnostics validated the adequacy of both marginal specifications.
The findings highlight that key clinical covariates, such as ER status, hormone therapy, and the Nottingham Prognostic Index, have a significant influence on the joint distribution of outcomes, whereas others (e.g., HER2 status or chemotherapy) showed limited or no association.
This study makes two key contributions: (i) it demonstrates the practical application of copula regression for mixed binary–continuous outcomes in a large, real-world cancer cohort, and (ii) it provides a reproducible template for model selection, diagnostic evaluation, and interpretation that can be adapted to other clinical settings.
Overall, this framework provides a robust and interpretable approach to exploring complex relationships in clinical data, and can be extended to include more outcomes or high-dimensional health studies. The analytical code and model specification details are provided in the Supporting Information to facilitate replication and extension by other researchers.
Supporting information
S1 Table. Comparison of AIC and BIC values for candidate marginal distributions of Age.
Normal, Gamma, Log-normal, Logistic, and Inverse-Gamma distributions were fitted to the data. The Normal distribution provided the best fit based on AIC and BIC.
https://doi.org/10.1371/journal.pone.0346495.s001
(PDF)
S1 Fig. Goodness of fit plots for fitted distributions to Age (Y2).
Normal, Gamma, Log-normal, Logistic, and Inverse-Gamma distributions were fitted to the data. Visual inspection confirmed the Normal distribution as the most appropriate choice.
https://doi.org/10.1371/journal.pone.0346495.s002
(TIFF)
S2 Fig. Type of breast surgery and tumor laterality by survival status.
Combined bar plots depict the distribution of breast-conserving surgery versus mastectomy and tumor laterality (left vs. right breast) among surviving and deceased patients. Distinct patterns suggest potential associations between surgical choice, tumor location, and patient survival.
https://doi.org/10.1371/journal.pone.0346495.s003
(TIFF)
S3 Fig. Tumor cellularity across age groups.
Grouped bar charts show how cellularity (low, moderate, high) varies across age categories. Older patients more frequently exhibited higher cellularity levels, indicative of more aggressive disease profiles.
https://doi.org/10.1371/journal.pone.0346495.s004
(TIFF)
S4 Fig. Integrative molecular clusters across age groups.
Bar plots demonstrate how integrative molecular cluster frequencies vary across age categories. Differences in cluster prevalence with age suggest underlying biological and genomic heterogeneity within the cohort.
https://doi.org/10.1371/journal.pone.0346495.s005
(TIFF)
S5 Fig. Histologic subtype distribution across age groups.
The distribution of ductal/NST, lobular, and mixed histologic subtypes is shown by age category. While ductal/NST carcinoma remains predominant in all age groups.
https://doi.org/10.1371/journal.pone.0346495.s006
(TIFF)
S1 Appendix. Comprehensive Technical Details for the Copula Models.
This appendix provides the full technical derivation and computational details for the mixed binary-continuous copula models described in the main text.
https://doi.org/10.1371/journal.pone.0346495.s007
(PDF)
References
- 1. Sklar M. Fonctions de répartition à n dimensions et leurs marges. Annales de l’ISUP. 1959;8(3):229–31.
- 2.
Nelsen RB. Methods of constructing copulas. An introduction to copulas. 2006. p. 51–108.
- 3.
Embrechts P, McNeil A, Straumann D. Correlation and dependence in risk management: properties and pitfalls. Risk management: value at risk and beyond. 2002. p. 176–223.
- 4. Frees EW, Valdez EA. Understanding Relationships Using Copulas. North Am Act J. 1998;2(1):1–25.
- 5. Patton AJ. Modelling asymmetric exchange rate dependence*. Int Econ Rev. 2006;47(2):527–56.
- 6.
Clementi F, Gianmoena L. Modeling the joint distribution of income and consumption in Italy: A copula-based approach with κ-generalized margins. Introduction to Agent-Based Economics. Elsevier. 2017. p. 191–228.
- 7. Salvadori G, De Michele C. On the Use of Copulas in Hydrology: Theory and Practice. J Hydrol Eng. 2007;12(4):369–80.
- 8. Zhang Q, Li J, Singh VP, Xu C. Copula‐based spatio‐temporal patterns of precipitation extremes in China. Int J Climatol. 2013;33(5):1140–52.
- 9. Zhang D, Chen P, Zhang Q, Li X. Copula-based probability of concurrent hydrological drought in the Poyang lake-catchment-river system (China) from 1960 to 2013. J Hydrol. 2017;553:773–84.
- 10. Zou Y, Ye X, Henrickson K, Tang J, Wang Y. Jointly analyzing freeway traffic incident clearance and response time using a copula-based approach. Transp Res Part C Emerg Technol. 2018;86:171–82.
- 11. Larsen R, W. Mjelde J, Klinefelter D, Wolfley J. The use of copulas in explaining crop yield dependence structures for use in geographic diversification. Agric Finance Rev. 2013;73(3):469–92.
- 12. Brown PH, Theoharides C. Health-seeking behavior and hospital choice in China’s New Cooperative Medical System. Health Econ. 2009;18 Suppl 2:S47-64. pmid:19551751
- 13. Song PXK, Li M, Yuan Y. Joint modeling of correlated survival data using copulas. Biometrics. 2009;65(3):688–96.
- 14. Kim J-M, Jung Y-S, Sungur EA, Han K-H, Park C, Sohn I. A copula method for modeling directional dependence of genes. BMC Bioinform. 2008;9:225. pmid:18447957
- 15. Farnoudkia H, Purutcuoglu V. Vine copula graphical models in the construction of biological networks. Hacet J Math Stat. 2021;50(4):1172–84.
- 16. Gasparini A, Humphreys K. A natural history and copula-based joint model for regional and distant breast cancer metastasis. Stat Methods Med Res. 2022;31(12):2415–30. pmid:36120891
- 17.
Emura T, Chen YH. Analysis of survival data with dependent censoring: copula-based approaches. Vol. 450. Springer; 2018.
- 18. de Leon AR, Wu B. Copula-based regression models for a bivariate mixed discrete and continuous outcome. Stat Med. 2011;30(2):175–85. pmid:20963753
- 19. Kolev N, Paiva D. Copula-based regression models: A survey. J Stat Plann Infer. 2009;139(11):3847–56.
- 20. Trivedi PK, Zimmer DM. Copula Modeling: An Introduction for Practitioners. Found Trends Econom. 2007;1(1):1–111.
- 21. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209–49.
- 22. Anders CK, Hsu DS, Broadwater G, Acharya CR, Foekens JA, Zhang Y, et al. Young age at diagnosis correlates with worse prognosis and defines a subset of breast cancers with shared patterns of gene expression. J Clin Oncol. 2008;26(20):3324–30. pmid:18612148
- 23. Allemani C, Matsuda T, Di Carlo V, Harewood R, Matz M, Nikšić M, et al. Global surveillance of trends in cancer survival 2000-14 (CONCORD-3): analysis of individual records for 37 513 025 patients diagnosed with one of 18 cancers from 322 population-based registries in 71 countries. Lancet. 2018;391(10125):1023–75. pmid:29395269
- 24.
Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. CA: a cancer journal for clinicians. 2023;73(1).
- 25.
Joe H. Multivariate models and multivariate dependence concepts. CRC Press; 1997.
- 26. Marra G, Radice R. Bivariate copula additive models for location, scale and shape. Computat Stat Data Analys. 2017;112:99–113.
- 27.
Marra G, Radice R. Generalised joint regression modelling. R Package. 2017.
- 28. Curtis C, Shah SP, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346–52. pmid:22522925
- 29.
Nelsen RB. An Introduction to Copulas. Springer Science & Business Media; 2007.
- 30. Klein N, Kneib T, Marra G, Radice R, Rokicki S, McGovern ME. Mixed binary-continuous copula regression models with application to adverse birth outcomes. Stat Med. 2019;38(3):413–36. pmid:30334275
- 31.
Nelsen RB. Archimedean copulas. An introduction to copulas. 2006. p. 109–55.
- 32. Frank MJ. On the simultaneous associativity ofF(x,y) andx +y -F(x,y). Aeq Math. 1978;18(1–2):266–7.
- 33. Czado C, Gneiting T, Held L. Predictive model assessment for count data. Biometrics. 2009;65(4):1254–61. pmid:19432783