A streamlined model for use in clinical breast cancer risk assessment maintains predictive power and is further improved with inclusion of a polygenic risk score

Five-year absolute breast cancer risk prediction models are required to comply with national guidelines regarding risk reduction regimens. Models including the Gail model are under-utilized in the general population for various reasons, including difficulty in accurately completing some clinical fields. The purpose of this study was to determine if a streamlined risk model could be designed without substantial loss in performance. Only the clinical risk factors that were easily answered by women will be retained and combined with an objective validated polygenic risk score (PRS) to ultimately improve overall compliance with professional recommendations. We first undertook a review of a series of 2,339 Caucasian, African American and Hispanic women from the USA who underwent clinical testing. We first used deidentified test request forms to identify the clinical risk factors that were best answered by women in a clinical setting and then compared the 5-year risks for the full model and the streamlined model in this clinical series. We used OPERA analysis on previously published case-control data from 11,924 Gail model samples to determine clinical risk factors to include in a streamlined model: first degree family history and age that could then be combined with the PRS. Next, to ensure that the addition of PRS to the streamlined model was indeed beneficial, we compared risk stratification using the Streamlined model with and without PRS for the existing case-control datasets comprising 1,313 cases and 10,611 controls of African-American (n = 7421), Caucasian (n = 1155) and Hispanic (n = 3348) women, using the area under the curve to determine model performance. The improvement in risk discrimination from adding the PRS risk score to the Streamlined model was 52%, 46% and 62% for African-American, Caucasian and Hispanic women, respectively, based on changes in log OPERA. There was no statistically significant difference in mean risk scores between the Gail model plus risk PRS compared to the Streamlined model plus PRS. This study demonstrates that validated PRS can be used to streamline a clinical test for primary care practice without diminishing test performance. Importantly, by eliminating risk factors that women find hard to recall or that require obtaining medical records, this model may facilitate increased clinical adoption of 5-year risk breast cancer risk prediction test in keeping with national standards and guidelines for breast cancer risk reduction.


Introduction
Apart from non-melanoma skin cancer, breast cancer is the most common form of cancer affecting women and approximately one in eight women in the United States of America (U.S. A) will develop the disease in their lifetime [1]. In 2019, an estimated 268,000 U.S. women were diagnosed with invasive breast cancer and approximately 41,000 will have died as a result. There is, therefore, a need to identify which women are more likely to develop sporadic disease, so as to best apply measures to prevent it.
For women who are not initially identified as at high risk based on previous personal history of breast cancer, or family history suggestive of germline pathogenic mutations, the U.S. Preventative Services Task Force (USPSTF) [2], the American Society of Clinical Oncology (ASCO) [3], as well as the National Comprehensive Cancer Network (NCCN) [4], all have guidelines that these women should be screened to determine their five-year risk of developing breast cancer and offered risk-reducing medications, if appropriate. This is a responsibility that falls upon the woman's primary care health professional, as these patients would not fall under the initial 'high risk' rubric. The USPSTF uses a 5-year high-risk threshold of 3% and recommends a strong grade B ('offer or provide this service') guidance that providers offer tamoxifen or raloxifene for women above this 3% threshold [2], while ASCO and NCCN use a lower 5-year high-risk threshold of 1.67%, and in addition to tamoxifen and raloxifene, provide the option of aromatase inhibitors [3,4].
Approximately 10 million women in the U.S.A are eligible for breast cancer risk reducing medication [5]. Even though uptake of risk reducing medications has been reportedly low [6,7], the majority of eligible women are not being assessed for their risk. Multiple tools are available to provide the risk assessment, including the Gail Model [8,9] and the Breast Cancer Surveillance Consortium Risk Calculator [10], but they tend to be underutilized for various reasons [11] including lack of routine risk assessment and time constraints. Furthermore, there is often a failure to complete risk factor questionnaires, especially for the Gail model [12]. The overall result being that many women (and their physicians) may be unaware of their risk of developing breast cancer and that preventive options are available. These options are not just risk reducing medications, but also include increased surveillance and lifestyle modifications, such as reduction of alcohol consumption, increasing exercise, and maintaining a healthy body weight.
Indeed, whilst 96% of physicians agree that assessing breast cancer risk was a primary care provider's responsibility, 76% never calculate a Gail score [13]. Surprisingly, over 70% of internal medicine residents reported no knowledge of the Gail model [14]. In real-world clinical practice, a "streamlined model" relying on only the age and family history of a woman is more likely being utilized subconsciously by the physician in absence of a full model. Use of the Gail model can also be problematic because of its reliance on women remembering clinical information that may have occurred many years prior. It is also of note that the question of whether the woman has had at least one breast biopsy with atypical hyperplasia requires a level of understanding of medical terminology that most patients would not have. Of note, it is known that women with atypical hyperplasia are at increased risk of breast cancer based on their biopsy status alone and their risk score is actually underestimated by the Gail model [15].
With the advent of genome wide association studies (GWAS), researchers have identified single nucleotide polymorphisms (SNP) that are risk markers for breast cancer that are independent of clinical risk factors [16,17]. In the same paper that validated the SNP set used for risk assessment, the authors found that just using two easily accessible clinical risk factors, that of family history and age, combined with this SNP set provided a superior risk assessment model [18]. However, this paper focused on a 10-year score, while the US guidelines are built around a 5-year risk score assessment. Based on the above, a compelling case could be made for developing a more streamlined 5-year risk model that would help providers be compliant with national standards, using a similar model of limited but proven risk factors and risk SNP.
Therefore, the purpose of this study was twofold: (1) To develop a streamlined 5-year risk assessment tool based on validated risk SNP (PRS) that incorporates clinical risk factors from the Gail model that are readily available and important to risk prediction and (2) to evaluate the strength of this test's 5-year risk prediction capabilities.

Clinical review sample
De-identified test request forms from 2,339 African American, Caucasian, and Hispanic U.S.A women who had been tested with the BREVAGenplus (Phenogen Sciences) commercial breast cancer risk assessment test, between October 2014 and October 2016, were reviewed to determine how many of the Gail model questions were not answered or answered as "unknown". This analysis was reviewed by an independent institutional review board and deemed exempt (Quorum Review IRB). For model comparison, we removed the patient samples with unknown family history results (n = 57) for final sample size of 2882. Sample characteristics and summary can be found in S1 and S2 Tables, respectively.

Model validation sample
An independent dataset consisting of a total of 1,313 case and 10,611 controls from two different cohorts were used to compare the performance of the two models. Details of the 7,421African American (416 case/7005 control), 1,155 Caucasian (750 case/405 control), and 3,348 Hispanic (147 case/3210 control) women used in the risk discrimination analyses are described elsewhere [19,20]. Briefly, the Caucasian women were identified from the Australian site of the Breast Cancer Family Registry and the African American and Hispanic women were identified from the Women's Health Initiative (WHI) SNP Health Association Resource (SHARe). Women with unknown family history were removed from the analysis. Our SNP were validated in African American and Hispanic women [19], however population-specific SNP improvements need be made in future studies for a more robust model because the majority of the SNP panels were discovered in European ancestry populations. The Gail model, or the NCI's breast cancer risk assessment tool (BCRAT), is a well-established risk prediction tool that incorporates age, age at menarche, age at parity, 1 st degree family history, and biopsy status including presence of atypical hyperplasia. This clinical gold-standard was the model upon which the initial commercial clinical test was built: Gail+PRS [19,20].

Polygenic risk score and combined risk score
Using the approach of Mealiffe et al. [21], we calculated a PRS-a SNP-based (relative) risk score using previously published estimates of the odds ratio (OR) per allele and risk allele frequency (p) assuming independent and additive risks on the log OR scale (S3-S5 Tables). For each SNP, we calculated the unscaled population average risk as μ = (1 -p) 2 + 2p (1 -p) OR + p 2 OR 2 . Adjusted risk values (with a population average risk equal to 1) were calculated as 1/ μ, OR/μ and OR 2 /μ for the three genotypes defined by number of risk alleles (0, 1, or 2). The overall SNP-based risk score was then calculated by multiplying the adjusted risk values for each of the 70+ SNP.
We created a five-year absolute clinical risk of breast cancer based on published relative risks for having an affected first-degree relative [22], and taking into account the competing risk of dying from other causes. Ethnic-specific breast cancer incidence and competing mortality data were derived from the U.S.A SEER database (SEER 2013 Research Data). Absolute 5-year risk is calculated using the following formula: Where cumul_b is the cumulative risk at baseline (cumul_b = 1e −fh × snp × incid_b ) And cumul_b_5 is the cumulative risk at baseline plus 5-years (cumul_b_5 = 1e −fh × snp × incid_b_5 ) Where; incid_b is the cumulative incidence of breast cancer from birth to baseline, incid_b_5 is the cumulative incidence of breast cancer from birth to baseline plus 5 years, mortsurv_5 is survival from baseline age to baseline age plus 5 years fh is family history relative risk snp is the SNP-based relative risk score calculated using the method of Mealiffe [21]. In developing a streamlined Gail model, we elected to retain only a patient's age and firstdegree family history of breast cancer, which do not require recall of events over long periods of time.
Combined absolute five-year risk scores based on Gail model plus PRS or the Streamlined model plus PRS, were calculated as previously described [19,20].

Statistical analysis
For the series of test request forms, we used descriptive statistics to assess the completeness of the data (Table 1). Comparative analyses of five-year risk estimate between the Gail Model plus PRS versus the Streamlined model plus PRS were performed using the log transformation of the aforementioned commercial clinical samples (n = 2,882).
For the case-control data, we used logistic regression to estimate the change in log odds per adjusted standard deviation (OPERA) for log-transformed age-adjusted five-year risks [23]. We used the log OPERA and the area under the receiver operator curve (AUC) to assess risk discrimination. All tests were two sided and p-values <0.05 were considered nominally statistically significant. Stata Release 14 [24] was used for all statistical analyses.

Clinical commercial sample study discovery
The extent to which Gail model questions were not answered or answered as "unknown" was assessed to determine how often data went missing from the test inputs. Our data set of the aforementioned 2,339 women indicates that approximately 16% of all answers relating to the Gail model were not answered, or answered as "unknown", as part of their risk testing. The most commonly unanswered question was age of menarche, with 4.4% of women being unable to provide an answer ( Table 1). The second most common unanswered question (or answered "unknown") related to whether the patient had at least one biopsy with atypical hyperplasia (Table 1). There was no missing information for age and ethnicity.
Based on the above, we developed a Streamlined Model, requiring only the patient's age and first-degree family history of breast cancer, with the assumption that these do not require long term recall (for example, age of first menses), nor access to medical records (number of biopsies and/or diagnosis of atypical hyperplasia).
A comparative analysis of five-year risk estimates between the Gail model plus PRS versus the Streamlined model plus PRS, was performed using the 2,339 commercial samples (excluding patients for whom the 1st degree relative response was missing or unknown, n = 57) (Fig 1). The two-tailed t-test of the log transformed absolute 5-year risk scores between the Gail model plus PRS compared to the Streamlined model plus PRS indicates that there is no significant (n = 2282; P = 0.8441) difference in mean risk scores between each model (Fig 1).

Case-control study validation
Using a combined multi-ethnic case-control dataset, we assessed the extent to which adding a PRS to the Streamlined model improved breast cancer risk prediction compared with predictions using the Streamlined model alone.  [26,27]. The next step is to integrate these markers into a risk model and cross validate in an independent cohort. As the collaborative consortium continue to expand, the opportunity to more accurately include other and mixed genetic ancestries will be imperative. The increments in AUCs are similar to the increments in log OPERAs, and showed that the improvements in risk prediction from including the PRS were approximately 7.5%, 8.5% and 9.1% for African Americans, Caucasians, and Hispanics, respectively for the AUCs and 52%, 46% and 62% for the log OPERAs.

Conclusions
Breast cancer PRS is an underutilized risk factor that can add value to the clinical implementation of risk assessment in the general population. In an effort to improve the clinical application of breast cancer risk scores in the general population, we have streamlined a clinical risk model (Gail) by retaining just age and whether a first-degree relative has breast cancer, and then including PRS, consistent with the approach taken by Mavaddat et al. [18] These clinical variables are important risk factors for breast cancer as cancer increases substantially with age [1] and the presence of a first degree relative with breast cancer is associated with, an approximate doubling of a woman's risk [22]. Given the lack of full breast cancer risk assessment at the primary care level [13], physicians are presumably relying on only age and first degree family history of breast cancer as a subconscious measure of risk for their patient-similar to this Streamlined model. Importantly, we focus on the 5-year risk score and not the lifetime risk score with this Streamlined model because it is well established in clinical recommendations that the Gail model does not include enough family history to appropriately determine lifetime risk for screening guidance [4,28,29]. Furthermore, the majority of women diagnosed with sporadic breast cancer typically have little or no family history. Therefore, most women in the general population will have a lifetime risk score that will never surpass the 20% threshold of actionable risk based on PRS and age alone [30]. We have incorporated polygenic risk of over 70 SNP to the Streamlined model plus PRS to provide an absolute five-year breast cancer risk prediction to improve performance beyond these simple clinical risk factors alone. In terms of differentiating women who will develop breast cancer from those who will not develop breast cancer, adding a PRS to the Streamlined model is on average 53% better than the Streamlined model alone across the three ethnicities in this study. Our OPERA and ROC analysis indicates an average 8.4% increase in AUC and 50% increase in log OPERA values with the addition of PRS to the Streamlined model. Clearly, the more information that is incorporated into a risk model, the more accurate that model will be. However, model accuracy is dependent on the accuracy of the input. Of interest, our data suggests that questions such as age at menarche and age at first live birth culminate in a low relative contribution to the overall risk score. When looking at 5-year risk scores, the predictive ability of two important, common clinical risk factors plus PRS (Streamlined model plus PRS) is similar for the Gail model plus PRS, with a reduction in the mean AUC of only 0.02. Thus, our data indicate that reducing the Gail questionnaire to only two clinical variables maintains the integrity of a breast cancer risk prediction algorithm in a clinical setting when PRS is included. This streamlined questionnaire could make it easier for physicians to administer an absolute fiveyear risk assessment for the majority of women who do not meet other high-risk criteria.
Our Streamlined model plus PRS is designed for assessing 5-year breast cancer risk in women who are not yet categorized as "high risk." Because the Gail model has been shown to previously underestimate risk in women with atypia [15,31], we did not include that clinical factor into our Streamlined model plus PRS. Furthermore, we acknowledge that NCCN guidelines suggest women with atypical hyperplasia are categorized as high risk based on biopsy confirmed atypia alone [4]. Interestingly, we observed a statistically significant difference (p<0.005) between PRS from commercial patients with atypical hyperplasia (n = 112) versus patients with no biopsy history (n = 1503; S1 and S2 Tables). This suggests a possible modest association between PRS and atypical hyperplasia that could be further exploited to improve risk assessment prior to the point of biopsy.
As PRS continue to improve, so will the breast cancer risk assessment models. There exists solid evidence on additional SNP that could further improve the Streamlined model plus PRS [30,32]. Unfortunately, due to the clinical restrictions on our commercial samples, we did not have the ability to retrospectively assess alternative SNP to the 77 initially included in our interrogation.
By increasing breast cancer risk assessment of the general population, physicians can increase patient breast cancer awareness and identify those patients at increased risk of breast cancer enabling a more proactive breast health management, potentially improving compliance with current guidance on risk reduction [2].
Supporting information S1 Table. These