Peripheral blood transcriptome identifies high-risk benign and malignant breast lesions

Background Peripheral blood transcriptome profiling is a potentially important tool for disease detection. We utilize this technique in a case-control study to identify candidate transcriptomic biomarkers able to differentiate women with breast lesions from normal controls. Methods Whole blood samples were collected from 50 women with high-risk breast lesions, 57 with breast cancers and 44 controls (151 samples). Blood gene expression profiling was carried out using microarray hybridization. We identified blood gene expression signatures using AdaBoost, and constructed a predictive model differentiating breast lesions from controls. Model performance was then characterized by AUC sensitivity, specificity and accuracy. Biomarker biological processes and functions were analyzed for clues to the pathogenesis of breast lesions. Results Ten gene biomarkers were identified (YWHAQ, BCLAF1, WSB1, PBX2, DDIT4, LUC7L3, FKBP1A, APP, HERC2P2, FAM126B). A ten-gene panel predictive model showed discriminatory power in the test set (sensitivity: 100%, specificity: 84.2%, accuracy: 93.5%, AUC: 0.99). These biomarkers were involved in apoptosis, TGF-beta signaling, adaptive immune system regulation, gene transcription and post-transcriptional protein modification. Conclusion A promising method for the detection of breast lesions is reported. This study also sheds light on breast cancer/immune system interactions, providing clues to new targets for breast cancer immune therapy.


Materials and methods
This study was approved by the Ethics Committee of the Qingdao Central (Tumor) Hospital (IRB no. KY-P201803601) on January 30 th 2019. Participants were recruited to this study from January 31 st 2019 to June 30 th 2019. Sample acquisition was conducted between January 31 st 2019 and June 30 th 2019 at the Qingdao Central (Tumor) Hospital. 151 participants were enrolled, including 44 healthy controls and 107 patients with breast lesions (50 high risk lesions and 57 breast cancer). Written informed consent was obtained from all study participants and approved by the Ethics Committee of Qingdao Central (Tumor) Hospital. All authors in this manuscript had access to individual participants' information and medical records, and data was scrubbed after information collection.
A total of 107 blood samples from patients with breast lesions was obtained. The study population comprised 107 female adult patients (age range, 23-78 years; mean age: 50.6 ± 11.2 years), including 50 women with high-risk breast lesions and 57 breast cancer patients. All patients were recruited before they had undergone any form of treatment, including endocrinotherapy, radio/chemo-therapy, targeted therapy or surgery. The breast lesion cohorts were categorized according to pathological examination. All patients underwent mammography or ultrasound, and the results were analyzed and categorized according to the Breast Imaging Reporting and Data System (BI-RADS) Grades [22]. In cases where the grades of mammography and ultrasound were inconsistent, the higher grade was adopted. High-risk lesions were defined as BI-RADS Grades 3 to 5 with no evidence of cancer at biopsy.

Blood collection, RNA isolation and RNA quality control
Blood samples (2.5 ml) were drawn using PaxGene Blood RNA tubes (PreAnalytix GmbH, Hombrechtikon, Switzerland) and total RNA was then isolated as described in a previous publication [11]. The integrity of the purified RNA was accessed by 2100 Bioanalyzer RNA 6000 Nano Chips (Agilent Technologies, Inc., Santa Clara, CA, USA) and the quantity of RNA was assessed by NanoDrop 1000 UV-Vis spectrophotometer (Thermo Fisher Scientific, Inc. Waltham, MA, USA). All RNA samples were assessed by RNA integrity number �7�0 and 28S:18S rRNA�1.0.

Microarray hybridization and microarray data analysis
The gene expression profiles of all 151 samples, including 44 normal controls, 50 high-risk breast lesions and 57 breast cancer, were characterized by microarray hybridization as per the manufacturer's protocol (Gene Profiling Array cGMP U133 P2 [Affymetrix; Thermo Fisher Scientific, Inc.]). Blood total RNA (200 ng for each sample) was labeled and hybridized onto Affymetrix microarray according to the manufacturer's protocol. Gene expression profiles were accessed using Affymetrix Expression Console software (version 1.4.1; Affymetrix; Thermo Fisher Scientific, Inc.). The raw gene expression data were normalized using the MAS5 method to make it possible to compare the profiling variations among microarrays.
The data mining method utilized for this study mostly follows the strategy described in our previous report [23]. In brief, to identify gene biomarkers for distinguishing breast lesions (high-risk benign and cancer) from normal controls, the probe sets of interest were selected from the 54,675 probe sets on the Affymetrix Gene Profiling cGMP U133 P2 microarray, by filtration according to the following series criteria: the probe sets could be detected reliably ("present" call) in all the samples; the sets were present within the MAQC list as reported by MAQC Consortium; and the stably expressed probe sets, also deemed as internal reference genes, were removed. The microarray data was transformed by a logarithmic intensity to satisfy Gaussian distribution requirements. All sample data were randomly divided into a training set and a test set in a proportion of 7:3.
To accelerate the screening of breast lesion-specific gene expression signatures, an ensemble learning strategy called AdaBoost was executed. Instead of making restrictive assumptions regarding the training set as in traditional data mining methods, this boosting method first creates a set of weak classifiers by assigning them appropriate extra weights and then combines these weak classifiers into a strong classifier. AdaBoost has important and significant advantages in both accuracy and training time as compared with other data mining methods [24]. The transcriptomic features of the breast lesions were identified and used to construct the predictive model by AdaBoost. To classify the breast lesion group and the normal control group, the area under the receiver operating characteristic curve (AUC) sensitivity, specificity and accuracy were estimated in both the training and the test groups.

Bioinformatics analysis
The GO and KEGG annotations of the selected transcriptomic genes were queried from the COXPRESdb v7 database [25]. The protein-protein interactions between each transcriptomic feature and its first neighbouring protein counterpart with number less than 20 were downloaded from the STRING database with a total confidence greater than or equal to 0.7. Geneannotation enrichment analysis using the cluster Profiler R package was performed on signature genes and their correlative proteins. Gene Ontology (GO) terms were identified with a strict cutoff of adjusted p < 0.05 corrected with the Benjamini-Hochberg (BH) method and a false discovery rate (FDR) of less than 0.05. Reactome pathways were also identified, with a strict cutoff of p < 0.05 corrected with the BH method and a false discovery rate (FDR) of less than 0.05. The protein-protein interaction network and gene network with the final biomarkers was carried out with Cytoscape software.

Results
For this study a total of 151 blood samples was collected, including 44 controls and 107 breast lesions (50 high-risk breast lesions and 57 breast cancer lesions). Patients with breast cancer were older than the controls and older than those with high-risk lesions. Most subjects in the control group were aged less than 60 years, whereas about half (49/107) of the patients in the breast lesion cohort were older than age 60 ( Table 1). The BI-RADS Grades of patients in the breast lesion group are also summarized: for high-risk lesions, the number of lesions Grade 3 and 4 was similar; for breast cancer lesions, most of the patients were Grade 5 ( Table 1).
The histopathology of the breast lesions is shown in Table 2. In the category of high-risk lesions, the main two types were hyperplasia-related disease and fibroadenoma. In the category of breast cancer, invasive breast cancer accounted for about 81% (46/57) of all histological types. Most of the samples were histological Grade II (26/40), 17 were unknown.

Transcriptome profiling of peripheral blood samples from normal controls and breast lesions
Transcriptome profiling of peripheral blood samples taken from women in the two cohorts (normal controls 44, breast lesions 107), were generated using Affymetrix GeneChip U133Plus2.0. The profiles were then analyzed comparing breast lesions and normal control samples. A final ten transcriptomic gene biomarkers were identified (YWHAQ, BCLAF1, WSB1, PBX2, DDIT4, LUC7L3, FKBP1A, APP, HERC2P2, FAM126B) and were able to distinguish blood samples from patients with breast lesions from normal control samples. The corresponding gene symbols and fold changes of the final ten probe sets are listed in Table 3.

Model selection and performance evaluation
Based on the ten candidate biomarkers we identified, a predictive model was constructed for discriminating breast lesions from normal controls using AdaBoost.
Fig 1 demonstrates using hierarchical cluster diagrams the performance of each single gene and the ten-gene panel for distinguishing breast lesions from controls for the entire 151 samples. The ten-gene panel exhibited a better performance than any of the single genes alone in clustering breast lesion samples from normal control samples.
To construct the predictive model, we divided the total data into a training set and a test set in proportions of 7:3. The predictive model built on the training set that contained a total of 105 samples included 80 breast lesions and 25 normal controls. The performance of the predictive model was then evaluated by the completely independent samples in the test set, which contained a total of 46 samples, including 27 breast lesions and 19 normal controls. The performances of the training set and the test set are shown in Table 4. In terms of specificity and accuracy both training set and test set performed well; the test set sensitivity was 100%, and specificity and accuracy were 84.2% and 93.5%, respectively. Three of the 19 normal control samples in the test set were predicted as positive results; the reason for these false-positive results requires further study in a larger cohort. The ten-gene biomarker panel also exhibited a higher ROC AUC as compared with any single biomarker, in both the training set and the test set, as shown in

Protein networks and functional enrichment analysis
The proteins interacting with the ten candidate biomarkers used for the model construction were downloaded from the STRING database, and a total of 147 proteins were identified with a confidence greater or equal to 0.7. The detailed interaction of these proteins is shown in Fig  4. Functional enrichment analysis was conducted and pathways were identified with a strict

Discussion
In this study we report a method for differentiating breast lesions-including high-risk benign breast lesions and malignant breast lesions-from normal controls using blood transcriptomic gene expression analysis. We collected blood samples from healthy control women with no breast disease and from breast lesion patients, and focused on identifying blood transcriptomic features that can distinguish the two groups. We identified ten genes that can detect breast lesions with an accuracy higher than 90%. These preliminary results are encouraging, but further research is needed for validation. As breast cancer is the leading cause of cancer death in women, early detection has played a critical role in the management of this disease, especially for those many women whose breast cancer has no symptoms [26]. High-risk breast lesions represent a group of lesions, which clinically, morphologically, and biologically heterogeneous carry an increased risk of breast cancer, albeit to various degrees [27]. The threat of high-risk though benign breast lesions should not be underestimated. High-risk breast lesions convey a high relative risk for a later breast cancer with a cumulative incidence of 29% within 25 years [28][29][30]. Since high risk lesions are frequently also asymptomatic, we should explore new strategies for the detection of all breast lesions, including both breast cancer and high risk lesions not yet malignant.
In current clinical practice the most common tool used for the early detection of breast lesions is mammographic screening with complementary ultrasound. Definitive diagnosis requires biopsy. Since mammography carries high false positive rates and biopsy is traumatically invasive, the development of a novel, sensitive, non-invasive approach for early detection of breast lesions is essential to complement existing methods of detection.
To develop such an approach, we have utilized methods for cancer detection described in our blood transcriptome study 'and our previous reports [17,31,32], and identified a ten-gene panel (Table 3) from peripheral blood gene expression profiles. The predictive model we developed based in the ten-gene panel performed well both in the training set and test set (Figs 1 and  2). In the independent test set, the ten-gene panel differentiated breast lesions from normal controls with sensitivity of 100%, specificity of 84.2%, accuracy of 93.5% (Table 4). We are planning to follow these patients over the next few years to confirm whether those 3 false positive samples are true negative samples. Since it is essential to predict breast lesions at early stages for prevention and optimal treatment, we are interested to know whether the biomarkers identified in the present retrospective study are effective in predicting high-risk lesions or breast cancer. We also expect to further evaluate the blood based biomarkers in a future prospective study.
Among the ten candidate biomarkers we identified (YWHAQ, BCLAF1,WSB1, PBX2, DDIT4, LUC7L3, FKBP1A, APP,HERC2P2,FAM126B), five genes (DDIT4, APP, FKBP1A, PBX2, YWHAQ) were upregulated in breast lesion patients as compared with normal controls, and the other five genes were downregulated (FAM126B, BCLAF1, WSB1, LUC7L33, HERC2P2.) There were a total of 147 proteins interacting with the ten transcriptomic genes (Fig 4), and functional enrichment analysis of these proteins showed they were mainly associated with apoptosis, TGF-beta signaling, adaptive immune system regulation, gene transcription and post-transcriptional protein modification (Fig 5). The gene involved in apoptosis was YWHAQ and the gene involved in TGF-beta signaling was FKBP1A. YWHAQ also joined the process of gene transcription with DDIT4. In adaptive immune system regulation, FKBP1A

PLOS ONE
Blood transcriptome identifies high-risk benign and malignant breast lesions participates in the calcineurin activation of NFAT and WSB1 and plays a role in antigen processing involving ubiquitination and proteasome degradation. WSB1 is also involved in the post-transcriptional protein modification process, neddylation.

Blood transcriptome identifies high-risk benign and malignant breast lesions
The most over-expressed biomarker in the breast lesion group was DDIT4 (for DNA-damage-inducible transcript 4), also known as REDD1 or RTP801. The major function of the protein encoded by DDIT4 is to inhibit mTORC1, which is induced by various stress stimulus in the hypoxia inducible factor (HIF) family [33,34]. Pinto et al reported that high levels of DDIT4 were significantly associated with a worse prognosis (recurrence-free survival, time to progression and overall survival) in several cancer types, including breast cancer [35]. Their previous work indicated that high DDIT4 expression was also an independent factor for a shorter disease-free survival in chemotherapy-resistant triple negative breast tumors [36]. In another report, the dysregulation of basal DDIT4 gene expression in several cancer types (e.g. lung, breast, prostate) can be altered by promyelocytic leukemia (PML) and lead to mTOR activation and cancer progression [37]. DDIT4 also acts as a pro-death transcript in the calcitriol inducing endoplasmic reticulum -stress-like response in breast cancer [38]. Consistent with these reports, in our study DDIT4 was also upregulated in breast lesions, therefore it might serve as a novel prognostic biomarker and is a potential candidate for the development of targeted therapy for breast cancer. Another upregulated gene, YWHAQ encodes the 14-3-3 proteins, which belong to a group of highly conserved proteins that are essential components of key signaling pathways involved in apoptosis and cell proliferation. These proteins interact with proteins such as Raf, BAD, protein kinase C (PKC), and phosphatidylinositol 3-kinase [39]. The products of YWHAQ (14-3-3ε) regulate TP53 through protein-protein interactions and post-translational modifications [40], and the germline variation in the TP53 network genes PRKAG2, PPP2R2B, CCNG1, PIAS1 and YWHAQ, might affect prognosis and treatment outcome in breast cancer patients [41]. TP53 is closely associated with breast cancer; women who have germline TP53 mutations have a very high risk of breast cancer of up to 85% by age 60 [42]. Combining these reports with our results suggests the TP53 network gene YWHAQ may act as a predictor and new therapy target for breast cancer.
In the present study, FKBP1A participated in both the TGF-beta signaling and calcineurin activation of NFAT. FKBP1A, also named FKBP12, is a member of the FK-506-binding protein (FKBP) family, and its expression in cells is ubiquitous [43,44]. FKBP1A mediates the immunosuppressive and antitumor effects of rapamycin [45], widely used in the treatment of breast cancer [46,47]. One study on Eph receptors and invasive breast carcinoma suggested that the level of FKBP1A was significantly affected by EphB6, which was a target mRNA of miR-100, the changes in miRNAs and the target mRNA may have a role in PI3K/Akt/mTOR pathways [48]. FKBP1A has also been shown to inhibit TGF-beta type 1 receptor [49] and it was found overexpressed in childhood astrocytomas, which presented as the EGFR/FKBP12/HIF-2alpha pathway [50]. While an aberration of TGF-beta type 1 receptor is associated with a significantly increased risk of breast cancer [51], FKBP1A may also be associated with an elevated risk of breast cancer, as our study indicated.
Among the downregulated genes, WSB1 is associated with antigen processing, specifically: ubiquitination and proteasome degradation and the post-transcriptional protein modification process, neddylation. WSB-1 (WD-40 repeat-containing SOCS Box protein), is the substrate recognition element of an Elongin Cullin SOCS (ECS box) E3 ubiquitin ligase complex [52] and it was identified as a transcriptional target of HIF [53]. In the only study on the role of WSB1 in breast cancer, Poujade et al found that WSB-1 plays an important role in breast cancer metastasis. By knocking down the WSB-1 gene in breast cancer cell lines, these investigators found that the downregulation of WSB-1 gene expression levels could significantly decrease the metastatic potential of breast cancer [54].
Our results were inconsistent with the above report, however, since WSB1 was decreased in our breast lesion group. The role of WSB1 in other types of cancer is also controversial; this gene was involved in pancreatic cancer progression [55] and metastatic potential of osteosarcoma [53], but its high expression was associated with good prognosis and favorable outcome of neuroblastoma [56]. So the definite function of WSB1 in breast cancer remains unclear.
The gene mutations related to carcinogenesis, such as p53, BRCA1 / BRCA2, have been widely observed in breast tumor cells; however they have not been identified in our study with significant expression variation between breast lesion and healthy control group in peripheral blood. There are several possible reasons for this. Although tumor cells could be released into a patient's peripheral blood, the proportion of such cells as compared with white blood cells would be very low, even for patients with advanced disease. White blood cells predominate in the cell spectrum of peripheral blood, and therefore blood gene expression signatures would largely reflect these abundant blood white cells rather than the rare circulating tumor cells. In addition, as blood white cells and tumor cells play different biological roles in the process of carcinogenesis their gene expression profiles also differ. Gene expression variations in blood white cells, for example, more likely reflect interactions between the immune system and the tumor rather than reflecting intrinsic changes within the tumor cells themselves. These differences might be an important reason why the driver genes that have been observed in tumor cells did not show abnormal signals in the gene expression profile of peripheral blood in this study. Further study is required in order to identify the signaling pathways of blood cells and their interaction with cancer cells to better understand the roles of blood cells in carcinogenesis.
Our study has several limitations. First, the sample size was relatively small and different genes or more genes that have better discriminatory power may be validated among a larger independent cohort of patients. For example, our samples show some age variation among the healthy controls, the women with high risk lesions and the breast cancer patients. Age has been regarded as an important risk factor for cancer, as the incidence of most cancers increases with age. In this study, which is restricted by a limited sample size, we tried to optimize the algorithm to eliminate the interference of age factors as much as possible. However, it is hard to confirm that the biomarkers derived are completely unrelated to age.
We intend to confirm the effectiveness of our data mining method in further studies, using a larger sample size with age-matched patients. Second, the nature of the mechanisms driving the different transcriptomic biomarkers in peripheral blood is not yet clear, and the function of some biomarkers requires further study. We are currently exploring the expression differences of the ten candidate biomarkers between high-risk breast lesions and breast cancer, which study may be helpful for the differential diagnosis of high risk lesions and breast cancer.
Finally, RNA sequencing (RNAseq) has been proven an efficient tool for transcriptome analysis, especially for exploring expression signatures of unknown transcript fragments and revealing the signaling pathways beneath, An interesting subject for future study would be to compare the variations in gene expression signatures between RNAseq and the microarray method.
Using peripheral blood gene expression profiles we identified ten transcriptomic biomarkers that could distinguish women with high-risk breast lesions and breast cancer from normal controls. Our model, based in the ten transcriptomic biomarkers identified, has shown good discriminatory power between breast lesion and control subjects. Our functional enrichment analysis suggested that our candidate biomarkers were mainly involved in apoptosis, TGF-beta signaling, adaptive immune system regulation, gene transcription and post-transcriptional protein modification. This study has therefore established a promising methodology for the non-invasive detection of breast lesions, and we have also shed light on the pathogenic mechanisms of breast cancer and provided clues to new targets for breast cancer therapy, especially therapies related to immune treatment.