Figures
Abstract
Lung cancer is the second most diagnosed cancer and the first cause of cancer related death for men and women in the United States. Early detection is essential as patient survival is not optimal and recurrence rate is high. Copy number (CN) changes in cancer populations have been broadly investigated to identify CN gains and deletions associated with the cancer. In this research, the similarities between cancer and paired peripheral blood samples are identified using maximal information coefficient (MIC) and the spatial locations with substantially high MIC scores in each chromosome are used for clustering analysis. The results showed that a sizable reduction of feature set can be obtained using only a subset of locations with high MIC values. The clustering performance was evaluated using both true rate and normalized mutual information (NMI). Clustering results using the reduced feature set outperformed the performance of clustering using entire feature set in several chromosomes that are highly associated with lung cancer with several identified oncogenes.
Citation: N. Kachouie N, Deebani W, Shutaywi M, Christiani DC (2024) Lung cancer clustering by identification of similarities and discrepancies of DNA copy numbers using maximal information coefficient. PLoS ONE 19(5): e0301131. https://doi.org/10.1371/journal.pone.0301131
Editor: Chien-Feng Li, National Institute of Cancer Research, TAIWAN
Received: August 9, 2023; Accepted: March 11, 2024; Published: May 13, 2024
Copyright: © 2024 N. Kachouie et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The dataset is available at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0DFPOE.
Funding: This research was partially funded by NIH Grant # U01CA209414 to David C. Christiani.The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Lung cancer is the leading cause of cancer death in the United States [1–4]. It is the second most diagnosed cancer in both men and women in the US [1]. Over half of patients diagnosed with lung cancer die within one year of diagnosis and the 5-year survival is around 17.8% [3, 4]. Two main sub-types of lung cancer include non-small-cell lung carcinoma (NSCLC) and small-cell lung carcinoma where NSCLC accounts for about 85% of all lung cancers [2– 4]. Depending on the NSCLC stage, different treatments including surgery, radiation, chemotherapy, or targeted therapy might be considered. With the advancement of somatic genetics and biomarkers testing, specific mutations have been identified to better target treatment for individual patients [5].
An estimated 238,340 adults including 117,550 men and 120,790 women in the US will be diagnosed with lung cancer in 2023 [2, 3]. Cigarette smoking is a major risk factor for developing NSCLC. After increasing for decades, lung cancer rates are decreasing nationally as fewer people smoke cigarettes [1]. Overall incidence rates among men and women have dropped by around 2% each year since the mid-2000s. The number of new lung cancer cases diagnosed in men each year has been dropping annually since the mid1980s, while the number of new cases has been dropping in women since the mid-2000s [2]. Incidence rates are dropping faster in men than in women.
Moreover, lung cancer makes up around 25% of cancer deaths. It is estimated that 127,070 deaths from lung cancer including 67,160 men and 59,910 women will occur in the US in 2023. However, death rates for the disease have declined by 54% since 1990 in men and 30% in women since 2002 [2]. The death rates for men and women with lung cancer dropped by 5% and 4% respectively each year from 2014 to 2018. Research indicates that these declines are due to medical advances in diagnosis and treatment, fewer people smoking, and more people quitting smoking. The average age of diagnosis is 70 and people aged 65 and older are more likely to develop the disease. In comparison with white males, black males are about 15% more likely to get lung cancer. In contrast, black females are 14% less likely to get lung cancer in comparison with white females [2, 3].
The rapid progress in machine learning technologies along with the advancements in the infrastructure for information technologies such as the graphics processing unit (GPU), and the development of public databases, have made it possible to make use of large-scale data and have motivated a great deal of interest in using machine learning and artificial intelligence (AI) technologies in cancer diagnosis and clinical oncology [6–8]. Nearly 350 AI equipped medical devices have been approved in the US by the FDA [9]. Among them imaging and diagnostic technologies lead the integration of algorithms to drive clinical decision-making in healthcare. Several AI equipped medical devices have already been used for clinical applications such as diagnostics imaging. Moreover, machine learning techniques have been developed for Precision Medicine by customization and optimization of the medical care for each individual and Precision Oncology is taking advantage of advancements in the machine learning for choosing treatment options [10–12]. Machine learning algorithms have also been used for Genomic Medicine to use genomic information of individuals as part of their clinical care [12, 13].
Some blood tests are performed to detect signs of cancer such as CBC (Complete Blood Count), Electrophoresis blood test, and Tumor marker test. Moreover, some tests are performed to detect proteins or other substances made by the cancer in blood and are performed after cancer diagnosis such as Cancer antigen tests, Circulating tumor cell tests, and Genetic tests. These blood tests have been somewhat successful in diagnosing some types of cancer like prostate cancer, colon cancer, and ovarian cancer by detecting various proteins or chemicals made by cancer cells. However, these tests do not always help with cancer diagnosis as some healthy cells also make these proteins and chemicals, and some non-cancer conditions can also cause high levels of tumor markers.
The premise of future advancements in cancer research is to use blood samples to test DNA changes to detect signs of cancer in healthy people with no symptoms. The research conducted here is in-line with an active area of research to use a blood sample for testing DNA changes for cancer diagnosis. In our previous works, we investigated common CN changes among cancer population [14]. We then studied correlations between cancer and matched blood samples [15] by computing maximal information coefficient (MIC) at each spatial location (locus) of each chromosome to quantify and identify similarities between tumor and blood samples. In contrast with the Pearson’s correlation coefficient that is relevant for quantifying the strength of linear correlations, MIC can be employed to quantify either linear or non-linear correlations. MIC assumes values between zero and one where values above 0.5 demonstrate substantial correlation, and the MIC values close to one reveal strong correlations. We showed in [15] that a few chromosomes with a large set of CN changes (loci) can potentially be used to identify early signs of NSCLC. In contrast, another group of chromosomes with several CN changes (loci) are potential candidates to develop biomarkers for separating cancer from the matched blood sample. Also, we divided patients into healthy and cancer groups using 240,000 paired CN’s collected for each individual participated in the study [16].
In this work, we extend our previous studies in [15] and [16] to separate tumor samples from non-involved tissue (blood samples) using a reduced feature-set extracted from a large set of significant identified loci. The reduced set can be potentially used for separating cancer patients from healthy individuals. Our research using paired tumor-blood samples taken from lung cancer patients is an essential step towards the future advancements in cancer research using blood samples for testing DNA changes to detect signs of cancer in people with no symptoms. We demonstrated potential differences and similarities in DNA copy numbers in tumor sample in comparison with blood sample that can be potentially used in future to develop cancer tests for lung cancer diagnosis using patient’s blood sample.
2. Data description
A set of 63 early stage (stage 1 and 2) non-small cell lung cancer (NSCLC) patients were prospectively enrolled in the Boston Lung Cancer Study at the Massachusetts General Hospital (MGH), Boston, MA [17, 18]. A snap-frozen tumor sample was gathered for each patient during biopsy or surgery in addition to a blood sample. Table 1 shows a snippet of the CN scores ranging from one to three. Deleted CNs have values below two while values above two indicate gained CNs.
3. Methods
Mass General Brigham IRB committee approved this study and participants provided written informed consent. Lung cancer clustering was performed in the previous work [16] using chromosome-wide spatial copy number variations among paired cancer and non-involved (blood) samples. Kernel K-means, a nonlinear clustering method, was applied to separate cancer from normal samples in each chromosome using DNA CN’s. A snippet of CN values for chromosome one obtained using paired cancer-blood samples for 63 patients is shown in Table 2. It should be noticed that copy numbers are obtained for all spatial locations (loci) on a chromosome and in turn several thousand CN’s are collected for each chromosome. Therefore, the corresponding feature space consisting all copy numbers for each chromosome is highly dimensional. For example, there are 19,873 CN’s collected for chromosome one, which forms a 19,873-dimensional copy number space [16].
It is essential to reduce the dimensions of the chromosomal feature space to improve the discriminant power. Hence, for feature reduction, the significant features must be identified and extracted from the feature space. To this end, we quantify the similarities and discrepancies between cancer and paired peripheral blood samples using a correlation measure, so called maximal information coefficient (MIC) [19]. In the previous work [15], correlations between cancer and matched blood samples were studied by quantifying MIC at each spatial location of each chromosome. In this work, an upper and a lower threshold will be set for identifying significant chromosomal features with high discriminant power of detecting cancer from healthy. The reduced feature set will then be used for grouping individuals to the two groups of cancer and healthy.
MIC takes a value between zero and one and can identify linear and nonlinear associations [19]. MIC values above 0.5 demonstrate substantial correlation (similarities) while MIC values below 0.2 indicate low or no correlation (discrepancies) between the samples. MIC values that exceed the upper threshold represent high correlation between cancer and non-involved tissue and can be considered as early indicators of NSCLC. MIC values that fall below the lower threshold represent low correlation between cancer and non-involved samples and are relevant for developing biomarkers to distinguish cancer from matched blood samples [15].
MIC for the jth location for a typical chromosome is calculated using [20]:
(1)
where m is the total number of locations and
(2)
where p(bloodj, cancerj) is the joint probability distribution of jth blood and jth cancer samples respectively, p(bloodj) and p(cancerj) are the marginal distributions of jth blood sample and jth cancer sample respectively, nbloodj and ncancerj are the number of bins in the partitions associated with the blood and cancer respectively and nbloodj ×ncancerj<B(n) where B (n) = n0.6, and n is the sample size (n = 63). After calculating MIC values at each locus of every chromosome, we identify spatial locations with considerably high correlation scores between cancer and blood samples. We locate MIC values greater than a threshold γ.
Let v be the number of selected locations with correlation values greater than γ in a given chromosome, where v < m. Then, we use kernel K-means clustering method which is a nonlinear extension of K-means clustering and can separate both linearly and non-linearly separable clusters [21, 22]. Kernel K-means clustering projects the data into a higher-dimensional feature space using a nonlinear function. In this way, projected points will be linearly separable in the transformed space. Let {x1,x2,…,xn} be a set of n data points (n = 126) with v-dimension, K be the number of clusters, πk be the cluster k, be a partitioning of points into K groups, and φ be a non-linear function. For each element of the kernel matrix,
, where φ(xi) and φ(xj) denote the data points xi and xj in the transformed space respectively. The Euclidean distance from each data point to a cluster center μk is computed in the transformed space for all K clusters by:
(3)
where |πk| is the number of elements in cluster πk. Each data point will be assigned to the closest cluster based on calculated distances between the point and cluster centers. A new cluster center μk is obtained for cluster k by averaging Euclidean distance of all elements that were assigned to cluster πk in the transformed space in the previous iteration:
(4)
This process will be repeated until |μk,t − μk,t−1| < ε for all K clusters, where μk,t is the center of cluster k in the current iteration t and μk,t−1 is the center of cluster k in the previous iteration t − 1.
4. Results
Lung cancer data contains copy numbers for pair cancer-blood (non-involved) samples for 22 chromosomes (excluding sex chromosome) for 63 patients. The correlation measure, i.e. MIC, was obtained at each location for each chromosome. To improve the discriminant power of chromosomes, MIC values above upper threshold were located. A high MIC score at an identified location indicates high similarity between blood (non-involved) and cancer samples. This results in reducing the copy number feature space dimensionality, and hence the unlabeled copy numbers in each chromosome form a 126×T feature matrix, where 126 is the total number of samples (63 paired samples), and T is the number of identified spatial locations with MIC score above threshold in the given chromosome. Notice that MIC score at each locus (chromosome location) shows the similarity between CN’s in two groups of cancer and control (63 paired samples).
To cluster 126 unlabeled samples into two groups, kernel K-means is applied to the reduced feature matrix of each chromosome separately. In a perfect world, 63 cancer samples will be grouped together in one cluster and 63 non-involved samples will be clustered together in a different group. In practice however, each cluster contains a combination of cancer and non-involved samples. The clustering performance is evaluated by two methods; 1) quantifying the true rate of recognized cancer samples. 2) computing normalized mutual information (NMI). NMI provides a value between zero and one, where the higher values of NMI indicate better clustering results. High NMI value means that a substantial proportion of data points are grouped in the right clusters.
Clustering results for each chromosome are summarized in Table 3. As it can be seen in the table, 63 paired samples are once clustered using chromosome-wide CN’s (entire feature-set), and then using only selected CN’s (reduced feature-set) at the identified spatial locations with high MIC scores. The table shows the results for three different clustering set-ups using: (a) All spatial locations of a chromosome, (b) Selected features with MIC values greater than 0.65, and (c) Selected features with MIC values greater than 0.52. The number of selected features in each set-up, along with NMI and true rate are shown in the table.
4.1. Clustering analysis: Using all CNs of a chromosome
In this set-up, the clustering analysis is performed using the 126 × L feature matrix, where L is the chromosome length (number of spatial locations in the chromosome) depicted in the second column in Table 3. For instance, to identify cancer from normal in the chromosome one, entire feature-set holding 19,873 features (chromosome-wide copy numbers) of paired cancer-blood subjects is used for clustering. Clustering performance yields a true rate of 56% and NMI value of 0.012. Considering very high dimensional feature space of each chromosome, a subset of features in each chromosome is used to discriminate cancer from blood samples as follows below.
4.2. Clustering analysis: Using selected CNs with MIC values above a set threshold
To reduce the dimensionality of feature set, clustering of cancer and blood samples is performed using a 126 × T feature matrix in each chromosome, where T is the number of identified spatial locations in the chromosome with MIC score above a set threshold. Once, a threshold of 0.65 is set to identify CN locations with strong correlations between cancer and paired blood subjects. The kernel K-means clustering analysis is then performed using the reduced feature set with corresponding MIC values higher than 0.65. In this way, a very small number of highly selective CN locations will be identified demonstrating strong correlations between cancer and blood subjects at the given locations. For example, only 3 locations out of 4854 that have MIC score above 0.65 in chromosome 17 are used for clustering which provides 99.9% dimensionality reduction. Notice that in the previous work [14], amplified copy number regions that were associated with new oncogenes were found in chromosomes 17. Therefore, identifying locations with high MIC scores in a chromosome could potentially improve the diagnostic power of CN’s.
4.3. Performance-difference and adjusted performance-difference
Performance difference, PD is computed using:
(5)
where TREFS is true rate (TR) obtained using entire feature-set (EFS), and TRRFS is true rate obtained using reduced feature-set (RFS). Adjusted performance difference, APD, is calculated by:
(6)
In this way, the positive values of PD and APD represent improved performance using RFS in comparison with using EFS while their negative values represent declined performance using RFS.
4.4. Comparing clustering analysis performance: Reduced feature-set vs. entire feature-set
Setting the high threshold of 0.65 leads to selection of very few features. For example, there were only 4 (out of 19873), 16 (out of 22215), 5 (out of 18381), and 2 (out of 17147) locations with MIC scores above 0.65 in chromosomes 1, 2, 3, and 6, respectively. This provides a sizable reduction of feature space with comparable clustering results that was obtained using entire feature-set (Table 3). However, in several chromosomes including chromosome 5, 8, 9, 10, 14, 15, 16, 18, 19, 20, 21, and 22, no location with MIC score above 0.65 was identified. Hence, a new threshold was set to ensure that at least two locations with MIC value above the threshold are identified on each chromosome. Using an ad-hoc learning approach, the threshold of 0.52 was determined to satisfy the constraint. As it can be seen in Table 3, all 22 chromosomes have two or more identified locations with MIC score above 0.52. The number of identified locations with MIC above threshold varies from 2 (in chromosome 19) to 392 (in chromosome 2). For instance, 56, 21, 93, and 52 locations with MIC score above 0.52 were identified in chromosomes 5, 8, 9, and 10, while no location with MIC score above 0.65 was identified in these chromosomes.
As represented in Table 3, clustering using the reduced feature-set of CN locations with MIC score above 0.52 provides comparable performance for 9 out of 22 chromosomes (about 41% of chromosomes) summarized in Table 4, and substantially better performance (10% or higher) for 3 out of 22 chromosomes (about 14% of chromosomes) summarized in Table 5. Using MIC score to identify CN locations with high correlation between cancer and control produces a reduced feature-set with considerably smaller size. Depicted in Fig 1, the proportional size of reduced feature-set ranges from 0.06% to 1.76% of that of entire feature-set. As we can see in Figs 1 and 2, and Table 5, only 2 (out of 2693) locations are identified in chromosome 19 and used for clustering, while the adjusted performance-difference and performance-difference are +29% and +15.1% respectively. Similarly, using only 5 (out of 8149) locations in chromosome 18, RFS outperforms the EFS in clustering performance yielding +14% APD and +8% PD. Overall, for six chromosomes (Table 6) clustering performance has improved with APD from +1% to +29%. Clustering performance using RFS has declined only for 10 chromosomes with PD from -2% to -10% (APD from -4% to -16%). However, yet the RFS has a significant size reduction (Figs 1 and 2, and Table 3) that can justify using selected CN locations for clustering.
Chromosome number (first column) and proportional size of reduced feature-set in comparison with the size of the entire feature-set (second column). Clustering performance using entire feature-set (third column) and reduced feature-set (forth column). True rate (blue) and false identification (orange) based on clustering result (columns 3 and 4).
Chromosome number (first column) and proportional size of reduced feature-set in comparison with the size of the entire feature-set (second column). Clustering performance using entire feature-set (third column) and reduced feature-set (forth column). True rate (blue) and false identification (orange) based on clustering result (columns 3 and 4).
Fig 3 displays the original groups with true labels, clustering results using EFS, and clustering results using RFS for chromosome 12, 18, and 19. The noticeable overlapping of cancer and control clusters using true labels (column 1) is clearly visible. In all three chromosomes, cancer cluster is contained inside the control group that makes the discrimination between cancer and control groups very challenging. Although, clustering using EFS could not separate the clusters, clustering using RFS could separate a subset of control group as cancer group in these chromosomes. Notice that these are the same chromosomes with the highest improvement of clustering performance using RFS with APD of +29% in chromosome 19, +14% in chromosome 18, and +10% in chromosome 12.
Control (green) vs. cancer (red). True groups (column 1), clustering results using entire feature-set (column 2), and clustering results using reduced feature-set (column 3) for chromosome 12 (row 1), chromosome 18 (row 2), and chromosome 19 (row 3).
Figs 4–6 show the true clustering labels for cancer and blood groups (left) vs. clustering results using reduced feature-set (right) for chromosomes 1 to 9 with the true rate ranges from 53% in chromosome 1 to 65% in chromosome 3, and the APD ranges from -9% in chromosome 5 to +3% in chromosome 2. The clustering performance with the TR of about 50% (NMI about zero) is due to fact that the cancer group is contained inside the control group, and it makes it very challenging to separate the two groups. Nevertheless, among the chromosomes depicted in Figs 4–6, the overall performance using RFS has only declined by up to a maximum of 5.6%, while for several chromosomes the clustering performance using RFS is comparable with the results achieved using EFS, and for chromosomes 2 and 3, the performance has improved.
Control (green) vs. cancer (red) groups for chromosomes 1 to 3. True labels (left column); the assumed labels obtained by clustering using the reduced feature-set (right column).
Control (green) vs. cancer (red) groups for chromosomes 4 to 6. True labels (left column); the assumed labels obtained by clustering using the reduced feature-set (right column).
Control (green) vs. cancer (red) groups for chromosomes 7 to 9. True labels (left column); the assumed labels obtained by clustering using the reduced feature-set (right column).
Further, we compared the proposed Kernel K-means with K-means and Fuzzy c-means. K-means is a the most widely used clustering algorithm. It is an efficient centroid-based algorithm that is suitable for clustering large datasets, but it is sensitive to outliers. In contrast with K-means which is a hard-clustering technique, Fuzzy c-means is a soft-clustering method. In soft clustering, a data point may belong to different cluster with different likelihoods, rather than being assigned a hard cluster label. For most chromosomes, the proposed method outperformed both K-means and Fuzzy c-means techniques. Listed in Table 7, for five chromosomes, K-means performed better, and for three chromosomes, Fuzzy c-means outperformed the other methods. The performance of all three clustering methods for remaining 14 chromosomes are shown in Table 8. For the chromosomes listed in this table, the Kernel K-means using entire feature-set outperformed one or both K-means and Fuzzy c-means. We should point out that, for two chromosomes in Table 7, i.e., chromosomes 18 and 19, the proposed Kernel K-means method using the reduced feature-set outperformed the other methods with True rates of 0.651 and 0.675 respectively.
5. Discussion
Overall, thresholding MIC to obtain a reduced feature-set provides a sizable reduction in number of CN locations to distinguish cancer from control. The number of CN locations ranges from a minimum of 2,520 in chromosome 22 to a maximum of 22,215 in chromosome 2. The proportion of reduced set depends on the number of CN locations with MIC above 0.52 that was learned using an ad-hoc approach to identify a minimum of two CN locations in a given chromosome. The size of RFS is only a small fraction of the EFS and varies from a minimum of 0.06% for chromosome 18 to a maximum of 1.76% in chromosome 2. It means only 5 CN locations from 8,149 in chromosome 18 and 392 locations from 22,215 in chromosome 2 are identified and used for clustering. Among all, clustering results of chromosomes 19, 3, and 18 achieved the highest true rates of 67%, 65%, and 65%. Our results are noticeable as it has been shown in previous works that 3q is among the most commonly cited amplifications and 3p and 19p are among most common deletions [23–28]. Partial deletion of 3p has been reported in almost all analyzed NSCLCs [29, 30], and contains numerous genes including FHIT (3p14.2), RASSF1 (3p21.3), TUSC2 (FUS1, 3p21.3), SEMA3B (3p21.3), SEMA3F (3p21.3) and MLH1 (3p22.3) where allelic imbalance of FHIT is associated with chromosomal deletions [31, 32], RASSF1 and MLH1 are inactivated by promoter hypermethylation [33–35], TUSC2 [36], SEMA3F and SEMA3B transcripts [37] are recurrently underrepresented in lung cancers and the SEMA3s were found to be targets of TP53 [38] could be potentially activated during DNA damage. Moreover, 3p, 18p, and 19p are among the most frequent loss of heterozygosity that has been reported to occur on chromosome arms [26, 39, 40].
The clustering performance using RFS has improved in chromosome 12 yielding 60% true rate with APD of +10% indicating a net gain in PD of +6%. In lung cancer, KRAS at 12p12.1 is frequently mutated. Further, among the most important factors for lung tumor growth and proliferation the ERBB family coded by the genes including ERBB3 in 12q13 [30, 41].
To further improve the clustering performance, our future research will be focused to implement an ensemble clustering method using the proposed method and Fuzzy c-means, since it performed better on clustering of a few chromosomes as a soft clustering technique.
6. Conclusion
The survival rate of lung cancer, as the second most diagnosed cancer and the first cause of cancer related death in the US, is low. Early detection is critical as patient survival rate is low and recurrence rate is high. Copy number (CN) changes have been broadly investigated to identify CN amplifications and deletions associated with the cancer and can be potentially used for cancer diagnosis in future. Lung cancer data used in this project contains CN pairs for cancer and blood (non-involved) samples at each location for each chromosome for 63 subjects. In this research, the similarities between cancer and paired peripheral blood samples are identified using maximal information coefficient (MIC) and the spatial locations with substantially high MIC scores in each chromosome are used for clustering analysis. Identifying the locations with high similarities between cancer and healthy tissues in each chromosome, can potentially help with early diagnosis, treatment, and prevention of cancer. The outcomes of this research can be summarized as:
- Identifying CN locations with high correlations between cancer and control with MIC above a set threshold.
- Substantial feature reduction using a subset of CN locations with high MIC score.
- Improved clustering performance in several chromosomes that are associated with oncogenes.
Separating cancer from control is a very challenging clustering task because the cancer group coincides in the control group. In several chromosomes, reducing the copy number feature space dimensionality led to obtain comparable clustering results in comparison with the results obtained using the entire feature set. Moreover, using a small set of CN locations with high MIC exhibited improved discrimination power in some chromosomes. The highest true rate was achieved in chromosomes 19 and 18, where reduced feature set contained only 2 and 5 CN locations in these chromosomes respectively. The results suggest the identification of a handful of CN locations in each chromosome may improve the discrimination power of cancer from healthy tissue. Therefore, our future research will be focused on identifying a customized MIC threshold for each chromosome to adaptively limit the number of identified locations with high MIC scores in each chromosome.
Some blood samples are tested for cancer diagnosis by looking for signs of cancer including [42–44]:
- Complete blood count (CBC): It measures the amount of each type of blood cell in a blood sample to diagnose blood cancer.
- Test for blood proteins: An electrophoresis blood test looks at the various proteins in the blood sample to find those made by the immune system. This test is helpful in diagnosing multiple myeloma.
- Tumor marker: Use a blood sample to look for chemicals made by cancer cells. These tests do not always help with cancer diagnosis as some healthy cells also make these chemicals. Moreover, some non-cancer conditions can also cause high levels of tumor markers.
Some other blood tests might find proteins or other substances made by the cancer and are performed after cancer diagnosis such as:
- Cancer antigen tests: To assess if the treatment is working. Examples of cancer antigens include prostate-specific antigen (PSA) for prostate cancer and cancer antigen CA-125 for ovarian cancer, carcinoembryonic antigen (CEA) for colon cancer, and alpha-fetoprotein for testicular cancer.
- Circulating tumor cell tests: To look for cancer cells that might be in the blood when cancer cells are broken away from where they started and are spreading to other parts of the body. Often used in breast cancer, colon cancer, and prostate cancer.
- Genetic tests: These tests look for small pieces of cancer cells’ DNA that make their way into the blood. Genetic tests are used in cancer patients to understand the DNA changes present in the cancer cells and the results can be considered to select the best treatment.
Cancer diagnosis has been swiftly improving due to advancements in technology and the progresses in our understanding of the disease. As a result, a reliable diagnostic approach could be feasible in near future. The premise of future advancements in cancer research is to use blood samples to test DNA changes to detect signs of cancer in healthy people with no symptoms. A cancer blood test can screen the blood of cancer patient for the traces of released DNA by the dying tumor cells. Therefore, a highly desired and ideal diagnostic approach is a blood test that could detect cancer at its early onset with high accuracy. A cancer blood test to detect the early signals of cancer would be an alternative to invasive procedures like tissue biopsies and provides the patients with major benefits including receiving treatment earlier with a higher chance of success in case of positive test or ruling out the cancer with no need for invasive procedures. A cancer blood test with high accuracy will also allow targeted diagnostic evaluations at the onset of cancer. The research conducted here is in-line with an active area of research to use a blood sample for testing DNA changes for cancer diagnosis. Our research using paired tumor-blood samples taken from lung cancer patients is an essential step towards this goal. We demonstrated potential differences and similarities in DNA copy numbers in tumor sample in comparison with blood sample that can be potentially used in future to develop cancer tests for lung cancer diagnosis using patient’s blood sample.
References
- 1.
American Society of Clinical Oncology: https://www.cancer.net/es/node/19149.
- 2. American Cancer Society, Cancer statistics 2023, https://seer.cancer.gov/statfacts/html/common.html#.
- 3.
Cancer.org. Key statistics for lung cancer.
- 4.
National Cancer Institute. Annual report to the nation on the status of cancer.
- 5. Cecilia Zappa and Shaker A Mousa. Non-small cell lung cancer: current treatment and future advances. Translational lung cancer research, 5(3):288, 2016. pmid:27413711
- 6. Elemento Olivier, Leslie Christina, Lundin Johan, and Tourassi Georgia. Artificial intelligence in cancer research, diagnosis and therapy. Nature Reviews Cancer, 21(12):747–752, 2021. pmid:34535775
- 7. Benjamin H Kann, Reid Thompson, Thomas Charles R Jr Adam Dicker, and Sanjay Aneja. Artificial intelligence in oncology: Current applications and future directions. Oncology (Williston Park, NY), 33(2):46–53, 2019.
- 8. Benjamin H Kann, Ahmed Hosny, and Aerts Hugo JWL. Artificial intelligence for clinical oncology. Cancer Cell, 39(7):916–927, 2021. pmid:33930310
- 9. Zhu Simeng, Gilbert Marissa, Chetty Indrin, and Siddiqui Farzan. The 2021 landscape of fda-approved artificial intelligence/machine learning-enabled medical devices: An analysis of the characteristics and intended use. International journal of medical informatics, 165:104828, 2022. pmid:35780651
- 10. Sotoudeh Houman, Shafaat Omid, Bernstock Joshua D, Brooks Michael David, Elsayed Galal A, Chen Jason A, et al. Artificial intelligence in the management of glioma: era of personalized medicine. Frontiers in oncology, 9:768, 2019. pmid:31475111
- 11. Hamamoto Ryuji, Suvarna Kruthi, Yamada Masayoshi, Kobayashi Kazuma, Shinkai Norio, Miyake Mototaka, et al. Application of artificial intelligence technology in oncology: Towards the establishment of precision medicine. Cancers, 12(12):3532, 2020. pmid:33256107
- 12. Hamamoto Ryuji, Komatsu Masaaki, Takasawa Ken, Asada Ken, and Kaneko Syuzo. Epigenetics analysis and integrated analysis of multiomics data, including epigenetic data, using artificial intelligence in the era of precision medicine. Biomolecules, 10(1):62, 2020.
- 13. Iftikhar Pulwasha, Marcela V Kuijpers Azadeh Khayyat, Iftikhar Aqsa, and Maribel DeGouvia De Sa. Artificial intelligence: a new paradigm in obstetrics and gynecology research and clinical practice. Cureus, 12(2), 2020.
- 14. Nezamoddin N KachouieXihong Lin, Christiani David C, and Armin Schwartzman. Detection of local dna copy number changes in lung cancer population analyses using a multi-scale approach. Communications in Statistics: Case Studies, Data Analysis and Applications, 1(4):206–216, 2015. pmid:31489360
- 15. Nezamoddin N Kachouie, Wejdan Deebani and Christiani David C Identifying similarities and disparities between DNA copy number changes in cancer and matched blood samples. Cancer investigation, 37(10):535–545, 2019. pmid:31584296
- 16. Kachouie Nezamoddin N., Shutaywi Meshal & Christiani David C. Discriminant Analysis of Lung Cancer Using Nonlinear Clustering of Copy Numbers, Cancer Investigation, 38:2, 102–112, 2020. pmid:31977287
- 17. Huang Yen-Tsung, Lin Xihong, Liu Yan, Lucian R Chirieac Ray McGovern, Wain John, et al. Cigarette smoking increases copy number alterations in nonsmall-cell lung cancer. Proceedings of the National Academy of Sciences, 108(39):16345–16350, 2011. pmid:21911369
- 18. Yen-Tsung Huang, Xihong Lin, Chirieac Lucian R, McGovern Ray, Wain John C, Heist Rebecca S, et al. Impact on disease development, genomic location and biological function of copy number alterations in non-small cell lung cancer. PloS one, 6(8):e22961, 2011. pmid:21829676
- 19. Reshef David N, Reshef Yakir A, Finucane Hilary K, Grossman Sharon R, Gilean McVean, Turnbaugh Peter J, et al. Detecting novel associations in large data sets. science, 334(6062):1518–1524, 2011. pmid:22174245
- 20. Zhang Yi, Jia Shili, Huang Haiyun, Qiu Jiqing, and Zhou Changjie. A novel algorithm for the precise calculation of the maximal information coefficient. Scientific reports, 4(1):1–5, 2014. pmid:25322794
- 21. Campbell Colin. An introduction to kernel methods. Studies in Fuzziness and Soft Computing, 66:155–192, 2001.
- 22. Inderjit S Dhillon Yuqiang Guan, and Kulis Brian. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 551–556, 2004.
- 23. Chmara M, Wozniak A, Ochman K, Kobierska G, Dziadziuszko R, Sosinska-Mielcarek K, et al. Loss of heterozygosity at chromosomes 3p and 17p in primary non-small cell lung cancer. Anticancer Res 2004; 24: 4259–63. pmid:15736482
- 24. Marsit CJ, Hasegawa M, Hirao T, Kim DH, Aldape K, Hinds PW, et al. Loss of heterozygosity of chromosome 3p21 is associated with mutant TP53 and better patient survival in non-small-cell lung cancer. Cancer Res 2004; 64: 8702–7. pmid:15574780
- 25. Woenckhaus M, Stoehr R, Dietmaier W, Wild PJ, Zieglmeier U, Foerster J, et al. Microsatellite instability at chromosome 8p in non-small cell lung cancer is associated with lymph node metastasis and squamous differentiation. Int J Oncol 2003; 23: 1357–63. pmid:14532977
- 26. Garnis C., Lockwood W.W., Vucic E., Ge Y., Girard L., Minna J.D., et al., 2006. High resolution analysis of non‐small cell lung cancer cell lines by whole genome tiling path array CGH. International Journal of Cancer, 118(6), pp.1556–1564. pmid:16187286
- 27. Hong Chen, Thiele Robin, Feuerbach Lars, GenomeTornadoPlot: a novel R package for CNV visualization and focality analysis, Bioinformatics, , 38, 7, (2036–20), (2022). pmid:35099519
- 28. Gurjit Kaur Bhatti Paras Pahwa, Gupta Anshika, Navik Umashanker, Jasvinder Singh Bhatti, Therapeutic Strategies Targeting Signaling Pathways in Lung Cancer, Targeting Cellular Signaling Pathways in Lung Diseases, , (217–239), (2021).
- 29. Kok K, Naylor SL, Buys CH. Deletions of the short arm of chromosome 3 in solid tumors and the search for suppressor genes. Adv Cancer Res. 1997; 71:27–92. pmid:9111863
- 30. Varella-Garcia M. Chromosomal and genomic changes in lung cancer. Cell Adh Migr. 2010 Jan-Mar;4(1):100–6. Epub 2010 Jan 7. pmid:20139701; PMCID: PMC2852566.
- 31. Croce CM, Sozzi G, Huebner K. Role of FHIT in human cancer. J Clin Oncol. 1999; 17:1618–1624. pmid:10334551
- 32. Sard L, Accornero P, Tornielli S, Delia D, Bunone G, Campiglio M, et al. The tumor-suppressor gene FHIT is involved in the regulation of apoptosis and in cell cycle control. Proc Natl Acad Sci USA. 1999; 96:8489–8492. pmid:10411902
- 33. Burbee DG, Forgacs E, Zöchbauer-Müller S, Shivakumar L, Fong K, Gao B, et al. Epigenetic inactivation of RASSF1A in lung and breast cancers and malignant phenotype suppression. J Natl Cancer Inst. 2001; 93:691–699. pmid:11333291
- 34. Kaira K, Sunaga N, Tomizawa Y, Yanagitani N, Ishizuka T, Saito R, et al. Epigenetic inactivation of the RAS-effector gene RASSF2 in lung cancers. Int J Oncol. 2007; 31:169–173. pmid:17549418
- 35. Wang YC, Lu YP, Tseng RC, Lin RK, Chang JW, Chen JT, et al. Inactivation of hMLH1 and hMSH2 by promoter methylation in primary non-small cell lung tumors and matched sputum samples. J Clin Invest. 2003; 111:887–895. pmid:12639995
- 36. Ji L, Roth JA. Tumor suppressor FUS1 signaling pathway. J Thorac Oncol. 2008; 3:327–330. pmid:18379348
- 37. Potiron VA, Roche J, Drabkin HA. Semaphorins and their receptors in lung cancer. Cancer Lett. 2009; 273:1–14. pmid:18625544
- 38. Futamura M, Kamino H, Miyamoto Y, Kitamura N, Nakamura Y, Ohnishi S, et al. Possible role of semaphorin 3F, a candidate tumor suppressor gene at 3p21.3, in p53-regulated tumor angiogenesis suppression. Cancer Res. 2007; 67:1451–1460. pmid:17308083
- 39. Girard L, Zochbauer-Muller S, Virmani AK, Gazdar AF, Minna JD. Genome-wide allelotyping of lung cancer identifies new regions of allelic loss, differences between small cell lung cancer and non-small cell lung cancer, and loci clustering. Cancer Res 2000; 60: 4894–906. pmid:10987304
- 40. Janne PA, Li C, Zhao X, Girard L, Chen TH, Minna J, et al. High-resolution single-nucleotide polymorphism array and clustering analysis of loss of heterozygosity in human lung cancer cell lines. Oncogene 2004; 23: 2716–26. pmid:15048096
- 41. Hirsch FR, Varella-Garcia M, Cappuzzo F, McCoy J, Bemis L, Xavier AC, et al. Combination of EGFR gene copy number and protein expression predicts outcome for advanced non-small-cell lung cancer patients treated with gefitinib. Ann Oncol. 2007; 18:752–760. pmid:17317677
- 42. Schrag D, Beer TM, McDonnell CH 3rd, Nadauld L, Dilaveri CA, Reid R, et al. Blood-based tests for multicancer early detection (PATHFINDER): a prospective cohort study. Lancet. 2023 Oct 7;402(10409):1251–1260. pmid:37805216.
- 43. Ye M, Tong L, Zheng X, Wang H, Zhou H, Zhu X, et al. A Classifier for Improving Early Lung Cancer Diagnosis Incorporating Artificial Intelligence and Liquid Biopsy. Front Oncol. 2022 Mar 2; 12:853801. pmid:35311112; PMCID: PMC8924612.
- 44. Katz RL, Zaidi TM, Pujara D, Shanbhag ND. Identification of Circulating Tumor Cells Using 4-Color Fluorescence in Situ Hybridization: Validation of a Noninvasive Aid for Ruling Out Lung Cancer in Patients With Low-Dose Computed Tomography-Detected Lung Nodules. Cancer Cytopathol (2020) 128:553–62. pmid:32320527