The authors have declared that no competing interests exist.
Conceived and designed the experiments: SCC CLH KHY YJC KPW. Performed the experiments: SCC. Analyzed the data: SCC KPW. Contributed reagents/materials/analysis tools: SCC CLH YJC KPW. Wrote the paper: SCC YJC KPW.
Cancer marker discovery is an emerging topic in high-throughput quantitative proteomics. However, the omics technology usually generates a long list of marker candidates that requires a labor-intensive filtering process in order to screen for potentially useful markers. Specifically, various parameters, such as the level of overexpression of the marker in the cancer type of interest, which is related to sensitivity, and the specificity of the marker among cancer groups, are the most critical considerations. Protein expression profiling on the basis of immunohistochemistry (IHC) staining images is a technique commonly used during such filtering procedures. To systematically investigate the protein expression in different cancer versus normal tissues and cell types, the Human Protein Atlas is a most comprehensive resource because it includes millions of high-resolution IHC images with expert-curated annotations. To facilitate the filtering of potential biomarker candidates from large-scale omics datasets, in this study we have proposed a scoring approach for quantifying IHC annotation of paired cancerous/normal tissues and cancerous/normal cell types. We have comprehensively calculated the scores of all the 17219 tested antibodies deposited in the Human Protein Atlas based on their accumulated IHC images and obtained 457110 scores covering 20 different types of cancers. Statistical tests demonstrate the ability of the proposed scoring approach to prioritize cancer-specific proteins. Top 100 potential marker candidates were prioritized for the 20 cancer types with statistical significance. In addition, a model study was carried out of 1482 membrane proteins identified from a quantitative comparison of paired cancerous and adjacent normal tissues from patients with colorectal cancer (CRC). The proposed scoring approach demonstrated successful prioritization and identified four CRC markers, including two of the most widely used, namely CEACAM5 and CEACAM6. These results demonstrate the potential of this scoring approach in terms of cancer marker discovery and development. All the calculated scores are available at
Quantitative proteomics has been used widely in cancer marker discovery with a certain degree of success
A method that is able to define the cancer-specificity of a protein to the cancer of interest is therefore indispensible. To create such a cancer-specificity index, we need to have expression information on the various proteins in healthy individuals and in patients with different types of cancer. Acquiring such proteomic data, however, is resource and time-consuming for small-scale academic research groups. Fortunately the Human Protein Atlas (HPA) is available; this comprehensively annotates a large number of genes and proteins expressed in various types of normal and cancer tissues
In this study, we proposed a scoring approach based on the annotation of the IHC images from the HPA. The scoring approach takes into account a protein's expression levels in normal/cancer tissues and the significance/specificity of any overexpression of the protein in the cancer tissue. On the basis of the proposed scoring mechanism, we comprehensively prioritized all the tested antibodies in the HPA (17219 antibodies in the HPA version 10.0) for 20 different types of cancers. A statistical analysis of the results was carried out by the one-sample
In this study, immunohistochemistry (IHC) staining images of the HPA version 10.0 released on the 12 September 2012 (
A cohort of 1482 membrane proteins expressed in paired tumor and adjacent normal tissues from 28 patients diagnosed with colorectal cancer was used as our validation dataset
The proposed scoring approach is primarily based on using protein expression differences between cancer and normal tissues. Therefore there was a need to map the relationship between the various types of cancer and their paired normal tissues. These mappings, which were extracted from the HPA, are listed in
Cancer | Normal Tissue (Cell Type) | Mapping ID |
Breast cancer | breast (glandular cells) | Breast |
Carcinoid | pancreas (islets of Langerhans) | Carcinoid |
Cervical cancer | cervix, uterine (glandular cells) | Cervical-A |
cervix, uterine (squamous epithelial cells) | Cervical-B | |
Colorectal cancer | colon (glandular cells) | Colorectal-A |
rectum (glandular cells) | Colorectal-B | |
Endometrial cancer | uterus, pre-menopause (glandular cells) | Endometrial-A |
uterus, post-menopause (glandular cells) | Endometrial-B | |
Glioma | cerebral cortex (glial cells) | Glioma |
Head and neck cancer | oral mucosa (squamous epithelial cells) | Head & neck-A |
salivary gland (glandular cells) | Head & neck-B | |
Cholangiocarcinoma | liver (bile duct cells) | Cholangio |
Hepatocellular carcinoma | liver (hepatocytes) | Hepato |
Lung cancer | bronchus (respiratory epithelial cells) | Lung-A |
lung (pneumocytes) | Lung-B | |
Lymphoma | lymph node (germinal center cells) | Lymphoma-A |
lymph node (non-germinal center cells) | Lymphoma-B | |
Melanoma | skin (melanocytes) | Melanoma |
Ovarian cancer |
N/A | |
Pancreatic cancer | pancreas (exocrine glandular cells) | Pancreatic |
Prostate cancer | prostate (glandular cells) | Prostate |
Renal cancer | kidney (cells in tubules) | Renal |
Skin cancer | skin (keratinocytes) | Skin |
Stomach cancer | stomach, lower (glandular cells) | Stomach-A |
stomach, upper (glandular cells) | Stomach-B | |
Testis cancer | testis (cells in seminiferus ducts) | Testis |
Thyroid cancer | thyroid gland (glandular cells) | Thyroid |
Urothelial cancer | urinary bladder (urothelial cells) | Urothelial |
*Ovarian cancer was not available because most of the antibodies in HPA database were not evaluated against normal ovary tissues.
For a given mapping and a given antibody, our aim was to determine the expression difference (
For the target protein, its expression in tissues is characterized by the annotations
(A) Initially, the protein expression levels and the expression difference (
For the normal cell type, no matter how many times the antibody is used to perform the IHC staining, HPA only reports one pair of
In contrast to the situation for normal tissue, for a given cancer type, the HPA reports a pair of
Finally, the expression difference,
For a given antibody and a given mapping, the antibody is expected to receive a high score if (1) the target protein is overexpressed in the cancer tissue, and (2) the degree of the overexpression is significant and specific to the mapping. The score of the antibody to the mapping is therefore determined using the following steps (
We have comprehensively calculated the scores for all the antibodies used in the HPA for each of the 27 mappings and this resulted in 457110 scores. Instead of summarizing these into a huge flat supplementary file, all the calculated scores are available on a web site that allows queries to be made (
(A) The result of querying by gene name. (B) The result of querying by the mapping of a cancer type.
For each mapping, we select the top 100 antibodies according to their
All the tested antibodies | Top 100 antibodies |
|||
Mapping ID | Mean | Mean | Standard deviation | |
Breast | 86.967 | 210.927 | 12.894 | <0.001 |
Carcinoid | 78.322 | 213.083 | 13.889 | <0.001 |
Cervical-A | 70.833 | 207.705 | 13.866 | <0.001 |
Cervical-B | 70.833 | 211.265 | 12.488 | <0.001 |
Colorectal-A | 95.679 | 210.295 | 12.865 | <0.001 |
Colorectal-B | 95.549 | 210.435 | 12.72 | <0.001 |
Endometrial-A | 76.765 | 205.513 | 13.47 | <0.001 |
Endometrial-B | 76.731 | 208.591 | 12.639 | <0.001 |
Glioma | 61.147 | 212.212 | 11.275 | <0.001 |
Head & neck-A | 83.165 | 218.938 | 9.166 | <0.001 |
Head & neck-B | 83.162 | 219.875 | 8.492 | <0.001 |
Cholangio | 86.078 | 222 | 5.871 | <0.001 |
Hepato | 75.282 | 211.352 | 13.172 | <0.001 |
Lung-A | 65.822 | 178.636 | 25.491 | <0.001 |
Lung-B | 66.014 | 207.776 | 10.676 | <0.001 |
Lymphoma-A | 53.113 | 200.519 | 15.218 | <0.001 |
Lymphoma-B | 53.113 | 202.852 | 14.863 | <0.001 |
Melanoma | 79.367 | 210.641 | 9.016 | <0.001 |
Pancreatic | 83.807 | 207.797 | 12.52 | <0.001 |
Prostate | 79.458 | 206.901 | 12.988 | <0.001 |
Renal | 59.437 | 200.705 | 15.034 | <0.001 |
Skin | 62.645 | 207.807 | 16.649 | <0.001 |
Stomach-A | 75.516 | 202.007 | 14.891 | <0.001 |
Stomach-B | 75.467 | 207.048 | 14.199 | <0.001 |
Testis | 75.369 | 210.936 | 11.128 | <0.001 |
Thyroid | 96.606 | 218.875 | 9.891 | <0.001 |
Urothelial | 77.656 | 194.347 | 17.655 | <0.001 |
The 100 antibodies were selected on the basis of their
The
In order to make sure the proposed scoring approach is capable of identifying proteins that are significantly overexpressed in cancer tissues, we perform a one-sample
All the tested antibodies | Top 100 antibodies |
|||
Mapping ID | Mean | Mean | Standard deviation | |
Breast | −11.035 | 97.927 | 37.331 | <0.001 |
Carcinoid | −3.655 | 113.333 | 37.197 | <0.001 |
Cervical-A | −13.116 | 125.455 | 45.193 | <0.001 |
Cervical-B | −2.811 | 121.015 | 41.49 | <0.001 |
Colorectal-A | −30.496 | 92.995 | 35.012 | <0.001 |
Colorectal-B | −33.668 | 80.685 | 31.201 | <0.001 |
Endometrial-A | −14.956 | 75.013 | 31.704 | <0.001 |
Endometrial-B | −11.921 | 88.341 | 34.803 | <0.001 |
Glioma | 13.024 | 121.612 | 40.476 | <0.001 |
Head & neck-A | 4.52 | 143.588 | 41.322 | <0.001 |
Head & neck-B | 1.333 | 148.475 | 37.688 | <0.001 |
Cholangio | 37.899 | 185.75 | 32.201 | <0.001 |
Hepato | −5.828 | 110.852 | 44.067 | <0.001 |
Lung-A | −58.02 | 67.486 | 40.632 | <0.001 |
Lung-B | 13.475 | 145.776 | 37.781 | <0.001 |
Lymphoma-A | −8.229 | 94.219 | 35.837 | <0.001 |
Lymphoma-B | −11.273 | 88.602 | 31.782 | <0.001 |
Melanoma | −0.421 | 162.141 | 39.514 | <0.001 |
Pancreatic | −21.58 | 116.147 | 36.815 | <0.001 |
Prostate | −13.088 | 88.051 | 33.274 | <0.001 |
Renal | −59.582 | 77.754 | 36.312 | <0.001 |
Skin | −12.319 | 93.407 | 38.835 | <0.001 |
Stomach-A | −42.253 | 91.407 | 38.16 | <0.001 |
Stomach-B | −44.785 | 94.598 | 40.061 | <0.001 |
Testis | −28.357 | 105.686 | 34.966 | <0.001 |
Thyroid | −15.33 | 107.125 | 44.123 | <0.001 |
Urothelial | −36.808 | 68.547 | 36.583 | <0.001 |
The 100 antibodies were selected on the basis of their
The
The top100 antibodies of each mapping were also used to verify whether or not the proposed scoring approach is capable of identifying proteins whose overexpression is specific to the cancer of interest. For the top 100 antibodies of a specific mapping, their average
In this heat map, large
In summary, the proposed scoring approach shows great potential as a means of identifying abundant and cancer-specific proteins in tissues.
In this section we use an evaluation cohort to demonstrate how the proposed scoring approach can be used to screen possible markers for cancers. The cohort consists of 1482 up-regulated membrane proteins from 28 patients who had been diagnosed with colorectal cancer
The proteins selected by these criteria were then further analyzed using the
Eight combinations of filtering criteria were evaluated. Each of the combinations takes into consideration different combinations of the various filtering rules. The filtering results are shown in
(A) The rules that are used to screen genes are marked with a plus sign and otherwise there is a minus sign. For each combination, the numbers of filtered genes, genes with biomarker annotation, and genes with disease annotation are listed. (B) The proportions of annotated biomarkers and disease-related genes to filtered genes of each combination are shown. (C) The proportion of the filtering results to our sample population is shown. This figure is a panel chart that has two panels; the upper one has an axis that covers the full range of data, while the lower one has an axis that focuses on data within the range 0%–25%.
We then applied Combinations 2, 3, and 4 to evaluate the effect of Rule 1, Rule 2, and Rule 3, respectively. Combination 2, namely Rule 1 alone, allowed a certain degree of success in biomarker discovery; the proportion of the annotated biomarkers to the filtered genes is increased from 21.9% to 29.8% (
We also evaluate the performance of combinations that use two filtering rules together. Combination 5 applies Rules 1 and 2, Combination 6 applies Rules 1 and 3, and Combination 7 applies Rules 2 and 3. All the three combinations dramatically shrink the sample size to a scale that is suitable for wet-lab validation; applying Combinations 5, 6, and 7 generates 13, 8, and 14 filtered genes, respectively (
Finally, we apply Combination 8 that combines all the three rules to select potential biomarkers for colorectal cancer. This approach identified four filtered genes, among which three genes have biomarker annotation and all the four genes have disease-related annotation from the IPA. Information on the four proteins is listed in
Gene | Fold Change | Number of Patients | HPA |
IPA annotated Biomarker |
CEACAM5 | 6.41 | 24 | 200.5 | Yes |
CEACAM6 | 4.32 | 21 | 155.2 | Yes |
ANXA4 | 2.07 | 15 | 138.84 | No |
CAMP | 4.29 | 20 | 132.92 | Yes |
Taking the above findings as a whole, all the four identified filtered genes have experimental evidence that supports their potential as biomarkers for colorectal cancer. The filtering results of this model disease suggest that the proposed scoring approach based on the IHC annotation provided by the HPA is an effective approach. Even though the HPA has received criticism based on the unreliable quality of the IHC images and antibodies used, our proposed score appeared to provide useful additional information that assists the filtering of cancer marker candidates obtained from high-throughput omics experiments. As the antibody and IHC imaging data are continuously being improved and optimized through the efforts of the HPA, we believe that the reliability issue can be gradually resolved in the future.
Expression fold change of 1482 proteins identified from 28 colorectal patients
(XLS)
Clinical information of the 28 patients in colorectal cancer cohort
(XLS)
We thank Dr. Hsuan-Cheng Huang at the Institute of Biomedical Informatics, National Yang Ming University, for support and inspiring discussions.