Finding Protein-Coding Genes through Human Polymorphisms

Human gene catalogs are fundamental to the study of human biology and medicine. But they are all based on open reading frames (ORFs) in a reference genome sequence (with allowance for introns). Individual genomes, however, are polymorphic: their sequences are not identical. There has been much research on how polymorphism affects previously-identified genes, but no research has been done on how it affects gene identification itself. We computationally predict protein-coding genes in a straightforward manner, by finding long ORFs in mRNA sequences aligned to the reference genome. We systematically test the effect of known polymorphisms with this procedure. Polymorphisms can not only disrupt ORFs, they can also create long ORFs that do not exist in the reference sequence. We found 5,737 putative protein-coding genes that do not exist in the reference, whose protein-coding status is supported by homology to known proteins. On average 10% of these genes are located in the genomic regions devoid of annotated genes in 12 other catalogs. Our statistical analysis showed that these ORFs are unlikely to occur by chance.


Introduction
Compilation of an accurate catalog of protein-coding genes encoded in human genomes is a critical step to fully understand the functional elements in human genomes. Many annotations of protein-coding genes have been published [1] and a plethora of gene finding software has been introduced [2]. Nevertheless, the task has remained as a great challenge [3]. As stated by Brent [4] the difficulty lies in the limitations of sequencing protocols, ways to combine the predicted genes and finally the limitations of human curators.
In response to these findings we also believe that there is an increasing amount of genomic evidence that may affect proteincoding gene detection, which has not been taken into account. Human polymorphism is one example of such evidence [5]. It has been suggested that such polymorphism affects protein-coding genes, and that they are responsible for various human diseases [6][7][8].
Several studies have tried to account for damaging polymorphism in known protein-coding genes. Such polymorphism could affect the amino acid sequence, alter protein function and contribute to disease. For example Ng and Henikoff [8] discovered that two known genes were mistakenly classified as pseudogenes because of mutations. Others analyzed predicted protein-coding genes and found known SNPs and insertions and deletions in them [9,10]. However, to our knowledge no work has been done on the influence of polymorphism on the finding of novel genes.
In this article we provide a framework for finding new unannotated genes based on human polymorphism. We examine several types of polymorphism: SNPs, insertions, deletions, multiple nucleotide polymorphisms, and microsatellites. With this information we reconstruct the mRNAs and finally define the new open reading frames (ORFs). Figure 1a depicts the standard model for general transfers in biological sequence information. Figure 1b further illustrates the effect of polymorphism on the mature mRNA and its effect in defining the new ORF.
Demonstrating definitively that a transcript encodes a protein is a difficult, time-consuming and expensive task. One approach is to synthesize the protein artificially, raise an antibody against it and use the antibody to test whether it is expressed in vivo. Even this does not discriminate functional proteins from translation noise caused by stochastic nonfunctional translation by ribosomes [11,12]. On the other hand, proving that a transcript does not encode protein is an impossible task because the protein might only be expressed in very rare circumstances [13]. Thus we apply several bioinformatics criteria to improve the reliability of our findings: the ORF length, absence of repeats, E-value of homology to Swiss-Prot and shortness of 59 untranslated region (UTR). Figure 2 illustrates two pipelines in parallel, those with and without polymorphism.
We identified 5,737 putative protein-coding genes that result from mRNA modified by human polymorphisms and have significant homology to known proteins. On average 10% of these genes are located in genomic regions unannotated by 12 other gene catalogs. A genomic coordinate list of these proteincoding genes is available as Table S1.
Among our other findings, we also discover that some of these novel genes are orthologous to genomic regions of other species where genes are annotated. Furthermore, we exhibit three examples of longest ORFs found after modification of the mRNA. They show significant homology to Swiss-Prot proteins and were not predicted as coding regions by any other gene finders. We cannot rule out that these mRNAs are fragments of longer mRNAs (i.e. artifacts), however the presence of supporting ESTs suggests that they are not.

Results
We will first highlight the results by three examples. The genomic positions of the mRNA in these examples are provided in Table 1. There we also indicate the genomic positions of the longest ORF, the polymorphism that causes the modification, the length of 59 UTR after modification, the E-value of homology to known proteins and the sequence orientation. Figure 3 shows a modification by polymorphism of mRNA AK124706. The new ORF is located upstream of the initial ORF. A repeating element overlaps the initial ORF, but none is found for the new ORF. There is also supporting evidence for transcription with the presence of EST (DA187884) at the 59 end of the mRNA. It is aligned to Swiss-Prot Integrin beta-5 protein (Acc: P18084) [14]. Figure 4 shows modification of mRNA AK127273. The mRNA and ORFs are located on the forward strand of the chromosome. We note that the predicted longest ORF after modification does overlap a pseudogene (HIT00004716) predicted by H-Inv, however this pseudogene is on the reverse strand of the chromosome. The new ORF is 59 of the initial ORF. We verified that an EST (DA317450) is present at the 59 end of the mRNA. None of the existing gene finders predict a coding region in the same location as our new ORF, which aligns to Swiss-Prot protein capicua homolog (Acc: Q96RK0) [15]. Figure 5 shows modification of mRNA AY129028. Although there is an existing coding region annotation in the mRNA, this just overlaps the initial ORF. The mRNA and ORFs are located on the reverse strand of the chromosome. Moreover, the initial ORF is located 39 of the new ORF, and it overlaps a repeat element. This suggests that the ribosome would have to scan over the new ORF without translating it in order to reach the known ORF. Even though AY129028 looks like a fragment of pre-spliced mRNA, we found two ESTs (A1290869 and GD144663) at its 59 end, evidence for a real transcript starting here. It appears in this figure that GD144663 starts at chr14:93,406,483-93,406,489 instead of chr14:93,406,109-93,406,489, however we believe the latter is correct. The alignment in the former position lacks a standard splice acceptor sequence, and alignment in the latter  Figure S1). The new ORF also aligns to Swiss-Prot protein FLJ4630 (Acc: Q6ZRH3) [14].

Population Differences in ORF changes
Human populations from different parts of the world are surprisingly similar genetically. When averaged over all genes, the majority of genetic variation present in the human species can be found in any human ethnic group [16]. However the small level of differentiation among these groups is large enough for geneticists to make estimates on human diseases related to their ethnicity [17]. Results from large scale undertakings such as 1000 Genome [5] or HapMap Projects [18] for identifying human polymor-     The normalized cumulative frequency of allele percentage can be seen in Figure 6a. We observed that there are slight differences in the cumulative frequency on seven populations: ASW, CHD, GIH, LWK, MEX, MKK, TSI. It is interesting to note that some populations are closely grouped together, e.g. ASW-MKK and CHB-JPT.
Four of the populations (CEU, CHB, JPT, and YRI) stand out in Figure 6a. In these populations around 40% of the new ORFs are extremely rare. In order to understand this, we plot the     (Figure 6b) directly from the UCSC HapMap data, regardless whether they cause new ORFs or not. The same four populations also stand out as having rare alleles. The reason presumably is an artifact of how the data was collected. These four populations belong to the earlier phase of HapMap project than the other seven populations [19].
For all the ORFs that underwent change, we discover that the majority of them (54%) appear in all 11 populations and around 23% in 4 populations (CEU, CHB, JPT and YRI). In Supplementary Material S1 Section 2 we listed the detailed breakdown of the occurrences of ORFs in every population and additionally in Table S2 we have included lists of ORFs and their occurrence in every population.

Genes and mRNA Counts Affected By Polymorphism
In this work, an ORF is said to be new if it shares no sameframe codons with the initial ORF before modification in mRNA. This implies that these proteins could be completely distinct from one another. In Table 2 we list the numbers of human ORFs and genes affected by polymorphism.
We also looked at the types of polymorphism that cause ORF modification. The table in Supplementary Material S1 Section 3 indicates five types of polymorphisms: deletion (DLT), insertion (INS), insertion/deletion (IND) microsatellite (MIC), multiple nucleotide polymorphism (MNP), and single nucleotide polymorphism (SNP). We found that the primary source of modification that causes ORF change is SNP (79%) followed by insertion (3%), deletion (6%), insertion/deletion (1%), MNP (0.01%) and microsatellite (0.01%).This is consistent with the fact that SNP is the most common in the dbSNP database. Among all the ORFs and genes, there are 8% a of them are affected by polymorphism. In order to verify whether the new ORF could be significantly affected by other polymorphisms, we perform a further experiment. From the new mRNA after initial modification we re-apply polymorphisms onto it (2nd modification). Although the second modification does change the counts of the disrupted ORFs, the effect is small (Supplementary Material S1 Section 5). One possible future direction following this result is    Ranked top 20 terms according to the P-value of overrepresentation against the background set. Last column with 'Enrichment (N, B, n, b)' is defined as follows: N is the total number of genes, B is the total number of genes associated with the corresponding GO term (description), n is the number of genes in the target set, b is the number of genes in the intersection. Enrichment~(b=n)=(B=N): Note that the total number of target genes (n) in the last column could be less or equal to the number of input genes. This is because GOrilla normalised the input gene names with its gene database. doi:10.1371/journal.pone.0054210.t006

Frequency of Random ORFs
We performed a control procedure where the chromosomal positions of polymorphism are randomized. The mRNAs are then modified with these false polymorphisms. Then we applied the same software and parameterization as described in the Method section for determining the longest ORF. Table 3 shows that the number of changes in ORF and gene after modification by random polymorphism is smaller than those caused by real polymorphism. In Supplementary Material S1 Section 1, we detail further the differences of these ORF predictions with respect to their overlap with known genes of other species.

Comparison With Other Gene Catalogs
In general gene prediction approaches can be classified into two categories: sequence similarity based searches and structure or signal based searches. The similarity search approach is based on finding similarity of ESTs (expressed sequence tags) and proteins to the input genome. On the other hand signal based searches rely on information such as motifs, splice sites, start and stop codons to identify the genes. Our method belong to both categories, where we treat the polymorphisms as signal and validation through Swiss-Prot proteins as similarity based approach. We examine the discrepancy of the genes found by our methods with the existing gene finders.
There are fourteen gene finders used by the UCSC Genome Browser to annotate the human genome. They are: Acembly [20], CCDS [21], Ensembl [22], GeneID [23], Genscan [24], KnownGene [25], H-InvDB [26], NSCAN [27], RefGene [28], SGP [29], Vega [30] and Xenoref [28]. We compile the number of our predicted genes that were not found and found by these gene finders in Table 4. The majority of our genes do have overlap, i.e. partially share genomic location, with those genes from other finders. Our finding should complement the results from existing gene finders. On average 10% of our novel coding genes have zero overlap with any given one of these gene finders. It was with H-InvDB genes where we found the greatest number of gene overlaps, around 5% of our genes have no overlap with it.

Gene Comparison With Other Species
Recently there is a growing interest in the origin of human protein coding genes [31]. It was suggested that novel genes regularly appear from messenger RNAs of ancestral genes, and such novel genes significantly affect the evolution of cellular, physiological, morphological, behavioral and reproductive phenotypic traits [32][33][34].
We compared our novel protein coding human genes subject to polymorphism with their counterpart genes in 25 other genomes. For these genomes we find their genes based on the annotation given by Ensembl gene prediction software [22]. Table 5 shows the number of new protein coding genes discovered by our method that overlap and do not overlap with genes in each of the species. Surprisingly, we discovered that there are some overlaps between predicted genes and the known genes of these species. Moreover, we can distinguish between mammals and nonmammals from the table. The mammals have more overlaps ( §81) than non-mammals.

Gene Ontology
Discovery of new genes will be useful if we can elucidate their roles in various biological domains. Gene ontology (GO) provides such a framework. We use Uniprot to estimate gene ontologies of our newly defined ORFs based on their homologs in Swiss-Prot, because Uniprot provides high-quality gene ontology annotation for proteins from multiple species [35]. We tabulate three categories of gene product (molecular function, cellular function and biological process) where the ORFs are over-represented (Table 6, 7,8). The ontologies are assigned using the web based application GOrilla [36].
In the initial experiment we used a target set of gene names from Uniprot that have homology to our predicted ORFs, and a background set of gene names from Uniprot that have homology with longest ORFs before modification by polymorphism. In this experiment we wish to show what is overrepresented in the new ORFs with respect to old ORFs. We found that in the molecular function category, ATP binding is the most overrepresented term with significant P-value (1.49E-33). For the biological process category we found that macromolecule modification is the most overrepresented term with significant P-value (1.23E-13). Finally for the cellular component category we found terms with significant P-value (2.57E-32) related to cell part.
We also performed another experiment where we apply gene names from Uniprot that have homology to old ORF and our newly predicted ORFs as target sets separately. All human proteins from Uniprot are used as background set. With this experiment we want to show if the polymorphism does change the GO assignment, and whether we can find more significant terms compared to the old ORFs. The results (Supplementary Material S1 Section 6) showed that polymorphism does indeed change the GO assignment. Moreover, the P-values of GO terms for ORFs after modification are more significant than those before modification.

Conclusion
In this work we describe our findings on the importance of human polymorphism in determining protein-coding genes. Here we indicate that the human population harbors a number of genes that have been missed by previous analyses focusing on one reference genome alone. In our approach we employ strict bioinformatics criteria: ORF length, absence of repeats, E-value of ORF homology to Swiss-Prot, and length of 59 UTR to verify our findings. We demonstrate that from the new protein-coding genes there were some not found by other genefinder algorithms, but located in genomic regions where genes are annotated in other species.
The UCSC Genome Browser also continuously updates their data especially with regards to newer human genome assemblies, introduction of new genomes, new gene predictors and annotation of polymorphism. Along with this information, in the short-term when more complete SNP catalogs, frequency data and more distant species will be made available; we hope to explore further the impact of human variations in gene finding. It would be useful for example, to probe for the differences between newly predicted genes and known annotated genes by looking at their sequence similarity in addition to their genomic regions.
And eventually there will be data that include haplotype combinations, more RNAs as well as full-length transcripts. To adapt to these future changes, we believe it would be helpful to make available a gene database based on our approach. A dynamic browsing function for the location of ORFs before and after modification would intuitively convey our results.

Source of sequence and polymorphism data
For consistency, all our information described below is obtained from a single resource: the UCSC Genome Browser. First of all, the human mature mRNA sequences from GenBank [37] were used as the source for deriving ORFs. The sequences were obtained from http://hgdownload.cse.ucsc.edu/goldenPath/ hg19/bigZips/mrna.fa.gz. In total there are 917,725 mRNA sequences. To construct the genes we utilize the annotation of mRNA location in the human genome. It can be downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/ database/all_mrna.txt.gz.
We applied human polymorphism from the dbSNP build131 database [38] for reconstructing the mRNAs. In total we considered 25,877,929 variations. The annotation of these variants on the human genome (hg19) was obtained from http:// hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp131. txt.gz. The variants consist of: SNP -single nucleotide polymorphism, MNP-multiple nucleotides polymorphism, microsatellitevariation in forms of short tandem repeats, insertion -the polymorphism as insertions relative to the reference assembly, deletion -the polymorphism as deletion relative to the reference assembly and in-del -insertion/deletion. In-del is a special class defined by dbSNP. We compare the length of the reference allele to the length(s) of observed alleles; if the reference allele is shorter than all other observed alleles, then 'in-del' will be considered as 'insertion'. Likewise, if the reference allele is longer than all other observed alleles, it will be considered as 'deletion'. The table in Supplementary Material S1 Section 4 shows the overall frequencies of the above mentioned polymorphisms.

Defining mRNAs and ORFs from polymorphism
The reconstruction of mRNA is done by mutating, inserting or deleting the nucleotides in the mRNA according to the variant alleles in dbSNP, based on the positions of the variants and mRNA in the human genome (hg19). Lacking information on haplotypes (i.e. combinations of polymorphisms), we applied each polymorphism on its own. Thus, it is possible that some haplotypes that we considered do not exist in any individual. For each newly reconstructed mRNA we obtained the longest ORF using getorf from EMBOSS [39], with options -min 300 -norev -find 1. We used 300 bp nucleotide as minimum length for ORFs prediction as suggested by many studies that this is an appropriate threshold [40,41]. Our approach is also based on the standard view of mRNA that it only encodes one distinct protein.
The sequence of triplet codons in ORFs beginning with ATG and ending with a stop codon represents the protein. Here we only look at ORFs with w~100 codons, because three out of sixty-four codons encode stops and ORFs greater than 100 codons are unlikely to appear by chance in non-coding sequences of average composition [13]. We define genes by merging the ORFs from different mRNAs that overlap in the same strand of a chromosome. Figure 2 exhibits the procedure for identifying novel coding genes through polymorphism. Notice that in our procedure, there is no re-alignment of the modified mRNA to the human genome involved. For the purpose of determining the genomic loci of exons of new mRNAs, we incorporate the mRNA annotation provided by the UCSC Genome Browser. In this annotation the coordinates of the mRNAs in the genome are specified. Such information enables us to convert the exon locations in new mRNAs into exon locations in the genome.
We also find supporting evidence to show that our newly predicted ORFs are translated into proteins. As a first criterion, we selected ORFs that aligned to the manually curated Swiss-Prot protein database with alignment score §130. This scoring threshold corresponds to significant homology (E-value ƒ3.07e-6). At the end of this step we obtained 5,737 protein-coding genes. The majority of the ORFs have low E-values (0) as can be seen Finding Protein-Coding Genes through Polymorphisms from their cumulative distribution in Supplementary Material S1 Section 4. This procedure will allow us to reduce erroneous translation of frameshifted or non-coding nucleotide sequences. The alignment was done using LAST [42]. It contains two subprograms: lastdb and lastal. We indexed the protein database with lastdb with parameter -p to indicate that the sequences are amino acids and -c to soft-mask lower case letters. The actual alignment was done using lastal with parameter -e 130 which sets the minimum score for gapped alignments to 130. The E-value was computed using the lastex program that comes in the package. Both the translated ORFs and Swiss-Prot proteins were repeat masked using TANTAN [43] before performing the alignment.
As a final criterion we select ORFs based on proximity to the mRNA 59 end. It has been suggested that the ribosome scanning mechanism favors short 59 UTRs, whereas long 59 UTRs are likely to contain encumbrances, e.g. upstream start codons or secondary structure elements [44,45].

Allele Frequency
The percentage of alleles is obtained from the count of individuals who are homozygous and heterozygous for the observed alleles. This information is available from the UCSC Genome Browser HapMap annotation: http://hgdownload.cse. ucsc.edu/goldenPath/hg19/database/hapmapSnps*.txt.gz. Following their nomenclature, the zygote counts are defined as follows: homoCount1 and homoCount2 are counts of individuals who are homozygous for the first allele and second allele respectively; heteroCount is the count of individuals who are heterozygous.
Subsequently we define counts of first of and second alleles (Allele1Count and Allele2Count): Finally the allele percentage can be obtained as: : 100% These percentages are then computed with respect to our predicted ORFs, by looking at the correspondence of modifying variants in our method and the HapMap annotation above.

Comparison with Other Gene Catalogs
There are three sources of information involved in comparing our result to other gene catalogs: 1) list of the new ORFs with their positions in mRNA, 2) alignment of human mRNA with the human genome, and 3) fourteen gene annotations of the human genome. The primary information we employed from these annotations are the predicted starting and ending positions of the exons. We obtained the last two sources from the UCSC Genome Browser.
The key idea in these steps is to compare the overlaps with respect to the genomic location of our predicted ORFs and those predicted by other finders. Initially, we located the positions of the new ORFs in the human genome. We can do that by transferring the positions of new ORFs in mRNA with respect to alignment of human mRNA to the human genome. Since the annotations of all other catalogs already provide the genomic location of the exons, we can easily find their overlap or non-overlap with the modified new ORFs.

Gene Comparison with Other Species
For this task we use the following data sources: 1) list of new ORFs obtained after modification and their corresponding position in the mRNA, 2) alignment of human mRNA with the human genome, 3) gene prediction for human genome (hg19) and all other 25 species provided by Ensembl software [22], 4) human genome alignment with other species. We obtained data sources 2 to 4 from the UCSC genome browser.
Similarly to the previous section, we defined overlap based on the genomic location of new ORFs and gene location of other species aligned to human genome. Hence, the first step we did was to find the positions of the new ORFs in the human genome. This can be done by transferring the positions of data source (1) using data source (2). The next crucial step is to find the location of exons of all other species predicted by Ensembl that align to the human genome. The key data source for this step is (3) and (4). Finally, the overlap and non-overlap of new ORFs in human and all other species can be determined.

Gene Ontology
In finding over-represented gene ontology terms resulting from the newly predicted genes after polymorphism, we applied GOrilla (http://cbl-gorilla.cs.technion.ac.il/), a web-based application. We used two unranked lists of genes as the option. We performed two types of experiments for the analysis: 1. For the first experiment, the target list contains 5,630 gene names from Uniprot that have homology to our 18,726 predicted ORFs. The background set consists of 46,961 gene names from Uniprot that have homology with ORFs before modification by polymorphism. 2. In the second experiment, we use all 71,124 human gene names from Uniprot as background set. Then separately we use 5,630 gene names from ORFs after and 46,961 gene names from ORFs before modification by polymorphisms as target set.
In both of the above experiments when an ORF has multiple hits to the Uniprot gene names, we only took gene name with highest E-value.
GOrilla uses the minimum hypergeometric score (mHG) statistic for computing the P-value. In total there are 11,109 GO terms used in our computation. To correct the P-value for multiple testing, we further compute the FDR q-value [46]. It is computed as:   Supporting Information S1 Gene Ontology data used for experiment as described in the method section. (ZIP) Supplementary Material S1 Alignment of EST GD144663 and chr14. (PDF)