Characterization of mammalian Lipocalin UTRs in silico: Predictions for their role in post-transcriptional regulation

The Lipocalin family is a group of homologous proteins characterized by its big array of functional capabilities. As extracellular proteins, they can bind small hydrophobic ligands through a well-conserved β-barrel folding. Lipocalins evolutionary history sprawls across many different taxa and shows great divergence even within chordates. This variability is also found in their heterogeneous tissue expression pattern. Although a handful of promoter regions have been previously described, studies on UTR regulatory roles in Lipocalin gene expression are scarce. Here we report a comprehensive bioinformatic analysis showing that complex post-transcriptional regulation exists in Lipocalin genes, as suggested by the presence of alternative UTRs with substantial sequence conservation in mammals, alongside a high diversity of transcription start sites and alternative promoters. Strong selective pressure could have operated upon Lipocalins UTRs, leading to an enrichment in particular sequence motifs that limit the choice of secondary structures. Mapping these regulatory features to the expression pattern of early and late diverging Lipocalins suggests that UTRs represent an additional phylogenetic signal, which may help to uncover how functional pleiotropy originated within the Lipocalin family.


Introduction
Lipocalins are extracellular proteins that share an ability to bind small hydrophobic ligands and a highly conserved β-barrel folding [1], though their primary sequences diverge greatly among paralogous groups [2]. Proteins in this family also show a wide functional diversity and moonlighting properties [3] that parallel their heterogeneous tissue expression patterns.
Mechanisms controlling gene expression have been studied in a handful of Lipocalins, mainly focused on their promoter regions [4,5,6,7,8]. The post-transcriptional control of gene expression exerted by the upstream and downstream untranslated regions (5' UTR and 3' UTR) has gained importance in recent years [9]. UTRs influence translation efficiency, mRNA PLOS  molecule stability and its export outside the cell nucleus [10], to the extent that mutations in these regions are associated to severe diseases [9]. Nucleotide sequence motifs found in UTRs interact with RNA-binding proteins thanks to hairpin-like secondary structures, and non-coding RNAs like miRNAs can bind to targets in UTRs, especially in 3' UTR [9]. Scarce information is available about UTR regulatory roles in Lipocalin gene expression and a relationship between post-transcriptional control mechanisms and the Lipocalins pleiotropic potential has not been examined. The Lipocalin evolutionary history stands out for its vast branching across different taxa [11]. Metazoans could have inherited an ancestral prokaryotic Lipocalin gene, which after successive duplication rounds gave rise to the tens of paralogs that can be currently found in chordates. The evolutionary process followed by chordate Lipocalin genes has been studied using phylogenetic signals derived from both the gene coding sequence (CDS, namely amino acid sequence alignments) and the exon-intron architecture [12].
In this work, we analyze in silico the UTR regulatory regions of Lipocalins, which might represent an additional phylogenetic signal to uncover how functional diversity originated within the Lipocalin family given their aforementioned characteristics. We focus on mammalian Lipocalins because abundant information of gene orthologs is available and facilitates direct comparisons. The existence of alternative UTRs is examined, as it represents a frequent phenomenon in eukaryotic genomes that would allow a finer and more flexible gene expression control [13].

Selection and collection of 5' and 3' UTRs of mammalian Lipocalin sequences
Sequences from rodent and human Lipocalin orthologs were selected as representative members of the mammalian Lipocalins from the AceView database [14]. The selection was based on their position in a gene phylogeny tree [2,3,11,12] so that both early diverging (ED) and late diverging (LD) Lipocalins are represented in the study sample. We selected Lipocalins for which we found sufficient information of orthologous mammalian genes in the databases used in this work. The Lipocalin α1-microglobulin was not included in our sample because their particular gene fusion to Bikunin could uniquely affect their UTR evolutionary history.
Only transcripts with coincidence with the predicted CDS annotated in RefSeq (NCBI) were chosen. Nucleotide sequences obtained from AceView were present in ASPIcDB [15], which also allowed to include alternative transcripts. Both annotations were confirmed with NCBI RefSeq at the time of sequence selection for our catalog. When comparisons expand to species from other mammalian orders, the UTRs of the genes annotated in RefSeq were chosen.
Sequences and alignments used in this work will be available in S1-S5 Files.
Target regions for micro RNAs (miRNA) were predicted using the PITA algorithm (https://genie.weizmann.ac.il/pubs/mir07/mir07_prediction.html) using 8 as the minimum seed size, allowing single G:U and mismatch, and using flanks to calculate site accessibility [23]. Although other miRNA prediction algorithms exist, we chose PITA due to its consideration of sequence base-pairing, free energy target accessibility and flanking sequences to test whether the existence of potential miRNA target sites is an evolutionary trait in Lipocalin diversity.

Organization and origin of alternative 5' UTRs
EMBOSS ESIM4 [24] was used to align alternative 5' UTR sequences with the corresponding genomic region. AceView database annotations were used to map exon-intron organization into the alignment. 5' UTR genomic regions were additionally examined with ExonScan [25] to predict potential exons. The presence and category of constitutive, alternative or cryptic splicing sites flanking exons were predicted with ASSP [26].
Promoter regions were identified as those annotated by the ENCODE project [27], and predicted by the NNPP algorithm [28]. We also confirmed the NNPP predictions in two Lipocalins (The ED-Lipocalin Rbp4, and the LD-Lipocalin Lcn2) with predictions of the different algorithms FPROM [29], and GPMiner [30]. FPROM predictions coincide with those NNPP of higher probability. Likewise, GPMiner predictions also show results compatible with NNPP for both Lipocalins (S1 Table). The 5' UTR and 2 kb-upstream sequences were used for each selected Lipocalin to detect possible alternative promoters.

UTR exon genomic conservation
Predicted exons were mapped to the genome of different mammalian orders (primates, rodents, artiodactyls and carnivores) using BLAT [31]. Retrieved sequences with percent identity >60% and presumably located in correct positions were marked as potential UTR exon orthologues. We chose the 60% identity as a stringent criterion to maximize homology, because the conservation of human and mouse orthologous sequences ranges 60-80% [32] and the~60% conservation in the 3rd position of orthologous coding sequences. The presence of selected sequences in transcript UTRs of expression datasets was assessed using BLAST [33].

UTR secondary structure prediction
To predict the minimal folding energy (MFE), as well as the suboptimal structures of Lipocalin UTRs, we used the RNAshape algorithm (http://bibiserv.techfak.uni-bielefeld.de/rnashapes) [34] selecting a range of free energy of +5 Kcal/mol for the suboptimal structures. Native structures show energy values closed to the MFE, and RNAshape uses 5 Kcal/mol as a default to predict alternative forms because native structures of structural RNAs show similar energy values.

Post-transcriptional regulation of Lipocalin expression
Protein abundance levels were obtained from PaxDb 4.1 (https://pax-db.org/) in human and mouse whole-integrated proteomes. Ranking and percent normalization to the overall protein abundance were estimated.

Characterization of UTRs in mammalian Lipocalins
Length and composition. A sample of eleven human and murine Lipocalins were chosen according to their position in the family tree ( Fig 1A) based on our previous phylogenetic analyses [2,3,11,12]. Early-diverging (ED) Lipocalins are represented by APOD, APOM, RBP4 and PTGDS, and Late-diverging (LD) Lipocalins by LCN2, LCN8, LCN12, LCN1, C8G, ORM2 and OBP2A. Overall, Lipocalin 5' UTRs possess length and G+C content values similar to the global average found in the UTR database in both species, whereas Lipocalin 3' UTRs tend to diverge from average values ( Fig 1B). Mammalian 3' UTRs are over three times longer than 5' UTRs on average [37], a larger proportion than that of Lipocalins.
The G+C content of gene UTRs and third codon position of CDS are known to correlate [37,38], which holds true for Lipocalin 5' UTRs ( Fig 1C). However, no significant correlation was found for Lipocalin 3' UTRs ( Fig 1D), with a G+C content higher than expected for their length [39]. These results suggest that Lipocalin 3' UTRs G+C content does not properly reflect the features of their genomic context and support the idea that mammalian Lipocalin 3' UTRs have adapted along their evolutionary history to specific gene expression regulatory needs.
Repetitive elements. Some eukaryotic UTRs appear enriched in repetitive elements (STR, LINE, SINE, LTR), mostly found in the 3' UTR, with frequencies associated to functional roles [38]. Repetitive motifs are found in some human and murine Lipocalin UTRs ( Fig 1E). The most common elements are SINE/ALU and STR, in agreement with the expected mammalian UTRs [38]. There are clear differences in the 5' and 3' distribution of repetitive elements between human and mouse orthologues for some Lipocalins, suggesting that their contribution to regulate Lipocalin gene expression is species-specific. Since some repetitive elements span over a hundred nucleotides (Fig 1E), and they even give origin to new alternative exons, they could likely play a role in generating UTR variability during Lipocalin evolution.
Alternative UTRs. Lipocalin UTRs display sequence variation, and many genes selected for this work show alternative 5' UTRs both in mouse and human (Fig 2A). Furthermore, we find a tendency to present high number of alternative 5' UTRs in ED-Lipocalins such as APOD, PTGDS and RBP4. In contrast, alternative 3' UTRs ( Fig 2B) are not so common in Lipocalins, but also appear to be more frequent in ED-genes. In general, human Lipocalins tend to have more alternative UTRs than murine ones.
Considering the mechanisms underlying alternative UTR forms, we compiled the number of UTR exons found in Lipocalins (Fig 2C and 2D). RNA Alternative splicing explains the origin of alternative forms in most cases. However, among Lipocalins with a single 5' UTR exon, human APOM, LCN1 and LCN2 still possess alternative forms (asterisks in Fig 1A), suggesting the existence of alternative transcription start sites (see below).
In relation to 3' UTRs, the two exons detected in human and murine PTGDS ( Fig 2D) support a splicing mechanism for the predicted alternative forms. All other Lipocalins in the set studied have single exon 3' UTRs ( Fig 2D). However, some of them (APOD, RBP4, APOM and LCN2) bear alternative forms ( Fig 2B) that can be originated by variable cleavage at different polyadenylation sites.

5' UTR evolution.
A set of features found in the different alternative 5' UTRs of human and mouse Lipocalins are compiled in Table 1, where each alternative form is denoted by a letter suffix. To learn about the evolution of mammalian Lipocalin 5' UTRs, we first analyzed the genomic architecture of exons/introns for human and murine genes that show alternative and multiexonic 5' UTRs in both species.  these 5' UTRs. The NNPP algorithm and the ENCODE project predict alternative gene promoters that are coherent with several transcription start sites in some Lipocalins such as human and murine APOD, RBP4, PTGDS, and human LCN12. In Lipocalins not showing 5' UTR variability (Fig 2A), ExonScan and ENCODE detected neither additional upstream exons nor candidate promoter regions. Interestingly, the ED-Lipocalins APOD and RBP4 show clear similarities between murine and human exon/intron structure (Fig 3), as well as alternative gene promoters and transcription start sites. However, PTGDS shows species-specific 5' UTR exon-intron structures, quite dissimilar between human and mouse genes.
We then calculated the degree of similarity between exons of human Lipocalin 5' UTRs versus selected species of different mammalian orders (primates, rodentia, artiodactyla and carnivora) (Fig 4). Orthologous pairs of exons were compared. Pairwise alignments reveal that some of the human 5' UTRs exons of APOD, RBP4 and PTGDS (Fig 4A-4C) show significant sequence similarity (>60% identity), indicating conservation along the mammalian orders studied. However, other exons in the same UTRs show no significant similarity with other species, which could be considered hominidae synapomorphies. As for APOM (Fig 4D), its unique 5' UTR exon also shows significant similarity (72-89% identity) with those of other mammalian orders. However, the single 5' UTR exons of LD-Lipocalins display no significant similarity with other mammals.
We also compared average percent identities of orthologous 5' UTRs exons with those obtained when analyzing the corresponding coding sequences (CDS) in the mammalian orders shown in  Table 2 shows that values of percent identity in 5' UTRs are similar to those for the third position of CDS codons in ED-Lipocalins, but much lower in  LD-Lipocalins. This result indicates the existence of a strong selective pressure operating in the 5' UTRs of early diverging mammalian Lipocalins. Considering the RefSeq 5' UTRs of the Lipocalins studied in this work (bold letters in Table 1), we performed a multiple sequence alignment (MSA) in a set of 16 mammalian orders belonging to three Eutherian taxonomic ranks that cover 120 My of mammalian evolution. The result of the pairwise percent identities (distance matrices) are graphically shown in Fig 5. The pattern supports that ED-Lipocalins display a strong sequence conservation of their 5' UTR throughout mammalian evolution, while LD-Lipocalins show high variability in their sequence even among species of the same order. 3' UTR evolution. Overall, the genomic architecture of Lipocalin 3' UTRs is simpler than that of 5' UTRs (Fig 2). Only PTGDS present a single intron. Lipocalin 3' UTRs seem fairly conserved within primates, with identities in the range of 88-96%, and a fair degree of conservation (>60%) in most other cases (Table 3). However, the lack of complete 3' UTR sequences in the databases for some Lipocalins precluded a broad analysis. With the data available so far,  these results provide evidence for an important regulatory function of 3' UTRs in Lipocalin expression.

Properties of mammalian Lipocalin 5' UTR sequences influencing regulatory complexity of protein expression
Because of the different prevalence of alternative forms and the differences in sequence conservation of Lipocalin 5' UTRs depending of their evolutionary history, variations are also expected in the regulatory elements present in these gene regions. Length, G+C content, several sequence motifs and secondary structure are 5' UTR features that could play an important role in gene expression regulation. Short 5' UTRs, with low G+C content and low degree of secondary structure allow efficient translation, while the contrary  holds for genes showing low translation levels [40,41]. Similarly, the existence of upstream initiation codons (uAUG) and upstream open reading frames (uORF) is generally assumed to involve a negative regulation of translation [42,43,44], whose strength relies on properties such as an appropriate sequence context [45], enough distance (>19 nucleotides) to the 5' cap, the presence of multiple uORFS, and their evolutionary conservation. Overrepresented sequence elements in 5' UTRs can be considered regulatory motifs. A low incidence rate categorize 6-8 nucleotide oligonucleotides as significant. Moreover, an overlap of different oligonucleotides and their evolutionary conservation favor their regulatory role [46].
We searched for the features above in our set of human and murine Lipocalin genes 5' UTRs, and these data were used to categorize the translation efficiency of our UTRs according to the classification and regression tree (CART) method [47]. The overall results are compiled in Table 1.
We also found that human and murine Lipocalins uAUG/uORFs are abundant in other species, and many of them show an optimal/adequate context for translation (Table 1). Translation inhibition of uORFs was also predicted by measuring distances between the 5' cap and each Lipocalin uORF (Fig 6). Together these results suggest that translated uORFs are common and efficient in Lipocalins, mainly in ED-genes (Table 1). Moreover, some Lipocalin 5' UTR variants bearing uORFs show significant sequence conservation in several mammalian orders. Particularly, two uORFs of human APOD_a and APOM_d variants and its orthologous sequences show Ka/Ks values above one (1.587 for APOD and 1.309 for APOM) which suggests a positive selection for the peptides putatively translated from those uORFs.
Finally, the features above contributed to categorize translation efficiency as CART Class I genes (low translation), more abundant in ED-Lipocalins such as APOD and RBP4, and those with efficient translation (Class III) that correspond to LD-Lipocalins (Table 1).
In summary, more variation in terms of alternative 5' UTRs, more sequence conservation found across evolutionarily divergent mammalian orders, as well as sequence motifs compatible with a stringent translational control, suggest that ED-Lipocalins amply present in chordates are limitedly translated.

Properties of mammalian Lipocalin 3' UTR sequences influencing regulatory complexity of protein expression
The sequence conservation observed in Lipocalin 3' UTRs led us to explore whether some known regulatory features of this gene region could underlie the functional evolutionary diversity of the Lipocalin gene family.
Polyadenylation signals (PAS) are involved in mRNA cytoplasmic export and stability [48]. We analyzed the number, position, type (canonical vs. non-canonical) of PAS of human and murine Lipocalin 3' UTRs and estimated their polyadenylation efficiency [49,50]. Table 4 shows that ED-Lipocalins APOD, RBP and PTGDS (both in human and mouse) bear long 3' UTRs with more alternative forms. Longer variants with multiple polyadenylation sites (PAS) are predicted to have potentially complex regulation, depending on the efficiency of their PAS. In contrast, LD-Lipocalins show short 3' UTRs with single PAS that suggests less complexity in their translation regulation.
3' UTRs are a common target for miRNAs, well-known regulators of gene expression [9]. We evaluated the miRNA accessibility of 3' UTRs (Table 4), and found that human Lipocalins show more miRNA potential targets than those in the mouse, suggesting a stronger role of 3' UTR miRNA in gene regulation of primate Lipocalins. A different strategy to assess the biological relevance of the predicted miRNA targets is to compare them among different vertebrate species. Table 5 shows a list of potential miRNA targets in human and mouse Lipocalins. Several miRNAs show 3' UTR targets in different human Lipocalins, and miR-125a-3p is the only common miRNA predicted for an orthologous Lipocalin (Obp2a) in mouse and human.  Table 4. Features of alternative 3' UTR of human and murine Lipocalins. In the past few years, a number of miRNA have been found to alter experimentally the expression of some Lipocalins. miRNAs 299-3p, 423-3p and 490-3p were associated to ApoD expression in rat [51]; miRNAs 18b-5p, 19b-3p, 99a-5p, 100-5p, 145-5p, 214-3p and 138 alter Lcn2 expression [52,53], and miRNA 573 affects ApoM expression [54]. Some of these miR-NAs were detected by the PITA algorithm [23], but they were below the ΔΔG threshold of -10 Kcal/mol to be considered accessible.

Properties of mammalian Lipocalin 5' and 3' UTR secondary structures influencing regulatory complexity of protein expression
The secondary structure of 5' and 3' UTRs are known to be a key factor for their regulatory function in gene expression [13,38]. Among the possible folds of a given UTR, the native structure not always represents the one with a minimal folding energy (MFE) [34,55]. Moreover, structural RNAs show a more reduced repertoire of potential secondary structures than those of non-structural RNAs [34].
Therefore, we believe it is very important to study the predicted catalogue of secondary structures of the Lipocalin UTRs in order to make informative hypotheses about their regulatory role. We analyzed the MFE and suboptimal (±5 Kcal/mol) structures of the 5' and 3' UTRs of our selected human and mouse Lipocalins predicted by the RNAshape algorithm (see Methods). We first compared the number of alternative UTR secondary structures of Lipocalins with those of structural RNAs (tRNAs and rRNAs) of similar length present in the Rfam database. The number of alternative secondary structures grow exponentially with the sequence length of structural RNAs (Fig 7A), and a similar relationship found in 3' UTR Lipocalins. However, the average number of alternative secondary structures of Lipocalin 5' UTR is significantly lower in sequences over 150 nucleotides length. Moreover, we assessed the degree of similarity among human 5' UTR alternative structures (over 150 nucleotides) through alignments with RNAforester (see Methods) and found slight differences between MFE and suboptimal structures (Fig 7B).
A restricted range of secondary structures suggests a high conservation of functional elements, and highlights the relevant role of 5' UTR in Lipocalin gene regulation.

UTR properties and post-transcriptional regulation of Lipocalin expression
An apparent contrast in mRNA regulatory stringency led us to consider whether evolutionary divergence might underlie actual differences in translation efficiencies. This idea was tested by assaying protein abundance in the PaxDb 4.1 (https://pax-db.org/) for our Lipocalin set in human and mouse whole-integrated proteomes. Following ranking and percent normalization to the overall protein abundance, a general finding is that Lipocalins show high protein abundance levels in mammals (Fig 8A). These results can be explained by a substantial production of Lipocalin mRNAs that would ensure adequate protein levels despite a stringent post-transcriptional regulation. Also, a positive correlation is evident among orthologous Lipocalins ( Fig 8B), in agreement with overall results when comparing human and mouse proteomes [37]. High protein levels are clear for ED-Lipocalins in mouse and human proteomes (Fig 8A), while only immune system-related acute phase LD-Lipocalins Lcn2, C8g and Orm2 show high abundance. The remaining LD-Lipocalins show scarce or even unnoticeable protein levels.
In contrast, an analysis of RNA-Seq of Human tissues (Illumina Body Map; https://www. ebi.ac.uk/arrayexpress/experiments/E-MTAB-513/) show that ED-Lipocalin transcripts are broadly present in human tissues (Fig 8C; underlined genes), while LD-Lipocalins appear more restricted to certain tissues. Similar results are obtained in a RNA-seq study Lipocalin UTRs in silico analysis (https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2801/) of nine mouse tissues (not shown). ED-Lipocalins broad distribution across many different tissues possibly reflects evolutionary traits that result in an increased variability and tight regulation, as suggested by alternative splicing being more common in UTR regions than in their CDS. A complex translational regulation might be responsible for a given ED-Lipocalin mRNA to be differentially expressed in diverse cellular contexts.
On the contrary, LD-Lipocalin genes display UTRs less constricted by selective pressure, with more divergent sequences across orthologs and sequence motifs usually associated with an efficient translation, alongside simpler post-transcriptional regulation mechanisms. This contrasts to their relatively low levels of protein abundance, but a plausible explanation is their tissue-specific expression pattern, which could have led to a lesser need of innovative posttranscriptional regulatory solutions.
Overall, there is an apparent "evolutionary distance/complexity" trade-off in Lipocalin gene UTR-dependent expression regulation, with ED-Lipocalins displaying tight translational regulatory mechanisms under high selective pressure, and LD-Lipocalins having tissue expression patterns loosely regulated at the post-transcriptional level.

Conclusions
The results of our in silico study point to mammalian Lipocalins as a group of paralogous genes, heterogeneous in the context of expression regulation, with UTRs playing a critical role. A strong selective pressure operating upon UTRs (mainly 5' UTR), reflecting a relevant and complex regulation of translation, is suggested by: 1) the presence of alternative UTRs accompanied by a predicted diversity of transcription start sites and alternative promoters; 2) a fair sequence conservation in different mammalian orders; 3) the existence of particular sequence motifs and other regulatory features; 4) a limited choice of secondary structures. This is especially clear in some Lipocalins present early in vertebrate evolution that we have called ED-Lipocalins. These genes show UTR features compatible with complex regulatory mechanisms apparently motivated by the need to accommodate gene expression levels to many different cellular environments, as shown by their high abundance and ubiquitous presence in human and mouse tissues. The opposite seems to occur for LD-Lipocalins, which presumably reflects their role as functional specialists that originated as niche solutions to concrete physiological needs.
Overall, there is an apparent "evolutionary distance/complexity" trade-off in Lipocalin gene UTR-dependent expression regulation, with ED-Lipocalins displaying tight translational regulatory mechanisms under high selective pressure, and LD-Lipocalins having tissue expression patterns loosely regulated at the post-transcriptional level.