Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species

In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis-determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use four species and build cross-species training sets with two of them and evaluate the performance of the remaining ones. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.


Architecture & Details of Implementation
Tensorflow implementation of Poly(A)-DG can be found on the GitHub.

MLP
The inputs of MLP are shuffled and one-hot encoded DNA sequences. The hidden layers of MLP have 128 nodes and we use a ReLU to activate the outputs. The architecture of our MLP module is shown in Figure [

CNN
The inputs of MLP are one-hot encoded DNA sequences. We apply 16 1-D convolution kernels, which size is 10*1 and stride is 1, in the convolution layer and use ReLU to activate the feature maps. After ReLU, we place a 1-D Max-Pooling layer with a 10*1 kernel and stride of 10. Then we flatten the output of the max pooling layer and fed the feature maps to a fully-connected layer with 128 hidden nodes and use ReLU to activate its output. The Dropout technique is applied to alleviate over-fitting. The architecture of our CNN module is shown in Figure [

Domain generalization & Prediction
We concatenate representations extracted by MLP and CNN and feed them into a fully-connected layer with 2 hidden nodes. Then we use the HEX block to process concatenated features and a softmax classifier to output the prediction of samples. The operation of HEX and classifier is shown in Figure [

Hyperparameters
The batch size of our experiments is 256. Following [1], we use hyperparameters searching method to find the most appropriate hyperparameters for our experiments. To sample a real number in the interval [a, b] with log-uniform, we first uniformly sample a real number x ∈ [0, 1], then return 10 (log10b−log10a)x+log10a . The search range and sampling method of each hyper-parameter can are presented in Table [

Build pseudo data set
According to the Rat BodyMap database, we select genes which are expressed in the rat's brain. We only select those gene segments which are not annotated to contain PAS to make sure the reliability of our pseudo-data set. Then we use the get fasta function to extract the selected DNA sequences and search motifs that are the same as PAS patterns. When we find the motifs the same as PAS patterns, we crop 200 nt genomic sequences, 100-nt upstream and 100-nt downstream, flanking the pseudo-motif as one pseudo sample. The number of pseudo PAS sequences is the same as the number of true sequences and the number of true PAS sequences from each PAS motif is the same as the number of pseudo ones as well.

Experimental results of training with imbalanced species
To investigate the performance of Poly(A)-DG on imbalanced species source domains, we conduct experiments by fix the DNA sequences number of one species in the source domain, and control the number of DNA sequences from the other species. We have six cross-species source domain, Human-Mouse, Human-Rat, Human-bovine, Mouse-Rat, Mouse-bovine and Rat-bovine. We use the name of species to label each source domain, and the first species in the name as first species for the source domain. For example, the Human-Mouse dataset, the first species is human and the second species is mouse. We show the experimental results in Figure 5 and Figure 6. The Figure 5 shows the experiments conducted on datasets with the DNA sequences number of first species is fixed and we plot the empirical results of experiments on source domains with the sample number of second species is fixed in the Figure 6.  Table 3. The accuracy of Poly(A)-DG in many source domains is close to or equals to the random (50%), in other source domains, Poly(A)-DG works with a relatively low accuracy.