Representation learning of genomic sequence motifs with convolutional neural networks

Although convolutional neural networks (CNNs) have been applied to a variety of computational genomics problems, there remains a large gap in our understanding of how they build representations of regulatory genomic sequences. Here we perform systematic experiments on synthetic sequences to reveal how CNN architecture, specifically convolutional filter size and max-pooling, influences the extent that sequence motif representations are learned by first layer filters. We find that CNNs designed to foster hierarchical representation learning of sequence motifs—assembling partial features into whole features in deeper layers—tend to learn distributed representations, i.e. partial motifs. On the other hand, CNNs that are designed to limit the ability to hierarchically build sequence motif representations in deeper layers tend to learn more interpretable localist representations, i.e. whole motifs. We then validate that this representation learning principle established from synthetic sequences generalizes to in vivo sequences.


Introduction
Deep convolutional neural networks (CNNs) have recently been applied to predict transcription factor binding motifs from genomic sequences (Zhou and Troyanskaya, 2015;Quang and Xie, 2016;Kelley et al , 2016;Hiranuma et al , 2017).Despite the promise that CNNs bring in replacing methods that rely on k -mers and position weight matrices (PWMs) (Ghandi et al , 2016;Foat et al , 2006), there remains a large gap in our understanding of why CNNs perform well.
A convolutional layer of a CNN is comprised of a set of lters, each of which can be thought of as a PWM.Each lter scans across the inputs, and outputs a non-linear similarity score at each position, a so-called feature map.The lters are parameters of the CNN that are trained to detect relevant patterns in the data.Deep CNNs are constructed by feeding the feature maps of a convolutional layer as input to another convolutional layer.This can be repeated to create a network with any depth.CNNs typically employ max-pooling after each convolutional layer, which down-sample the feature maps by setting non-overlapping windows with a single maximum score, separately for each lter.Max-pooling enables deeper layers to detect features hierarchically across a larger spatial scale of the input.CNN predictions are typically made by feeding the feature map of the nal convolutional layer through a fully-connected hidden layer followed by a linear classier.
In genomics, it is commonly thought that the rst convolutional layer lters learn PWMs of sequence motifs, while deeper layers learn combinations of these motifs, so-called regulatory grammars (Alipanahi et al , 2015;Angermueller et al , 2016;Zeng et al , 2016;Quang and Xie, 2016;Kelley et al , 2016).Thus, rst convolutional layer lter sizes are usually chosen to be larger than the expected motifs, i.e. 6-19 nucleotides (nts) (Mathelier et al , 2016).A quantitative motif comparison search typically reveals that less than 50% of the rst layer lters nd any statistical match against a database of motifs (Kelley et al , 2016;Quang and Xie, 2016).Unmatched lters have been suggested to be either partial representations of known motifs or novel motifs, i.e. motifs not included in the database.It remains unclear to what extent we should expect rst layer lters to learn whole-motif representations in the rst convolutional layer.
Learning whole-motif representations by rst layer lters is not indicative of a deep CNN's classication performance.For instance, a deep CNN that employs a small rst layer lter, i.e. 8 nts (Zhou and Troyanskaya, 2015), which is shorter than many common motifs found in vivo, has demonstrated comparable performance as CNNs that employ larger lters, i.e. ≥19 nts (Quang and Xie, 2016;Kelley et al , 2016).In principle, smaller lters that capture partial motif representations can be combined in deeper layers to assemble whole-motif representations, thereby allowing the CNN to make accurate predictions.
Here we perform systematic experiments to demonstrate that a CNN's architecture, specically maxpooling and lter size, is indicative of how internal representations of motifs are learned.We focus our study on the representations learned by rst convolutional layer lters using a synthetic dataset with a known ground truth.By systematically modifying the architecture, we learn general principles that are predictive of the extent that rst layer lters learn representations that resemble whole-motifs versus partial-motifs.We then demonstrate that the same principles learned from synthetic sequences generalize to in vivo sequences.

Internal representations of motifs depend on architecture
We conjecture that motif representations learned in rst layer lters are largely inuenced by a CNN's ability to assemble whole-motif representations in deeper layers, which is determined by architectural constraints set by: 1. the convolutional lter size, 2. the stride of the lter, which is the number of steps the lter takes (usually set to 1), 3. the max-pool size, and 4. the max-pool stride, which is the number of steps for each max-pool window (usually set to the max-pool size to create non-overlapping max-pool windows).
Assuming that accurate classication can only be made if the correct motifs are detected, a CNN that learns partial-motif representations in the rst layer must assemble whole-motif representations at some point in deeper layers.To help explain how architecture can inuence representation learning in a given layer, we introduce the concept of a receptive eld, which is the sensory space of the data that aects a given neuron's activity.For the rst convolutional layer, each neuron's receptive eld has a size that is equal to the lter size at a particular region of the data.Each neuron's receptive eld is only activated to an extent that depends on the similarity of a sequence and a given lter.Since there are typically many lters in a convolutional layer, there are many neurons which have a receptive eld that share the same spatial region.However, each neuron's activation is determined by a dierent lter.Max-pooling combines multiple neurons of a given lter within a specied window size to a single max-pooled neuron, thereby augmenting the size of its receptive eld.In doing so, max-pooling obfuscates the exact positioning of the max-activation within each window.Thus the location of the max-activation has spatial invariance within its receptive eld with an amount equal to the max-pool size.
Although max-pooling creates spatial uncertainty of the max-activation within a max-pooled neuron's receptive eld, we surmise that neighboring max-pooled neurons of dierent lters, which share signicantly overlapping receptive elds, can help to resolve spatial positioning of an activation.To illustrate, Figure 1A shows a toy example of two convolutional lters, each 7 nts long, which have learned partial-motifs: `GTG' and `CAC'.An example sequence contains three embedded patterns (highlighted in green): `CACGTG', `GTGCAC', and `CACNNNGTG', where `N' represents any nucleotide with equal probability.The resultant max-pooled, activated convolutional scans for each lter are shown above the sequence with a blue shaded region highlighting the receptive eld of select max-pooled neurons.Even though the rst convolutional layer lters have learned partial-motifs, the second convolutional layer lters can still resolve each of the three embedded patterns by employing lters of length 3. Of course situations may arise where the three second convolutional layer lters are unable to fully resolve the embedded patterns with delity.For instance, `CACNGTG' could be activated by the same lter for `CACGTG'.A CNN can circumvent these ambiguous situations by either learning more information about each pattern within each lter or by dedicating additional lters to help discriminate the ambiguous patterns.
It follows that by creating a situation where partial-motif representations cannot be assembled into whole-motifs in deeper layers, learning whole-motifs by rst layer lters becomes obligatory for accurate classication.One method to limit the information ow through a CNN is by employing large max-pool sizes relative to the lter size.The max-pooled neurons then have large receptive elds with a large spatial uncertainty and only a small overlap in receptive elds with neighboring neurons of dierent lters.A deeper layer would be unable to resolve the spatial ordering of partial motifs to assemble whole-motifs with delity.To exemplify, gure 1B shows a toy example of a CNN that employs a larger pool size of 20.Importantly, there are large spatial regions within a receptive eld for which a neighboring neuron cannot help to resolve due to a lack of overlap in receptive elds.As a result, deeper convolutional layer lters which are dedicated to each pattern would yield the same signature, unable to resolve any of the three patterns.
More technically, the extent of motif information that each lter learns is guided by the gradients of the objective function, which serves as a measure of the classication error.Assuming accurate classication can only be achieved upon discriminating the underlying motifs of each class, once whole-motifs for each class are learned, then the objective function is minimized and the training gradients go to zero.If a CNN can build whole-motifs from partial-motifs in deeper layers, then there is no more incentive to learn additional Filter 2: GTGCAC Filter 1: CACGTG Filter 3: CACNNNGTG Filter 2: CAC Filter 1: GTG Filter 2: GTGCAC?
Filter 1: CACGTG?Filter 3: CACNNNGTG?Shows the feature maps when employing a small max-pooling size of 3, which creates overlapping receptive elds, highlighted in blue.3 second layer convolutional lters, shown above, demonstrate a feature map pattern that can resolve each embedded sequence pattern.(B) Shows the feature maps when employing a larger pooling size of 20 using the same lters as (A).The larger receptive elds have a large spatial uncertainty along with a small overlap in receptive elds from neighboring neurons.Each of the 3 second layer convolutional lters, shown above, is unable to nd a unique feature map pattern that can resolve any embedded sequence pattern.
information to build upon the partial-motif representations already learned.As a result, the rst layer lters will maintain a distributed representations of motifs (Hinton et al , 1986).On the other hand, if architectural constraints limit the ability to build whole-motifs from partial-motifs in deeper layers, then accurate predictions cannot be made.Hence, gradients will persist because the objective function is not yet minimized, encouraging rst layer lters to learn whole-motifs, also known as a localist representation of motifs (Hinton et al , 1986).Once the rst layer lters have learned sucient information of whole-motifs to discriminate each class, then the objective function can be minimized, signaling the end of training.

Max-pooling inuences ability to build hierarchical motif representations
To test this idea, we created a synthetic dataset for the task of predicting which transcription factors bind to a given sequence.Briey, synthetic sequences, each 200 nts long, were implanted with 1 to 5 known motifs, randomly selected with replacement from a pool of 12 transcription factor motifs embedded in random DNA (see Methods for details).The motifs were manually selected from the JASPAR database to represent a diverse, non-redundant set.The goal of this computational task is to simultaneously make 12 binary predictions for the presence or absence of each transcription factor motif in the sequence.Since we have ground truth for all of the relevant TF motifs and where they are embedded in each sequence, we can test the ecacy of the representations learned by a trained CNN model.We note that the ground truth is only from embedded motifs and not from motifs that occasionally arise by chance; the latter eectively creates false negative labels in this dataset.
A CNN model that employs at least two convolutional layers is required to test our hypotheses of representation learning.We constructed a CNN with 3 hidden layers: two convolutional layers, each followed by max-pooling, and a fully-connected hidden layer.Specically, our CNN takes as input one-hot encoded sequences, processes them with the hidden layers, and outputs a prediction for the binding probability for each of the 12 classes.The number of lters in each convolutional layer, the number of units in the fullyconnected hidden layer, and the dropout probabilities are xed (see Methods).The lter sizes, the max-pool window sizes, and the max-pool strides are the hyperparameters that can be varied.For a given hyperparameter setting, we trained the CNN as a multi-class logistic regression (see Methods for training details).All reported metrics are strictly drawn from the held-out test set using the model parameters that yielded the best performance on the validation set.
To explore how spatial uncertainty within receptive elds set by max-pooling inuences the representations learned by rst layer lters, we systematically altered the max-pool sizes while keeping all other hyperparameters xed, including a rst and second layer lter size of 19 and 5, respectively.To minimize the inuence of architecture on classication performance, we coupled the max-pool sizes between the rst and second layer, such that their products are equal, which makes the inputs into the fully-connected hidden layer the same size across all CNNs.The max-pool sizes we employed are (rst layer, second layer): (2, 50), (4, 25), (10, 10), (25, 4), (50, 2), and (100,1).For brevity, we denote each CNN with only the rst max-pool window size, e.g.CNN-2 for (2, 50).
We rst veried that the performance of each model is similar as measured by the average area-under the receiver-operator-characteristic (AU-ROC) curve across the 12 classes (Table 1), which is in the range of previously reported values for a similar task using experimental ChIP-seq data (Zhou and Troyanskaya, 2015;Quang and Xie, 2016).Next, we converted each lter to a sequence logo to visually compare the motif represenations learned by the rst layer lters across the dierent architectures (Fig. 2).As expected, we found CNNs that employ large max-pool sizes (≥10) learn representations that qualitatively resemble the ground truth motifs.On the other hand, CNNs that employ a small max-pool size (≤4) do not seem to qualitatively capture any ground truth motif in its entirety, perhaps learning, at best, parts of a motif.
To quantify the number of lters that have learned motifs, we employed the Tomtom motif comparison search tool (Gupta et al , 2007) to compare the similarity of each lter against all motifs in the JASPAR 2016 vertebrate database (Mathelier et al , 2016) using an E -value cuto of 0.1.In agreement with our qualitative observation, we found CNNs that employ a small max-pool size (≤ 4) have, at best, 33% of their lters match any known motifs.Of these, only 1 lter in CNN-4 matches a ground truth motif.In contrast, CNNs that employ a large max-pool size yield, at worst, a 90% match to ground truth motifs.Motif representations are not very sensitive to lter size Because it has been widely thought that rst convolutional layer lters learn motifs, deep learning practitioners have traditionally employed CNN architectures with large rst layer lters to capture motif patterns in their entirety.However, we have shown that employing a large lter does not necessarily lead to whole-motif representations.To test the sensitivity of lter size to representation learning, we created two new CNN models that employ a rst layer lter size of 9 (CNN 9 ), in contrast to a lter size of 19 which was previously used, with max-pool combinations of 4 and 25, i.e.CNN 9 -4 and CNN 9 -25.Since the combination of a lter size of 9 with a max-pool size of 4 creates overlapping receptive elds with a small spatial uncertainty, we expect that this architecture setting will lead to partial-motif representations.On the other hand, the lter size of 9 is insucient to resolve spatial positions when employing a max-pool size of 25.Hence, we predict that this architecture setting will yield whole-motif representations.As expected, CNN 9 -25 learns representations that qualitatively better reect the ground truth motifs compared to CNN 9 -4 (Fig. 3, A-B).Interestingly, CNN 9 -25 also learns partial motif representations of larger motifs, i.e.MEF2A, SRF, STAT1, CEBPB, but in a more visually identiable way compared to CNN 9 -4.By quantifying the percentage of lters that statistically match ground truth motifs, CNN 9 -25 yields an 80% match compared to CNN 9 -4 which only yields a single match (Table 1).
As a control, we created a CNN model with a lter size of 3 with max-pool size combinations of 2 and 50, i.e.CNN 3 -2 and CNN 3 -50.Since a max-pool size of 2 is smaller than the lter size, we expect that CNN 3 -2 will still be able to assemble whole motifs to some extent in deeper layers, despite having a very small lter size.On the other hand, since CNN 3 -50 has only one chance to learn whole motifs, we expect that the small lter size will lead to a poor classication performance.Indeed, CNN 3 -50 yields a mean AU-ROC of 0.652±0.060across the 12 classes, compared to CNN 3 -2 which yields 0.968±0.039(error is the standard deviation across the 12 classes).

Spatial uncertainty within receptive elds determines motif representations
One aspect of max-pooling that we did not consider in our toy model is the max-pool stride, which is typically set to the max-pool size.Employing a large max-pool size with a small max-pool stride can create a situation where the receptive eld of max-pooled neurons overlap signicantly, which should improve the spatial resolution of partial-motifs.However, a deeper convolutional lter would still be unable to assemble whole motifs, because each receptive eld has a large spatial uncertainty, making it challenging to discriminate between partial-motifs that assemble into whole motifs from partial-motifs that are spatially distant.
To test this, we created a new CNN model which employs a large max-pool size of 50 with a max-pool stride of 2 (CNN-50-2).Consequently, the length of the feature maps after the rst convolutional layer are half of the input sequence, which is the same feature map shape as CNN-2, which employs a max-pool size of 2 with a stride of 2. Similar to CNN-2, CNN-50-2 employs a max-pool size and stride of 50 after the second convolutional layer.Strikingly, CNN-50-2 learns whole-motif representations with 70% of its lters matching ground truth motifs in the synthetic dataset (Table 1).Moreover, the motifs that are learned by CNN-50-2 qualitatively better resemble whole-motif representations (Fig. 3C) compared to CNN-2 (Fig. 2).Together, this result further supports that architecture, specically the ability to assemble whole-motifs in deeper layers, plays a major role in how CNNs learn genomic representations in a given layer.

Distributed representations build whole-motif representations in deeper layers
The high overall classication performance of the CNNs suggests that they must have learned whole-motif representations of each embedded TF at some point.Thus, CNN-2, whose rst layer lters did not match any relevant motifs, must be assembling whole-motif representations in deeper layers.To verify that CNN-2 eventually learns whole-motif representations, we visualize the representation learned throughout the entire network with saliency analysis, specically guided-backpropagation (Springenberg et al , 2014), which is a technique to probe the independent importance of each nucleotide in a sequence towards a given prediction (see Methods).By visualizing a representative sequence logo of a saliency map generated by CNN-2 and CNN-25 for sequences associated with dierent TF classes, we conrm that the underlying motif representations are indeed learned irrespective of whether the rst layer learns whole-motifs or partial-motifs (Fig. 4).We note that the quality of the saliency maps generated by CNN-2 can occasionally lead to noisier importance scores for nucleotide variants in the motif compared to CNN-25, which tends to better reect the embedded motifs.Interestingly, we also show that CNN 3 -2 is also able to largely learn representations of whole-motifs, despite employing a very small rst layer lter size of 3 (Fig. 4).

Generalization to in vivo sequences
To test whether the same representation learning principles generalize to in vivo sequences, we modied the DeepSea dataset (Zhou and Troyanskaya, 2015) to include only in vivo sequences that have a peak called The sequence logos of reference motifs and their reverse complements for each transcription factor from the JASPAR database is shown at the bottom.The y-axis label on select lters represents a Tomtom match to a reference motif.across 12 ChIP-seq experiments, each of which correspond to a TF in the synthetic dataset (see Supplemental Table S1).Thus, the truncated-DeepSea dataset is similar to the synthetic dataset with sequences that have a corresponding label for the presence of absence of a peak across the 12 ChIP-seq experiments.The truncated-DeepSea dataset consists of 1,000 nt sequences in contrast to the 200 nt sequences in the synthetic dataset.
We trained each CNN model on the in vivo dataset following the same protocol as the synthetic dataset.Similar to CNN models trained on the synthetic dataset, a qualitative comparison of the rst layer lters of dierent CNN models shows that employing a larger pool size yield representations that better reect wholemotifs (Fig. 5).By employing the Tomtom motif comparison search tool, we quantied the percentage of signicant hits between the rst layer lters against the JASPAR database (see Table 2).Similar to the synthetic dataset, CNNs that employ a smaller max-pool size (≤ 4) yield a percent match that is, at best, 57% (Table 2).In contrast, CNNs that employ a larger max-pool size (≥ 10) yields a percent match that is, at worst, 83% (Table 2).Since in vivo sequences contain many additional signals compared to the synthetic sequences, we were unable to reliably quantify the percentage of lters that learn relevant motifs.Interestingly, we found that the each CNN was consistently unable to identify known motifs for ARID3A, MEF2A, SP1, and STAT1.However, it is unclear whether this arises because of: experimental or postprocessing errors which create label noise in the sequences we assign as having a ChIP-seq peak, the large variance in numbers of sequences for dierent classes (class imbalance), and/or an inability of the CNNs to learn the correct motif, among the many other possible explanations.Notwithstanding, the same trends in the amount of motif information learned by rst layer lters in vivo suggests that we have identied a general principle for representation learning by CNNs.

Conclusion
By exploring dierent CNN architectures on a synthetic dataset with a known ground truth, we were able to reveal principles of how architecture design inuences representation learning of sequence motifs.Typical deep CNN architectures currently employed in genomics, which employ large lters and small max-pool sizes, tend to learn distributed representations of sequence motifs.However, localist representations, i.e. whole motifs, can be learned by constraining the ability of deeper layers from assembling hierarchical representations of motifs.While we explored the role of architecture, we note that there may be other factors that contribute to the quality of the learned representations, including regularization and optimization algorithms.
Interpreting the representations learned by convolutional lters should be approached with skepticism.Even though a CNN can be designed to preferentially learn whole-motifs in the rst convolutional layer, not all lters will learn motifs.Moreover, we showed that using a motif comparison tool does not necessarily provide a reliable way of identifying whether a CNN learns relevant motifs.Another aspect of convolutional lters that is often misleading is that the number of lters dedicated to a motif may not be a reliable measure of the of the motif.The variation in the number of lters dedicated to a motif on the synthetic sequences, which do not have any class imbalance, suggests that the observed dierence is more likely due to random initialization and the diculty of nding that motif, not the importance of the motif.
The similar performance across the CNNs explored here suggest that motif discovery does not require complicated architectures that learn distributed representations.Nevertheless, we posit that building distributed representations may be more benecial in more complicated tasks, because a wider array of representations can be constructed through combinatorics of partial representations.Moreover, there becomes less dependence on convolutional lter lengths and numbers of lters as long as there exist deeper layers that can build representations hierarchically.In contrast, building localist representations means that the CNN is subject to harder constraints set by the numbers of lters and the lter lengths, limiting the amount of representations and their sizes that can be learned.However, when the main features in the dataset is simple, such as whether or not a motif is present, then CNN architectures that learn localist representations achieve an easier to interpret model that still performs competitively.

Synthetic dataset
The synthetic dataset consists of sequences with known motifs embedded in random DNA sequences to mimic a typical multi-class binary classication task for ChIP-seq datasets.We acquired a pool of 24 PWMs from 12 unique transcription factors (forward and reverse complements) from the JASPAR database (Mathelier et al , 2016): Arid3a, CEBPB, FOSL1, Gabpa, MEF2A, MAFK, MAX, MEF2A, NFYB, SP1, SRF, STAT1, and YY1.For each sequence, we generated a 200 nt random DNA sequence model with equal probability for each nucleotide. 1 to 5 TF PWMs were randomly chosen with replacement and randomly embedded along the sequence model such that each motif has a buer of at least 1 nucleotide from other motifs and the ends of the sequence.We generated 25,000 sequence models and simulated a single synthetic sequence from each model.A corresponding label vector of length 12, one for each unique transcription factor, was generated for each sequence with a one representing the presence of a TF's motif or its reverse complement along the sequence model and zero otherwise.The 25,000 synthetic sequences and their associated labels were then randomly split into a training, validation, and test set according to the fractions 0.7, 0.1, and 0.2, respectively.

In vivo dataset
Sequences which contain ENCODE ChIP-seq peaks were downloaded from the DeepSEA dataset via (Zhou and Troyanskaya, 2015).The human reference genome (GRCh37/hg19) was segmented into non-overlapping 200 nt bins.A vector of binary labels for ChIP-seq peaks and DNase-seq peaks was created for each bin, with a 1 if more than half of the 200 nt bin overlaps with a peak region, and 0 otherwise.Adjacent 200 nt bins were then merged to 1,000 nt lengths and their corresponding labels were also merged.Chromosomes 8 and 9 were excluded from training to test chromatin feature prediction performances, and the rest of the autosomes were used for training and validation.We truncated the DeepSea dataset to include only the sequences which contain 12 transcription factor labels: Arid3a, CEBPB, FOSL1, Gabpa, MEF2A, MAFK, MAX, MEF2A, NFYB, SP1, SRF, STAT1, and YY1 (See Supplementary Table S1 for ENCODE lenames and class indices from the original DeepSea dataset).270,382 (92%) sequences comprise the training set and 23,768 (8%) sequences comprise the test set.Each 1000 nt DNA sequence is one-hot encoded into a 4x1000 binary matrix, where rows correspond to A, C, G and T.

CNN Models
All CNN models take as input a 1-dimensional one-hot-encoded sequence with 4 channels (one for each nucleotide: A, C, G, T), then processes the sequence with two convolutional layers, a fully-connected hidden layer, and a fully-connected output layer with 12 output neurons that have sigmoid activations for binary predictions.Each convolutional layer consists of a 1D cross-correlation operation, which calculates a running sum between convolution lters and the inputs to the layer, followed by batch normalization (Ioe and Szegedy, 2015), which independently scales the features learned by each convolution lter, and a non-linear activation with a rectied linear unit (ReLU), which replaces negative values with zero.
The rst convolutional layer employs 30 lters each with a size of 19 and a stride of 1.The second convolutional layer employs 128 lters each with a size of 5 and a stride of 1.All convolutional layers incorporate zero-padding to achieve the same output length as the inputs.Each convolutional layer is followed by max-pooling with a window size and stride that are equal, unless otherwise stated.The product of the two max-pooling window sizes is equal to 100.Thus, if the rst max-pooling layer has a window size of 2, then the second max-pooling window size is 50.This constraint ensures that the number of inputs to the fully-connected hidden layer is the same across all models.The fully-connected hidden layer employs 512 units with ReLU activations.
Dropout (Srivastava et al , 2014), a common regularization technique for neural networks, is applied during training after each convolutional layer, with a dropout probability set to 0.1 for convolutional layers and 0.5 for fully-connected hidden layers.During training, we also employed L 2 -regularization with a strength equal to 1e-6.The parameters of each model were initialized according to (He et al , 2015), more commonly known as He initialization.
All models were trained with mini-batch stochastic gradient descent (mini-batch size of 100 sequences) for 100 epochs, updating the parameters after each mini-batch with Adam updates (Kingma and Ba, 2014), using recommended default parameters with a constant learning rate of 0.0003.Training was performed on a NVIDIA GTX Titan X Pascal graphical processing unit with acceleration provided by cuDNN libraries (Chetlur et al , 2014).All reported performance metrics and saliency logos are drawn strictly from the test set using the model parameters which yielded the lowest binary cross-entropy loss on the validation set, a technique known as early stopping.
Visualizing saliency analysis and 1st layer lters Saliency analysis is performed by calculating the gradients of a neuron-of-interest with respect to the input one-hot representation.We use a variant of saliency analysis, called guided-backpropagation (Springenberg et al , 2014), which recties negative gradients through each ReLU activation.To generate a saliency logo, we calculated the saliency map using guided-backpropagation from the logits of a given class to the inputs.We then normalized the saliency map by dividing the maximum absolute value across the saliency map.Next, we applied an exponential lter according to: Ŝ = exp λ S max |S| , where Ŝ is the normalized saliency map, S is the saliency map generated by guided-backprop, λ is a scaling factor that we set to 3 for all of saliency logos in this paper.We then separately normalized each position across by dividing the sum of the ltered saliency map across nucleotides, thereby providing a probability for each nucleotide at each position.To generate a sequence logo, each amino acid a at each nucleotide position i is scaled according to: Ŝa,i × H i , where H i = 2 + a Ŝa,i log 2 Ŝa,i .First layer convolution lters were normalized and visualized following the same procedure, with the exception that the lter was used instead of the guided-backprop saliency map.

FilterFigure 1 :
Figure 1: Toy model for representation learning of sequence motifs.(A,B) An example 60 nt one-hot encoded sequence contains 3 patterns (shown in green): CACGTG, GTGCAC, and CACNNNGTG.Two lters, each of length 7 (7 columns and 4 rows, one for each nucleotide), are shown to the right.A partial-motif representation has been captured by each lter: GTG for lter 1 (Top) and CAC for lter 2 (Bottom).The max-pooled feature maps are shown above the sequence.The feature maps have the same size as the sequence by adding 3 zero-padding units to each end of the sequence prior to convolution (not shown in diagram).(A)Shows the feature maps when employing a small max-pooling size of 3, which creates overlapping receptive elds, highlighted in blue.3 second layer convolutional lters, shown above, demonstrate a feature map pattern that can resolve each embedded sequence pattern.(B) Shows the feature maps when employing a larger pooling size of 20 using the same lters as (A).The larger receptive elds have a large spatial uncertainty along with a small overlap in receptive elds from neighboring neurons.Each of the 3 second layer convolutional lters, shown above, is unable to nd a unique feature map pattern that can resolve any embedded sequence pattern.

Figure 2 :
Figure 2: Comparison of rst layer lters for CNNs with a dierent max-pool size.Sequence logos for normalized rst convolutional layer lters are shown for CNN-2 (Left), CNN-4 (Middle), and CNN-25 (Right).The sequence logo of the ground truth motifs and its reverse complement for each transcription factor is shown at the bottom.The y-axis label on select lters represents a Tomtom match to a ground truth motif.

Figure 3 :Figure 4 :
Figure 3: Representations learned by rst layer lters for alternative CNN architectures.Sequence logos for normalized rst convolutional layer lters are shown for (A) CNN 9 -4 (B) CNN 9 -25, and (C) CNN-50-2.(D) shows the ground truth motifs and its reverse complement for each transcription factor.The y-axis label on select lters represents a Tomtom match to a ground truth motif.

Figure 5 :
Figure 5: Comparison of the rst layer lters for CNN models trained on in vivo sequences.Sequence logos for normalized rst convolutional layer lters are shown for CNN-2 (Left), CNN-4 (Middle), and CNN-25The sequence logos of reference motifs and their reverse complements for each transcription factor from the JASPAR database is shown at the bottom.The y-axis label on select lters represents a Tomtom match to a reference motif.

Table 2 :
Performance of deep learning models on the in vivo dataset.The table shows the percentage of matches between the rst convolutional layer lters and the JASPAR database (JASPAR) and the percentage of lters that match known motifs for the 12 transcription factors.Motif matches were determined by the Tomtom motif comparison search tool using an E -value cuto of 0.1.