Large-scale multi-omic biosequence transformers for modeling protein–nucleic acid interactions

Sully F. Chen; Robert J. Steele; Glen M. Hocky; Beakal Lemeneh; Shivanand P. Lad; Eric K. Oermann

doi:10.1371/journal.pone.0341501

Abstract

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. To date, most biosequence transformers have been trained on single-omic data—either proteins or nucleic acids—and have seen incredible success in downstream tasks in each domain, with particularly noteworthy breakthroughs in protein structural modeling. However, single-omic pretraining limits the ability of these models to capture cross-modal interactions. Here we present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data. We show that despite only being trained on unlabeled sequence data, OmniBioTE learns joint representations mapping genes to their corresponding protein sequences. We further demonstrate that OmniBioTE achieves state-of-the-art results predicting the change in Gibbs free energy () of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein–nucleic acid binding interaction. Compared to single-omic controls trained with identical compute, OmniBioTE also demonstrates superior performance-per-FLOP across both multi-omic and single-omic benchmarks. Together, these results highlight the power of a unified modeling approach for biological sequences and establish OmniBioTE as a foundation model for multi-omic discovery.

Citation: Chen SF, Steele RJ, Hocky GM, Lemeneh B, Lad SP, Oermann EK (2026) Large-scale multi-omic biosequence transformers for modeling protein–nucleic acid interactions. PLoS One 21(2): e0341501. https://doi.org/10.1371/journal.pone.0341501

Editor: Abozar Ghorbani, Nuclear Science and Technology Research Institute, IRAN, ISLAMIC REPUBLIC OF

Received: September 2, 2025; Accepted: January 8, 2026; Published: February 2, 2026

Copyright: © 2026 Chen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Datasets can be found from their respective open sources, specifically the National Center for Biotechnology Information (Genbank: https://ftp.ncbi.nlm.nih.gov/genbank/), and UniProt (UniRef100: https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz). Additionally, we maintain code for the downloading and preprocessing of this data on our Github (https://github.com/nyuolab/OmniBioTE). We release all foundation models on HuggingFace (https://huggingface.co/WeiHua/OmniBioTE) and through Zenodo (https://doi.org/10.5281/zenodo.17945682). Data for predictions combining AlphaFold3 with MD simulations are available from Zenodo (https://zenodo.org/records/15098577).

Funding: GMH was supported by the National Institutes of Health through the award R35GM138312, and MD simulations were performed on NYU High Performance Computing resources, using GPUs purchased by the Simons Center for Computational Physical Chemistry (SCCPC) at NYU (SF Grant No. 839534). EKO was supported by the Institute for Information and Communications Technology Promotion (RS-2019-II190075).

Competing interests: The authors have declared that no competing interests exist.

Introduction

It has long been a fundamental goal of bioinformatics to derive functional and structural insights directly from primary biomolecular sequences. High-throughput sequencing technologies now enable routine acquisition of vast quantities of nucleic acid and protein data, yet translating these linear sequences into mechanistic understanding remains challenging. Recent breakthroughs in natural language processing (NLP), particularly the transformer architecture [1], have demonstrated exceptional capacity to model complex sequential dependencies in text. Despite these advances, cellular biology is inherently multi-omic, with proteins and nucleic acids engaging in dynamic and reciprocal interactions underpinning gene regulation, replication, and repair. Single-omic transformers, by design, lack the capacity to capture cross-modal dependencies in their fundamental representations to model tasks such as transcription factor binding, RNA-mediated translational control, and chromatin remodeling.

Here, we introduce the OmniBioTE series of models and the first exploration of scaling laws in multi-omic transformers. Additionally, we contribute the largest open-source multi-omic transformer, pretrained on 250 billion tokens drawn from GenBank nucleic-acid entries and UniRef100 protein sequences (Fig 1). We explore four model sizes (88M–2.3B parameters) and compare performance against matched single-omic controls (NucBioTE, ProtBioTE) trained with identical compute, but only nucleic acid data (NucBioTE) or on proteomic data (ProtBioTE). Notably, because total token budgets were fixed, each single-omic control is exposed to more unique single-omic data than the multi-omic model. We train four additional models that operate at the per-residue/nucleotide level (as opposed to tokenized chunks) to investigate the effects of tokenization on task-specific performance. For the multi-omic models, sequences of different modalities were never concatenated within the same context window during pre-training, so no cross-attention occurred across protein and nucleic-acid tokens at pre-training time. We evaluate on tasks spanning: (1) predicting binding free energies () for protein–nucleic acid complexes on ProNAB [2], (2) emergent contact prediction via attention-based probing, (3) nucleic acid specificity assessment on JASPAR [3], and (4) state-of-the-art performance on standard single-omic benchmarks (GUE [4], TAPE [5]). Our results demonstrate that multi-omic pretraining yields embeddings that inherently align gene and protein modalities, outperform single-omic models in both multi-omic and single-omic tasks, and exhibit emergent structural knowledge without explicit supervision. OmniBioTE sets a new paradigm for foundation modeling in biology by unifying sequence modalities within a single transformer framework.

Download:

Fig 1. Multi-omic pretraining and task-specific fine-tuning.

(A) First, we gather large-scale datasets consisting of proteomic data, nucleic acid modalities as DNA, many types of RNA, synthetic constructs, and more. (B) Next, we employ large-scale pretraining over these sequences via an encoder transformer and the masked language-modeling objective. (C) Finally, we fine-tune this foundation model with a task-specific head to tackle a wide variety of tasks. Created in BioRender. Chen, S. (2025) https://BioRender.com/ydhbam8.

https://doi.org/10.1371/journal.pone.0341501.g001

Our main contributions are as follows: we introduce OmniBioTE, a family of open-source multi-omic encoder transformers (88M–2.3B parameters; BPE and per-residue variants) jointly pretrained on 250 billion nucleic acid and protein tokens from GenBank and UniRef100, and release all models and code with a permissive open-source license. Second, we show that OmniBioTE learns modality-invariant gene–protein representations. Third, we develop multi-omic protein–nucleic acid interaction benchmarks—including a rigorously homology-filtered regression task on ProNAB, JASPAR-based mutation scans, and PDB-derived contact prediction—and demonstrate that OmniBioTE outperforms single-omic baselines, specialized models (DeePNAP), and an AlphaFold3-plus–molecular dynamics pipeline. Fourth, we demonstrate that the attention maps of our multi-omic models trained to predict binding energy emergently encode latent structural information. Finally, we perform a comprehensive scaling study on GUE, TAPE, and ProteinGLUE, showing that multi-omic pretraining improves performance per FLOP and establishes a new compute Pareto frontier on many single-omic tasks, despite reduced per-modality data.

Related work

The majority of research applying transformers to biosequences has focused on applying the architecture to single-omics, typically nucleic acid distributions (genomics, transcriptomics, epigenetics, etc.) or proteomics. These efforts have yielded astonishing successes in several tasks, with the most notable being the prediction of the 3D structure of proteins from their primary sequences [6–14]. Other work has focused on developing models that produce useful representations of single-omics biosequences for various downstream tasks. There exist numerous protein foundation models [15–27], and we find the most variety of model architectures in this class. Notably, there are many generative models [28–30], encoder–decoder models [22,23], and even a diffusion model [28].

Several genomics foundation models have been trained as well, primarily on human genomics data [31–34]. Other genomic foundation models have been trained on human and murine data [35], multi-species genomes [4], prokaryotic genomes [36], and even metagenomic scaffolds [37]. Notably, very few models integrate broad, multi-species training data, with the exception of DNABERT-2 [4], though this dataset notably lacks genomes from the domain Archaea and consists of only 32 billion nucleotides. To date, the largest DNA foundation model to be trained consists of 40 billion parameters [38], and was trained on multi-species genomes and found to be successful at multiple downstream tasks. Genomic models augmented with epigenetic data have also demonstrated great success in downstream tasks such as predicting epigenetic markers [39–42], detecting splice sites and promoter regions [34], modeling the histone code [43], and modeling the phosphorylation of protein kinases [44].

Other foundation models focus on transcriptomics, primarily focusing on single-cell RNA (scRNA) [45–49]. Other foundation models for mRNA [50] and general RNA [51] have also been trained. Transcriptomic foundation models have successfully predicted transcriptome-to-proteome translations, [52] gene rankings [53], cell type annotation [54], and drug response [49,54].

Only three existing models incorporate both nucleic acid and protein information in a unified framework: AlphaFold3 [8], a closed-source proprietary model; RoseTTAFoldNA [10]; and LucaOne [55]. The former two models are focused primarily on structure prediction rather than generally learning from multi-omic sequences, while the latter model’s nucleic acid sources are primarily sourced from RefSeq [56]. RefSeq provides a sparse, curated subset: a single representative genome per organism, a reduced set of mature transcript and protein models, and virtually none of the underlying high-throughput data such as partial transcripts, genomic survey reads, metagenomic contigs, rare isoforms, immune V(D)J recombination products, or engineered sequences [57]. As a result, large classes of biologically meaningful variation and sequence diversity present in GenBank are absent from RefSeq, potentially making it challenging for the model to learn robust representations of these classes (e.g., immunoglobulins or T-cell receptors). Furthermore, the largest open-source multi-omic model to date is LucaOne, with 1.8 billion parameters. In this work, we train a 2.3 billion parameter model, nearly 28% larger. None of these models are open-source multi-omic sequence encoders trained at the scale and breadth of OmniBioTE, nor do they systematically study multi-omic scaling behavior across both single-omic and explicitly multi-omic benchmarks. A summary of the sizes of the models evaluated in this work can be found in Table 1.

Download:

Table 1. Parameter counts for OmniBioTE and the external models evaluated in this work.

https://doi.org/10.1371/journal.pone.0341501.t001

Methods

Broadly, we train dense, non-causal encoder transformer models of varying sizes using the masked-language-modeling (MLM) objective [60] on 250 billion tokens of nucleic acid and protein sequences of varying types. We additionally train control models consisting of only nucleic acid or protein sequences with equal compute budgets to evaluate the effect of training on additional sequence types. We demonstrate that our MOMs emergently learn joint representations between nucleic acid and protein sequences by showing that there exist meaningful features roughly invariant to sequence modality, and that such features do not exist in single-omic models.

We evaluate our suite of models by fine-tuning on several single-omics datasets that assess performance on various downstream tasks relevant to molecular biology, structural biology, and biochemistry. Additionally, we design two novel multi-omic tasks that require inference on both protein and nucleotide sequences simultaneously. Lastly, we show via simple convolutional probes that the models’ attention maps encode structural information that is learned without any a priori structural training.

Training data

We source our nucleic acid data from GenBank [61], a collection compiled by the National Center for Biotechnology Information. We preprocessed the entire GenBank archive by first removing all metadata from each sequence, with the exception of sequence type (DNA, mRNA, tRNA, etc.). This produced 242,855,368 sequences with a total of 312,190,748,151 nucleotides, primarily composed of general DNA, general RNA, mRNA, cRNA, and single-stranded RNA. A full breakdown of nucleic acid sequence data can be found in S1 Table. We source our protein data from UniRef100 [62], a dataset maintained by UniProt. Similarly to the nucleic acid data, we remove all metadata from each sequence, yielding 369,597,671 sequences with a total of 1,739,747,047 residues.

We take a subset of 10⁵ nucleotides and protein residues total to train a byte-pair encoding tokenizer [63] using the Sentencepiece library [64], with a vocabulary size of 2¹¹ for protein sequences and nucleic acid sequences (2¹² unique tokens total since the vocabularies are disjoint). Our choice of tokenizer and vocabulary size was chosen based on previous work [4]. Additionally, we train a multi-omic per-residue/nucleotide model at each size to investigate the effects of tokenization on downstream performance, where each token is simply a single base pair or residue. In each case, we use a separate tokenizer for protein sequences and nucleic acid sequences. For example, the sequence “ACGT” is both a valid nucleic acid and peptide, and its tokenized representation will be different depending on the modality.

Architecture and training

OmniBioTE is based on the LLaMA-2 architecture [65] with minimal modifications: we substitute learned positional embeddings [1] with rotary positional embeddings (RoPE) [66] and replace the causal self-attention mechanism [1,67] with a full, non-causal attention operation [60]. We additionally scale the pre-SoftMax attention scores at rather than in accordance with maximal update parameterization () [68]. We use an aspect ratio (the ratio of model width to depth) of 128. We modify Karpathy’s NanoGPT [69] for a lightweight and simple model implementation. For a detailed description of the architecture, see S1 File. We train four OmniBioTE variants, OmniBioTE-small (88 million non-embedding parameters), OmniBioTE-medium (675 million), OmniBioTE-large (1.3 billion) and OmniBioTE-XL (2.3 billion). Additionally, we train controls for each model on only nucleic acid data or only protein data (henceforth referred to as “NucBioTE-[size]” and “ProtBioTE-[size]”). For experiments requiring fine-grained, single-nucleotide/residue inference, we also train an OmniBioTE model of each size that uses a single-character tokenizer rather than a byte-pair encoding (BPE). In total, we train 16 models: OmniBioTE-small/medium/large/XL, OmniBioTE (single-char)-small/medium/large/XL, ProtBioTE-small/medium/large/XL, and NucBioTE-small/medium/large/XL. Notably, the single-omic models and the multi-omic models have the same token budget, but different data mixtures. Thus, each single-omic model is trained on more unique data for its respective modality than the multi-omic models are.

We train each model for 250 billion tokens with a context length of 1024 tokens for the BPE-tokenized models and a context length of 2048 characters for the single-character models (to accommodate the decreased amount of data per token). We train at a batch size of 786432, 1032192, or 1048576 tokens (chosen based on available compute and memory and to maximize throughput) with the masked language modeling objective [60]. We use AdamW [70] (, , , weight decay = 10⁻²), employing for stable hyperparameter transfer. For the parameters with fixed learning rate under (the embedding and unembedding parameters), we set the learning rate to 0.05, and scale learning rates of the rest of the parameters via . These hyperparameters were determined empirically with sweeps at the 10⁶-parameter-scale. Finally, all learning rates are decayed with PyTorch’s OneCycleLR [71], with a warmup period of 1 billion tokens, a starting and ending learning rate scale of 10⁻⁵.

Evaluations

We design our own multi-omic benchmark to assess our model’s ability to accurately characterize protein-nucleic acid interactions. We further design several novel benchmarks to assess the performance and interpretability of our models on protein-nucleic acid tasks. In addition to our main multi-omic tasks, we evaluate our approach on several popular benchmarks to evaluate single-omic performance on a variety of nucleic acid and protein-based tasks in an effort to assess the baseline single-omic capabilities of our model before multi-omic task-specific fine-tuning. All fine-tuning optimization is performed via AdamW [70] with identical hyperparameters as described in the pretraining step unless otherwise specified.

Protein-nucleic acid binding evaluation.

To showcase the native multimodality of our generalist model, we designed a novel evaluation task using the ProNAB dataset [2]. ProNAB consists of 20,090 samples comprised of 14606 protein-DNA complexes, 5323 protein-RNA complexes, and 161 protein-DNA-RNA complexes. These samples are composed of 798 unique DNA-binding proteins and 340 unique RNA-binding proteins. We refer to the original work for a detailed description of the dataset composition [2]. The objective of our task is as follows: given the primary sequence of a nucleic acid-binding protein and a nucleic acid sequence, predict the of the binding interaction. This task is of particular interest in the prediction of unknown DNA/RNA-binding protein interactions with the human genome.

We assemble our dataset by first filtering the ProNAB dataset, rejecting any nucleic acid or protein sequences with non-standard residues (we use only the standard 20 amino acids and the 5 standard nucleotide bases), leaving 850 unique proteins, and 15994 protein-nucleic acid complexes. We then split the data into 10 cross-validation sets. Ultimately, we end up with 752 unique proteins and 12282 total protein-nucleic acid interactions.

The ProNAB dataset often has multiple nucleic acid sequences per protein; thus the number of unique proteins is vastly outweighed by the number of unique nucleic acids. To avoid data leakage in the train and test sets, we group samples by protein sequence, then create folds by randomly grouping by protein sequence such that the folds do not have any proteins in common. Furthermore, we conduct sequence similarity analysis on the protein sequences in the train and test set via sequence alignment with the BLOSUM62 substitution matrix [72] to ensure minimal train/test leakage. We found that the average normalized alignment score between identical protein sequences in our dataset was (identical sequences may have different scores due to length normalization and BLOSUM62 scores), while over 99.4% of pairwise comparisons in our train/test set had an alignment score below 0.0, and 99.9% had a score below 1.0 suggesting that our results are not purely a result of sequence homology. As an extra precaution, we keep any proteins that have a sequence similarity score over 1.5 with any other protein sequence in the dataset strictly in the train set of all cross-validation sets to guarantee there is no significant sequence homology in any cross-validation fold. As a result, 13 unique proteins and 232 protein-nucleic acid interactions were always kept in the train set.

To compute a value, we first concatenate a primary protein sequence and nucleic acid sequence pair and run a forward pass through OmniBioTE. We then take the embedding produced by the first token and apply a linear projection which produces a single value. If a complex is composed of a protein and a double-stranded DNA or RNA molecule, we append the second nucleic acid sequence as well. We fine-tune our model to predict from the protein-nucleic acid pairs in the train set, with mean-squared error (MSE) as our loss target. As a single-omic control, we compute the embeddings of the protein and nucleic acid sequences separately with the corresponding ProtBioTE and NucBioTE model. We then concatenate these embeddings and use a linear projection head to produce the value.

Our primary evaluation metrics are the Pearson correlation coefficient of prediction with the ground-truth measured value, as well as the mean absolute error of the predicted values. We begin with a pretrained OmniBioTE model, then further train our models for 64 epochs with a batch size of 256 on the prediction task. The projection head learning rate initialized to 10⁻², the embedding vector learning rate initialized to 10⁻³, and the non-embedding parameters learning rate to . All learning rates are decayed with PyTorch’s OneCycleLR, an implementation of the learning rate schedule first described in [71].

As a baseline, we train a recent deep-learning-based architecture, DeePNAP [59] on the identical cross-validation dataset as our model. We train the DeePNAP architecture for 64 epochs with a batch size of 256. For the training, we use AdamW (, , , weight decay = 10⁻²), starting at a learning rate of 10⁻³ and decaying linearly to 0.0. Additionally, we fine-tune a recently released Genome-Protein model, LucaOne [55] in a similar manner. Specifically, we set the embedding learning rate to 10⁻⁴, the non-embedding parameter learning rates to , and the projection head learning rate to 10⁻². We train the LucaOne with identical AdamW hyperparameters, batch size, and epochs.

Lastly, we compare against a baseline that is more representative of current computational methods. First, we predict the structure of the protein-nucleic acid complex with AlphaFold3 [8] and use molecular dynamics simulations to predict the of the binding interaction.

Nucleic acid binding specificity.

To further validate the robustness of the OmniBioTE models fine-tuned to predict binding affinity, we evaluate whether the models can correctly predict the specificity of various DNA-binding proteins (DBPs) to their consensus sequences. First, we gather a set of 2,145 DBPs and their position-frequency matrices (PFMs) from JASPAR [3]. Using the same sequence similarity rejection technique described in the ProNAB experiment, we filter all DBPs from the JASPAR dataset that have any significant overlap with the ProNAB dataset used in the cross-validation evaluation. We then use our fine-tuned OmniBioTE model to compute the for each DBP-nucleic-acid pair, where the consensus sequence is defined by the most frequent nucleotide in each position of the PFM. Next, we mutate each consensus sequence by randomly substituting each nucleotide with probability 5%. This produces a mutated nucleic acid sequence that would have a reduced binding affinity to the DBP as empirically known by the PFM, but would still be “in distribution” of the plausible binding nucleic acids with high sequence similarity. We generate 8 unique mutated nucleic acid sequences per DBP. We predict the for these mutated interactions and compute the difference between the predicted of the consensus sequence. If the fine-tuned model has learned to model the specificity of the binding interaction correctly, we should expect the to increase after the consensus sequence is mutated.

Protein-nucleotide contact prediction.

We gather all structures from the Research Collaboratory for Structural Bioinformatics Protein Data Bank [73] that contain strictly one protein chain and either one or two nucleic acid chains. For each residue in the protein-nucleic acid complex, we compute the distance to the nearest nucleotide and label a residue as “contacting a nucleotide” if it is within a given distance threshold of a nucleotide. We test distance thresholds of 4 Å, 6 Å, and 8 Å. Next, we group data by primary protein sequence and create 10 cross-validation splits by protein grouping to avoid data leakage. To fine-tune OmniBioTE, we concatenate the protein and nucleic acid sequences together and compute a forward pass through the model as usual. Instead of unembedding the hidden states of the final layers, we instead compute a linear projection to a single scalar, upon which a sigmoid function is applied to yield a contact prediction. Although the nucleic acid sequence is included in the forward pass, contact prediction is only computed for the protein residues. We train the model against a binary cross-entropy loss function for 32 epochs on each fold with a batch size of 256, with an identical training setup to the runs in the protein-nucleic acid binding evaluation. We additionally run the same training procedure on LucaOne with the embedding learning rate set to 10⁻⁴, the non-embedding parameter learning rates set to , and the projection head learning rate set to 10⁻², with identical AdamW hyperparameters.

Genome understanding evaluation.

To evaluate OmniBioTE’s generalizability to a variety of domain-specific nucleic acid tasks, we employ the Genome Understanding Evaluation (GUE) suite [4]. GUE consists of several genetic and epigenetic classification tasks over human, mouse, yeast, and coronaviridae genomes. Core promoter detection, transcription factor prediction, promoter detection, splice site detection, epigenetic mark prediction, and COVID variant classification were the target classes among these genomes. The promoter detection task is a binary classification task, where the goal is to determine whether a sequence of DNA is or is not a promoter. The promoter task is divided into several subcategories: proximal promoter detection, core promoter detection, and TATA/non-TATA motif promoter detection. The proximal promoter task contains the entire promoter sequence (including the core promoter) in the classification task, while the core promoter task only includes the sequence in close proximity to the transcription start site. The TATA class is composed of promoters that contain a TATA-motif, while the non-TATA does not have a TATA motif. Transcription factor detection is another binary classification task, where the goal is to determine whether a DNA sequence is the binding site of a transcription factor. This task is divided into human and murine datasets. Splice site detection is a classification task where the goal is to determine if a DNA sequence contains a splice donor or acceptor site. The epigenetic tasks’ goals are to determine whether a nucleic acid sequence taken from a yeast genome is likely to contain a given epigenetic modification. Lastly, the COVID variant task is a multi-class classification task where the goal is to predict which variant type (Alpha, Beta, Delta, Eta, Gamma, Iota, Kappa, Lambda and Zeta) a 1000 base pair snippet was sequenced from. We refer to the original work for a full characterization of the evaluation set. All tasks use Matthews correlation coefficient as the primary metric, with the exception of the COVID variant classification task, which uses F1-score.

For each classification task, we fine-tune a base OmniBioTE or NucBioTE model. A class prediction is generated by taking the first token’s final embedding and applying a linear projection down to the number of classes in place of the original final projection head, followed by a SoftMax operation. We set the embedding parameter learning rate to 10⁻³, the transformer weight matrices to , and lastly, set the learning rate of the projection head to 10⁻² for all model sizes. Hyperparameters were determined with sweeps over the validation sets. All learning rates are decayed with PyTorch’s OneCycleLR. The small and medium models are trained for 15000 steps with a batch size of 32 over the training data, while the large and XL models were trained for 30000 steps with a batch size of 32. We find that final validation performance is relatively robust to the number of epochs over each dataset, thus these training parameters were chosen to yield a reasonable training time. The model that performs best on the validation set is evaluated on the test set. We additionally fine-tune LucaOne as an additional multi-omic baseline. We train with the exact same optimizer hyper-parameters described for LucaOne in the protein-nucleic acid binding evaluation above. We train with batch size 32 for 30,000 iterations on each task.

Tasks assessing protein embeddings.

We employ the Tasks Assessing Protein Embeddings (TAPE) suite [5] to evaluate OmniBioTE’s ability to generalize to unseen protein-based tasks. TAPE consists of five challenges: secondary structure prediction, residue contact prediction, remote homology detection, fluorescence prediction, and stability prediction. Secondary structure prediction is a per-residue classification challenge, where the goal is to determine what type of secondary structure each residue composes. The secondary structures are split into one of either 3 or 8 classes, depending on the task. Residue contact prediction involves generating an mask, where N is the length of the protein, with each element of the mask predicting the probability that a residue pair are within 8 Å of each other. Remote homology detection involves mapping a primary protein sequence to one of 1195 homologies, with the aim to learn to classify primary sequences into meaningful structural families. Fluorescence prediction is a regression task, where the goal is to predict the log fluorescence intensity of a protein from a given primary structure. Finally, stability prediction is a regression task that aims to predict the maximum concentration at which a protein is still structurally stable. All classification tasks are measured in accuracy, while all regression tasks are measured via Spearman’s correlation coefficient. We train each task (excluding the contact evaluation which is discussed below) for 64 epochs over the dataset with a batch size of 32, with identical initial learning rate parameters and schedule as the GUE tasks [4], though we initialize the non-embedding model parameter learning rate to , the embedding learning rate to 10⁻⁴, and the projection head learning rate to 10⁻² for all model sizes.

The residue contact evaluation task involves predicting an matrix of values between 0 and 1, with each element (i,j) representing the probability that residue i in the primary sequence is within 8 Å of residue j. To generate this prediction matrix, embeddings are generated from a transformer model [1], and a learned linear projection head transforms each embedding into 128-dimensional vectors. As inspired by previous work [74], a tensor of shape is constructed, where item corresponds to the i^th 128-dimensional vector concatenated with the j^th 128-dimensional vector. This tensor is transformed via an 8-layer ResNet [75] to yield a final matrix, which after transformation by the sigmoid function, produces the desired probability matrix. Binary cross-entropy is used as the loss target, with masks applied computing the loss only on residue pairs that are separated by at least 12 total residues (excluding “short” contacts). Fine-tuning is performed for 128 epochs with a batch size of 128. The learning rate of non-embedding transformer parameters was set to , with the projection head and ResNet [75] using a learning rate of 10⁻³. Learning rates were warmed up and decayed via the PyTorch OneCycleLR [71] learning rate scheduler as mentioned previously.

We fine-tune a series of ESM2 models [13] to compare both absolute performance and scaling performance against a state-of-the-art single-omic protein model. Specifically, we fine-tune the 8 million, 35 million, 150 million, 650 million, and 3 billion parameter ESM2 models in an identical fashion as the OmniBioTE models above. For brevity, we hereafter refer to the ESM models as ESM2-XS (8 million), ESM2-S (35 million), ESM2-M (150 million), ESM2-L (650 million), and ESM2-XL (3 billion). We use the same embedding and head learning rate as the OmniBioTE finetuning runs, and set the non-embedding parameter learning rate to . Additionally, we evaluate LucaOne via the same hyperparameters described in the protein-nucleic acid binding evaluation, with the same number of iterations and batch size for each task. We use AdamW (, , , weight decay = 0.01) as the optimizer for all models.

Protein general language of life evaluation.

To explore per-residue tasks (i.e., tasks that require a prediction for every residue in the protein), we employ the Protein General Language of Life Evaluation (ProteinGLUE) [76]. We refer to the original work for a full description of ProteinGLUE, but briefly, ProteinGLUE consists of several tasks:

Secondary structure prediction: the task is identical to the TAPE secondary structure task discussed above [5]. Accuracy is the primary metric.

Solvent accessibility: the task is to either classify whether a residue has less than 7% solvent accessibility, as well as a regression task to predict the actual solvent accessibility value. For the binary classification task, accuracy is the primary metric, and Pearson correlation coefficient is used as the primary metric for the regression task.

Protein-protein interaction: the task is to predict which residues interact in either homodimer or heterodimers. Area under the receiver operating characteristic curve (AUCROC) is used as the primary metric.

Epitope region detection: the task is to predict which regions of a protein are antigenic epitopes. The performance of this task is measured in AUCROC.

Hydrophobic patch prediction: the goal of this task is to predict the largest rank of a hydrophobic patch that a residue belongs to. This task is measured via Pearson correlation coefficient.

Each task was trained with a batch size of 32 for 16 epochs on all tasks except for the protein-protein interaction, for which 64 epochs were used owing to a smaller dataset size. Identical initial learning rates and schedules used in the TAPE evaluation mentioned above were used. We compare against ESM models in a similar manner as the TAPE evaluations, namely with an embedding learning rate of 10⁻⁴, a projection head learning rate of 10⁻², and a non-embedding parameter learning rate of . We use the same optimizers and hyperparameters as described in the TAPE evaluations. We evaluate LucaOne on this task with identical hyperparameters as the TAPE evaluation.

Per-residue evaluations

Because the protein and nucleic acid datasets were tokenized with byte-pair encoding, most tokens contain several nucleotides or residues. Evaluations that require a per-residue prediction, such as secondary structure, are not directly compatible with this tokenization scheme. To solve this issue, we apply two simple strategies at train and test time. At train time, we compute the target of a single token as the mode of all the residues it contains in the case of a classification task or the mean of the values of the residues it contains in the case of a regression task. This allows the input sequence length and the target sequence length to be the same size. At test time, we simply duplicate the value at the predicted token by the number of residues that token contains, allowing us to construct a prediction with the same length as the target ground truth. This method places an upper bound on the maximum achievable performance our model can achieve on any per-residue task, but in practice, this upper bound is higher than state-of-the-art results previously reported. This is likely due to the fact that nearby residues often share similar values in per-residue prediction tasks (e.g., if a residue is in a beta chain, its adjacent residues are likely to be in a beta chain as well). We note that our evaluation results are still directly comparable to previous per-residue methods, as we duplicate our predictions to match the ground truth dimensionality rather than decreasing the ground truth dimensionality to match the sequence length (as is done at train time).

For the contact evaluations, the non-uniform number of residues encoded by each token presented a significant challenge. We remedy this issue by transforming prediction targets from residue to token space for training and transforming predictions from token to residue space for evaluation. Transformation of prediction maps from residue space to token space was accomplished by assigning the (i,j)-token pair as a true contact if any of the residues contained within token i contact any of the residues within token j. Similarly, the (i,j)-token pair of the contact mask, used to ignore short-range contacts in the loss function, was assigned a positive value if any of the residues contained within token i are at least 12 residues apart from any of the residues contained in token j. Transforming from token space to residue space for evaluation is done in a simpler manner: residue (n,m) is assigned the value of the token pair (i,j), where i is the token containing residue n and j is the token containing residue m. For the per-residue/nucleotide models, the models were evaluated normally.

Interpretability

Protein-nucleic acid interactions.

To show that OmniBioTE learns semantically meaningful features, we demonstrate that when trained to predict the binding affinity between a nucleic acid and a protein sequence, OmniBioTE implicitly learns structural information despite exclusively being trained on primary sequence data. We fine-tune one OmniBioTE model of each size, in an identical fashion as described for the protein-nucleic acid binding evaluation, though we use all available data rather than cross-validation splits, as the goal is to fine-tune OmniBioTE models to be highly capable of predicting binding interactions, then investigate their mechanics.

Next, we gather all structures from the Research Collaboratory for Structural Bioinformatics Protein Data Bank [73] that contain strictly one protein chain and either one or two nucleic acid chains. For each residue in the protein-nucleic acid complex, we classify the residue as making contact with a nucleotide if it is within 8 Å of any nucleotide (in the same manner as described in the Protein-nucleic acid Contact Prediction task). We then compute a forward pass through either the OmniBioTE model fine-tuned to predict or through the base OmniBioTE model (control) and collect the attention maps produced by each head in each layer (this results in N² attention maps, where N is the number of layers). Next, we concatenate these attention maps along the channel dimension to produce an tensor, where L is the length of the input sequence. We then train a small convolutional network consisting of four layers. The first layer takes the N² channels and applies a convolution to produce 64 channels, the next two layers apply a convolution producing 64 channels, and the final layer again applies a convolution but produces only one channel. The output of the convolutional net is an tensor, and we average across the last dimension to produce L logits that, after a sigmoid operation, yield the predicted probability that a given residue makes contact with a nucleotide (this task is identical to the Protein-Nucleic acid Contact Prediction task described above). We train this convolutional network via AdamW with a learning rate of 10⁻³, , , weight decay of 10⁻², and for 1000 steps with a batch size of 256, linearly decaying the learning rate to zero over the course of training. Critically, the weights of the underlying OmniBioTE model remain frozen throughout training, meaning that the convolutional network must extract this structural information strictly from the attention maps produced by the underlying model. We compare the F1-score on each of the 10 folds for the attention maps produced by the base OmniBioTE model and those produced by the OmniBioTE model fine-tuned to predict binding affinity. If the fine-tuned model has learned meaningful structural information from the fine-tuning process, we would expect the F1-score for convolutional networks trained on these attention maps to be higher than those of the base model.

Shared representations between modalities.

We aim to test whether OmniBioTE effectively learns a joint representation space between nucleic acid and protein sequences rather than simply learning to represent both modalities separately. In this case, we want to test whether OmniBioTE has learned representations of gene sequences (DNA, both coding and non-coding regions) and their corresponding protein sequences that reflect shared functional or structural properties, independent of the sequence modality.

We first formalize the notion of invariance under transcription and translation. Let be a gene (DNA) sequence, and let be the corresponding protein sequence produced by a mapping , such as the standard transcription and translation process. Suppose that our pretrained multimodal model outputs embeddings for x and for y, where . We define a feature extractor that maps an embedding to a scalar feature value. A feature is called invariant under the mapping G if

for all and . In practical terms, such an invariant feature may correspond to the biological function or identity of a gene–protein pair, that is, a characteristic that remains constant regardless of the modality.

To test whether the model has indeed learned such invariant features, we conduct a contrastive learning experiment employing a strict linear transformation. In this experiment, we first obtain pairs of gene sequences (including both intronic and exonic regions) and their corresponding translated protein sequences. Using our pretrained multimodal model, we compute the embeddings and for each gene and protein sequence, respectively. We then introduce a learnable linear transform with low rank to project the embeddings into a shared subspace, yielding and . The function W is optimized via a contrastive objective that simultaneously maximizes the cosine similarity between corresponding pairs and while minimizing the similarity between non-corresponding pairs.

Specifically, we employ a contrastive loss function similar to the CLIP framework [77] to learn our feature extractor: let and denote two batches of embeddings (with N samples and embedding dimension d), where each row x_i of X is a gene’s feature vector and each row y_i of Y is the corresponding protein sequence. Any given pair x_i and y_j are unrelated if . To compute the contrastive loss, each embedding in X and Y is normalized to unit length. The normalized embeddings are then used to compute a similarity matrix whose entries are given by

where τ is a temperature parameter that controls the scaling of the cosine similarities.

In this setup, the diagonal elements S_ii represent the cosine similarity between corresponding pairs, while the off-diagonal elements S_ij for represent the similarities between non-corresponding pairs. Our final loss is composed of two terms: the first term considers each row of S as logits for a classification task in which the correct label for x_i is i. The second term is computed by treating each column as logits for the corresponding y_i. The two terms are simply averaged to compute the final scalar loss. This approach is identical to the original CLIP loss proposed by Radford et al. [77]. For our experiments, we use , and k = 16.

We minimize this loss via the AdamW optimizer, with learning rate 0.01, linearly decayed to 0.0 over 10000 steps, , and . We optimize strictly over the projection matrix and leave the model parameters frozen, as the goal is to test whether joint features are already learned, not whether they can be learned.

After learning ϕ, we apply this transformation to a held-out set of gene-protein pairs and compute the dot product between their feature representations. If ϕ is a generalizable feature extractor, we should see high dot product scores between corresponding held-out pairs and low dot product scores between non-corresponding held-out pairs.

Critically, we assess the generalization capability of the invariant features under very strict conditions; we train on only 5% of the available paired data and test on the remaining 95%. Strong performance in this setting indicates that the model’s embeddings encode a shared subspace that captures the desired invariances.

For further validation, we perform a control experiment using two separately trained single-omic models—one trained solely on genes and the other solely on proteins. In this case, the embedding spaces of these models are learned independently, and there is no inherent guarantee of alignment between them. We attempt to learn two distinct feature extractors, and , for the gene and protein modalities, respectively, with the goal of minimizing the same contrastive loss.

Results

Emergent joint representations

We first tested whether OmniBioTE embeddings encode modality-invariant features linking genes and proteins. First, we generate embeddings for the primary sequences of a set of proteins, as well as the genes that encode them (both non-coding and coding regions). Next, a low-rank linear projector is trained on these frozen embeddings via a contrastive loss objective (with matching protein-gene pairs serving as positives) with only 5% of ground-truth data. This simple linear probe is best thought of as a transform that narrows into a small subspace of the overall embedding space, rather than a feature extractor. Remarkably, we find that the contrastive performance of small linear probes trained on only 5% of the gene-protein pairs generalize well to the remaining 95% of held-out data (Fig 2a, 2b). In comparison, two separate low-rank linear probes trained with identical objectives and data splits on the single-omic models fail to generalize. Despite OmniBioTE never being explicitly (or even implicitly) taught a correspondence between genes and their corresponding translated protein sequences, the model naturally learns these associations from the underlying distributions. Furthermore, the failure of single-omic models to generalize despite loosening constraints to two separate linear probes demonstrates that the generalization is due to joint embeddings, rather than matching corresponding extracted features.

Download:

Fig 2. Emergent alignment of gene and protein embeddings and latent structural information.

(a) The distribution of cosine similarity between feature vectors produced by OmniBioTE via a low-rank feature extractor on the 95% held-out data. (b) The analogous plot produced by NucBioTE and ProtBioTE with two separate feature extractors with identical methodology. (c) The increase in F1-score on the contact-prediction task using frozen attention maps from OmniBioTE models fine-tuned to predict binding affinity compared to frozen attention maps from the base models. (d) An example of predicted contact probability for zinc-finger and BTB domain-containing protein 7A (ZBTB7A) bound to a DNA duplex computed from the attention maps produced by the fine-tuned OmniBioTE models. Darker red colors indicate a stronger predicted probability of contact. All box-and-whisker plots are constructed via the median value as the central line, the interquartile range (IQR) as the box, and the whiskers denoting the minimum and maximum value of the distribution. Outliers are defined as points that lie outside of and were excluded from (a) for clarity. ***¹: , ***²: , **³: . P-values were computed via one-sided Welch’s t-test with Holm-Bonferroni correction for multiple comparisons. Significance was determined at . Significance testing in (a) and (b) is omitted due to extremely large sample size leading to trivially high significance.

https://doi.org/10.1371/journal.pone.0341501.g002

Performance on multi-omic tasks

We demonstrated OmniBioTE’s potential as a foundation model for natively multi-omic tasks by fine-tuning each OmniBioTE model to predict the of protein-nucleic acid binding interactions. We measured the Pearson correlation coefficient between the laboratory-measured value and the value predicted by OmniBioTE, as well as the mean absolute error between these values. We found that our largest model achieved a Pearson correlation coefficient of 0.41 and MAE = 1.56 kcal/mol, exceeding single-omic controls (ΔPCC=+0.33) (Fig 3a, 3b). In addition to our single-omic controls, we compared against a recently developed binding affinity regression model, DeePNAP [59], as well as a computationally intense molecular dynamics-based simulation on structures predicted by AlphaFold3 [8]. We find that after rigorously partitioning the train and test sets by sequence homology (via alignment scores generated with BLOSUM62 substitution matrices), our largest model considerably outperforms both DeePNAP and the AlphaFold3 + molecular dynamics predictions. AlphaFold3 based simulations were notably more computationally intensive (S1 File). The full results of the evaluation can be found in S3 Table. As a performance ceiling, we note that empirical work has found that the maximum possible Pearson’s correlation coefficient is around 0.81, and the minimum possible mean absolute error is around 0.6 kcal/mol [78].

Download:

Fig 3. Multi-omic pretraining facilitates state-of-the-art results on protein-nucleic acid complex

regression.

(A) Performance on 10-fold cross-validation over the ProNAB dataset as measured by the Pearson correlation coefficient (PCC) as a function of pretraining compute. (B) Mean absolute error in prediction over the 10-fold cross-validation set. (C) The predicted of mutated consensus sequences as a function of pretraining compute. Error bars represent the standard error of the mean of all 10 folds. LucaOne and DeePNAP baselines are omitted for clarity, as both achieve performance similar to random chance (). (D) Performance on the supervised contact evaluation task trained at various contact thresholds. The positive-to-negative ratio of the dataset is 0.29, 0.16, 0.09, and the maximum F1-score achievable with random guessing is 0.37, 0.247, and 0.157, for 8 Å, 6 Å, and 4 Å, respectively. (*) represents the top-performing model in each evaluation. ***¹: , **²: , ***³: , **⁴: . P-values were computed via one-sided Welch’s t-test with Holm-Bonferroni correction for multiple comparisons. Significance was determined at .

https://doi.org/10.1371/journal.pone.0341501.g003

Next, we aimed to evaluate whether our fine-tuned models could model the specificity of DNA-binding proteins. We introduced small mutations into the consensus sequences of DNA-binding proteins from the JASPAR dataset [3], yielding highly similar sequences that should have strictly lower binding affinity. These mutation scans confirmed that predictions increase upon subtle consensus sequence disruption on average, scaling with model size (Fig 3c). Additionally, we find that the increase in the magnitude of the predicted increases significantly with scale (Spearman’s , ). This result demonstrates that our fine-tuned model is sensitive to fine-grained changes in sequences, rather than modeling a rough distribution of binding affinities belonging to families of proteins or consensus sequences. Furthermore, the generalization of our method to sequences from JASPAR, a dataset that is out-of-distribution with respect to the training set, demonstrates the validity and robustness of our methodology.

We found that the multi-omic approach is considerably more performant and compute-efficient than using two identically trained single-omic models (Fig 3a, 3b). We found a clear trend of increasing performance with model scale, as opposed to over-fitting with greater parameter count, indicating the robustness of the approach and potential for further performance gains with greater scale in both compute and data.

As another multi-omic structural task, we fine-tuned OmniBioTE models to take as input the primary sequence of both the protein and nucleic acid in a given protein-nucleic acid complex (sourced from the RCSB Protein Data Bank [73]) and predict which residues in the complex make contact with the nucleic acid it binds to, with contact defined as a residue and nucleotide residing within a given distance threshold. We find that on the protein-nucleic acid contact prediction task (measured in F1-score), our per-residue/nucleotide OmniBioTE-XL model outperforms a genomic/proteomic baseline, LucaOne, which had considerably more pretraining compute invested (Fig 3d) and that results improve with model scale. We hypothesize that this advantage stems from training OmniBioTE on a wide variety of nucleic acid data, in addition to genomics. We find that the byte-pair encoded OmniBioTE model underperforms compared to the LucaOne baseline and the per-residue/nucleotide OmniBioTE models, which we attribute to lower-resolution predictions (each token predicts the contact for multiple residues at a time). Additionally, we find similar improvements with scale on the contact prediction task (S2 Table).

Attention-based structural interpretability

We assessed whether attention maps extracted from OmniBioTE models fine-tuned to predict binding affinity implicitly encoded structural information, despite having no explicit structural training data. A simple convolutional probe was trained on frozen attention maps from OmniBioTE models fine-tuned to predict binding affinity and compared to an identical convolutional probe trained on frozen attention maps produced by their corresponding base models. Critically, all model parameters were frozen while training the probes, ensuring that no structural information leaked into either model’s attention maps. If the simple convolutional probes trained on frozen attention maps from the fine-tuned models consistently yield better prediction performance than identical probes fine-tuned on attention maps from base models, then it can be concluded that the fine-tuned model implicitly encodes richer structural information. We found that the probe trained on attention maps from the fine-tuned OmniBioTE models yielded consistently higher F1 scores on the contact prediction task at larger model scales (Fig 2c), indicating that more latent structural information is present in the attention maps produced by models trained to predict binding affinity. This is particularly striking as this structural information is not explicitly present in the binding affinity task and must instead be inferred. Additionally, the difference in F1 score increases with model size (Spearman’s , ), suggesting that larger pretrained models may be better at inferring structural information. An example of contact predictions projected onto a zinc-finger protein is shown in Fig 2d.

Performance on single-omic benchmarks

We hypothesized that our multi-omic model may be more performant on single-omic benchmarks, especially from the perspective of performance-per-FLOP or performance per dataset size, two metrics that are broadly recognized as critical metrics for scaling large transformer models. For each benchmark across all tasks, multi-omic pretraining demonstrates superior or comparable performance to single-omic pretraining in terms of performance-per-FLOP even with vastly different compute budgets for the GUE, TAPE, and ProteinGLUE benchmarks (Fig 4a, 4c, 4e). This improvement in performance-per-FLOP is even more striking when considering that significantly less data per-modality was seen by the model in the multi-omic training runs, since the total token budget was fixed in all training runs regardless of modality. In the GUE benchmarks (Fig 4b), OmniBioTE models set a new state-of-the-art in all categories, with the exception of human transcription factor classification. All sizes of the OmniBioTE models lie well above the previous compute-to-performance Pareto frontier, with the exception of the RandomMask model, indicating strong scaling across over an order of magnitude of compute. The full results of the GUE evaluations can be found in S4 Table, S5 Table, S6 Table, S7 Table, S8 Table, and S9 Table. In the TAPE evaluations (Fig 4d), OmniBioTE does not achieve any state-of-the-art results in terms of absolute performance, but the per-residue OmniBioTE models begin to trend above the previous compute Pareto frontier set by ESM, with only the smallest OmniBioTE model lying below the Pareto frontier. The results of the TAPE evaluation can be found in S10 Table, S11 Table, and S12 Table. Results are mixed between all models on ProteinGLUE (Fig 4f), with the Pareto frontier difficult to ascertain; more scaling experiments are likely needed to elucidate the true frontier. The full results of the ProteinGLUE evaluation can be found in S13 Table and S14 Table. The new compute Pareto frontier highlights the benefits of multi-omic data for efficient model scaling.

Download:

Fig 4. Performance and scaling across single-omic benchmarks.

Aggregate benchmark performance for each model plotted as a function of pretraining FLOPs for the (a, b) GUE, (c, d) TAPE, and (e, f) ProteinGLUE benchmarks demonstrating superior performance per pretraining FLOPs of multi-omic pretraining compared to single-omic pretraining. GUE epigenetic mark prediction benchmarks were averaged to form a single category. Each point in the OmniBioTE series represents small, medium, large, and XL from least to most parameters in panels (a,c,e). (*) represents the top-performing model in each evaluation.

https://doi.org/10.1371/journal.pone.0341501.g004

Notably, results on protein evaluation tasks differed depending on whether the tokenization was per-residue/nucleotide or whether a byte-pair encoding tokenizer was used. This difference in performance is likely due to an increase in performance on per-residue tasks.

Discussion

Implications, limitations, and outlook

OmniBioTE is a series of multi-omic models (MOMs) pretrained jointly on a diverse set of nucleic acid sequences and proteomic data. We analyzed the properties of these models across a wide range of scales and tasks. We found that these models not only achieve state-of-the-art performance on single-omic tasks measured in performance-per-FLOP, but also unlock novel multi-omic tasks such as modeling protein-nucleic acid interactions by predicting the change in Gibbs free energy between a protein and nucleic acid. We also showed that as a result of this fine-tuning process, OmniBioTE learns meaningful structural information without any explicit structural training, allowing one to estimate how strongly a given protein residue or nucleotide participates in binding interactions. Although our model is able to implicitly learn some structural information, the quality of the information pales in comparison to that of models that are explicitly trained to predict the structure of protein-nucleic acid complexes. For example, AlphaFold3 [8] reports an interface local distance difference test (iLDDT) score of over 55 for protein-DNA complexes, loosely meaning that, on average, AlphaFold3 predicts the interface structure between proteins and DNA within 4 Å over 55% of the time. This value is a lower bound, as iLDDT is computed as an average between distance cutoffs of 0.5, 1, 2, and 4 Å, meaning that AlphaFold3 significantly outperforms the contacts predicted from the information implicitly learned by our models from fine-tuning. Multi-omic modeling of this sort is of great interest in the development of new pharmacologic therapies; many notable pharmaceutical drugs and candidate drugs that function via nucleic acid-protein interaction have already shown great promise, such as pegaptanib [79], an RNA aptamer targeting vascular endothelial growth factor, as well as RNA sequences that target nucleolin [80], coagulation factors [81–84], CCL2 [85], CXCL12 [86], and hepcidin [87]. While our methodology does not explore aptamer design or property prediction, we believe that this methodology could be extended to aptamers with the right dataset and leave this to future research.

We found that OmniBioTE emergently learned a joint representation between protein sequences and their corresponding genes despite never explicitly being trained on a joint objective, demonstrating that training biosequence transformer models on multi-omic data can learn non-trivial representations across sequences even with a simple masked language model objective. We attribute this emergence from self-supervised pretraining as being a consequence of the efficient coding hypothesis [88].

A natural question is how OmniBioTE could acquire joint, modality-invariant features in the absence of explicit cross-attention between protein and nucleic-acid sequences during pretraining. We hypothesize this is due largely because the model must use the same parameters to model all sequence types. Both nucleic-acid sequences and their translated proteins can be considered as two views of the same underlying “latent sequences” generated from shared underlying biological factors. When such a model is trained to maximize the masked-language-modeling objective on the joint data distribution under finite capacity, the most efficient solution is to allocate some representation dimensions to the latent sequence representation and reuse them across modalities, while also learning the modality-specific information. This is the behavior predicted by efficient-coding perspectives on representation learning, in which optimization favors codes that are representative of common latent structure. Importantly, this mechanism does not require cross-attention between protein and nucleic acid tokens during pretraining; the coupling arises from the shared parameters, a common unsupervised objective over the joint data distribution, and the constraints of a finite parameter count. Analogous phenomena have been observed in multilingual language models such as mBERT [89], which show evidence of transfer learning and shared representations over multiple languages despite lacking any inter-language cross-attention. Our results, where a low-rank linear projector trained on only 5% of gene–protein pairs generalizes to the remaining 95%, are consistent with this view and suggest that OmniBioTE’s backbone already encodes a modality-invariant subspace that is approximately stable under transcription and translation.

We hypothesize that considerably richer representations could be learned if auxiliary training objectives were introduced, such as structure/property prediction, cross-attention between different modalities, or the addition of multiple sequence alignment data. Beyond additional learning objectives, we note that there has been a considerable amount of research into multi-modal vision-language modeling using novel model architectural components including cross-attention and cross-modal projectors [90–93], and that many of these approaches may be of interest in multi-modal biosequence modeling as well.

We additionally found that multi-omic pretrained models are superior or comparable at scale to identical models trained on single-omics data with identical compute budgets and smaller per-omic data budgets. Furthermore, we find that our multi-omic models set a new compute Pareto frontier across GUE and TAPE benchmarks, even before factoring in the lower amount of per-modality data each model sees during training. Despite the difference in datasets, we found no downsides to mixing in other modalities during pretraining for our biosequence foundation models in this project. In fact, our MOMs set new state-of-the-art performance numbers for several of the downstream nucleic acid tasks. Our MOMs also considerably outperformed a combination of single-omic models on the multi-omic task of binding affinity prediction, and outperformed molecular dynamics methods in conjunction with structural predictions from AlphaFold3, despite being a considerably more computationally intensive baseline. Lastly, we showed that these results robustly transfer to completely unseen and unrelated datasets by testing our models on the JASPAR dataset.

There are several notable limitations to this work that deserve special mention. Most notably, we only scratched the surface of multi-omic biosequence modeling. As noted earlier, there are many popular ways of training multi-omic sequence models, and we elected for a simple approach using a masked language modeling task. We additionally only investigate our scaling over a rough two orders of magnitude of compute and leave the training of larger models on larger datasets as future research directions that seem reasonably likely to yield performance benefits consistent with the scaling results we found in this work. Additionally, for the protein evals, the ESM models probe a higher compute budget than we were able to reach with OmniBioTE due to constraints on available compute. We hope to explore larger compute budgets in future work given the promising results at the 10²¹ FLOP range. Lastly, we only investigated a masked language modeling task for pretraining rather than the more popular autoregressive training framework, again leaving this approach open as a viable future research direction.

Conclusions

Many of biology’s most significant interactions occur between proteins and nucleic acids, and we demonstrate the first large-scale attempt at building and quantifying the scaling behavior of multi-omic foundation models to specifically capture these critical molecular interactions. Our results indicate that multi-omic pretraining is a scalable and compute-efficient strategy for building unified biological foundation models with rich representations and the capacity to perform strongly on downstream multi-omic tasks. Beyond their biological significance, modeling the interactions between nucleic acids and proteins is of great pharmaceutical and clinical importance; models that can assist with the development of nucleic acids that modify the function of naturally occurring proteins would greatly accelerate pharmaceutical development. Foundational biosequence models have the promise of dramatically improving our ability to both understand and predict biology, and we hope that our work with OmniBioTE presents one of many efforts to build multi-omic models that can capture the full richness of biomolecular interactions.

Acknowledgments

The authors would like to thank Michael Retchin for his insightful comments and broad literature knowledge on protein-nucleic acid interactions. The authors would like to thank Douglas Kondziolka for his feedback on the manuscript. The authors would also like to thank Vincent D’Anniballe for his helpful discussion surrounding biosequence datasets. Lastly, we would like to thank Michael Costantino and the NYU Langone High Performance Computing team for their assistance with maintaining state-of-the-art computing infrastructure necessary for this research.

Supporting information

S1 File. Supporting information describing our model architecture and molecular dynamics experiment.

https://doi.org/10.1371/journal.pone.0341501.s001

(PDF)

S1 Table. Training data statistics across all sequence types.

https://doi.org/10.1371/journal.pone.0341501.s002

(DOCX)

S2 Table. Mean F1 scores for predicted contact maps at distance thresholds of 4 Å, 6 Å, and 8 Å.

https://doi.org/10.1371/journal.pone.0341501.s003

(DOCX)

S3 Table. OmniBioTE performance across all 10-folds of the Pronab mutation benchmark as measured in Pearson correlation coefficient (PCC) and mean absolute error (MAE).

https://doi.org/10.1371/journal.pone.0341501.s004

(DOCX)

S4 Table. GUE Results (Epigenetics): Histone Modification Benchmarks (Part 1). Values represent the Matthews correlation coefficient of the predictions.

https://doi.org/10.1371/journal.pone.0341501.s005

(DOCX)

S5 Table. GUE Results (Epigenetics): Histone Modification Benchmarks (Part 2). Values represent the Matthews correlation coefficient of the predictions.

https://doi.org/10.1371/journal.pone.0341501.s006

(DOCX)

S6 Table. GUE Results: Human Transcription Factors and COVID. Values represent the Matthews correlation coefficient of the predictions, with the exception of the COVID variant prediction task which uses F1-score.

https://doi.org/10.1371/journal.pone.0341501.s007

(DOCX)

S7 Table. GUE Results: Mouse Transcription Factors. Values represent the Matthews correlation coefficient of the predictions.

https://doi.org/10.1371/journal.pone.0341501.s008

(DOCX)

S8 Table. Promoter Detection performance across all promoters (All) and promoter subtypes (No TATA, TATA).

https://doi.org/10.1371/journal.pone.0341501.s009

(DOCX)

S9 Table. Core Promoter evaluation: performance across all promoters (All) and promoter subtypes (No TATA, TATA).

https://doi.org/10.1371/journal.pone.0341501.s010

(DOCX)

S10 Table. Secondary structure performance. In the 3-way columns, CASP12, CB513, and TS115 scores are reported; in the 8-way columns, the corresponding scores are reported. All values are measured in accuracy.

https://doi.org/10.1371/journal.pone.0341501.s011

(DOCX)

S11 Table. Remote homology (Fold, Superfamily, Family) classification performance measured in accuracy and regression performance (Fluorescence, Stability) measured in Spearman’s correlation coefficient.

https://doi.org/10.1371/journal.pone.0341501.s012

(DOCX)

S12 Table. Contact evaluation performance, reporting Contacts P@L for long- and medium-range contacts. All values are the computed precision of the predictions.

https://doi.org/10.1371/journal.pone.0341501.s013

(DOCX)

S13 Table. Performance on the structural prediction tasks in the ProteinGLUE dataset. Values represent the accuracy of the predictions.

https://doi.org/10.1371/journal.pone.0341501.s014

(DOCX)

S14 Table. Performance on the remaining tasks in the ProteinGLUE dataset.

https://doi.org/10.1371/journal.pone.0341501.s015

(DOCX)

References

1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 6000–10.
2. Harini K, Srivastava A, Kulandaisamy A, Gromiha MM. ProNAB: database for binding affinities of protein-nucleic acid complexes and their mutants. Nucleic Acids Res. 2022;50(D1):D1528–34. pmid:34606614
- View Article
- PubMed/NCBI
- Google Scholar
3. Rauluseviciute I, Riudavets-Puig R, Blanc-Mathieu R, Castro-Mondragon JA, Ferenc K, Kumar V, et al. JASPAR 2024 : 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2024;52(D1):D174–82. pmid:37962376
- View Article
- PubMed/NCBI
- Google Scholar
4. Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H. DNABERT-2: efficient foundation model and benchmark for multi-species genome. 2024. https://arxiv.org/abs/2306.15006
5. Rao R, Bhattacharya N, Thomas N, Duan Y, Chen X, Canny J, et al. Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst. 2019;32:9689–701. pmid:33390682
- View Article
- PubMed/NCBI
- Google Scholar
6. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–10. pmid:31942072
- View Article
- PubMed/NCBI
- Google Scholar
7. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. pmid:34265844
- View Article
- PubMed/NCBI
- Google Scholar
8. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. pmid:38718835
- View Article
- PubMed/NCBI
- Google Scholar
9. Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373(6557):871–6. pmid:34282049
- View Article
- PubMed/NCBI
- Google Scholar
10. Baek M, McHugh R, Anishchenko I, Jiang H, Baker D, DiMaio F. Accurate prediction of protein-nucleic acid complexes using RoseTTAFoldNA. Nat Methods. 2024;21(1):117–21. pmid:37996753
- View Article
- PubMed/NCBI
- Google Scholar
11. Ahdritz G, Bouatta N, Floristean C, Kadyan S, Xia Q, Gerecke W, et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods. 2024;21(8):1514–24. pmid:38744917
- View Article
- PubMed/NCBI
- Google Scholar
12. Wu R, Ding F, Wang R, Shen R, Zhang X, Luo S, et al. High-resolution de novo structure prediction from primary sequence. openRxiv. 2022.
- View Article
- Google Scholar
13. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. pmid:36927031
- View Article
- PubMed/NCBI
- Google Scholar
14. Drori I, Thaker D, Srivatsa A, Jeong D, Wang Y, Nan L. Accurate protein structure prediction by embeddings and deep learning representations. arXiv preprint 2019. https://arxiv.org/abs/1911.05531
- View Article
- Google Scholar
15. Elnaggar A, Essam H, Salah-Eldin W, Moustafa W, Elkerdawy M, Rochereau C. Ankh: optimized protein language model unlocks general-purpose modelling. arXiv preprint 2023. https://arxiv.org/abs/2301.06568
- View Article
- Google Scholar
16. Geffen Y, Ofran Y, Unger R. DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. Bioinformatics. 2022;38(Suppl_2):ii95–8. pmid:36124789
- View Article
- PubMed/NCBI
- Google Scholar
17. Nambiar A, Heflin M, Liu S, Maslov S, Hopkins M, Ritz A. Transforming the language of life: transformer neural networks for protein prediction tasks. In: Proceedings of the 11th ACM international conference on bioinformatics, computational biology and health informatics; 2020. p. 1–8.
18. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022;44(10):7112–27. pmid:34232869
- View Article
- PubMed/NCBI
- Google Scholar
19. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118(15):e2016239118. pmid:33876751
- View Article
- PubMed/NCBI
- Google Scholar
20. Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems. 2021;34:29287–303.
- View Article
- Google Scholar
21. Rao RM, Liu J, Verkuil R, Meier J, Canny J, Abbeel P, et al. MSA transformer. In: International Conference on Machine Learning. PMLR; 2021. p. 8844–56.
22. Heinzinger M, Weissenow K, Sanchez JG, Henkel A, Mirdita M, Steinegger M, et al. Bilingual language model for protein sequence and structure. openRxiv 2023. http://dx.doi.org/10.1101/2023.07.23.550085
- View Article
- Google Scholar
23. Chen B, Cheng X, Li P, Geng Y, Gong J, Li S, et al. xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein. openRxiv. 2023.
- View Article
- Google Scholar
24. Su J, Han C, Zhou Y, Shan J, Zhou X, Yuan F. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv. 2023;2023-10.
- View Article
- Google Scholar
25. Notin P, Dias M, Frazer J, Marchena-Hurtado J, Gomez AN, Marks D, et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In: International Conference on Machine Learning. PMLR; 2022. p. 16990–7017.
26. Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16(12):1315–22. pmid:31636460
- View Article
- PubMed/NCBI
- Google Scholar
27. Bepler T, Berger B. Learning protein sequence embeddings using information from structure. International Conference on Learning Representations. 2019.
28. Alamdari S, Thakkar N, van den Berg R, Lu AX, Fusi N, Amini AP, et al. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv. 2023:2023-09.
- View Article
- Google Scholar
29. Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023;41(8):1099–106. pmid:36702895
- View Article
- PubMed/NCBI
- Google Scholar
30. Ferruz N, Schmidt S, Höcker B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun. 2022;13(1):4348. pmid:35896542
- View Article
- PubMed/NCBI
- Google Scholar
31. Fishman V, Kuratov Y, Petrov M, Shmelev A, Shepelin D, Chekanov N. Gena-lm: a family of open-source foundational models for long DNA sequences. bioRxiv. 2023;:2023–06.
- View Article
- Google Scholar
32. Nguyen E, Poli M, Faizi M, Thomas A, Wornow M, Birch-Sykes C. Hyenadna: long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems. 2024;36.
- View Article
- Google Scholar
33. Dalla-Torre H, Gonzalez L, Mendoza Revilla J, Lopez Carranza N, Henryk Grywaczewski A, Oteri F. The nucleotide transformer: building and evaluating robust foundation models for human genomics. bioRxiv. 2023;2023-01.
- View Article
- Google Scholar
34. Dudnyk K, Cai D, Shi C, Xu J, Zhou J. Sequence basis of transcription initiation in the human genome. Science. 2024;384(6694):eadj0116. pmid:38662817
- View Article
- PubMed/NCBI
- Google Scholar
35. Avsec �1/2, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18(10):1196–203. pmid:34608324
- View Article
- PubMed/NCBI
- Google Scholar
36. Zvyagin M, Brace A, Hippe K, Deng Y, Zhang B, Bohorquez CO, et al. GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. The International Journal of High Performance Computing Applications. 2023;37(6):683–705.
- View Article
- Google Scholar
37. Hwang Y, Cornman AL, Kellogg EH, Ovchinnikov S, Girguis PR. Genomic language model predicts protein co-regulation and function. Nat Commun. 2024;15(1):2880. pmid:38570504
- View Article
- PubMed/NCBI
- Google Scholar
38. Brixi G, Durrant MG, Ku J, Poli M, Brockman G, Chang D, et al. Genome modeling and design across all domains of life with Evo 2. bioRxiv. 2025.
- View Article
- Google Scholar
39. Tsukiyama S, Hasan MM, Deng H-W, Kurata H. BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches. Brief Bioinform. 2022;23(2):bbac053. pmid:35225328
- View Article
- PubMed/NCBI
- Google Scholar
40. De Waele G, Clauwaert J, Menschaert G, Waegeman W. CpG Transformer for imputation of single-cell methylomes. Bioinformatics. 2022;38(3):597–603. pmid:34718418
- View Article
- PubMed/NCBI
- Google Scholar
41. Jin J, Yu Y, Wang R, Zeng X, Pang C, Jiang Y, et al. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome Biol. 2022;23(1):219. pmid:36253864
- View Article
- PubMed/NCBI
- Google Scholar
42. Zhou J, Chen Q, Braun PR, Perzel Mandell KA, Jaffe AE, Tan HY, et al. Deep learning predicts DNA methylation regulatory variants in the human brain and elucidates the genetics of psychiatric disorders. Proc Natl Acad Sci U S A. 2022;119(34):e2206069119. pmid:35969790
- View Article
- PubMed/NCBI
- Google Scholar
43. Lee D, Yang J, Kim S. Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. Nat Commun. 2022;13(1):6678. pmid:36335101
- View Article
- PubMed/NCBI
- Google Scholar
44. Zhou Z, Yeung W, Gravel N, Salcedo M, Soleymani S, Li S, et al. Phosformer: an explainable transformer model for protein kinase-specific phosphorylation predictions. Bioinformatics. 2023;39(2):btad046. pmid:36692152
- View Article
- PubMed/NCBI
- Google Scholar
45. Fu X, Mo S, Buendia A, Laurent A, Shao A, Alvarez-Torres Md M, et al. GET: a foundation model of transcription across human cell types. bioRxiv. 2023;:2023–09.
- View Article
- Google Scholar
46. Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022;4(10):852–66.
- View Article
- Google Scholar
47. Hao M, Gong J, Zeng X, Liu C, Guo Y, Cheng X, et al. Large-scale foundation model on single-cell transcriptomics. Nat Methods. 2024;21(8):1481–91. pmid:38844628
- View Article
- PubMed/NCBI
- Google Scholar
48. Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods. 2024;21(8):1470–80. pmid:38409223
- View Article
- PubMed/NCBI
- Google Scholar
49. Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, Hill MC, et al. Transfer learning enables predictions in network biology. Nature. 2023;618(7965):616–24. pmid:37258680
- View Article
- PubMed/NCBI
- Google Scholar
50. Li S, Moayedpour S, Li R, Bailey M, Riahi S, Kogler-Anele L. Codonbert: Large language models for mRNA design and optimization. bioRxiv. 2023;:2023–09.
- View Article
- Google Scholar
51. Celaj A, Gao AJ, Lau TT, Holgersen EM, Lo A, Lodaya V. An RNA foundation model enables discovery of disease mechanisms and candidate therapeutics. bioRxiv. 2023:2023-09.
- View Article
- Google Scholar
52. Liu L, Li W, Wong KC, Yang F, Yao J. A pre-trained large generative model for translating single-cell transcriptome to proteome. bioRxiv. 2023;:2023–07.
- View Article
- Google Scholar
53. Shen H, Liu J, Hu J, Shen X, Zhang C, Wu D, et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. iScience. 2023;26(5):106536. pmid:37187700
- View Article
- PubMed/NCBI
- Google Scholar
54. Gong J, Hao M, Cheng X, Zeng X, Liu C, Ma J. xTrimoGene: an efficient and scalable representation learner for single-cell RNA-seq data. Advances in Neural Information Processing Systems. 2024;36.
- View Article
- Google Scholar
55. He Y, Fang P, Shan Y, Pan Y, Wei Y, Chen Y, et al. LucaOne: generalized biological foundation model with unified nucleic acid and protein language. bioRxiv. 2024:2024-05.
- View Article
- Google Scholar
56. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-45. pmid:26553804
- View Article
- PubMed/NCBI
- Google Scholar
57. Pruitt K, Murphy T, Brown G, Murphy M. RefSeq frequently asked questions (FAQ). RefSeq Help National Center for Biotechnology Information (US). 2020.
58. Liang C, Qiao L, Ye P, Dong N, Sun J, Bai W. Toward understanding BERT-like pre-training for DNA foundation models. arXiv preprint 2023. https://arxiv.org/abs/231007644
- View Article
- Google Scholar
59. Pandey U, Behara SM, Sharma S, Patil RS, Nambiar S, Koner D, et al. DeePNAP: a deep learning method to predict protein-nucleic acid binding affinity from their sequences. J Chem Inf Model. 2024;64(6):1806–15. pmid:38458968
- View Article
- PubMed/NCBI
- Google Scholar
60. Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. 2019. https://arxiv.org/abs/1810.04805
61. Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2019;47(D1):D94–9. pmid:30365038
- View Article
- PubMed/NCBI
- Google Scholar
62. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282–8. pmid:17379688
- View Article
- PubMed/NCBI
- Google Scholar
63. Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. 2016. https://arxiv.org/abs/1508.07909
64. Kudo T, Richardson J. SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. 2018. https://arxiv.org/abs/1808.06226
65. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models. 2023. https://arxiv.org/abs/2307.09288
66. Su J, Ahmed M, Lu Y, Pan S, Bo W, Liu Y. Roformer: enhanced transformer with rotary position embedding. 2024.
67. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. 2019.
68. Yang G, Hu EJ, Babuschkin I, Sidor S, Liu X, Farhi D. Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer. 2022. https://arxiv.org/abs/2203.03466
69. Karpathy A. NanoGPT. 2022. https://github.com/karpathy/nanoGPT
70. Loshchilov I, Hutter F. Decoupled weight decay regularization. 2019. https://arxiv.org/abs/1711.05101
71. Smith LN, Topin N. Super-convergence: very fast training of neural networks using large learning rates. 2018. https://arxiv.org/abs/1708.07120
72. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915–9. pmid:1438297
- View Article
- PubMed/NCBI
- Google Scholar
73. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42. pmid:10592235
- View Article
- PubMed/NCBI
- Google Scholar
74. Ma J, Wang S, Wang Z, Xu J. Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning. Bioinformatics. 2015;31(21):3506–13. pmid:26275894
- View Article
- PubMed/NCBI
- Google Scholar
75. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. 2015. https://arxiv.org/abs/1512.03385
76. Capel H, Weiler R, Dijkstra M, Vleugels R, Bloem P, Feenstra KA. ProteinGLUE: a multi-task benchmark suite for self-supervised protein modeling. openRxiv. 2021.
- View Article
- Google Scholar
77. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S. In: International conference on machine learning. 2021. p. 8748–63.
78. Kramer C, Kalliokoski T, Gedeck P, Vulpetti A. The experimental uncertainty of heterogeneous public K(i) data. J Med Chem. 2012;55(11):5165–73. pmid:22643060
- View Article
- PubMed/NCBI
- Google Scholar
79. Gragoudas ES, Adamis AP, Cunningham ET Jr, Feinsod M, Guyer DR, VEGF Inhibition Study in Ocular Neovascularization Clinical Trial Group. Pegaptanib for neovascular age-related macular degeneration. N Engl J Med. 2004;351(27):2805–16. pmid:15625332
- View Article
- PubMed/NCBI
- Google Scholar
80. Carvalho J, Paiva A, Cabral Campello MP, Paulo A, Mergny J-L, Salgado GF, et al. Aptamer-based targeted delivery of a G-quadruplex ligand in cervical cancer cells. Sci Rep. 2019;9(1):7945. pmid:31138870
- View Article
- PubMed/NCBI
- Google Scholar
81. Marsala A, Lee E. Coil-assisted retrograde transvenous obliteration: a valid treatment for gastric variceal hemorrhage and hepatic encephalopathy. Dig Dis Interv. 2017;01(04):302–5.
- View Article
- Google Scholar
82. Chan MY, Rusconi CP, Alexander JH, Tonkens RM, Harrington RA, Becker RC. A randomized, repeat-dose, pharmacodynamic and safety study of an antidote-controlled factor IXa inhibitor. J Thromb Haemost. 2008;6(5):789–96. pmid:18284597
- View Article
- PubMed/NCBI
- Google Scholar
83. Riccardi C, Meyer A, Vasseur J-J, Cavasso D, Russo Krauss I, Paduano L, et al. Design, synthesis and characterization of cyclic NU172 analogues: a biophysical and biological insight. Int J Mol Sci. 2020;21(11):3860. pmid:32485818
- View Article
- PubMed/NCBI
- Google Scholar
84. Jilma-Stohlawetz P, Knöbl P, Gilbert JC, Jilma B. The anti-von Willebrand factor aptamer ARC1779 increases von Willebrand factor levels and platelet counts in patients with type 2B von Willebrand disease. Thromb Haemost. 2012;108(2):284–90. pmid:22740102
- View Article
- PubMed/NCBI
- Google Scholar
85. Menne J, Eulberg D, Beyer D, Baumann M, Saudek F, Valkusz Z, et al. C-C motif-ligand 2 inhibition with emapticap pegol (NOX-E36) in type 2 diabetic patients with albuminuria. Nephrol Dial Transplant. 2017;32(2):307–15. pmid:28186566
- View Article
- PubMed/NCBI
- Google Scholar
86. Giordano FA, Layer JP, Leonardelli S, Friker LL, Turiello R, Corvino D, et al. L-RNA aptamer-based CXCL12 inhibition combined with radiotherapy in newly-diagnosed glioblastoma: dose escalation of the phase I/II GLORIA trial. Nat Commun. 2024;15(1):4210. pmid:38806504
- View Article
- PubMed/NCBI
- Google Scholar
87. Schwoebel F, van Eijk LT, Zboralski D, Sell S, Buchner K, Maasch C, et al. The effects of the anti-hepcidin Spiegelmer NOX-H94 on inflammation-induced anemia in cynomolgus monkeys. Blood. 2013;121(12):2311–5. pmid:23349391
- View Article
- PubMed/NCBI
- Google Scholar
88. Loh LK, Bartulovic M. Efficient coding hypothesis and an introduction to information theory. Homayoun Shahri. 2014. http://users.ece.cmu.edu/pgrover/teaching/files/InfoTheoryEfficientCodingHypothesis.pdf
89. Wu S, Dredze M. Are all languages created equal in multilingual BERT?. arXiv preprint 2020.
- View Article
- Google Scholar
90. Jaegle A, Gimeno F, Brock A, Vinyals O, Zisserman A, Carreira J. In: International conference on machine learning. 2021. p. 4651–64.
91. Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems. 2022;35:23716–36.
- View Article
- Google Scholar
92. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S. Learning transferable visual models from natural language supervision. In: International conference on machine learning. 2021. p. 8748–63.
93. Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. Advances in neural information processing systems. 2024;36.
- View Article
- Google Scholar

[ref1] 1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 6000–10.

[ref2] 2. Harini K, Srivastava A, Kulandaisamy A, Gromiha MM. ProNAB: database for binding affinities of protein-nucleic acid complexes and their mutants. Nucleic Acids Res. 2022;50(D1):D1528–34. pmid:34606614
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Rauluseviciute I, Riudavets-Puig R, Blanc-Mathieu R, Castro-Mondragon JA, Ferenc K, Kumar V, et al. JASPAR 2024 : 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2024;52(D1):D174–82. pmid:37962376
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref4] 4. Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H. DNABERT-2: efficient foundation model and benchmark for multi-species genome. 2024. https://arxiv.org/abs/2306.15006

[ref5] 5. Rao R, Bhattacharya N, Thomas N, Duan Y, Chen X, Canny J, et al. Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst. 2019;32:9689–701. pmid:33390682
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref6] 6. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–10. pmid:31942072
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref7] 7. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. pmid:34265844
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref8] 8. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. pmid:38718835
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref9] 9. Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373(6557):871–6. pmid:34282049
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref10] 10. Baek M, McHugh R, Anishchenko I, Jiang H, Baker D, DiMaio F. Accurate prediction of protein-nucleic acid complexes using RoseTTAFoldNA. Nat Methods. 2024;21(1):117–21. pmid:37996753
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref11] 11. Ahdritz G, Bouatta N, Floristean C, Kadyan S, Xia Q, Gerecke W, et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods. 2024;21(8):1514–24. pmid:38744917
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref12] 12. Wu R, Ding F, Wang R, Shen R, Zhang X, Luo S, et al. High-resolution de novo structure prediction from primary sequence. openRxiv. 2022.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref13] 13. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. pmid:36927031
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref14] 14. Drori I, Thaker D, Srivatsa A, Jeong D, Wang Y, Nan L. Accurate protein structure prediction by embeddings and deep learning representations. arXiv preprint 2019. https://arxiv.org/abs/1911.05531
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref15] 15. Elnaggar A, Essam H, Salah-Eldin W, Moustafa W, Elkerdawy M, Rochereau C. Ankh: optimized protein language model unlocks general-purpose modelling. arXiv preprint 2023. https://arxiv.org/abs/2301.06568
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref16] 16. Geffen Y, Ofran Y, Unger R. DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. Bioinformatics. 2022;38(Suppl_2):ii95–8. pmid:36124789
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref17] 17. Nambiar A, Heflin M, Liu S, Maslov S, Hopkins M, Ritz A. Transforming the language of life: transformer neural networks for protein prediction tasks. In: Proceedings of the 11th ACM international conference on bioinformatics, computational biology and health informatics; 2020. p. 1–8.

[ref18] 18. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022;44(10):7112–27. pmid:34232869
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref19] 19. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118(15):e2016239118. pmid:33876751
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref20] 20. Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems. 2021;34:29287–303.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref21] 21. Rao RM, Liu J, Verkuil R, Meier J, Canny J, Abbeel P, et al. MSA transformer. In: International Conference on Machine Learning. PMLR; 2021. p. 8844–56.

[ref22] 22. Heinzinger M, Weissenow K, Sanchez JG, Henkel A, Mirdita M, Steinegger M, et al. Bilingual language model for protein sequence and structure. openRxiv 2023. http://dx.doi.org/10.1101/2023.07.23.550085
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref23] 23. Chen B, Cheng X, Li P, Geng Y, Gong J, Li S, et al. xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein. openRxiv. 2023.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref24] 24. Su J, Han C, Zhou Y, Shan J, Zhou X, Yuan F. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv. 2023;2023-10.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref25] 25. Notin P, Dias M, Frazer J, Marchena-Hurtado J, Gomez AN, Marks D, et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In: International Conference on Machine Learning. PMLR; 2022. p. 16990–7017.

[ref26] 26. Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16(12):1315–22. pmid:31636460
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref27] 27. Bepler T, Berger B. Learning protein sequence embeddings using information from structure. International Conference on Learning Representations. 2019.

[ref28] 28. Alamdari S, Thakkar N, van den Berg R, Lu AX, Fusi N, Amini AP, et al. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv. 2023:2023-09.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref29] 29. Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023;41(8):1099–106. pmid:36702895
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref30] 30. Ferruz N, Schmidt S, Höcker B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun. 2022;13(1):4348. pmid:35896542
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref31] 31. Fishman V, Kuratov Y, Petrov M, Shmelev A, Shepelin D, Chekanov N. Gena-lm: a family of open-source foundational models for long DNA sequences. bioRxiv. 2023;:2023–06.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref32] 32. Nguyen E, Poli M, Faizi M, Thomas A, Wornow M, Birch-Sykes C. Hyenadna: long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems. 2024;36.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref33] 33. Dalla-Torre H, Gonzalez L, Mendoza Revilla J, Lopez Carranza N, Henryk Grywaczewski A, Oteri F. The nucleotide transformer: building and evaluating robust foundation models for human genomics. bioRxiv. 2023;2023-01.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref34] 34. Dudnyk K, Cai D, Shi C, Xu J, Zhou J. Sequence basis of transcription initiation in the human genome. Science. 2024;384(6694):eadj0116. pmid:38662817
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref35] 35. Avsec �1/2, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18(10):1196–203. pmid:34608324
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref36] 36. Zvyagin M, Brace A, Hippe K, Deng Y, Zhang B, Bohorquez CO, et al. GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. The International Journal of High Performance Computing Applications. 2023;37(6):683–705.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref37] 37. Hwang Y, Cornman AL, Kellogg EH, Ovchinnikov S, Girguis PR. Genomic language model predicts protein co-regulation and function. Nat Commun. 2024;15(1):2880. pmid:38570504
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref38] 38. Brixi G, Durrant MG, Ku J, Poli M, Brockman G, Chang D, et al. Genome modeling and design across all domains of life with Evo 2. bioRxiv. 2025.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref39] 39. Tsukiyama S, Hasan MM, Deng H-W, Kurata H. BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches. Brief Bioinform. 2022;23(2):bbac053. pmid:35225328
View Article
PubMed/NCBI
Google Scholar

[123] View Article

[124] PubMed/NCBI

[125] Google Scholar

[ref40] 40. De Waele G, Clauwaert J, Menschaert G, Waegeman W. CpG Transformer for imputation of single-cell methylomes. Bioinformatics. 2022;38(3):597–603. pmid:34718418
View Article
PubMed/NCBI
Google Scholar

[127] View Article

[128] PubMed/NCBI

[129] Google Scholar

[ref41] 41. Jin J, Yu Y, Wang R, Zeng X, Pang C, Jiang Y, et al. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome Biol. 2022;23(1):219. pmid:36253864
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

[ref42] 42. Zhou J, Chen Q, Braun PR, Perzel Mandell KA, Jaffe AE, Tan HY, et al. Deep learning predicts DNA methylation regulatory variants in the human brain and elucidates the genetics of psychiatric disorders. Proc Natl Acad Sci U S A. 2022;119(34):e2206069119. pmid:35969790
View Article
PubMed/NCBI
Google Scholar

[135] View Article

[136] PubMed/NCBI

[137] Google Scholar

[ref43] 43. Lee D, Yang J, Kim S. Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. Nat Commun. 2022;13(1):6678. pmid:36335101
View Article
PubMed/NCBI
Google Scholar

[139] View Article

[140] PubMed/NCBI

[141] Google Scholar

[ref44] 44. Zhou Z, Yeung W, Gravel N, Salcedo M, Soleymani S, Li S, et al. Phosformer: an explainable transformer model for protein kinase-specific phosphorylation predictions. Bioinformatics. 2023;39(2):btad046. pmid:36692152
View Article
PubMed/NCBI
Google Scholar

[143] View Article

[144] PubMed/NCBI

[145] Google Scholar

[ref45] 45. Fu X, Mo S, Buendia A, Laurent A, Shao A, Alvarez-Torres Md M, et al. GET: a foundation model of transcription across human cell types. bioRxiv. 2023;:2023–09.
View Article
Google Scholar

[147] View Article

[148] Google Scholar

[ref46] 46. Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022;4(10):852–66.
View Article
Google Scholar

[150] View Article

[151] Google Scholar

[ref47] 47. Hao M, Gong J, Zeng X, Liu C, Guo Y, Cheng X, et al. Large-scale foundation model on single-cell transcriptomics. Nat Methods. 2024;21(8):1481–91. pmid:38844628
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref48] 48. Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods. 2024;21(8):1470–80. pmid:38409223
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref49] 49. Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, Hill MC, et al. Transfer learning enables predictions in network biology. Nature. 2023;618(7965):616–24. pmid:37258680
View Article
PubMed/NCBI
Google Scholar

[161] View Article

[162] PubMed/NCBI

[163] Google Scholar

[ref50] 50. Li S, Moayedpour S, Li R, Bailey M, Riahi S, Kogler-Anele L. Codonbert: Large language models for mRNA design and optimization. bioRxiv. 2023;:2023–09.
View Article
Google Scholar

[165] View Article

[166] Google Scholar

[ref51] 51. Celaj A, Gao AJ, Lau TT, Holgersen EM, Lo A, Lodaya V. An RNA foundation model enables discovery of disease mechanisms and candidate therapeutics. bioRxiv. 2023:2023-09.
View Article
Google Scholar

[168] View Article

[169] Google Scholar

[ref52] 52. Liu L, Li W, Wong KC, Yang F, Yao J. A pre-trained large generative model for translating single-cell transcriptome to proteome. bioRxiv. 2023;:2023–07.
View Article
Google Scholar

[171] View Article

[172] Google Scholar

[ref53] 53. Shen H, Liu J, Hu J, Shen X, Zhang C, Wu D, et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. iScience. 2023;26(5):106536. pmid:37187700
View Article
PubMed/NCBI
Google Scholar

[174] View Article

[175] PubMed/NCBI

[176] Google Scholar

[ref54] 54. Gong J, Hao M, Cheng X, Zeng X, Liu C, Ma J. xTrimoGene: an efficient and scalable representation learner for single-cell RNA-seq data. Advances in Neural Information Processing Systems. 2024;36.
View Article
Google Scholar

[178] View Article

[179] Google Scholar

[ref55] 55. He Y, Fang P, Shan Y, Pan Y, Wei Y, Chen Y, et al. LucaOne: generalized biological foundation model with unified nucleic acid and protein language. bioRxiv. 2024:2024-05.
View Article
Google Scholar

[181] View Article

[182] Google Scholar

[ref56] 56. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-45. pmid:26553804
View Article
PubMed/NCBI
Google Scholar

[184] View Article

[185] PubMed/NCBI

[186] Google Scholar

[ref57] 57. Pruitt K, Murphy T, Brown G, Murphy M. RefSeq frequently asked questions (FAQ). RefSeq Help National Center for Biotechnology Information (US). 2020.

[ref58] 58. Liang C, Qiao L, Ye P, Dong N, Sun J, Bai W. Toward understanding BERT-like pre-training for DNA foundation models. arXiv preprint 2023. https://arxiv.org/abs/231007644
View Article
Google Scholar

[189] View Article

[190] Google Scholar

[ref59] 59. Pandey U, Behara SM, Sharma S, Patil RS, Nambiar S, Koner D, et al. DeePNAP: a deep learning method to predict protein-nucleic acid binding affinity from their sequences. J Chem Inf Model. 2024;64(6):1806–15. pmid:38458968
View Article
PubMed/NCBI
Google Scholar

[192] View Article

[193] PubMed/NCBI

[194] Google Scholar

[ref60] 60. Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. 2019. https://arxiv.org/abs/1810.04805

[ref61] 61. Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2019;47(D1):D94–9. pmid:30365038
View Article
PubMed/NCBI
Google Scholar

[197] View Article

[198] PubMed/NCBI

[199] Google Scholar

[ref62] 62. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282–8. pmid:17379688
View Article
PubMed/NCBI
Google Scholar

[201] View Article

[202] PubMed/NCBI

[203] Google Scholar

[ref63] 63. Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. 2016. https://arxiv.org/abs/1508.07909

[ref64] 64. Kudo T, Richardson J. SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. 2018. https://arxiv.org/abs/1808.06226

[ref65] 65. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models. 2023. https://arxiv.org/abs/2307.09288

[ref66] 66. Su J, Ahmed M, Lu Y, Pan S, Bo W, Liu Y. Roformer: enhanced transformer with rotary position embedding. 2024.

[ref67] 67. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. 2019.

[ref68] 68. Yang G, Hu EJ, Babuschkin I, Sidor S, Liu X, Farhi D. Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer. 2022. https://arxiv.org/abs/2203.03466

[ref69] 69. Karpathy A. NanoGPT. 2022. https://github.com/karpathy/nanoGPT

[ref70] 70. Loshchilov I, Hutter F. Decoupled weight decay regularization. 2019. https://arxiv.org/abs/1711.05101

[ref71] 71. Smith LN, Topin N. Super-convergence: very fast training of neural networks using large learning rates. 2018. https://arxiv.org/abs/1708.07120

[ref72] 72. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915–9. pmid:1438297
View Article
PubMed/NCBI
Google Scholar

[214] View Article

[215] PubMed/NCBI

[216] Google Scholar

[ref73] 73. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42. pmid:10592235
View Article
PubMed/NCBI
Google Scholar

[218] View Article

[219] PubMed/NCBI

[220] Google Scholar

[ref74] 74. Ma J, Wang S, Wang Z, Xu J. Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning. Bioinformatics. 2015;31(21):3506–13. pmid:26275894
View Article
PubMed/NCBI
Google Scholar

[222] View Article

[223] PubMed/NCBI

[224] Google Scholar

[ref75] 75. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. 2015. https://arxiv.org/abs/1512.03385

[ref76] 76. Capel H, Weiler R, Dijkstra M, Vleugels R, Bloem P, Feenstra KA. ProteinGLUE: a multi-task benchmark suite for self-supervised protein modeling. openRxiv. 2021.
View Article
Google Scholar

[227] View Article

[228] Google Scholar

[ref77] 77. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S. In: International conference on machine learning. 2021. p. 8748–63.

[ref78] 78. Kramer C, Kalliokoski T, Gedeck P, Vulpetti A. The experimental uncertainty of heterogeneous public K(i) data. J Med Chem. 2012;55(11):5165–73. pmid:22643060
View Article
PubMed/NCBI
Google Scholar

[231] View Article

[232] PubMed/NCBI

[233] Google Scholar

[ref79] 79. Gragoudas ES, Adamis AP, Cunningham ET Jr, Feinsod M, Guyer DR, VEGF Inhibition Study in Ocular Neovascularization Clinical Trial Group. Pegaptanib for neovascular age-related macular degeneration. N Engl J Med. 2004;351(27):2805–16. pmid:15625332
View Article
PubMed/NCBI
Google Scholar

[235] View Article

[236] PubMed/NCBI

[237] Google Scholar

[ref80] 80. Carvalho J, Paiva A, Cabral Campello MP, Paulo A, Mergny J-L, Salgado GF, et al. Aptamer-based targeted delivery of a G-quadruplex ligand in cervical cancer cells. Sci Rep. 2019;9(1):7945. pmid:31138870
View Article
PubMed/NCBI
Google Scholar

[239] View Article

[240] PubMed/NCBI

[241] Google Scholar

[ref81] 81. Marsala A, Lee E. Coil-assisted retrograde transvenous obliteration: a valid treatment for gastric variceal hemorrhage and hepatic encephalopathy. Dig Dis Interv. 2017;01(04):302–5.
View Article
Google Scholar

[243] View Article

[244] Google Scholar

[ref82] 82. Chan MY, Rusconi CP, Alexander JH, Tonkens RM, Harrington RA, Becker RC. A randomized, repeat-dose, pharmacodynamic and safety study of an antidote-controlled factor IXa inhibitor. J Thromb Haemost. 2008;6(5):789–96. pmid:18284597
View Article
PubMed/NCBI
Google Scholar

[246] View Article

[247] PubMed/NCBI

[248] Google Scholar

[ref83] 83. Riccardi C, Meyer A, Vasseur J-J, Cavasso D, Russo Krauss I, Paduano L, et al. Design, synthesis and characterization of cyclic NU172 analogues: a biophysical and biological insight. Int J Mol Sci. 2020;21(11):3860. pmid:32485818
View Article
PubMed/NCBI
Google Scholar

[250] View Article

[251] PubMed/NCBI

[252] Google Scholar

[ref84] 84. Jilma-Stohlawetz P, Knöbl P, Gilbert JC, Jilma B. The anti-von Willebrand factor aptamer ARC1779 increases von Willebrand factor levels and platelet counts in patients with type 2B von Willebrand disease. Thromb Haemost. 2012;108(2):284–90. pmid:22740102
View Article
PubMed/NCBI
Google Scholar

[254] View Article

[255] PubMed/NCBI

[256] Google Scholar

[ref85] 85. Menne J, Eulberg D, Beyer D, Baumann M, Saudek F, Valkusz Z, et al. C-C motif-ligand 2 inhibition with emapticap pegol (NOX-E36) in type 2 diabetic patients with albuminuria. Nephrol Dial Transplant. 2017;32(2):307–15. pmid:28186566
View Article
PubMed/NCBI
Google Scholar

[258] View Article

[259] PubMed/NCBI

[260] Google Scholar

[ref86] 86. Giordano FA, Layer JP, Leonardelli S, Friker LL, Turiello R, Corvino D, et al. L-RNA aptamer-based CXCL12 inhibition combined with radiotherapy in newly-diagnosed glioblastoma: dose escalation of the phase I/II GLORIA trial. Nat Commun. 2024;15(1):4210. pmid:38806504
View Article
PubMed/NCBI
Google Scholar

[262] View Article

[263] PubMed/NCBI

[264] Google Scholar

[ref87] 87. Schwoebel F, van Eijk LT, Zboralski D, Sell S, Buchner K, Maasch C, et al. The effects of the anti-hepcidin Spiegelmer NOX-H94 on inflammation-induced anemia in cynomolgus monkeys. Blood. 2013;121(12):2311–5. pmid:23349391
View Article
PubMed/NCBI
Google Scholar

[266] View Article

[267] PubMed/NCBI

[268] Google Scholar

[ref88] 88. Loh LK, Bartulovic M. Efficient coding hypothesis and an introduction to information theory. Homayoun Shahri. 2014. http://users.ece.cmu.edu/pgrover/teaching/files/InfoTheoryEfficientCodingHypothesis.pdf

[ref89] 89. Wu S, Dredze M. Are all languages created equal in multilingual BERT?. arXiv preprint 2020.
View Article
Google Scholar

[271] View Article

[272] Google Scholar

[ref90] 90. Jaegle A, Gimeno F, Brock A, Vinyals O, Zisserman A, Carreira J. In: International conference on machine learning. 2021. p. 4651–64.

[ref91] 91. Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems. 2022;35:23716–36.
View Article
Google Scholar

[275] View Article

[276] Google Scholar

[ref92] 92. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S. Learning transferable visual models from natural language supervision. In: International conference on machine learning. 2021. p. 8748–63.

[ref93] 93. Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. Advances in neural information processing systems. 2024;36.
View Article
Google Scholar

[279] View Article

[280] Google Scholar

Figures

Abstract

Introduction

Related work

Methods

Training data

Architecture and training

Evaluations

Protein-nucleic acid binding evaluation.

Nucleic acid binding specificity.

Protein-nucleotide contact prediction.

Genome understanding evaluation.

Tasks assessing protein embeddings.

Protein general language of life evaluation.

Per-residue evaluations

Interpretability

Protein-nucleic acid interactions.

Shared representations between modalities.

Results

Emergent joint representations

Performance on multi-omic tasks

Attention-based structural interpretability

Performance on single-omic benchmarks

Discussion

Implications, limitations, and outlook

Conclusions

Acknowledgments

Supporting information

S1 File. Supporting information describing our model architecture and molecular dynamics experiment.

S1 Table. Training data statistics across all sequence types.

S2 Table. Mean F1 scores for predicted contact maps at distance thresholds of 4 Å, 6 Å, and 8 Å.

S3 Table. OmniBioTE performance across all 10-folds of the Pronab mutation benchmark as measured in Pearson correlation coefficient (PCC) and mean absolute error (MAE).

S4 Table. GUE Results (Epigenetics): Histone Modification Benchmarks (Part 1). Values represent the Matthews correlation coefficient of the predictions.

S5 Table. GUE Results (Epigenetics): Histone Modification Benchmarks (Part 2). Values represent the Matthews correlation coefficient of the predictions.

S6 Table. GUE Results: Human Transcription Factors and COVID. Values represent the Matthews correlation coefficient of the predictions, with the exception of the COVID variant prediction task which uses F1-score.

S7 Table. GUE Results: Mouse Transcription Factors. Values represent the Matthews correlation coefficient of the predictions.

S8 Table. Promoter Detection performance across all promoters (All) and promoter subtypes (No TATA, TATA).

S9 Table. Core Promoter evaluation: performance across all promoters (All) and promoter subtypes (No TATA, TATA).

S10 Table. Secondary structure performance. In the 3-way columns, CASP12, CB513, and TS115 scores are reported; in the 8-way columns, the corresponding scores are reported. All values are measured in accuracy.

S11 Table. Remote homology (Fold, Superfamily, Family) classification performance measured in accuracy and regression performance (Fluorescence, Stability) measured in Spearman’s correlation coefficient.

S12 Table. Contact evaluation performance, reporting Contacts P@L for long- and medium-range contacts. All values are the computed precision of the predictions.

S13 Table. Performance on the structural prediction tasks in the ProteinGLUE dataset. Values represent the accuracy of the predictions.

S14 Table. Performance on the remaining tasks in the ProteinGLUE dataset.

References