Short tandem repeat variants are possibly associated with RNA secondary structure and gene expression

Nick Kinney; Dikshya Pathak; Emma Evans; Paola Arias

doi:10.1371/journal.pone.0326355

Abstract

Short tandem repeats (STRs) are abundant in the human genome with approximately 300,000 embedded in gene introns, exons, and untranslated regions. High penetrance STR variants cause human diseases such as Myotonic dystrophy, Baratela-Scott syndrome, and various ataxias. The possibility that STRs contribute to polygenic disease is supported by recent high-powered datasets that link STRs to more subtle effects on gene expression. Indeed, STR variants can induce Z-DNA and H-DNA folding; alter nucleosome positioning; and change the spacing of DNA binding sites. On the other hand, little is known about how STR variants affect RNA secondary structure and accessibility. These factors could affect rates of splicing, nuclear export, and translation. We hypothesize that effects on RNA structure can be predicted using computational tools and associated with gene expression using DNA and RNA sequencing data. We test this hypothesis using data from the 1000 Genomes Project and ViennaRNA. We identify 17,255 transcribed STRs that affect RNA folding (fSTRs); 356 are possibly associated with gene expression. We characterize fSTRs by repeat motif, length, and gene level annotation. Transcribed fSTR variants tend to affect RNA multiloops and external loops. Effects on RNA accessibility depends on the repeat motif: a surprising result that is checked against simulation. These results shed light on how transcribed STRs affect RNA structure and pave the way for experimental validation.

Citation: Kinney N, Pathak D, Evans E, Arias P (2025) Short tandem repeat variants are possibly associated with RNA secondary structure and gene expression. PLoS One 20(6): e0326355. https://doi.org/10.1371/journal.pone.0326355

Editor: Karthikeyan Thiyagarajan, Borlaug Institute for South Asia-CIMMYT, INDIA

Received: February 22, 2025; Accepted: May 28, 2025; Published: June 18, 2025

Copyright: © 2025 Kinney et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: We used freely available python packages to perform our analysis. In addition to those already mentioned, we use sklearn to preform affinity propagation clustering on RNA structural similarity scores. We use the force directed RNA (forna) web interface to produce secondary structure plots of select STRs (71): http://rna.tbi.univie.ac.at/forna. All other plots were prepared with plotnine and pillow for python. Data and code used for manuscript preparation are freely available as supplementary material and online: https://github.com/nkinney06/.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Short tandem repeats (STRs) are hotspots for human genetic variation [1]. Their repetitive sequence motifs (1–6 base pair) are prone to strand slip replication and unequal crossing over which tend to increase or decrease the STR array length [1,2]. Indeed, STRs have been used for decades as markers in forensic and population analysis [3,4]. Approximately 300,000 STRs are embedded in human gene introns, exons, and untranslated regions (UTRs); consequently, variation in these regions is possibly associated with differential gene expression across human populations [5,6]. In fact, this hypothesis has recently been supported and reproduced by integrating data from DNA and RNA sequencing [7,8].

In 2015 and 2019, a pair of studies used variance partitioning to survey the human genome for STRs associated with gene expression [7,8]. The first study identified 2,060 expression STRs (eSTRs). The second study identified 28,375 eSTRs and recapitulated many of the 2,060 identified in 2015 [7]. The discovery of correlations between eSTR array length and gene expression provides a measure of validation for past and future studies of STRs in complex disease. In fact, several studies prior to 2015 reported links between various cancers and STR variation [9,10]. Since then, STR variation has been investigated in several additional cancer types [11,12] and autism spectrum disorder [13]. These breakthroughs paved the way for dedicated catalogues of STR variation [14–16]. In particular, WebSTR provides a catalogue of genome-wide STR variation in humans, and currently contains data for approximately 1.7 million unique regions [14].

The idea that STRs can affect gene expression is not surprising. Specific STR variants are causative in various ataxias, Huntington’s disease, and fragile X syndrome [17]. These high impact examples have been known for decades; however, discovery of more subtle effects on gene expression have had to wait for large datasets with more statistical power. These data have helped link STR variation to complex traits including blood and lipid biomarkers as well as oxidative stress [5,18]; and, the aforementioned studies of cancer and autism. The results suggest the possibility that STR variations can be leveraged for diagnostic proposes [19]. This hypothesis is supported by several studies of human cancer; in particular, colorectal and breast cancer [20,21]. So far, most of the attempts to leverage STRs for diagnostic purposes have used a polygenic risk approach with modest results [11,12,21].

The mechanisms that dictate how STR variants affect gene expression are diverse with some known and some unknown. Regardless of their position in the genome, STR variants can inducing Z-DNA and H-DNA folding [22]; alter nucleosome positioning [22,23]; and change the spacing of DNA binding sites [22,24]. When STR variants are positioned in coding regions they have the additional capacity to affect protein folding. Due to the possibility of frameshift, STRs embedded in coding regions are under unique selective pressure that favors insertion and deletion (indel) factors of three [25–27]. In addition, those coding for hydrophilic amino acids are over-represented [27]. Indeed, polyglutamine variants are among the most common of the repeat expansion disorders [28,29].

Relatively few studies have investigated how transcribed STR variants affect RNA structure [30–32]. This is important because a precedent has been set that links RNA structure to gene expression in humans. RNA sequence (primary structure) can affect translational speed and accuracy when the transcript’s 5’ end is enriched with rare, slowly-translated codons [33–35]. The folding of RNA into hairpins, loops, and other structural motifs (secondary structure) can affect how the RNA interacts with proteins, ribosomes, and other RNAs [36,37]. However, links between STR variants and possible effects on RNA secondary structure are understudied. We hypothesize that some transcribed STRs affect RNA secondary structure which in turn are associated with gene expression. If supported, this hypothesis would contribute to what is known about how STRs affect differential gene expression across populations and disease states.

We use a data integration approach to test our hypothesis. Briefly, STR variants are identified from samples in the 1000 genomes project [38]. We focus on transcribed variants found in intron, exon, UTR, and coding regions. Next, we use the ViennaRNA package to predict the secondary structure of each variant [39]. To identify STRs that affect RNA folding (fSTRs) we cluster each collection of secondary structures using bpRNA-align [40,41]. Briefly, bpRNA-align uses a state-of-the-art global structural alignment algorithm to improve clustering performance over a broad range of structure types [41]. Finally, fSTRs are tested for association with gene expression using 462 human lymphoblastoid cell line samples created by the Geuvadis consortium [42]. We characterize fSTRs by motif length, gene level annotation, and effects on RNA folding. We discuss our results in the context of recent STR studies and suggest future lines of inquiry.

Results

Transcribed STRs are possibly associated with gene expression

The overall goals of this study are threefold: (a) identify STR variants that affect RNA folding (fSTRs); (b) establish an association between fSTRs and gene expression; and (c) characterize the effects of fSTR variants on RNA folding. To begin we identify STR variants in 2,529 samples from the 1000 genomes project [38]. Variants for each transcribed sequence – including 50 bp of 3’ and 5’ sequence – were folded with ViennaRNA [39]. Secondary structures were compared with bpRNAalign and affinity propagation clustering [40,41]. Changes in RNA structure were indicated by clustering results in excess of one (Fig 1, left panel). We identify 17,255 fSTRs. A representative fSTR in an intron of SH2B3 has five variants with transcribed RNA sequences forming two clusters (Fig 1, left panel).

Download:

Fig 1. Short tandem repeat (STR) variants affect RNA secondary structure and are possibly associated with gene expression.

(left) The effect of STR variants on RNA folding (fSTRs) was inferred by comparing secondary structures with bpRNA-align and affinity propagation clustering. Five variants of a penta-repeat (TGGGG) in an SH2B3 intron fall into two clusters (A and B). (right) Gene expression (rpkm) values for a collection of samples are grouped by genotype after mapping each variant to its cluster assignment. Since each sample has two alleles there are three combinations of cluster assignments (independent axis). We perform a test of the null: no difference in gene expression between groups. The null is rejected (p < .01) suggesting an association between RNA folding and SH2B3 expression.

https://doi.org/10.1371/journal.pone.0326355.g001

Associations with gene expression were checked using RNAseq data for a subset of the samples. We use the 462 human lymphoblastoid cell line samples created by the Geuvadis consortium [42]. The analysis was performed in three steps. First, variants from those samples were mapped to their corresponding cluster assignments. Second, expression for genes harboring fSTRs were grouped by genotype; i.e., a pair of variants mapped to cluster assignments (Fig 1, right panel). Third, we perform a test of the null: no difference in gene expression between groups. The null is rejected for 356 fSTRs suggesting an association with gene expression. Cluster assignments for an fSTR in SH2B3 show significant differences in gene expression (Fig 1, right panel).

fSTRs are over represented in coding regions

We reiterate the discovery criteria for a single fSTR: affinity propagation of its transcribed variants forms two or more clusters. The 66,876 transcribed STRs investigated in this study revealed 17,255 (25.8%) fSTRs. However, this may be an underestimate for two reasons. First, we only considered variants identified in the 1000 Genomes Project. Additional STR variants would likely be found with a larger sample size. Undoubtedly, some of the single cluster results would form multiple clusters with these additional variants. Second, STRs lacking variation in the 1000 Genomes Project samples were excluded from analysis: without variation there is no suitable test of the null. A larger set of samples would likely reveal variation in some of the excluded STRs and the discovery of additional fSTRs.

Characterization of fSTRs by gene level annotation reveals overrepresentation in coding regions (Fig 2a). This result is intriguing when paired with characterization of fSTRs by motif length. It comes as no surprise that effects on RNA structure increase with motif length; indeed, motif lengths greater than one are overrepresented (Fig 2b). However, coding regions are known to favor motif lengths of 3 and 6 to avoid frameshift. Apparently, coding regions are under far greater selective pressure to avoid frameshifts than fSTRs. If this were not the case, the unit one motifs – underrepresented among fSTRs – would outnumber unit three motifs in coding regions.

Download:

Fig 2. Characterization of fSTRs by gene level annotation, unit length, and repeat motif.

(a) fSTRs are overrepresented in coding regions. (b) effects on RNA structure tend to increase with motif length. (c) characterization of fSTRs by sequence motif. (+) over-representation among all fSTRs. (-) under-representation among all fSTRs.

https://doi.org/10.1371/journal.pone.0326355.g002

Characterization of fSTRs by unit is harder to interpret (Fig 2c). The well-known CAG motif (listed as its equivalent ACG motif) is conspicuously associated with fSTRs, but so too are many other motifs. Taken as a negative result, one interpretation is that any motif has the capacity to affect RNA folding.

fSTR motifs affect RNA accessibility

RNA accessibility may be important for protein binding, rates of splicing, nuclear export, and translation. We characterize how fSTRs affect accessibility of minimum free energy (MFE) RNA structures using ViennaRNA. Briefly, the core prediction algorithm uses dynamic programming to predict base paired and unpaired regions within single stranded RNA. To infer accessibility, we tally unpaired bases for fSTRs and stratify the results by allele length and repeat motif (Fig 3). Results of two types are obtained: (a) accessibility increases with allele length; (b) accessibility decreases with allele length. Examples of both types are shown in Fig 3a and 3b, respectively. Although strong examples were found for both types of association; accessibility varies substantially for fSTR alleles of fixed length regardless of motif. Thus, RNA accessibility probably depends on the fSTR allele length as well as the sequence context 5’ and 3’ to the actual repeat motif.

Download:

Fig 3. Effects of fSTR variants on RNA accessibility for MFE secondary structures.

Accessibility is inferred from the tally of unpaired bases using ViennaRNA. (a) Accessibility increases with allele length for poly-A repeats: r = 0.066, p = 0. (b) Accessibility decreases with allele length for poly-AT repeats: r = −0.217, p = 0. (c) Accessibility increases with allele length for non-reverse complementary repeats: r = 0.017, p = 1.3e-10. (d) Accessibility decreases with allele length for reverse complementary sequences: r = −0.214, p = 0.

https://doi.org/10.1371/journal.pone.0326355.g003

To further characterize RNA accessibility, we investigate possible associations with repeat length and unit. Associations of this type are hard to pin down with one exception. Sequences serving as their own reverse complement tend to decrease accessibility as allele length increases. For example, the reverse of poly-AT (poly-TA) is complementary to the original poly-AT motif (Fig 3b). We speculate that such sequences – which have the ability to base pair with themselves – cause a decrease in transcribed RNA accessibility. To test this, we aggregated all non-reverse complementary and reverse complementary sequence motifs. Indeed, the former sequence motifs show a positive correlation with allele length (Fig 3c; r = 0.017, p = 1.3e-10) while the latter have a negative correlation (Fig 3d; r = −0.214, p = 0).

fSTRs tend to affect RNA multiloops and external loops

The effects of fSTRs on MFE RNA folding are characterized by comparing secondary structure motifs using bpRNA and bpRNA-align [40,41]. Briefly, the per base secondary structure assignments are aligned for each pair of variants belonging to an fSTR. Mismatching structural motifs are tallied over pairs of alleles. Tallies are visualized as a matrix with row sums normalized to 100% and columns indicating the frequency of mismatch with all other motifs. Over 15% of RNA multiloops (M) and external loops (X) are affected (Fig 4a); and, they are frequently exchanged with one-another. Frequent changes to bulge motifs (B) are also common (red off-diagonal in Fig 4a). Interestingly, no changes are prohibited. Dangling end motifs (E) were rarely exchanged for multiloops (M) with the former being altered in less than 3% of the bases tallied (Fig 4a).

Download:

Fig 4. MFE secondary structure changes tallied for pairs of fSTR alleles and normalized by row: left-handed stem (L), right-handed stem (R), internal loops (I), bulges (B), hairpins (H), multiloops (M), external loops (X), and ends (E).

(a,b) Multiloops and external loops are frequently exchanged due to fSTR insertions. (e) The same exchange is seen for non-reverse complementary sequences. (c) The AT motif – which base pairs with itself – shifts towards right (R) and left-handed stem (L) motifs. (f) The same shift is seen for reverse complementary fSTRs. (d) CAG repeats conserve right-handed stems (R), left-handed stems (L), and ends (E) while departing from other structural motifs.

https://doi.org/10.1371/journal.pone.0326355.g004

For reverse complementary sequences (see previous section) we notice a many to one shift towards left (L) and right-handed (R) stem motifs: these columns are mostly red for reverse complementary motifs (Fig 4f). We see a shift away from multiloops (M) and external loops (X) suggesting a link between some fSTRs and gene expression. Indeed, multibranch loops (M) are hubs of interaction within RNA. In fact, this is precisely the difference seen between the clusters of variants for the fSTR embedded in SH2B3 (Fig 1a). However, that particular repeat is not reverse complementary. Of course, the suggested links between DNA motif, RNA folding, and gene expression should be interpreted as preliminary associations and not causation.

Simulations recapitulate effects of STR variants on RNA structure

The effects of reverse complementary sequences were verified using a simulation-based approach. This is important for two reasons. First, sequences 5’ and 3’ to repeat variants may influence RNA folding as seen in experiment. Second, singleton motifs are overrepresented in the experimental data. Accessibility was tested on 10,000 simulated STR alleles. In each case, 5’ and 3’ sequence context was randomized. Reverse complementary (Fig 5a) and non-reverse complementary (Fig 5c) motifs were sampled randomly. The results recapitulate the experimental data in Fig 3c and 3d, respectively.

Download:

Fig 5. The effects of fSTR variants on MFE RNA structures was verified using a simulation-based approach.

(a) accessibility increases with motif length for non-reverse complementary sequences. (b) changes in secondary structure for non-reverse complementary sequences. (c) accessibility decreases with motif length for reverse complementary sequences. (d) changes in secondary structure for reverse complementary sequences.

https://doi.org/10.1371/journal.pone.0326355.g005

Effects on secondary structure used a similar approach. Motifs were sampled randomly. Changes in RNA secondary structure were tallied for five simulated indel variants while keeping the 5’ and 3’ sequence context fixed. Simulations for reverse complementary sequences (Fig 5d) recapitulate experimental data (Fig 4f). However, the remaining sequences (Fig 5b) do not recapitulate experimental data (Fig 4e). The difference undoubtedly stems from the aforementioned over-representation of singleton motifs in experiment. Interestingly, the different motifs have little effect on secondary structure in simulation (Fig 5b and 5d). Apparently, reverse complementary sequences affect RNA accessibility (but not structure) while singleton motifs affect RNA secondary structure (but not accessibility).

Discussion

Our results support the hypothesis that some STR variants affect RNA secondary structure and gene expression. Support is provided by several lines of evidence. First, we fold and cluster variants for 66,876 transcribed STRs from the 1000 genomes project using ViennaRNA and bpRNA-align. We find 17,255 affect RNA folding (fSTRs). Interestingly, fSTRs are enriched in coding regions and specific 3-mers which conspicuously include the CAG repeat motif (Fig 2c). Although the collection of 17,255 fSTRs are discovered using computational tools, we emphasize that only real variants identified in 1000 Genomes Project samples were used for the analysis. Next, we infer effects on gene expression using RNAseq. Briefly, we map fSTR clusters to RPKM values for each sample and preform a test of the null: no association between cluster assignment and RPKM. Association is supported for 356 fSTRs. These include 13 in coding regions: SAAL1, ZNF384, TSC22D1, MEF2A, C16orf71, TOX3, ERN1, NADK, PTPN18, GIGYF2, USF3, TRERF1, and AK9.

Not to be lost in our results is the approach itself. We demonstrate a novel way to study STR variation using state of the art tools. ViennaRNA is widely regarded as the best in class for predicting RNA secondary structure and bpRNA-align is a recent addition that shows improvement in clustering performance over a broad range of structure types [39,41]. This approach could be extended to study other classes of repetitive DNA such as palindromes and terminal inverted repeats. Indeed, similar approaches have been used – with an older set of tools – to study the effects of single nucleotide polymorphisms (SNPs) on the structure of transcribed UTRs and RNA in general [43–45]. Most of the novelty we introduce lies in mapping the bpRNA-align cluster assignments to variants possessed by each sample; a critical step that enables RPKM association testing.

Our approach is easily extended to the study of disease provided both DNA and RNA sequencing data is available. This is certainly the case for many samples in The Cancer Genome Atlas (TCGA) and database of genomes and phenomes (dbGaP). However, the idea that RNA folding alone is sufficient to explain high impact STR variants should be approached with skepticism. Those that are known have catastrophic effect on protein structure (such as Huntington’s) or chromosome structure (such as fragile X); but not RNA structure. In other cases, epigenetic modifications (such as CpG methylation) may overshadow the effects of array length polymorphisms by silencing genes prior to transcription altogether. It is more reasonable to conclude that RNA structure alterations have modest effects on rates of transcription, translation, and splicing.

Beyond splicing, RNA secondary structure influences post-transcriptional gene regulation, particularly when variants occur in untranslated regions (UTRs) or coding sequences [31,46]. Variants in the 5′ UTR may modulate translation initiation while those in coding may affect elongation rates. Variants in the 3′ UTR may impact transcript stability or localization by disrupting motifs for RNA-binding proteins. Future work integrating ribosome profiling, RNA stability assays, and RNA binding protein mapping will help clarify and validate the broader functional consequences of fSTRs.

On the contrary, STR variation and its influence on RNA structure could play a larger role in prokaryotes where transcription and translation are spatially and temporally linked. In fact, two processes unique to prokaryotes provide a precedent. Attenuation is a well-established mechanism that leverages codon repeats to regulate transcription via mutually exclusive RNA secondary structures [47,48]. Possibly any STR variation that alters RNA secondary structure could influence the rate transcription or lead to its termination all together. While this is just a hypothesis, it may be experimentally tractable. A second process – bacterial phase variation – leverages STR mutation rates for semi-random dichotomous phenotype variation [49,50]. Although phase variation has more to do with DNA structure than RNA structure, it emphasizes the complex role of STR variation on phenotype.

To validate our computational predictions regarding RNA-protein interactions and translation efficiency, several experimental techniques could be employed. Cross-linking immunoprecipitation (CLIP) methods, such as HITS-CLIP or iCLIP, allow for transcriptome-wide mapping of protein binding sites on RNA at nucleotide resolution [51]. Applying CLIP to our system would test whether predicted RNA variants alter protein binding in vivo. Similarly, SHAPE-seq and DMS-seq could provide experimental insight into RNA secondary structure changes caused by fSTR variants [52,53]. For translation efficiency, ribosome profiling (Ribo-seq) offers a powerful means to assess ribosome occupancy along transcripts [54]. Comparing ribosome footprint density across transcript variants could determine if fSTR variants influence translation in vivo. When used in parallel with RNA-seq from the same samples, Ribo-seq also enables calculation of translational efficiency ratios, providing a direct test of our predictions. Together, these approaches offer complementary validation strategies that could substantiate the functional effects of fSTRs proposed in this study.

We suggest further lines of inquiry to investigate the effects of STR variation on RNA and DNA structure. The secondary structure of DNA may affect rates of transcription and protein interactions: both precursors to gene expression. Prediction of Z-DNA, H-DNA, and cruciform DNA are obvious starting points; but, newer tools offer a more sophisticated approach to DNA structure prediction. Deep DNAshape predicts up to a dozen intra-base and inter-base features which could shed light on how STR variation affects transcription factor binding and DNA-protein binding at large [55]. RhoFold uses a language model based deep-learning approach to predict the 3D structure of RNA which could extend our analysis of secondary structure to tertiary structure [56]. Likewise, tools for predicting ramp sequences could provide a starting point for linking STR variation to translation rate and fidelity [34,35].

Methods

Overall approach

Our overall hypothesis is that some STRs affect RNA folding (fSTRs) which in turn is associated with differential gene expression in human populations. A test of our hypothesis unfolds in two parts. First, we identify which (if any) of the 66,876 transcribed STRs in the human genome have the capacity to affect RNA folding (secondary structure). To do this, we use the ViennaRNA package to predict secondary structures and score their differences with bpRNA-align. We find 17,255 fSTRs which we characterize by repeat length, repeat motif, and functional annotation. Details of RNA folding and clustering are provided below. Next, we identify which (if any) of the fSTRs are possibly associated with gene expression (Fig 6).

Download:

Fig 6. Summary of overall approach.

(left) Variants for 66,876 transcribed STRs were identified in 2,529 samples from the 1000 Genomes Project. We used repeatseq: a standard STR variant caller. (middle) Variants for each STR were transcribed and folded using Vienna-RNA. Secondary structures were assigned and clustered with bpRNA-align and affinity propagation clustering, respectively. (right) Effects on RNA folding were indicated by clustering results in excess of one. Cluster assignments were mapped to 462 RNA-seq samples: a subset of the original 2,529 samples. Associations with gene expression was established using a Tukey’s Honestly Significant Difference test.

https://doi.org/10.1371/journal.pone.0326355.g006

To check for association with gene expression We use a second set of 462 RNAseq samples. Alleles for each sample were mapped to their transcribed cluster assignments (see below). Differences in gene expression (measured as RPKM) across cluster assignments were assessed with a post-hoc Tukey’s Honestly Significant Difference (HSD) test. The test was conducted using the pairwise_tukeyhsd function in Python, with RPKM values as the dependent variable (endog) and group assignments based on allele clusters as the independent variable (groups). A significance threshold of α = 0.05 was applied to determine pairwise differences between groups. This approach allowed for the identification of statistically significant differences while controlling for multiple comparisons: 356 fSTRs were possibly associated with gene expression (Fig 6). Data and code used for manuscript preparation are freely available online: https://github.com/nkinney06/fSTRs.

RNA folding with ViennaRNA

The ViennaRNA package is a widely used software suite for predicting and analyzing RNA secondary structures [39]. Briefly, it employs thermodynamic models to predict the most probable secondary structure of an RNA sequence. The core prediction algorithm uses dynamic programming to find the minimum free energy (MFE) structure, which is considered the most stable structure according to the energy model. Details of the ViennaRNA algorithm can be found elsewhere.

Input to the package typically consists of a single RNA sequence or a set of aligned sequences. For single sequences, the RNAfold program can predict either the MFE structure or thermodynamic ensembles using the partition function approach.

In our case, STR variants are inferred from 1000 Genomes Project samples using Repeatseq [57]: http://github.com/adaptivegenome/repeatseq. Details of variant calling are provided below. Each variant is transcribed and saved in fasta format to serve as input to ViennaRNA. ViennaRNA provides dot bracket notation (.dbn) output for each variant. A list of dbn files serves as the starting point for bpRNA-align clustering.

Thermodynamic considerations

The use of MFE structures without taking into consideration thermodynamic ensembles for each variant may raise concerns about our methodology. In reality, each variant folds into an ensemble of structures approximately 1k_bT around the MFE structures. It’s conceivable that the energy barrier between some MFE structures is less than 1k_bT; consequently, the similar overlapping ensembles mitigate any biological effects. This possibility may increase the false positive rate for the 17,255 fSTRs; but not the 356 fSTRs possibly associated with gene expression. Indeed, strong associations with gene expression are inconsistent with weak energy barriers between ensembles.

RNA clustering

We use bpRNA-align to compare structural differences between STR variants [40,41]. Details of bpRNA-align can be found elsewhere. Briefly, it is a recent contribution that uses a customized global (Needleman-Wunsch) dynamic programming approach. Per base mismatches are scored with a feature-specific substitution matrix and coupled with an inverted and context-specific affine gap penalty. The approach shows improvement in clustering performance over a broad range of structure types [41]. In our case, a list of dbn files (from ViennaRNA) serves as the starting point for bpRNA-align clustering. The output is a symmetric matrix of pairwise similarity scores for each variant. We use the matrix of similarity scores to cluster RNA secondary structures.

Clustering was performed using affinity propagation [58]: the same approach used by the authors of bpRNA-align [41]. We use the AffinityPropagation function from sklearn with the precomputed bpRNA-align similarity matrix. Changes in RNA structure were indicated by clustering results in excess of one. We use a filter parameter to mitigate false discoveries. Briefly, entries in the bpRNA-align similarity matrix were compared for each cluster. Only clusters with differences in excess of 100 were considered for analysis.

RNA Sequencing (RNA-seq) Data Analysis

The RNA-seq data analysis unfolded in four steps [59]: (a) quality control and preprocessing, (b) alignment to the human reference genome, (c) read counting, and (d) differential expression analysis.

(a). Quality Control and Preprocessing. Quality assessment of the sequencing reads was performed using FastQC [60]. Commonly expected warnings, such as sequence duplication due to highly expressed transcripts and minor issues with tile quality, were disregarded. Similarly, K-mer content warnings arising from random priming were ignored, as our analysis focused on gene-level counts rather than alternative splicing or de novo gene structure inference [61].
(b). Alignment to the Human Reference Genome. Reads were aligned to the GRCh38 human reference genome using STAR (Spliced Transcripts Alignment to a Reference) [62]. This tool is optimized for handling reads with insertions and deletions. The alignment utilized GENCODE annotation release 33 (gencode.v33.annotation.gtf) to enhance accuracy.
(c). Read Counting. Gene-level read counts for each sample were generated using HTSeq [63]. Exon-level counts (--type = exon) were aggregated by gene ID (--idattr = gene_id) without strand specificity (--stranded = no). Counts were subsequently normalized to FPKM (fragments per kilobase of transcript per million mapped reads) using the countToFPKM package in R.
(d). Differential Expression Analysis. DESeq2 [64] was employed to identify differentially expressed genes between 89 African and 373 European samples. The analysis began with constructing a count matrix where rows represented genes and columns corresponded to individual samples. DESeq2 automatically estimated size factors, computed gene-level dispersion, and fitted a generalized linear model to identify significant differences.

STR genotyping Using Repeatseq

Microsatellite genotypes were inferred from whole-genome sequencing data using RepeatSeq, a Bayesian framework specifically designed for genotyping tandem repeats from short-read sequencing datasets. RepeatSeq models PCR stutter noise, sequencing errors, and allele sampling to probabilistically call the most likely genotype at each locus. Input data consisted of aligned BAM files from the 1000 Genomes Project, which were processed according to the developers’ recommendations. Candidate repeat loci were specified in BED format, and reads overlapping these regions were extracted for analysis. For each locus and sample, RepeatSeq calculates genotype likelihoods by comparing observed read counts of repeat lengths to a stutter noise model fitted during analysis. The program reports maximum likelihood genotype calls as well as posterior probabilities, allowing for quality filtering in downstream analyses. Default parameters were used unless otherwise specified, with a minimum read coverage threshold applied to ensure reliability of calls. RepeatSeq has been used in previous studies and is freely available online: https://github.com/adaptivegenome/repeatseq. Additional details of STR genotyping are provided in our previous publications [5,6].

When benchmarked on diverse datasets, several recent variant callers report similar or better accuracy than repeatseq such as GangSTR [65], HipSTR [66], lobSTR [67], STRetch [68], TREDPARSE [69], and Dante [70]. Our use of RepeatSeq was justified in our previous publication. In particular, RepeatSeq was specifically designed and validated using data from the 1000 Genomes Project [57].

Samples

Samples used to identify fSTRs can be found in previous publications. Briefly, these samples come from phase 3 of the 1000 Genomes Project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/. In total, 2,529 samples were included for analysis: 667 African (AFR), 502 European (EUR), 352 American (AMR), 514 East Asian (EAS), 494 South Asian (SAS). We use a second set of 462 RNAseq samples for association testing of fSTR cluster assignments against RPKM values. These include 89 Africans and 373 Europeans. All samples are available through the European Bioinformatics Institute website: https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/samples/.

Statistical considerations

To evaluate pairwise differences between binned datapoints, we used Tukey’s Honestly Significant Difference (HSD) test, which is specifically designed for post-hoc comparisons following ANOVA. This method controls the family-wise error rate (FWER), reducing the likelihood of false positives that can arise from multiple testing. Tukey’s HSD achieves this by adjusting the significance threshold across all pairwise comparisons, ensuring that the overall probability of making one or more Type I errors remains at the specified alpha level (typically 0.05). As such, it provides a conservative and statistically robust approach to identify significant group differences while accounting for the multiple comparisons inherent in our analysis.

While it is true that multiple testing corrections can be applied both within and across families of tests, we chose to apply Tukey’s Honestly Significant Difference (HSD) test within each binned comparison group without an additional layer of correction across bins. This decision reflects our aim to identify localized effects of specific variants or sequence contexts, rather than to make broad claims about global significance across the entire dataset. Tukey’s HSD already controls the family-wise error rate for the multiple pairwise comparisons within each group, which are the relevant statistical units for our hypotheses. Furthermore, because each bin represents a biologically distinct context, we treat these as independent analytical units rather than as components of a single multiple testing framework. As such, we interpret statistical significance conservatively and contextualize findings based on consistency across bins and biological plausibility, rather than relying solely on adjusted p-values for global inference.

Supporting information

S1 File. Expanded characterization of STRs, fSTRs, and efSTRs.

Each is characterized by gene feature, sequence motif, and amino acid motif.

https://doi.org/10.1371/journal.pone.0326355.s001

(PDF)

References

1. Tanudisastro HA, Deveson IW, Dashnow H, MacArthur DG. Sequencing and characterizing short tandem repeats in the human genome. Nature Reviews Genetics. 2024;1–16.
- View Article
- Google Scholar
2. Gymrek M. A genomic view of short tandem repeats. Curr Opin Genet Dev. 2017;44:9–16. pmid:28213161
- View Article
- PubMed/NCBI
- Google Scholar
3. Wyner N, Barash M, McNevin D. Forensic Autosomal Short Tandem Repeats and Their Potential Association With Phenotype. Front Genet. 2020;11:884. pmid:32849844
- View Article
- PubMed/NCBI
- Google Scholar
4. Butler JM. New resources for the forensic genetics community available on the NIST STRBase website. Forensic Science International: Genetics Supplement Series. 2008;1:97–9.
- View Article
- Google Scholar
5. Kinney N, Kang L, Bains H, Lawson E, Husain M, Husain K, et al. Ethnically biased microsatellites contribute to differential gene expression and glutathione metabolism in Africans and Europeans. PLoS One. 2021;16(3):e0249148. pmid:33765058
- View Article
- PubMed/NCBI
- Google Scholar
6. Kinney N, Kang L, Eckstrand L, Pulenthiran A, Samuel P, Anandakrishnan R, et al. Abundance of ethnically biased microsatellites in human gene regions. PLoS One. 2019;14(12):e0225216. pmid:31830051
- View Article
- PubMed/NCBI
- Google Scholar
7. Fotsing SF, Margoliash J, Wang C, Saini S, Yanicky R, Shleizer-Burko S, et al. The impact of short tandem repeat variation on gene expression. Nat Genet. 2019;51(11):1652–9. pmid:31676866
- View Article
- PubMed/NCBI
- Google Scholar
8. Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B, Georgiev S, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet. 2016;48(1):22–9. pmid:26642241
- View Article
- PubMed/NCBI
- Google Scholar
9. McIver LJ, Fonville NC, Karunasena E, Garner HR. Microsatellite genotyping reveals a signature in breast cancer exomes. Breast Cancer Res Treat. 2014;145(3):791–8. pmid:24838940
- View Article
- PubMed/NCBI
- Google Scholar
10. Galindo CL, McCormick JF, Bubb VJ, Abid Alkadem DH, Li L-S, McIver LJ, et al. A long AAAG repeat allele in the 5’ UTR of the ERR-γ gene is correlated with breast cancer predisposition and drives promoter activity in MCF-7 breast cancer cells. Breast Cancer Res Treat. 2011;130(1):41–8. pmid:21153485
- View Article
- PubMed/NCBI
- Google Scholar
11. Velmurugan KR, Varghese RT, Fonville NC, Garner HR. High-depth, high-accuracy microsatellite genotyping enables precision lung cancer risk classification. Oncogene. 2017;36(46):6383–90. pmid:28759038
- View Article
- PubMed/NCBI
- Google Scholar
12. Rivero-Hinojosa S, Kinney N, Garner HR, Rood BR. Germline microsatellite genotypes differentiate children with medulloblastoma. Neuro Oncol. 2020;22(1):152–62. pmid:31562520
- View Article
- PubMed/NCBI
- Google Scholar
13. Mitra I, Huang B, Mousavi N, Ma N, Lamkin M, Yanicky R, et al. Genome-wide patterns ofde novotandem repeat mutations and their contribution to autism spectrum disorders. Cold Spring Harbor Laboratory; 2020. https://doi.org/10.1101/2020.03.04.974170
14. Lundström OS, Adriaan Verbiest M, Xia F, Jam HZ, Zlobec I, Anisimova M, et al. WebSTR: A Population-wide Database of Short Tandem Repeat Variation in Humans. J Mol Biol. 2023;435(20):168260. pmid:37678708
- View Article
- PubMed/NCBI
- Google Scholar
15. Ruitberg CM, Reeder DJ, Butler JM. STRBase: a short tandem repeat DNA database for the human identity testing community. Nucleic Acids Res. 2001;29(1):320–2. pmid:11125125
- View Article
- PubMed/NCBI
- Google Scholar
16. Uppili B, Faruq M. STRIDE-DB: a comprehensive database for exploration of instability and phenotypic relevance of short tandem repeats in the human genome. Database (Oxford). 2024;2024:baae020. pmid:38602506
- View Article
- PubMed/NCBI
- Google Scholar
17. Chintalaphani SR, Pineda SS, Deveson IW, Kumar KR. An update on the neurological short tandem repeat expansion disorders and the emergence of long-read sequencing diagnostics. Acta Neuropathol Commun. 2021;9(1):98. pmid:34034831
- View Article
- PubMed/NCBI
- Google Scholar
18. Margoliash J, Fuchs S, Li Y, Zhang X, Massarat A, Goren A, et al. Polymorphic short tandem repeats make widespread contributions to blood and serum traits. Cell Genom. 2023;3(12):100458. pmid:38116119
- View Article
- PubMed/NCBI
- Google Scholar
19. Yoon JG, Lee S, Cho J, Kim N, Kim S, Kim MJ. Diagnostic uplift through the implementation of short tandem repeat analysis using exome sequencing. European Journal of Human Genetics. 2024;1–4.
- View Article
- Google Scholar
20. Nojadeh JN, Behrouz Sharif S, Sakhinia E. Microsatellite instability in colorectal cancer. EXCLI J. 2018;17:159–68. pmid:29743854
- View Article
- PubMed/NCBI
- Google Scholar
21. McIver LJ, Fonville NC, Karunasena E, Garner HR. Microsatellite genotyping reveals a signature in breast cancer exomes. Breast Cancer Res Treat. 2014;145(3):791–8. pmid:24838940
- View Article
- PubMed/NCBI
- Google Scholar
22. Bacolla A, Wells RD. Non-B DNA conformations as determinants of mutagenesis and human disease. Mol Carcinog. 2009;48(4):273–85. pmid:19306308
- View Article
- PubMed/NCBI
- Google Scholar
23. Vinces MD, Legendre M, Caldara M, Hagihara M, Verstrepen KJ. Unstable tandem repeats in promoters confer transcriptional evolvability. Science. 2009;324(5931):1213–6. pmid:19478187
- View Article
- PubMed/NCBI
- Google Scholar
24. Ellegren H. Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004;5(6):435–45. pmid:15153996
- View Article
- PubMed/NCBI
- Google Scholar
25. Hannan AJ. Tandem repeats mediating genetic plasticity in health and disease. Nature Reviews Genetics. 2018;19:286–98.
- View Article
- Google Scholar
26. Iennaco R, Formenti G, Trovesi C, Rossi RL, Zuccato C, Lischetti T, et al. The evolutionary history of the polyQ tract in huntingtin sheds light on its functional pro-neural activities. Cell Death Differ. 2022;29(2):293–305. pmid:34974533
- View Article
- PubMed/NCBI
- Google Scholar
27. Katti MV, Ranjekar PK, Gupta VS. Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol Biol Evol. 2001;18(7):1161–7. pmid:11420357
- View Article
- PubMed/NCBI
- Google Scholar
28. Silva A, de Almeida AV, Macedo-Ribeiro S. Polyglutamine expansion diseases: More than simple repeats. J Struct Biol. 2018;201(2):139–54. pmid:28928079
- View Article
- PubMed/NCBI
- Google Scholar
29. Lieberman AP, Shakkottai VG, Albin RL. Polyglutamine Repeats in Neurodegenerative Diseases. Annu Rev Pathol. 2019;14:1–27. pmid:30089230
- View Article
- PubMed/NCBI
- Google Scholar
30. Wright SE, Todd PK. Native functions of short tandem repeats. Elife. 2023;12:e84043. pmid:36940239
- View Article
- PubMed/NCBI
- Google Scholar
31. Georgakopoulos-Soares I, Parada GE, Hemberg M. Secondary structures in RNA synthesis, splicing and translation. Comput Struct Biotechnol J. 2022;20:2871–84. pmid:35765654
- View Article
- PubMed/NCBI
- Google Scholar
32. Fotsing SF, Margoliash J, Wang C, Saini S, Yanicky R, Shleizer-Burko S, et al. The impact of short tandem repeat variation on gene expression. Nat Genet. 2019;51(11):1652–9. pmid:31676866
- View Article
- PubMed/NCBI
- Google Scholar
33. Miller JB, Brandon JA, McKinnon LM, Sabra HW, Lucido CC, Murcia JDG. Ramp sequence may explain synonymous variant association with Alzheimer’s disease in the Paired Immunoglobulin-like Type 2 Receptor Alpha (PILRA). bioRxiv. 2025.
- View Article
- Google Scholar
34. McKinnon LM, Miller JB, Whiting MF, Kauwe JSK, Ridge PG. A comprehensive analysis of the phylogenetic signal in ramp sequences in 211 vertebrates. Sci Rep. 2021;11(1):622. pmid:33436653
- View Article
- PubMed/NCBI
- Google Scholar
35. Miller JB, Meurs TE, Hodgman MW, Song B, Miller KN, Ebbert MTW, et al. Ramp atlas: facilitating tissue and cell-specific ramp sequence analyses through an intuitive web interface. NAR Genomics and Bioinformatics. 2022;4(2):lqac039. pmid:35664804
- View Article
- PubMed/NCBI
- Google Scholar
36. Tieng FYF, Abdullah-Zawawi M-R, Md Shahri NAA, Mohamed-Hussein Z-A, Lee L-H, Mutalib N-SA. A Hitchhiker’s guide to RNA-RNA structure and interaction prediction tools. Brief Bioinform. 2023;25(1):bbad421. pmid:38040490
- View Article
- PubMed/NCBI
- Google Scholar
37. Sanchez de Groot N, Armaos A, Graña-Montes R, Alriquet M, Calloni G, Vabulas RM, et al. RNA structure drives interaction with proteins. Nat Commun. 2019;10(1):3246. pmid:31324771
- View Article
- PubMed/NCBI
- Google Scholar
38. Fairley S, Lowy-Gallego E, Perry E, Flicek P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 2020;48(D1):D941–7. pmid:31584097
- View Article
- PubMed/NCBI
- Google Scholar
39. Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF. ViennaRNA Package 2.0. Algorithms for Molecular Biology. 2011;6:1–14.
- View Article
- Google Scholar
40. Danaee P, Rouches M, Wiley M, Deng D, Huang L, Hendrix D. bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res. 2018;46(11):5381–94. pmid:29746666
- View Article
- PubMed/NCBI
- Google Scholar
41. Lasher B, Hendrix DA. bpRNA-align: improved RNA secondary structure global alignment for comparing and clustering RNA structures. RNA. 2023;29(5):584–95. pmid:36759128
- View Article
- PubMed/NCBI
- Google Scholar
42. Lappalainen T, Sammeth M, Friedländer MR, ’t Hoen PAC, Monlong J, Rivas MA, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501(7468):506–11. pmid:24037378
- View Article
- PubMed/NCBI
- Google Scholar
43. Ritz J, Martin JS, Laederach A. Evaluating our ability to predict the structural disruption of RNA by SNPs. BMC Genomics. 2012;13 Suppl 4(Suppl 4):S6. pmid:22759654
- View Article
- PubMed/NCBI
- Google Scholar
44. Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs. Hum Mutat. 2013;34(4):546–56. pmid:23315997
- View Article
- PubMed/NCBI
- Google Scholar
45. Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. The RNAsnp web server: predicting SNP effects on local RNA secondary structure. Nucleic Acids Res. 2013;41:W475-9. pmid:23630321
- View Article
- PubMed/NCBI
- Google Scholar
46. Kramer MC, Gregory BD. Does RNA secondary structure drive translation or vice versa?. Nat Struct Mol Biol. 2018;25(8):641–3. pmid:30061597
- View Article
- PubMed/NCBI
- Google Scholar
47. Baumberg S. Prokaryotic gene expression. OUP Oxford; 1999.
48. Press MO, Hall AN, Morton EA, Queitsch C. Substitutions Are Boring: Some Arguments about Parallel Mutations and High Mutation Rates. Trends in Genetics. 2019;35:253–64.
- View Article
- Google Scholar
49. van der Woude MW, Bäumler AJ. Phase and antigenic variation in bacteria. Clin Microbiol Rev. 2004;17(3):581–611, table of contents. pmid:15258095
- View Article
- PubMed/NCBI
- Google Scholar
50. Henderson IR, Owen P, Nataro JP. Molecular switches--the ON and OFF of bacterial phase variation. Mol Microbiol. 1999;33(5):919–32. pmid:10476027
- View Article
- PubMed/NCBI
- Google Scholar
51. Ule J, Hwang H-W, Darnell RB. The Future of Cross-Linking and Immunoprecipitation (CLIP). Cold Spring Harb Perspect Biol. 2018;10(8):a032243. pmid:30068528
- View Article
- PubMed/NCBI
- Google Scholar
52. Watters KE, Abbott TR, Lucks JB. Simultaneous characterization of cellular RNA structure and function with in-cell SHAPE-Seq. Nucleic Acids Res. 2016;44(2):e12. pmid:26350218
- View Article
- PubMed/NCBI
- Google Scholar
53. Watters KE, Yu AM, Strobel EJ, Settle AH, Lucks JB. Characterizing RNA structures in vitro and in vivo with selective 2’-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Methods. 2016;103:34–48. pmid:27064082
- View Article
- PubMed/NCBI
- Google Scholar
54. Calviello L, Ohler U. Beyond Read-Counts: Ribo-seq Data Analysis to Understand the Functions of the Transcriptome. Trends Genet. 2017;33(10):728–44. pmid:28887026
- View Article
- PubMed/NCBI
- Google Scholar
55. Li J, Chiu T-P, Rohs R. Predicting DNA structure using a deep learning method. Nat Commun. 2024;15(1):1243. pmid:38336958
- View Article
- PubMed/NCBI
- Google Scholar
56. Shen T, Hu Z, Sun S, Liu D, Wong F, Wang J, et al. Accurate RNA 3D structure prediction using a language model-based deep learning approach. Nat Methods. 2024;21(12):2287–98. pmid:39572716
- View Article
- PubMed/NCBI
- Google Scholar
57. Highnam G, Franck C, Martin A, Stephens C, Puthige A, Mittelman D. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 2013;41(1):e32. pmid:23090981
- View Article
- PubMed/NCBI
- Google Scholar
58. Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007;315(5814):972–6. pmid:17218491
- View Article
- PubMed/NCBI
- Google Scholar
59. Lovén J, Orlando DA, Sigova AA, Lin CY, Rahl PB, Burge CB, et al. Revisiting global gene expression analysis. Cell. 2012;151(3):476–82. pmid:23101621
- View Article
- PubMed/NCBI
- Google Scholar
60. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. https://cir.nii.ac.jp/crid/1370584340724053142.
- View Article
- Google Scholar
61. Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38(12):e131. pmid:20395217
- View Article
- PubMed/NCBI
- Google Scholar
62. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. pmid:23104886
- View Article
- PubMed/NCBI
- Google Scholar
63. Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31(2):166–9. pmid:25260700
- View Article
- PubMed/NCBI
- Google Scholar
64. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. pmid:25516281
- View Article
- PubMed/NCBI
- Google Scholar
65. Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 2019;47(15):e90. pmid:31194863
- View Article
- PubMed/NCBI
- Google Scholar
66. Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. Genome-wide profiling of heritable and de novo STR variations. Nat Methods. 2017;14(6):590–2. pmid:28436466
- View Article
- PubMed/NCBI
- Google Scholar
67. Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 2012;22(6):1154–62. pmid:22522390
- View Article
- PubMed/NCBI
- Google Scholar
68. Dashnow H, Lek M, Phipson B, Halman A, Sadedin S, Lonsdale A, et al. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 2018;19(1):121. pmid:30129428
- View Article
- PubMed/NCBI
- Google Scholar
69. Tang H, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, et al. Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes. Am J Hum Genet. 2017;101(5):700–15. pmid:29100084
- View Article
- PubMed/NCBI
- Google Scholar
70. Budiš J, Kucharík M, Ďuriš F, Gazdarica J, Zrubcová M, Ficek A, et al. Dante: genotyping of known complex and expanded short tandem repeats. Bioinformatics. 2019;35(8):1310–7. pmid:30203023
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Tanudisastro HA, Deveson IW, Dashnow H, MacArthur DG. Sequencing and characterizing short tandem repeats in the human genome. Nature Reviews Genetics. 2024;1–16.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Gymrek M. A genomic view of short tandem repeats. Curr Opin Genet Dev. 2017;44:9–16. pmid:28213161
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Wyner N, Barash M, McNevin D. Forensic Autosomal Short Tandem Repeats and Their Potential Association With Phenotype. Front Genet. 2020;11:884. pmid:32849844
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Butler JM. New resources for the forensic genetics community available on the NIST STRBase website. Forensic Science International: Genetics Supplement Series. 2008;1:97–9.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Kinney N, Kang L, Bains H, Lawson E, Husain M, Husain K, et al. Ethnically biased microsatellites contribute to differential gene expression and glutathione metabolism in Africans and Europeans. PLoS One. 2021;16(3):e0249148. pmid:33765058
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Kinney N, Kang L, Eckstrand L, Pulenthiran A, Samuel P, Anandakrishnan R, et al. Abundance of ethnically biased microsatellites in human gene regions. PLoS One. 2019;14(12):e0225216. pmid:31830051
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Fotsing SF, Margoliash J, Wang C, Saini S, Yanicky R, Shleizer-Burko S, et al. The impact of short tandem repeat variation on gene expression. Nat Genet. 2019;51(11):1652–9. pmid:31676866
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B, Georgiev S, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet. 2016;48(1):22–9. pmid:26642241
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. McIver LJ, Fonville NC, Karunasena E, Garner HR. Microsatellite genotyping reveals a signature in breast cancer exomes. Breast Cancer Res Treat. 2014;145(3):791–8. pmid:24838940
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref10] 10. Galindo CL, McCormick JF, Bubb VJ, Abid Alkadem DH, Li L-S, McIver LJ, et al. A long AAAG repeat allele in the 5’ UTR of the ERR-γ gene is correlated with breast cancer predisposition and drives promoter activity in MCF-7 breast cancer cells. Breast Cancer Res Treat. 2011;130(1):41–8. pmid:21153485
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref11] 11. Velmurugan KR, Varghese RT, Fonville NC, Garner HR. High-depth, high-accuracy microsatellite genotyping enables precision lung cancer risk classification. Oncogene. 2017;36(46):6383–90. pmid:28759038
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref12] 12. Rivero-Hinojosa S, Kinney N, Garner HR, Rood BR. Germline microsatellite genotypes differentiate children with medulloblastoma. Neuro Oncol. 2020;22(1):152–62. pmid:31562520
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref13] 13. Mitra I, Huang B, Mousavi N, Ma N, Lamkin M, Yanicky R, et al. Genome-wide patterns ofde novotandem repeat mutations and their contribution to autism spectrum disorders. Cold Spring Harbor Laboratory; 2020. https://doi.org/10.1101/2020.03.04.974170

[ref14] 14. Lundström OS, Adriaan Verbiest M, Xia F, Jam HZ, Zlobec I, Anisimova M, et al. WebSTR: A Population-wide Database of Short Tandem Repeat Variation in Humans. J Mol Biol. 2023;435(20):168260. pmid:37678708
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref15] 15. Ruitberg CM, Reeder DJ, Butler JM. STRBase: a short tandem repeat DNA database for the human identity testing community. Nucleic Acids Res. 2001;29(1):320–2. pmid:11125125
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Uppili B, Faruq M. STRIDE-DB: a comprehensive database for exploration of instability and phenotypic relevance of short tandem repeats in the human genome. Database (Oxford). 2024;2024:baae020. pmid:38602506
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref17] 17. Chintalaphani SR, Pineda SS, Deveson IW, Kumar KR. An update on the neurological short tandem repeat expansion disorders and the emergence of long-read sequencing diagnostics. Acta Neuropathol Commun. 2021;9(1):98. pmid:34034831
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref18] 18. Margoliash J, Fuchs S, Li Y, Zhang X, Massarat A, Goren A, et al. Polymorphic short tandem repeats make widespread contributions to blood and serum traits. Cell Genom. 2023;3(12):100458. pmid:38116119
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref19] 19. Yoon JG, Lee S, Cho J, Kim N, Kim S, Kim MJ. Diagnostic uplift through the implementation of short tandem repeat analysis using exome sequencing. European Journal of Human Genetics. 2024;1–4.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref20] 20. Nojadeh JN, Behrouz Sharif S, Sakhinia E. Microsatellite instability in colorectal cancer. EXCLI J. 2018;17:159–68. pmid:29743854
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref21] 21. McIver LJ, Fonville NC, Karunasena E, Garner HR. Microsatellite genotyping reveals a signature in breast cancer exomes. Breast Cancer Res Treat. 2014;145(3):791–8. pmid:24838940
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref22] 22. Bacolla A, Wells RD. Non-B DNA conformations as determinants of mutagenesis and human disease. Mol Carcinog. 2009;48(4):273–85. pmid:19306308
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref23] 23. Vinces MD, Legendre M, Caldara M, Hagihara M, Verstrepen KJ. Unstable tandem repeats in promoters confer transcriptional evolvability. Science. 2009;324(5931):1213–6. pmid:19478187
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref24] 24. Ellegren H. Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004;5(6):435–45. pmid:15153996
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref25] 25. Hannan AJ. Tandem repeats mediating genetic plasticity in health and disease. Nature Reviews Genetics. 2018;19:286–98.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref26] 26. Iennaco R, Formenti G, Trovesi C, Rossi RL, Zuccato C, Lischetti T, et al. The evolutionary history of the polyQ tract in huntingtin sheds light on its functional pro-neural activities. Cell Death Differ. 2022;29(2):293–305. pmid:34974533
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref27] 27. Katti MV, Ranjekar PK, Gupta VS. Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol Biol Evol. 2001;18(7):1161–7. pmid:11420357
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref28] 28. Silva A, de Almeida AV, Macedo-Ribeiro S. Polyglutamine expansion diseases: More than simple repeats. J Struct Biol. 2018;201(2):139–54. pmid:28928079
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref29] 29. Lieberman AP, Shakkottai VG, Albin RL. Polyglutamine Repeats in Neurodegenerative Diseases. Annu Rev Pathol. 2019;14:1–27. pmid:30089230
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref30] 30. Wright SE, Todd PK. Native functions of short tandem repeats. Elife. 2023;12:e84043. pmid:36940239
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref31] 31. Georgakopoulos-Soares I, Parada GE, Hemberg M. Secondary structures in RNA synthesis, splicing and translation. Comput Struct Biotechnol J. 2022;20:2871–84. pmid:35765654
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref32] 32. Fotsing SF, Margoliash J, Wang C, Saini S, Yanicky R, Shleizer-Burko S, et al. The impact of short tandem repeat variation on gene expression. Nat Genet. 2019;51(11):1652–9. pmid:31676866
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref33] 33. Miller JB, Brandon JA, McKinnon LM, Sabra HW, Lucido CC, Murcia JDG. Ramp sequence may explain synonymous variant association with Alzheimer’s disease in the Paired Immunoglobulin-like Type 2 Receptor Alpha (PILRA). bioRxiv. 2025.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref34] 34. McKinnon LM, Miller JB, Whiting MF, Kauwe JSK, Ridge PG. A comprehensive analysis of the phylogenetic signal in ramp sequences in 211 vertebrates. Sci Rep. 2021;11(1):622. pmid:33436653
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref35] 35. Miller JB, Meurs TE, Hodgman MW, Song B, Miller KN, Ebbert MTW, et al. Ramp atlas: facilitating tissue and cell-specific ramp sequence analyses through an intuitive web interface. NAR Genomics and Bioinformatics. 2022;4(2):lqac039. pmid:35664804
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref36] 36. Tieng FYF, Abdullah-Zawawi M-R, Md Shahri NAA, Mohamed-Hussein Z-A, Lee L-H, Mutalib N-SA. A Hitchhiker’s guide to RNA-RNA structure and interaction prediction tools. Brief Bioinform. 2023;25(1):bbad421. pmid:38040490
View Article
PubMed/NCBI
Google Scholar

[134] View Article

[135] PubMed/NCBI

[136] Google Scholar

[ref37] 37. Sanchez de Groot N, Armaos A, Graña-Montes R, Alriquet M, Calloni G, Vabulas RM, et al. RNA structure drives interaction with proteins. Nat Commun. 2019;10(1):3246. pmid:31324771
View Article
PubMed/NCBI
Google Scholar

[138] View Article

[139] PubMed/NCBI

[140] Google Scholar

[ref38] 38. Fairley S, Lowy-Gallego E, Perry E, Flicek P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 2020;48(D1):D941–7. pmid:31584097
View Article
PubMed/NCBI
Google Scholar

[142] View Article

[143] PubMed/NCBI

[144] Google Scholar

[ref39] 39. Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF. ViennaRNA Package 2.0. Algorithms for Molecular Biology. 2011;6:1–14.
View Article
Google Scholar

[146] View Article

[147] Google Scholar

[ref40] 40. Danaee P, Rouches M, Wiley M, Deng D, Huang L, Hendrix D. bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res. 2018;46(11):5381–94. pmid:29746666
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar

[ref41] 41. Lasher B, Hendrix DA. bpRNA-align: improved RNA secondary structure global alignment for comparing and clustering RNA structures. RNA. 2023;29(5):584–95. pmid:36759128
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref42] 42. Lappalainen T, Sammeth M, Friedländer MR, ’t Hoen PAC, Monlong J, Rivas MA, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501(7468):506–11. pmid:24037378
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref43] 43. Ritz J, Martin JS, Laederach A. Evaluating our ability to predict the structural disruption of RNA by SNPs. BMC Genomics. 2012;13 Suppl 4(Suppl 4):S6. pmid:22759654
View Article
PubMed/NCBI
Google Scholar

[161] View Article

[162] PubMed/NCBI

[163] Google Scholar

[ref44] 44. Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs. Hum Mutat. 2013;34(4):546–56. pmid:23315997
View Article
PubMed/NCBI
Google Scholar

[165] View Article

[166] PubMed/NCBI

[167] Google Scholar

[ref45] 45. Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. The RNAsnp web server: predicting SNP effects on local RNA secondary structure. Nucleic Acids Res. 2013;41:W475-9. pmid:23630321
View Article
PubMed/NCBI
Google Scholar

[169] View Article

[170] PubMed/NCBI

[171] Google Scholar

[ref46] 46. Kramer MC, Gregory BD. Does RNA secondary structure drive translation or vice versa?. Nat Struct Mol Biol. 2018;25(8):641–3. pmid:30061597
View Article
PubMed/NCBI
Google Scholar

[173] View Article

[174] PubMed/NCBI

[175] Google Scholar

[ref47] 47. Baumberg S. Prokaryotic gene expression. OUP Oxford; 1999.

[ref48] 48. Press MO, Hall AN, Morton EA, Queitsch C. Substitutions Are Boring: Some Arguments about Parallel Mutations and High Mutation Rates. Trends in Genetics. 2019;35:253–64.
View Article
Google Scholar

[178] View Article

[179] Google Scholar

[ref49] 49. van der Woude MW, Bäumler AJ. Phase and antigenic variation in bacteria. Clin Microbiol Rev. 2004;17(3):581–611, table of contents. pmid:15258095
View Article
PubMed/NCBI
Google Scholar

[181] View Article

[182] PubMed/NCBI

[183] Google Scholar

[ref50] 50. Henderson IR, Owen P, Nataro JP. Molecular switches--the ON and OFF of bacterial phase variation. Mol Microbiol. 1999;33(5):919–32. pmid:10476027
View Article
PubMed/NCBI
Google Scholar

[185] View Article

[186] PubMed/NCBI

[187] Google Scholar

[ref51] 51. Ule J, Hwang H-W, Darnell RB. The Future of Cross-Linking and Immunoprecipitation (CLIP). Cold Spring Harb Perspect Biol. 2018;10(8):a032243. pmid:30068528
View Article
PubMed/NCBI
Google Scholar

[189] View Article

[190] PubMed/NCBI

[191] Google Scholar

[ref52] 52. Watters KE, Abbott TR, Lucks JB. Simultaneous characterization of cellular RNA structure and function with in-cell SHAPE-Seq. Nucleic Acids Res. 2016;44(2):e12. pmid:26350218
View Article
PubMed/NCBI
Google Scholar

[193] View Article

[194] PubMed/NCBI

[195] Google Scholar

[ref53] 53. Watters KE, Yu AM, Strobel EJ, Settle AH, Lucks JB. Characterizing RNA structures in vitro and in vivo with selective 2’-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Methods. 2016;103:34–48. pmid:27064082
View Article
PubMed/NCBI
Google Scholar

[197] View Article

[198] PubMed/NCBI

[199] Google Scholar

[ref54] 54. Calviello L, Ohler U. Beyond Read-Counts: Ribo-seq Data Analysis to Understand the Functions of the Transcriptome. Trends Genet. 2017;33(10):728–44. pmid:28887026
View Article
PubMed/NCBI
Google Scholar

[201] View Article

[202] PubMed/NCBI

[203] Google Scholar

[ref55] 55. Li J, Chiu T-P, Rohs R. Predicting DNA structure using a deep learning method. Nat Commun. 2024;15(1):1243. pmid:38336958
View Article
PubMed/NCBI
Google Scholar

[205] View Article

[206] PubMed/NCBI

[207] Google Scholar

[ref56] 56. Shen T, Hu Z, Sun S, Liu D, Wong F, Wang J, et al. Accurate RNA 3D structure prediction using a language model-based deep learning approach. Nat Methods. 2024;21(12):2287–98. pmid:39572716
View Article
PubMed/NCBI
Google Scholar

[209] View Article

[210] PubMed/NCBI

[211] Google Scholar

[ref57] 57. Highnam G, Franck C, Martin A, Stephens C, Puthige A, Mittelman D. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 2013;41(1):e32. pmid:23090981
View Article
PubMed/NCBI
Google Scholar

[213] View Article

[214] PubMed/NCBI

[215] Google Scholar

[ref58] 58. Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007;315(5814):972–6. pmid:17218491
View Article
PubMed/NCBI
Google Scholar

[217] View Article

[218] PubMed/NCBI

[219] Google Scholar

[ref59] 59. Lovén J, Orlando DA, Sigova AA, Lin CY, Rahl PB, Burge CB, et al. Revisiting global gene expression analysis. Cell. 2012;151(3):476–82. pmid:23101621
View Article
PubMed/NCBI
Google Scholar

[221] View Article

[222] PubMed/NCBI

[223] Google Scholar

[ref60] 60. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. https://cir.nii.ac.jp/crid/1370584340724053142.
View Article
Google Scholar

[225] View Article

[226] Google Scholar

[ref61] 61. Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38(12):e131. pmid:20395217
View Article
PubMed/NCBI
Google Scholar

[228] View Article

[229] PubMed/NCBI

[230] Google Scholar

[ref62] 62. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. pmid:23104886
View Article
PubMed/NCBI
Google Scholar

[232] View Article

[233] PubMed/NCBI

[234] Google Scholar

[ref63] 63. Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31(2):166–9. pmid:25260700
View Article
PubMed/NCBI
Google Scholar

[236] View Article

[237] PubMed/NCBI

[238] Google Scholar

[ref64] 64. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. pmid:25516281
View Article
PubMed/NCBI
Google Scholar

[240] View Article

[241] PubMed/NCBI

[242] Google Scholar

[ref65] 65. Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 2019;47(15):e90. pmid:31194863
View Article
PubMed/NCBI
Google Scholar

[244] View Article

[245] PubMed/NCBI

[246] Google Scholar

[ref66] 66. Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. Genome-wide profiling of heritable and de novo STR variations. Nat Methods. 2017;14(6):590–2. pmid:28436466
View Article
PubMed/NCBI
Google Scholar

[248] View Article

[249] PubMed/NCBI

[250] Google Scholar

[ref67] 67. Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 2012;22(6):1154–62. pmid:22522390
View Article
PubMed/NCBI
Google Scholar

[252] View Article

[253] PubMed/NCBI

[254] Google Scholar

[ref68] 68. Dashnow H, Lek M, Phipson B, Halman A, Sadedin S, Lonsdale A, et al. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 2018;19(1):121. pmid:30129428
View Article
PubMed/NCBI
Google Scholar

[256] View Article

[257] PubMed/NCBI

[258] Google Scholar

[ref69] 69. Tang H, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, et al. Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes. Am J Hum Genet. 2017;101(5):700–15. pmid:29100084
View Article
PubMed/NCBI
Google Scholar

[260] View Article

[261] PubMed/NCBI

[262] Google Scholar

[ref70] 70. Budiš J, Kucharík M, Ďuriš F, Gazdarica J, Zrubcová M, Ficek A, et al. Dante: genotyping of known complex and expanded short tandem repeats. Bioinformatics. 2019;35(8):1310–7. pmid:30203023
View Article
PubMed/NCBI
Google Scholar

[264] View Article

[265] PubMed/NCBI

[266] Google Scholar

Figures

Abstract

Introduction

Results

Transcribed STRs are possibly associated with gene expression

fSTRs are over represented in coding regions

fSTR motifs affect RNA accessibility

fSTRs tend to affect RNA multiloops and external loops

Simulations recapitulate effects of STR variants on RNA structure

Discussion

Methods

Overall approach

RNA folding with ViennaRNA

Thermodynamic considerations

RNA clustering

RNA Sequencing (RNA-seq) Data Analysis

STR genotyping Using Repeatseq

Samples

Statistical considerations

Supporting information

S1 File. Expanded characterization of STRs, fSTRs, and efSTRs.

References