Figures
Abstract
Short tandem repeats (STRs) are abundant in the human genome with approximately 300,000 embedded in gene introns, exons, and untranslated regions. High penetrance STR variants cause human diseases such as Myotonic dystrophy, Baratela-Scott syndrome, and various ataxias. The possibility that STRs contribute to polygenic disease is supported by recent high-powered datasets that link STRs to more subtle effects on gene expression. Indeed, STR variants can induce Z-DNA and H-DNA folding; alter nucleosome positioning; and change the spacing of DNA binding sites. On the other hand, little is known about how STR variants affect RNA secondary structure and accessibility. These factors could affect rates of splicing, nuclear export, and translation. We hypothesize that effects on RNA structure can be predicted using computational tools and associated with gene expression using DNA and RNA sequencing data. We test this hypothesis using data from the 1000 Genomes Project and ViennaRNA. We identify 17,255 transcribed STRs that affect RNA folding (fSTRs); 356 are possibly associated with gene expression. We characterize fSTRs by repeat motif, length, and gene level annotation. Transcribed fSTR variants tend to affect RNA multiloops and external loops. Effects on RNA accessibility depends on the repeat motif: a surprising result that is checked against simulation. These results shed light on how transcribed STRs affect RNA structure and pave the way for experimental validation.
Citation: Kinney N, Pathak D, Evans E, Arias P (2025) Short tandem repeat variants are possibly associated with RNA secondary structure and gene expression. PLoS One 20(6): e0326355. https://doi.org/10.1371/journal.pone.0326355
Editor: Karthikeyan Thiyagarajan, Borlaug Institute for South Asia-CIMMYT, INDIA
Received: February 22, 2025; Accepted: May 28, 2025; Published: June 18, 2025
Copyright: © 2025 Kinney et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: We used freely available python packages to perform our analysis. In addition to those already mentioned, we use sklearn to preform affinity propagation clustering on RNA structural similarity scores. We use the force directed RNA (forna) web interface to produce secondary structure plots of select STRs (71): http://rna.tbi.univie.ac.at/forna. All other plots were prepared with plotnine and pillow for python. Data and code used for manuscript preparation are freely available as supplementary material and online: https://github.com/nkinney06/.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Short tandem repeats (STRs) are hotspots for human genetic variation [1]. Their repetitive sequence motifs (1–6 base pair) are prone to strand slip replication and unequal crossing over which tend to increase or decrease the STR array length [1,2]. Indeed, STRs have been used for decades as markers in forensic and population analysis [3,4]. Approximately 300,000 STRs are embedded in human gene introns, exons, and untranslated regions (UTRs); consequently, variation in these regions is possibly associated with differential gene expression across human populations [5,6]. In fact, this hypothesis has recently been supported and reproduced by integrating data from DNA and RNA sequencing [7,8].
In 2015 and 2019, a pair of studies used variance partitioning to survey the human genome for STRs associated with gene expression [7,8]. The first study identified 2,060 expression STRs (eSTRs). The second study identified 28,375 eSTRs and recapitulated many of the 2,060 identified in 2015 [7]. The discovery of correlations between eSTR array length and gene expression provides a measure of validation for past and future studies of STRs in complex disease. In fact, several studies prior to 2015 reported links between various cancers and STR variation [9,10]. Since then, STR variation has been investigated in several additional cancer types [11,12] and autism spectrum disorder [13]. These breakthroughs paved the way for dedicated catalogues of STR variation [14–16]. In particular, WebSTR provides a catalogue of genome-wide STR variation in humans, and currently contains data for approximately 1.7 million unique regions [14].
The idea that STRs can affect gene expression is not surprising. Specific STR variants are causative in various ataxias, Huntington’s disease, and fragile X syndrome [17]. These high impact examples have been known for decades; however, discovery of more subtle effects on gene expression have had to wait for large datasets with more statistical power. These data have helped link STR variation to complex traits including blood and lipid biomarkers as well as oxidative stress [5,18]; and, the aforementioned studies of cancer and autism. The results suggest the possibility that STR variations can be leveraged for diagnostic proposes [19]. This hypothesis is supported by several studies of human cancer; in particular, colorectal and breast cancer [20,21]. So far, most of the attempts to leverage STRs for diagnostic purposes have used a polygenic risk approach with modest results [11,12,21].
The mechanisms that dictate how STR variants affect gene expression are diverse with some known and some unknown. Regardless of their position in the genome, STR variants can inducing Z-DNA and H-DNA folding [22]; alter nucleosome positioning [22,23]; and change the spacing of DNA binding sites [22,24]. When STR variants are positioned in coding regions they have the additional capacity to affect protein folding. Due to the possibility of frameshift, STRs embedded in coding regions are under unique selective pressure that favors insertion and deletion (indel) factors of three [25–27]. In addition, those coding for hydrophilic amino acids are over-represented [27]. Indeed, polyglutamine variants are among the most common of the repeat expansion disorders [28,29].
Relatively few studies have investigated how transcribed STR variants affect RNA structure [30–32]. This is important because a precedent has been set that links RNA structure to gene expression in humans. RNA sequence (primary structure) can affect translational speed and accuracy when the transcript’s 5’ end is enriched with rare, slowly-translated codons [33–35]. The folding of RNA into hairpins, loops, and other structural motifs (secondary structure) can affect how the RNA interacts with proteins, ribosomes, and other RNAs [36,37]. However, links between STR variants and possible effects on RNA secondary structure are understudied. We hypothesize that some transcribed STRs affect RNA secondary structure which in turn are associated with gene expression. If supported, this hypothesis would contribute to what is known about how STRs affect differential gene expression across populations and disease states.
We use a data integration approach to test our hypothesis. Briefly, STR variants are identified from samples in the 1000 genomes project [38]. We focus on transcribed variants found in intron, exon, UTR, and coding regions. Next, we use the ViennaRNA package to predict the secondary structure of each variant [39]. To identify STRs that affect RNA folding (fSTRs) we cluster each collection of secondary structures using bpRNA-align [40,41]. Briefly, bpRNA-align uses a state-of-the-art global structural alignment algorithm to improve clustering performance over a broad range of structure types [41]. Finally, fSTRs are tested for association with gene expression using 462 human lymphoblastoid cell line samples created by the Geuvadis consortium [42]. We characterize fSTRs by motif length, gene level annotation, and effects on RNA folding. We discuss our results in the context of recent STR studies and suggest future lines of inquiry.
Results
Transcribed STRs are possibly associated with gene expression
The overall goals of this study are threefold: (a) identify STR variants that affect RNA folding (fSTRs); (b) establish an association between fSTRs and gene expression; and (c) characterize the effects of fSTR variants on RNA folding. To begin we identify STR variants in 2,529 samples from the 1000 genomes project [38]. Variants for each transcribed sequence – including 50 bp of 3’ and 5’ sequence – were folded with ViennaRNA [39]. Secondary structures were compared with bpRNAalign and affinity propagation clustering [40,41]. Changes in RNA structure were indicated by clustering results in excess of one (Fig 1, left panel). We identify 17,255 fSTRs. A representative fSTR in an intron of SH2B3 has five variants with transcribed RNA sequences forming two clusters (Fig 1, left panel).
(left) The effect of STR variants on RNA folding (fSTRs) was inferred by comparing secondary structures with bpRNA-align and affinity propagation clustering. Five variants of a penta-repeat (TGGGG) in an SH2B3 intron fall into two clusters (A and B). (right) Gene expression (rpkm) values for a collection of samples are grouped by genotype after mapping each variant to its cluster assignment. Since each sample has two alleles there are three combinations of cluster assignments (independent axis). We perform a test of the null: no difference in gene expression between groups. The null is rejected (p < .01) suggesting an association between RNA folding and SH2B3 expression.
Associations with gene expression were checked using RNAseq data for a subset of the samples. We use the 462 human lymphoblastoid cell line samples created by the Geuvadis consortium [42]. The analysis was performed in three steps. First, variants from those samples were mapped to their corresponding cluster assignments. Second, expression for genes harboring fSTRs were grouped by genotype; i.e., a pair of variants mapped to cluster assignments (Fig 1, right panel). Third, we perform a test of the null: no difference in gene expression between groups. The null is rejected for 356 fSTRs suggesting an association with gene expression. Cluster assignments for an fSTR in SH2B3 show significant differences in gene expression (Fig 1, right panel).
fSTRs are over represented in coding regions
We reiterate the discovery criteria for a single fSTR: affinity propagation of its transcribed variants forms two or more clusters. The 66,876 transcribed STRs investigated in this study revealed 17,255 (25.8%) fSTRs. However, this may be an underestimate for two reasons. First, we only considered variants identified in the 1000 Genomes Project. Additional STR variants would likely be found with a larger sample size. Undoubtedly, some of the single cluster results would form multiple clusters with these additional variants. Second, STRs lacking variation in the 1000 Genomes Project samples were excluded from analysis: without variation there is no suitable test of the null. A larger set of samples would likely reveal variation in some of the excluded STRs and the discovery of additional fSTRs.
Characterization of fSTRs by gene level annotation reveals overrepresentation in coding regions (Fig 2a). This result is intriguing when paired with characterization of fSTRs by motif length. It comes as no surprise that effects on RNA structure increase with motif length; indeed, motif lengths greater than one are overrepresented (Fig 2b). However, coding regions are known to favor motif lengths of 3 and 6 to avoid frameshift. Apparently, coding regions are under far greater selective pressure to avoid frameshifts than fSTRs. If this were not the case, the unit one motifs – underrepresented among fSTRs – would outnumber unit three motifs in coding regions.
(a) fSTRs are overrepresented in coding regions. (b) effects on RNA structure tend to increase with motif length. (c) characterization of fSTRs by sequence motif. (+) over-representation among all fSTRs. (-) under-representation among all fSTRs.
Characterization of fSTRs by unit is harder to interpret (Fig 2c). The well-known CAG motif (listed as its equivalent ACG motif) is conspicuously associated with fSTRs, but so too are many other motifs. Taken as a negative result, one interpretation is that any motif has the capacity to affect RNA folding.
fSTR motifs affect RNA accessibility
RNA accessibility may be important for protein binding, rates of splicing, nuclear export, and translation. We characterize how fSTRs affect accessibility of minimum free energy (MFE) RNA structures using ViennaRNA. Briefly, the core prediction algorithm uses dynamic programming to predict base paired and unpaired regions within single stranded RNA. To infer accessibility, we tally unpaired bases for fSTRs and stratify the results by allele length and repeat motif (Fig 3). Results of two types are obtained: (a) accessibility increases with allele length; (b) accessibility decreases with allele length. Examples of both types are shown in Fig 3a and 3b, respectively. Although strong examples were found for both types of association; accessibility varies substantially for fSTR alleles of fixed length regardless of motif. Thus, RNA accessibility probably depends on the fSTR allele length as well as the sequence context 5’ and 3’ to the actual repeat motif.
Accessibility is inferred from the tally of unpaired bases using ViennaRNA. (a) Accessibility increases with allele length for poly-A repeats: r = 0.066, p = 0. (b) Accessibility decreases with allele length for poly-AT repeats: r = −0.217, p = 0. (c) Accessibility increases with allele length for non-reverse complementary repeats: r = 0.017, p = 1.3e-10. (d) Accessibility decreases with allele length for reverse complementary sequences: r = −0.214, p = 0.
To further characterize RNA accessibility, we investigate possible associations with repeat length and unit. Associations of this type are hard to pin down with one exception. Sequences serving as their own reverse complement tend to decrease accessibility as allele length increases. For example, the reverse of poly-AT (poly-TA) is complementary to the original poly-AT motif (Fig 3b). We speculate that such sequences – which have the ability to base pair with themselves – cause a decrease in transcribed RNA accessibility. To test this, we aggregated all non-reverse complementary and reverse complementary sequence motifs. Indeed, the former sequence motifs show a positive correlation with allele length (Fig 3c; r = 0.017, p = 1.3e-10) while the latter have a negative correlation (Fig 3d; r = −0.214, p = 0).
fSTRs tend to affect RNA multiloops and external loops
The effects of fSTRs on MFE RNA folding are characterized by comparing secondary structure motifs using bpRNA and bpRNA-align [40,41]. Briefly, the per base secondary structure assignments are aligned for each pair of variants belonging to an fSTR. Mismatching structural motifs are tallied over pairs of alleles. Tallies are visualized as a matrix with row sums normalized to 100% and columns indicating the frequency of mismatch with all other motifs. Over 15% of RNA multiloops (M) and external loops (X) are affected (Fig 4a); and, they are frequently exchanged with one-another. Frequent changes to bulge motifs (B) are also common (red off-diagonal in Fig 4a). Interestingly, no changes are prohibited. Dangling end motifs (E) were rarely exchanged for multiloops (M) with the former being altered in less than 3% of the bases tallied (Fig 4a).
(a,b) Multiloops and external loops are frequently exchanged due to fSTR insertions. (e) The same exchange is seen for non-reverse complementary sequences. (c) The AT motif – which base pairs with itself – shifts towards right (R) and left-handed stem (L) motifs. (f) The same shift is seen for reverse complementary fSTRs. (d) CAG repeats conserve right-handed stems (R), left-handed stems (L), and ends (E) while departing from other structural motifs.
For reverse complementary sequences (see previous section) we notice a many to one shift towards left (L) and right-handed (R) stem motifs: these columns are mostly red for reverse complementary motifs (Fig 4f). We see a shift away from multiloops (M) and external loops (X) suggesting a link between some fSTRs and gene expression. Indeed, multibranch loops (M) are hubs of interaction within RNA. In fact, this is precisely the difference seen between the clusters of variants for the fSTR embedded in SH2B3 (Fig 1a). However, that particular repeat is not reverse complementary. Of course, the suggested links between DNA motif, RNA folding, and gene expression should be interpreted as preliminary associations and not causation.
Simulations recapitulate effects of STR variants on RNA structure
The effects of reverse complementary sequences were verified using a simulation-based approach. This is important for two reasons. First, sequences 5’ and 3’ to repeat variants may influence RNA folding as seen in experiment. Second, singleton motifs are overrepresented in the experimental data. Accessibility was tested on 10,000 simulated STR alleles. In each case, 5’ and 3’ sequence context was randomized. Reverse complementary (Fig 5a) and non-reverse complementary (Fig 5c) motifs were sampled randomly. The results recapitulate the experimental data in Fig 3c and 3d, respectively.
(a) accessibility increases with motif length for non-reverse complementary sequences. (b) changes in secondary structure for non-reverse complementary sequences. (c) accessibility decreases with motif length for reverse complementary sequences. (d) changes in secondary structure for reverse complementary sequences.
Effects on secondary structure used a similar approach. Motifs were sampled randomly. Changes in RNA secondary structure were tallied for five simulated indel variants while keeping the 5’ and 3’ sequence context fixed. Simulations for reverse complementary sequences (Fig 5d) recapitulate experimental data (Fig 4f). However, the remaining sequences (Fig 5b) do not recapitulate experimental data (Fig 4e). The difference undoubtedly stems from the aforementioned over-representation of singleton motifs in experiment. Interestingly, the different motifs have little effect on secondary structure in simulation (Fig 5b and 5d). Apparently, reverse complementary sequences affect RNA accessibility (but not structure) while singleton motifs affect RNA secondary structure (but not accessibility).
Discussion
Our results support the hypothesis that some STR variants affect RNA secondary structure and gene expression. Support is provided by several lines of evidence. First, we fold and cluster variants for 66,876 transcribed STRs from the 1000 genomes project using ViennaRNA and bpRNA-align. We find 17,255 affect RNA folding (fSTRs). Interestingly, fSTRs are enriched in coding regions and specific 3-mers which conspicuously include the CAG repeat motif (Fig 2c). Although the collection of 17,255 fSTRs are discovered using computational tools, we emphasize that only real variants identified in 1000 Genomes Project samples were used for the analysis. Next, we infer effects on gene expression using RNAseq. Briefly, we map fSTR clusters to RPKM values for each sample and preform a test of the null: no association between cluster assignment and RPKM. Association is supported for 356 fSTRs. These include 13 in coding regions: SAAL1, ZNF384, TSC22D1, MEF2A, C16orf71, TOX3, ERN1, NADK, PTPN18, GIGYF2, USF3, TRERF1, and AK9.
Not to be lost in our results is the approach itself. We demonstrate a novel way to study STR variation using state of the art tools. ViennaRNA is widely regarded as the best in class for predicting RNA secondary structure and bpRNA-align is a recent addition that shows improvement in clustering performance over a broad range of structure types [39,41]. This approach could be extended to study other classes of repetitive DNA such as palindromes and terminal inverted repeats. Indeed, similar approaches have been used – with an older set of tools – to study the effects of single nucleotide polymorphisms (SNPs) on the structure of transcribed UTRs and RNA in general [43–45]. Most of the novelty we introduce lies in mapping the bpRNA-align cluster assignments to variants possessed by each sample; a critical step that enables RPKM association testing.
Our approach is easily extended to the study of disease provided both DNA and RNA sequencing data is available. This is certainly the case for many samples in The Cancer Genome Atlas (TCGA) and database of genomes and phenomes (dbGaP). However, the idea that RNA folding alone is sufficient to explain high impact STR variants should be approached with skepticism. Those that are known have catastrophic effect on protein structure (such as Huntington’s) or chromosome structure (such as fragile X); but not RNA structure. In other cases, epigenetic modifications (such as CpG methylation) may overshadow the effects of array length polymorphisms by silencing genes prior to transcription altogether. It is more reasonable to conclude that RNA structure alterations have modest effects on rates of transcription, translation, and splicing.
Beyond splicing, RNA secondary structure influences post-transcriptional gene regulation, particularly when variants occur in untranslated regions (UTRs) or coding sequences [31,46]. Variants in the 5′ UTR may modulate translation initiation while those in coding may affect elongation rates. Variants in the 3′ UTR may impact transcript stability or localization by disrupting motifs for RNA-binding proteins. Future work integrating ribosome profiling, RNA stability assays, and RNA binding protein mapping will help clarify and validate the broader functional consequences of fSTRs.
On the contrary, STR variation and its influence on RNA structure could play a larger role in prokaryotes where transcription and translation are spatially and temporally linked. In fact, two processes unique to prokaryotes provide a precedent. Attenuation is a well-established mechanism that leverages codon repeats to regulate transcription via mutually exclusive RNA secondary structures [47,48]. Possibly any STR variation that alters RNA secondary structure could influence the rate transcription or lead to its termination all together. While this is just a hypothesis, it may be experimentally tractable. A second process – bacterial phase variation – leverages STR mutation rates for semi-random dichotomous phenotype variation [49,50]. Although phase variation has more to do with DNA structure than RNA structure, it emphasizes the complex role of STR variation on phenotype.
To validate our computational predictions regarding RNA-protein interactions and translation efficiency, several experimental techniques could be employed. Cross-linking immunoprecipitation (CLIP) methods, such as HITS-CLIP or iCLIP, allow for transcriptome-wide mapping of protein binding sites on RNA at nucleotide resolution [51]. Applying CLIP to our system would test whether predicted RNA variants alter protein binding in vivo. Similarly, SHAPE-seq and DMS-seq could provide experimental insight into RNA secondary structure changes caused by fSTR variants [52,53]. For translation efficiency, ribosome profiling (Ribo-seq) offers a powerful means to assess ribosome occupancy along transcripts [54]. Comparing ribosome footprint density across transcript variants could determine if fSTR variants influence translation in vivo. When used in parallel with RNA-seq from the same samples, Ribo-seq also enables calculation of translational efficiency ratios, providing a direct test of our predictions. Together, these approaches offer complementary validation strategies that could substantiate the functional effects of fSTRs proposed in this study.
We suggest further lines of inquiry to investigate the effects of STR variation on RNA and DNA structure. The secondary structure of DNA may affect rates of transcription and protein interactions: both precursors to gene expression. Prediction of Z-DNA, H-DNA, and cruciform DNA are obvious starting points; but, newer tools offer a more sophisticated approach to DNA structure prediction. Deep DNAshape predicts up to a dozen intra-base and inter-base features which could shed light on how STR variation affects transcription factor binding and DNA-protein binding at large [55]. RhoFold uses a language model based deep-learning approach to predict the 3D structure of RNA which could extend our analysis of secondary structure to tertiary structure [56]. Likewise, tools for predicting ramp sequences could provide a starting point for linking STR variation to translation rate and fidelity [34,35].
Methods
Overall approach
Our overall hypothesis is that some STRs affect RNA folding (fSTRs) which in turn is associated with differential gene expression in human populations. A test of our hypothesis unfolds in two parts. First, we identify which (if any) of the 66,876 transcribed STRs in the human genome have the capacity to affect RNA folding (secondary structure). To do this, we use the ViennaRNA package to predict secondary structures and score their differences with bpRNA-align. We find 17,255 fSTRs which we characterize by repeat length, repeat motif, and functional annotation. Details of RNA folding and clustering are provided below. Next, we identify which (if any) of the fSTRs are possibly associated with gene expression (Fig 6).
(left) Variants for 66,876 transcribed STRs were identified in 2,529 samples from the 1000 Genomes Project. We used repeatseq: a standard STR variant caller. (middle) Variants for each STR were transcribed and folded using Vienna-RNA. Secondary structures were assigned and clustered with bpRNA-align and affinity propagation clustering, respectively. (right) Effects on RNA folding were indicated by clustering results in excess of one. Cluster assignments were mapped to 462 RNA-seq samples: a subset of the original 2,529 samples. Associations with gene expression was established using a Tukey’s Honestly Significant Difference test.
To check for association with gene expression We use a second set of 462 RNAseq samples. Alleles for each sample were mapped to their transcribed cluster assignments (see below). Differences in gene expression (measured as RPKM) across cluster assignments were assessed with a post-hoc Tukey’s Honestly Significant Difference (HSD) test. The test was conducted using the pairwise_tukeyhsd function in Python, with RPKM values as the dependent variable (endog) and group assignments based on allele clusters as the independent variable (groups). A significance threshold of α = 0.05 was applied to determine pairwise differences between groups. This approach allowed for the identification of statistically significant differences while controlling for multiple comparisons: 356 fSTRs were possibly associated with gene expression (Fig 6). Data and code used for manuscript preparation are freely available online: https://github.com/nkinney06/fSTRs.
RNA folding with ViennaRNA
The ViennaRNA package is a widely used software suite for predicting and analyzing RNA secondary structures [39]. Briefly, it employs thermodynamic models to predict the most probable secondary structure of an RNA sequence. The core prediction algorithm uses dynamic programming to find the minimum free energy (MFE) structure, which is considered the most stable structure according to the energy model. Details of the ViennaRNA algorithm can be found elsewhere.
Input to the package typically consists of a single RNA sequence or a set of aligned sequences. For single sequences, the RNAfold program can predict either the MFE structure or thermodynamic ensembles using the partition function approach.
In our case, STR variants are inferred from 1000 Genomes Project samples using Repeatseq [57]: http://github.com/adaptivegenome/repeatseq. Details of variant calling are provided below. Each variant is transcribed and saved in fasta format to serve as input to ViennaRNA. ViennaRNA provides dot bracket notation (.dbn) output for each variant. A list of dbn files serves as the starting point for bpRNA-align clustering.
Thermodynamic considerations
The use of MFE structures without taking into consideration thermodynamic ensembles for each variant may raise concerns about our methodology. In reality, each variant folds into an ensemble of structures approximately 1kbT around the MFE structures. It’s conceivable that the energy barrier between some MFE structures is less than 1kbT; consequently, the similar overlapping ensembles mitigate any biological effects. This possibility may increase the false positive rate for the 17,255 fSTRs; but not the 356 fSTRs possibly associated with gene expression. Indeed, strong associations with gene expression are inconsistent with weak energy barriers between ensembles.
RNA clustering
We use bpRNA-align to compare structural differences between STR variants [40,41]. Details of bpRNA-align can be found elsewhere. Briefly, it is a recent contribution that uses a customized global (Needleman-Wunsch) dynamic programming approach. Per base mismatches are scored with a feature-specific substitution matrix and coupled with an inverted and context-specific affine gap penalty. The approach shows improvement in clustering performance over a broad range of structure types [41]. In our case, a list of dbn files (from ViennaRNA) serves as the starting point for bpRNA-align clustering. The output is a symmetric matrix of pairwise similarity scores for each variant. We use the matrix of similarity scores to cluster RNA secondary structures.
Clustering was performed using affinity propagation [58]: the same approach used by the authors of bpRNA-align [41]. We use the AffinityPropagation function from sklearn with the precomputed bpRNA-align similarity matrix. Changes in RNA structure were indicated by clustering results in excess of one. We use a filter parameter to mitigate false discoveries. Briefly, entries in the bpRNA-align similarity matrix were compared for each cluster. Only clusters with differences in excess of 100 were considered for analysis.
RNA Sequencing (RNA-seq) Data Analysis
The RNA-seq data analysis unfolded in four steps [59]: (a) quality control and preprocessing, (b) alignment to the human reference genome, (c) read counting, and (d) differential expression analysis.
- (a). Quality Control and Preprocessing. Quality assessment of the sequencing reads was performed using FastQC [60]. Commonly expected warnings, such as sequence duplication due to highly expressed transcripts and minor issues with tile quality, were disregarded. Similarly, K-mer content warnings arising from random priming were ignored, as our analysis focused on gene-level counts rather than alternative splicing or de novo gene structure inference [61].
- (b). Alignment to the Human Reference Genome. Reads were aligned to the GRCh38 human reference genome using STAR (Spliced Transcripts Alignment to a Reference) [62]. This tool is optimized for handling reads with insertions and deletions. The alignment utilized GENCODE annotation release 33 (gencode.v33.annotation.gtf) to enhance accuracy.
- (c). Read Counting. Gene-level read counts for each sample were generated using HTSeq [63]. Exon-level counts (--type = exon) were aggregated by gene ID (--idattr = gene_id) without strand specificity (--stranded = no). Counts were subsequently normalized to FPKM (fragments per kilobase of transcript per million mapped reads) using the countToFPKM package in R.
- (d). Differential Expression Analysis. DESeq2 [64] was employed to identify differentially expressed genes between 89 African and 373 European samples. The analysis began with constructing a count matrix where rows represented genes and columns corresponded to individual samples. DESeq2 automatically estimated size factors, computed gene-level dispersion, and fitted a generalized linear model to identify significant differences.
STR genotyping Using Repeatseq
Microsatellite genotypes were inferred from whole-genome sequencing data using RepeatSeq, a Bayesian framework specifically designed for genotyping tandem repeats from short-read sequencing datasets. RepeatSeq models PCR stutter noise, sequencing errors, and allele sampling to probabilistically call the most likely genotype at each locus. Input data consisted of aligned BAM files from the 1000 Genomes Project, which were processed according to the developers’ recommendations. Candidate repeat loci were specified in BED format, and reads overlapping these regions were extracted for analysis. For each locus and sample, RepeatSeq calculates genotype likelihoods by comparing observed read counts of repeat lengths to a stutter noise model fitted during analysis. The program reports maximum likelihood genotype calls as well as posterior probabilities, allowing for quality filtering in downstream analyses. Default parameters were used unless otherwise specified, with a minimum read coverage threshold applied to ensure reliability of calls. RepeatSeq has been used in previous studies and is freely available online: https://github.com/adaptivegenome/repeatseq. Additional details of STR genotyping are provided in our previous publications [5,6].
When benchmarked on diverse datasets, several recent variant callers report similar or better accuracy than repeatseq such as GangSTR [65], HipSTR [66], lobSTR [67], STRetch [68], TREDPARSE [69], and Dante [70]. Our use of RepeatSeq was justified in our previous publication. In particular, RepeatSeq was specifically designed and validated using data from the 1000 Genomes Project [57].
Samples
Samples used to identify fSTRs can be found in previous publications. Briefly, these samples come from phase 3 of the 1000 Genomes Project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/. In total, 2,529 samples were included for analysis: 667 African (AFR), 502 European (EUR), 352 American (AMR), 514 East Asian (EAS), 494 South Asian (SAS). We use a second set of 462 RNAseq samples for association testing of fSTR cluster assignments against RPKM values. These include 89 Africans and 373 Europeans. All samples are available through the European Bioinformatics Institute website: https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/samples/.
Statistical considerations
To evaluate pairwise differences between binned datapoints, we used Tukey’s Honestly Significant Difference (HSD) test, which is specifically designed for post-hoc comparisons following ANOVA. This method controls the family-wise error rate (FWER), reducing the likelihood of false positives that can arise from multiple testing. Tukey’s HSD achieves this by adjusting the significance threshold across all pairwise comparisons, ensuring that the overall probability of making one or more Type I errors remains at the specified alpha level (typically 0.05). As such, it provides a conservative and statistically robust approach to identify significant group differences while accounting for the multiple comparisons inherent in our analysis.
While it is true that multiple testing corrections can be applied both within and across families of tests, we chose to apply Tukey’s Honestly Significant Difference (HSD) test within each binned comparison group without an additional layer of correction across bins. This decision reflects our aim to identify localized effects of specific variants or sequence contexts, rather than to make broad claims about global significance across the entire dataset. Tukey’s HSD already controls the family-wise error rate for the multiple pairwise comparisons within each group, which are the relevant statistical units for our hypotheses. Furthermore, because each bin represents a biologically distinct context, we treat these as independent analytical units rather than as components of a single multiple testing framework. As such, we interpret statistical significance conservatively and contextualize findings based on consistency across bins and biological plausibility, rather than relying solely on adjusted p-values for global inference.
Supporting information
S1 File. Expanded characterization of STRs, fSTRs, and efSTRs.
Each is characterized by gene feature, sequence motif, and amino acid motif.
https://doi.org/10.1371/journal.pone.0326355.s001
(PDF)
References
- 1. Tanudisastro HA, Deveson IW, Dashnow H, MacArthur DG. Sequencing and characterizing short tandem repeats in the human genome. Nature Reviews Genetics. 2024;1–16.
- 2. Gymrek M. A genomic view of short tandem repeats. Curr Opin Genet Dev. 2017;44:9–16. pmid:28213161
- 3. Wyner N, Barash M, McNevin D. Forensic Autosomal Short Tandem Repeats and Their Potential Association With Phenotype. Front Genet. 2020;11:884. pmid:32849844
- 4. Butler JM. New resources for the forensic genetics community available on the NIST STRBase website. Forensic Science International: Genetics Supplement Series. 2008;1:97–9.
- 5. Kinney N, Kang L, Bains H, Lawson E, Husain M, Husain K, et al. Ethnically biased microsatellites contribute to differential gene expression and glutathione metabolism in Africans and Europeans. PLoS One. 2021;16(3):e0249148. pmid:33765058
- 6. Kinney N, Kang L, Eckstrand L, Pulenthiran A, Samuel P, Anandakrishnan R, et al. Abundance of ethnically biased microsatellites in human gene regions. PLoS One. 2019;14(12):e0225216. pmid:31830051
- 7. Fotsing SF, Margoliash J, Wang C, Saini S, Yanicky R, Shleizer-Burko S, et al. The impact of short tandem repeat variation on gene expression. Nat Genet. 2019;51(11):1652–9. pmid:31676866
- 8. Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B, Georgiev S, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet. 2016;48(1):22–9. pmid:26642241
- 9. McIver LJ, Fonville NC, Karunasena E, Garner HR. Microsatellite genotyping reveals a signature in breast cancer exomes. Breast Cancer Res Treat. 2014;145(3):791–8. pmid:24838940
- 10. Galindo CL, McCormick JF, Bubb VJ, Abid Alkadem DH, Li L-S, McIver LJ, et al. A long AAAG repeat allele in the 5’ UTR of the ERR-γ gene is correlated with breast cancer predisposition and drives promoter activity in MCF-7 breast cancer cells. Breast Cancer Res Treat. 2011;130(1):41–8. pmid:21153485
- 11. Velmurugan KR, Varghese RT, Fonville NC, Garner HR. High-depth, high-accuracy microsatellite genotyping enables precision lung cancer risk classification. Oncogene. 2017;36(46):6383–90. pmid:28759038
- 12. Rivero-Hinojosa S, Kinney N, Garner HR, Rood BR. Germline microsatellite genotypes differentiate children with medulloblastoma. Neuro Oncol. 2020;22(1):152–62. pmid:31562520
- 13.
Mitra I, Huang B, Mousavi N, Ma N, Lamkin M, Yanicky R, et al. Genome-wide patterns ofde novotandem repeat mutations and their contribution to autism spectrum disorders. Cold Spring Harbor Laboratory; 2020. https://doi.org/10.1101/2020.03.04.974170
- 14. Lundström OS, Adriaan Verbiest M, Xia F, Jam HZ, Zlobec I, Anisimova M, et al. WebSTR: A Population-wide Database of Short Tandem Repeat Variation in Humans. J Mol Biol. 2023;435(20):168260. pmid:37678708
- 15. Ruitberg CM, Reeder DJ, Butler JM. STRBase: a short tandem repeat DNA database for the human identity testing community. Nucleic Acids Res. 2001;29(1):320–2. pmid:11125125
- 16. Uppili B, Faruq M. STRIDE-DB: a comprehensive database for exploration of instability and phenotypic relevance of short tandem repeats in the human genome. Database (Oxford). 2024;2024:baae020. pmid:38602506
- 17. Chintalaphani SR, Pineda SS, Deveson IW, Kumar KR. An update on the neurological short tandem repeat expansion disorders and the emergence of long-read sequencing diagnostics. Acta Neuropathol Commun. 2021;9(1):98. pmid:34034831
- 18. Margoliash J, Fuchs S, Li Y, Zhang X, Massarat A, Goren A, et al. Polymorphic short tandem repeats make widespread contributions to blood and serum traits. Cell Genom. 2023;3(12):100458. pmid:38116119
- 19. Yoon JG, Lee S, Cho J, Kim N, Kim S, Kim MJ. Diagnostic uplift through the implementation of short tandem repeat analysis using exome sequencing. European Journal of Human Genetics. 2024;1–4.
- 20. Nojadeh JN, Behrouz Sharif S, Sakhinia E. Microsatellite instability in colorectal cancer. EXCLI J. 2018;17:159–68. pmid:29743854
- 21. McIver LJ, Fonville NC, Karunasena E, Garner HR. Microsatellite genotyping reveals a signature in breast cancer exomes. Breast Cancer Res Treat. 2014;145(3):791–8. pmid:24838940
- 22. Bacolla A, Wells RD. Non-B DNA conformations as determinants of mutagenesis and human disease. Mol Carcinog. 2009;48(4):273–85. pmid:19306308
- 23. Vinces MD, Legendre M, Caldara M, Hagihara M, Verstrepen KJ. Unstable tandem repeats in promoters confer transcriptional evolvability. Science. 2009;324(5931):1213–6. pmid:19478187
- 24. Ellegren H. Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004;5(6):435–45. pmid:15153996
- 25. Hannan AJ. Tandem repeats mediating genetic plasticity in health and disease. Nature Reviews Genetics. 2018;19:286–98.
- 26. Iennaco R, Formenti G, Trovesi C, Rossi RL, Zuccato C, Lischetti T, et al. The evolutionary history of the polyQ tract in huntingtin sheds light on its functional pro-neural activities. Cell Death Differ. 2022;29(2):293–305. pmid:34974533
- 27. Katti MV, Ranjekar PK, Gupta VS. Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol Biol Evol. 2001;18(7):1161–7. pmid:11420357
- 28. Silva A, de Almeida AV, Macedo-Ribeiro S. Polyglutamine expansion diseases: More than simple repeats. J Struct Biol. 2018;201(2):139–54. pmid:28928079
- 29. Lieberman AP, Shakkottai VG, Albin RL. Polyglutamine Repeats in Neurodegenerative Diseases. Annu Rev Pathol. 2019;14:1–27. pmid:30089230
- 30. Wright SE, Todd PK. Native functions of short tandem repeats. Elife. 2023;12:e84043. pmid:36940239
- 31. Georgakopoulos-Soares I, Parada GE, Hemberg M. Secondary structures in RNA synthesis, splicing and translation. Comput Struct Biotechnol J. 2022;20:2871–84. pmid:35765654
- 32. Fotsing SF, Margoliash J, Wang C, Saini S, Yanicky R, Shleizer-Burko S, et al. The impact of short tandem repeat variation on gene expression. Nat Genet. 2019;51(11):1652–9. pmid:31676866
- 33. Miller JB, Brandon JA, McKinnon LM, Sabra HW, Lucido CC, Murcia JDG. Ramp sequence may explain synonymous variant association with Alzheimer’s disease in the Paired Immunoglobulin-like Type 2 Receptor Alpha (PILRA). bioRxiv. 2025.
- 34. McKinnon LM, Miller JB, Whiting MF, Kauwe JSK, Ridge PG. A comprehensive analysis of the phylogenetic signal in ramp sequences in 211 vertebrates. Sci Rep. 2021;11(1):622. pmid:33436653
- 35. Miller JB, Meurs TE, Hodgman MW, Song B, Miller KN, Ebbert MTW, et al. Ramp atlas: facilitating tissue and cell-specific ramp sequence analyses through an intuitive web interface. NAR Genomics and Bioinformatics. 2022;4(2):lqac039. pmid:35664804
- 36. Tieng FYF, Abdullah-Zawawi M-R, Md Shahri NAA, Mohamed-Hussein Z-A, Lee L-H, Mutalib N-SA. A Hitchhiker’s guide to RNA-RNA structure and interaction prediction tools. Brief Bioinform. 2023;25(1):bbad421. pmid:38040490
- 37. Sanchez de Groot N, Armaos A, Graña-Montes R, Alriquet M, Calloni G, Vabulas RM, et al. RNA structure drives interaction with proteins. Nat Commun. 2019;10(1):3246. pmid:31324771
- 38. Fairley S, Lowy-Gallego E, Perry E, Flicek P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 2020;48(D1):D941–7. pmid:31584097
- 39. Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF. ViennaRNA Package 2.0. Algorithms for Molecular Biology. 2011;6:1–14.
- 40. Danaee P, Rouches M, Wiley M, Deng D, Huang L, Hendrix D. bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res. 2018;46(11):5381–94. pmid:29746666
- 41. Lasher B, Hendrix DA. bpRNA-align: improved RNA secondary structure global alignment for comparing and clustering RNA structures. RNA. 2023;29(5):584–95. pmid:36759128
- 42. Lappalainen T, Sammeth M, Friedländer MR, ’t Hoen PAC, Monlong J, Rivas MA, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501(7468):506–11. pmid:24037378
- 43. Ritz J, Martin JS, Laederach A. Evaluating our ability to predict the structural disruption of RNA by SNPs. BMC Genomics. 2012;13 Suppl 4(Suppl 4):S6. pmid:22759654
- 44. Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs. Hum Mutat. 2013;34(4):546–56. pmid:23315997
- 45. Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. The RNAsnp web server: predicting SNP effects on local RNA secondary structure. Nucleic Acids Res. 2013;41:W475-9. pmid:23630321
- 46. Kramer MC, Gregory BD. Does RNA secondary structure drive translation or vice versa?. Nat Struct Mol Biol. 2018;25(8):641–3. pmid:30061597
- 47.
Baumberg S. Prokaryotic gene expression. OUP Oxford; 1999.
- 48. Press MO, Hall AN, Morton EA, Queitsch C. Substitutions Are Boring: Some Arguments about Parallel Mutations and High Mutation Rates. Trends in Genetics. 2019;35:253–64.
- 49. van der Woude MW, Bäumler AJ. Phase and antigenic variation in bacteria. Clin Microbiol Rev. 2004;17(3):581–611, table of contents. pmid:15258095
- 50. Henderson IR, Owen P, Nataro JP. Molecular switches--the ON and OFF of bacterial phase variation. Mol Microbiol. 1999;33(5):919–32. pmid:10476027
- 51. Ule J, Hwang H-W, Darnell RB. The Future of Cross-Linking and Immunoprecipitation (CLIP). Cold Spring Harb Perspect Biol. 2018;10(8):a032243. pmid:30068528
- 52. Watters KE, Abbott TR, Lucks JB. Simultaneous characterization of cellular RNA structure and function with in-cell SHAPE-Seq. Nucleic Acids Res. 2016;44(2):e12. pmid:26350218
- 53. Watters KE, Yu AM, Strobel EJ, Settle AH, Lucks JB. Characterizing RNA structures in vitro and in vivo with selective 2’-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Methods. 2016;103:34–48. pmid:27064082
- 54. Calviello L, Ohler U. Beyond Read-Counts: Ribo-seq Data Analysis to Understand the Functions of the Transcriptome. Trends Genet. 2017;33(10):728–44. pmid:28887026
- 55. Li J, Chiu T-P, Rohs R. Predicting DNA structure using a deep learning method. Nat Commun. 2024;15(1):1243. pmid:38336958
- 56. Shen T, Hu Z, Sun S, Liu D, Wong F, Wang J, et al. Accurate RNA 3D structure prediction using a language model-based deep learning approach. Nat Methods. 2024;21(12):2287–98. pmid:39572716
- 57. Highnam G, Franck C, Martin A, Stephens C, Puthige A, Mittelman D. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 2013;41(1):e32. pmid:23090981
- 58. Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007;315(5814):972–6. pmid:17218491
- 59. Lovén J, Orlando DA, Sigova AA, Lin CY, Rahl PB, Burge CB, et al. Revisiting global gene expression analysis. Cell. 2012;151(3):476–82. pmid:23101621
- 60. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. https://cir.nii.ac.jp/crid/1370584340724053142.
- 61. Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38(12):e131. pmid:20395217
- 62. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. pmid:23104886
- 63. Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31(2):166–9. pmid:25260700
- 64. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. pmid:25516281
- 65. Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 2019;47(15):e90. pmid:31194863
- 66. Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. Genome-wide profiling of heritable and de novo STR variations. Nat Methods. 2017;14(6):590–2. pmid:28436466
- 67. Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 2012;22(6):1154–62. pmid:22522390
- 68. Dashnow H, Lek M, Phipson B, Halman A, Sadedin S, Lonsdale A, et al. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 2018;19(1):121. pmid:30129428
- 69. Tang H, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, et al. Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes. Am J Hum Genet. 2017;101(5):700–15. pmid:29100084
- 70. Budiš J, Kucharík M, Ďuriš F, Gazdarica J, Zrubcová M, Ficek A, et al. Dante: genotyping of known complex and expanded short tandem repeats. Bioinformatics. 2019;35(8):1310–7. pmid:30203023