Ethnically biased microsatellites contribute to differential gene expression and glutathione metabolism in Africans and Europeans

Approximately three percent of the human genome is occupied by microsatellites: a type of short tandem repeat (STR). Microsatellites have well established effects on (a) the genetic structure of diverse human populations and (b) expression of nearby genes. These lines of inquiry have uncovered 3,984 ethnically biased microsatellite loci (EBML) and 28,375 expression STRs (eSTRs), respectively. We hypothesize that a combination of EBML, eSTRs, and gene expression data (RNA-seq) can be used to show that microsatellites contribute to differential gene expression and phenotype in human populations. In fact, our previous study demonstrated a degree of mutual overlap between EBML and eSTRs but fell short of quantifying effects on gene expression. The present work aims to narrow the gap. First, we identify 313 overlapping EBML/eSTRs and recapitulate their mutual overlap. The 313 EBML/eSTRs are then characterized across ethnicity and tissue type. We use RNA-seq data to pursue validation of 49 regions that affect whole blood gene expression; 32 out of 54 affected genes are differentially expressed in Africans and Europeans. We quantify the relative contribution of these 32 genes to differential expression; fold change tends to be less than other differentially expressed genes. Repeat length correlates with expression for 15 of the 32 genes; two are conspicuously involved in glutathione metabolism. Finally, we repurpose a mathematical model of glutathione metabolism to investigate how a single polymorphic microsatellite affects phenotype. We conclude with a testable prediction that microsatellite polymorphisms affect GPX7 expression and oxidative stress in Africans and Europeans.

The novelty of the current work is also highlighted in the discussion. Briefly, we demonstrate how RNA-seq can be used to validate and quantify the effects of EBML/eSTRs on gene expression. In addition, we demonstrate that mathematical models can be repurposed to infer how EBML/eSTRs affect phenotype. The latter point has broader implications. In particular, there are over 28,000 known eSTRs and nearly 1,000 biological systems models for human available in databases such as JWS online and EBI BioModels. Thus, we anticipate phenotypic effect of many more eSTRs can be made using our approach on existing mathematical models.
We agree with the editor that it is important to distinguish how the two manuscripts differ. Although the differences can be found in the introduction and discussion, we have appended a short paragraph to the methods section making these differences explicitly clear.

I would like to invite the authors to reflect and discuss the impact of using these datasets on the results obtained:
The reviewer makes important points regarding the challenge of integrating the three main datasets we consider: (a) EBML identified from KGP samples, (b) eSTRs identified from GTEX samples, and (c) RNA-seq data from GEUVADIS consortium. The common thread of our response is that there is a "discovery bias" which favors identification and subsequent validation of EBML/eSTRs in Europeans. Critically, this does not invalidate any of our results or affect reproducibility. In what follows we elaborate on each dataset and our effort to integrate these datasets.
(i) EBML are identified from 5 main ethnic groups (data set 1KG -AFR, EUR, SOA, EAS, AME); First, we consider the "discovery bias" for EBML. Our previous publication identified 3,171 EBML for African populations and 450 EBML for European populations. However, these EBML were discovered using relatively small samples of sequenced individuals: 667 Africans and 502 Europeans. Given these samples sizes, we can say with confidence that the true number of EBML exceeds 450 for Europeans and 3,171 for Africans. Despite the greater number of African EBML, the percent of true EBML discovered is probably greater for Europeans. A passage from our previous paper justifies this speculation: the 1000 Genomes Project falls short in capturing genetic diversity in some populations. Principal component analysis has shown that genetic diversity in Southeast Asia is underrepresented in the 1000 Genomes Project [1]; and, whole genome sequencing of African Populations has revealed millions of unshared genetic variants [2].
Essentially, the 1000 Genomes Project captures more of the genetic diversity in Europeans than other ethnicities: particularly Africans and Asians. Estimating the true number of EBML for Europeans and Africans would probably require sophisticated power analysis beyond the scope of this work. In any case, the true number of EBML is not necessary for the current work. It is more important to recognize that the 450 EBML for Europeans and 3,171 EBML for Africans are all valid despite a larger number of undiscovered African EBML.
Our results now emphasize that there are likely a disproportionate number of undiscovered EBML for African and Asian populations (see the second section of our results). (ii) eSTR are identified from the GTEX database, which presents an ethnic bias, where the majority of the samples that they integrate are declared "white" (~ 85%); Superficially, the predominately white/European samples in the GTEX database seem to favor the discovery of European eSTRs; however, this incorrect as there is no notion of "European" eSTRs or "African" eSTRs. Any repeat shown to affect gene expressionan eSTR by definitionin Europeans likely also affects gene expression in Africans. Of course, there are exceptions; specifically, neutral repeats in Africans could hypothetically be eSTRs in Europeans and vice versa. These hypothetical repeats could be explained by standard concepts in evolution such as migration, genetic drift, and selection. Consequently, it is possible that some of the 28,000 eSTRs discovered from the GTEX database are neutral (false positives) in Africans; and some of the GTEX neutral repeats could be eSTRs (false negatives) in Africans. The number of such exceptions is difficult to estimate.
Probably, most of the 28,000 known eSTRs would be reproduced in an exclusively African cohort. Thus, despite the GTEX samples being predominantly white, we do not expect the set of 28,000 eSTRs to have a significant bias towards Europeans or Africans.

(iii) Data from RNAseq (GEUVADIS consortium) are only available for a 1KG subsample of Africans and
Europeans.
We use RNA-seq data (GEUVADIS consortium) to compare whole exome gene expression in cohorts of 89 Africans and 373 Europeans. First, we identified 14,124 (out of 60,667) differentially expressed genes. These numbers would certainly change with a more balanced cohort; however, the differentially expressed genes we report are not necessarily bias towards Europeans or Africans. Expression for all of them clearly differ in the cohorts we compare. The differentially expressed genes included 32 (out of 54) associated with a whole blood EBML/eSTRs. Statistically speaking, this supports our hypothesis that some EBML contribute to differential gene expression.
As the reviewer points out, it is important to consider how the aforementioned "discovery bias" for EBML may influence this result. To reiterate, there are undoubtedly undiscovered EBML for EUR populations and even more for AFR populations. We can have confidence in our resultthat EBML/eSTRs are associated with differential expressionby preforming the following hypothetical exercise.
Start with the original contingency table used to test association between EBML/eSTRs and differentially expressed genes (note the numbers sum to 60,667). A chi-squared test confirms EBML are associated with differential expression (p=1.08e-9; χ 2 test). This calculation is found in the caption to figure 2 of our manuscript.
Next, suppose we discover 14 new EBML/eSTRs for Europeans and 94 new EBML/eSTRs for Africans. In other words, whole blood EBML have tripled from 54 to 162. Let's further assume that only 25 of the new EBML are associated with a differentially expressed gene, i.e. they are drawn randomly from the left column of 14,092 differentially expressed genes and 46,521 similarly expressed genes above. We tally a new (hypothetical) contingency table shown below. A chi-squared test would still indicate that EBML are associated with differential expression (p=4.7e-4; χ 2 test).
What if we hypothetically discover 216 new whole blood EBML increasing the original 54 by a factor of 5? Again, we draw the 216 randomly from the left column of 14,092 differentially expressed genes and 46,521 similarly expressed genes. A chi-squared test would still indicate that EBML are associated with differential expression (p=.007; χ 2 test).
Overall, the association we report between EBML/eSTRs and differentially expressed genes seems robust to the possibility of numerous undiscovered EBML and eSTRs (even if these undiscovered EBML are random with respect to gene expression). It is unfortunate that suitable RNAseq data was only available for two ethnicities (AFR and EUR); but such data would probably reiterate once again an association between EBML/eSTRs and differentially expressed genes.

Why did the authors not previously filter the EBML for AFR and EUR, since confirmatory analyses can be performed only for these groups?
This is an astute detail! We thank the reviewer for careful reading of our manuscript. Indeed, we pursue confirmatory analysis for 49 EBML that were not explicitly filtered with respect to AFR and EUR populations. If we had filtered prior to confirmatory analysis we would have found 7 EBML for EUR and 37 EBML for AFR with three overlapping (EBML in both populations). In other words, filtering would have eliminated 8 of 49 EBML we considered. We provide links to the genetic tables for these 8 EBML: It may come as a surprise that genotypes for this short tandem repeat (STR) differ in Africans and Europeans. However, this difference alone is not sufficient to call this region an EBML in either population. For each EBML, at least one ethnicity must have genotype frequencies statistically different from the remaining four. Examining the complete tally of genotypes for this region reveals mutual similarities between AFR and SAS as well as AMR and EUR (table below). In fact, this region is only a rigorous EBML for EAS populations. But, given the difference between African genotypes and European genotypes, we feel it is appropriate to keep this region for confirmatory analyses.
We could make a similar case for keeping more of the eight regions in our confirmatory analysis; we opted to retain all eight for simplicity. Indeed, all eight regions show some differences in AFR and EUR populations; and it is very unlikely that these eight fringe cases affected our analysis. For example, our previous response demonstrates robustness of association between EBML/eSTRs and differentially expressed genes.

How can the ethnic bias of GTEX samples favor the identification of EBML/eSTR EUR?
The reviewer makes an important point that the GTEX samplesused to discover eSTRsare predominately European. For the most part we will reiterate our response from above. Essentially, repeats shown to affect gene expression -eSTRs by definitionin Europeans should also affect gene expression in Africans. Of course, there will be rare exceptions; but we expect the majority of the 28,000 known eSTRs to be reproducible in an exclusively African cohort.
The identification of EBML has a more obvious "discovery bias" stemming from the KGP samples. In particular, the KGP samples under-represent African genetic variation. Consequently, relatively fewer EBML have been established for Africans (3,171 total) than Europeans (450 total). But it is important to emphasize that most of our analysis focuses on a set of 49 EBML with clear genetic differences between the two populations (7 EBML for EUR, 37 EBML for AFR, and 8 polymorphic regions discussed above). Even if most of the undiscovered EBML are unique to AFR populationsas we suspectour analysis of these specific 49 does not change.
Perhaps the one piece of analysis that is affected by the KGP and GTEX datasets is the tally of 313 regions across ethnicity and tissue type. Taking into consideration the reviewer's question, figure 1D is misleading because it directly compares raw numbers of EBML by ethnicity. With more balanced samplesparticularly KGP samplesthe African, South Asian, and East Asian bars would represent more EBML regions and be even larger. In fact, under-representation of Africans and South Asians is well established for the KGP [1,2]. The figure is correct, but only for discovered EBML. We have added a short passage to this section of the manuscript that clarifies the issue.  A section with caveats would be welcome for the study.
We have added a paragraph to the discussion recognizing several caveats pointed out by the reviewer. We elaborate on some of these below.

(i) Discussion of challenges to obtain reliable allele calls for microsatellites from NGS low coverage data
Indeed, microsatellite allele calls remain challenging; particularly from NGS low coverage data. Mapping issues are common for large indels and realignment around indels is typically required before genotype calling. Our previous work on EBML (10.1371/journal.pone.0225216) used two strategies to increase sensitivity of allele calls; specifically, requirements were placed on (a) sequencing depth and (b) sequence logos.
(a) Requirements placed on sequencing depth. Allele calls were only made for regions with six or more mapped reads; consequently, most of the alleles in our call sets are supported by three or more reads. This strategy was implemented to mitigate the effects of PCR artifacts and read duplication.
(b) Requirements placed on sequence logos. We defined sequence logos for each microsatellite by concatenation of its 3' flanking sequence, repeat unit, and 5' flanking sequence. We used 5bp for the flanking regions. For example, microsatellites 'GCTGC(A) 34 CTTAG' and 'GCTGC(A) 15 CTTAG' received the same sequence logo: 'GCTGCACTTAG'. Only microsatellites with unique sequence logos were considered in our analysis. This strategy was implemented to mitigate occurrences of microsatellite polymorphisms mapping incorrectly to regions with similar (or identical) sequence logos.
We are open to including a further discussion of challenges; however, no allele calling was actually preformed in this work. We use the allele calls from our previous work: 10.1371/journal.pone.0225216. Thus, it could be confusing if we delve into the challenges of allele calling in this work. As a compromise, we mention that there are challenges and encourage the reader to pursue details in our previous work that made the calls [1].

(ii) Limitations for RNAseq comparison and quantification of gene expression (mapped reads x NGS low coverage)
We agree it is a good idea to remind the reader that RNA-seq has limitations. Measurement noise inevitably arises due to random sampling inherent in the assay [1]. This complicates differential expression analysis, particularly for between sample comparison of genes with low transcription rate. Measurement noise is exacerbated by heterogeneity of sequencing platforms and analysis protocols [1]. Thus, sequencing coverage, platform, sample sizes, and analysis could all increase the rate of false positives.
The RNA-seq analyses preformed in our work mitigatesbut does not eliminatethese issues. First, the data is relatively homogeneous. Publication of the GEUVADIS samples is self-described as "the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences" [2]. Second, we use the industry standard analysis pipeline -DESeq2to identify differentially expressed genes from count data [3]. Our code is available at https://github.com/nkinney06/dockerizedEBML. Third, regression analysis (figure 4 of our manuscript) is preformed on RPKM normalized counts. The differentially expressed genes we identify undoubtedly still contain false positives; however, association between EBML/eSTRs with those differentially expressed genes appears to be very robust (see our response above).