On the cross-population generalizability of gene expression prediction models

The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction.

that reflect the ancestry of genotypes used to impute. The authors have used both real and simulated data to make this case, and overall, I feel that paper is useful to the human genetic community. But the manuscript suffers from errors in data visualization that distract from the message and leave the reader confused as to which is the true representation of the data. Below are outline comments to improve and correct these errors.
Major Comments.
1. The number of gene models available in each weighted set would be highly dependent on the criteria used to identify these gene as "well predicted". As an example, depending on the R2 threshold the MESA_All set could have anywhere from 6896 gene model to 1336 gene models (as per the paper). The authors need to add details to the methods on what criteria was used to get to the final list of gene model that are compared. Where these thresholds the same for all weighed sets?
The reviewer makes the important observation that we are limited by existing prediction model sets in PredictDB that range in the number of genes covered. These numbers are given to researchers by PredictDB and are mostly dictated by training sample size, data quality, and other criteria. Furthermore, we are confused where the reviewer obtained the numbers 6896 and 1336, as they do not appear in the manuscript.
We clearly show all thresholding criteria in Supp Table 1 and include the number of gene models from each prediction weight set that meet each threshold. The threshold criteria were the same for all weight sets. Importantly we did not explicitly filter by R 2 for Supp Table 1; rather, we applied a filter for positive correlation between predictions and measurements, as significant negative correlations are hard to interpret.
We suspect that the reviewer is perhaps referring to the analysis behind Figure 3, where we focused on genes from GTEx v7. In that case, we state in the final paragraph of Results subsection "Concordance of measured gene expression and PrediXcan predictions is lower than expected" that we filtered genes with test R2 in GTEx v7 where R2 < 0.2. Our choice of R2 is intended roughly to capture the better half of predictions from DGN; see Figure 3 from Gamazon et al. (2015), which shows that mean R2 in DGN is 0.137. We have added text to the 4th paragraph of Results subsection "Concordance of measured gene expression and PrediXcan predictions is lower than expected" justifying our choice of R2: "Our choice of R2 is informed by observed R2 between predictions and measurements in DGN (see Figure  3 of [6]) and focuses our analysis on genes predicted better than average." 2. By using all 11,545 genes in the analysis, it seems like it would bias the estimate downward. Especially since the authors compare these vales to the R2 from the training set which do not include all 11,545 genes in all models. While this point is somewhat answered by the GTExv7 analysis on well predicated genes, this does not address if this is true for the more diverse cohorts like MESA.
The reviewer's skepticism of using all genes is warranted, which is why we also partitioned results for subsets of genes; see Supp Figs 2-8 in the revised version of this manuscript. Our subsequent parsing of genes focuses on better predictions and ensures more apples-to-apples comparisons, assuaging any concerns about bias. By any reasonable measure, PrediXcan models do not seem to offer consistently reliable prediction across any logical subset of genes that is not manually cherrypicked. Cherrypicking genes in unrealistic and impractical because researchers using PrediXcan generally do not have expression data to check the quality of predictions.
Regarding the reviewer's comment about our analysis of GTEx v7 genes with cross-validated R2 > 0.2, we have repeated these analyses using MESA models. Four new figures (Supp Figs 9-12) appear in the text with crossvalidated testing R2>0.2, one for each MESA prediction model set. The final sentence of the 4th paragraph of Results subsection "Concordance of measured gene expression and PrediXcan predictions is lower than expected" now reads: "We see a similar trend with MESA models , in which R2 in SAGE is consistently much lower (mean R2 0.026 -0.030) than test R2 from each prediction weight set (test R2 0.373 -0.392)." 3. The box and whiskers plot are not particularly informative to show the difference in R2 between the weighted sets as highlighted in the results. While I understand the want to graph the test R2 with the actual R2, I think it may be easier to see when graphed separately.
The requested plots appear as Supp Figs 2, 6, and 8, with appropriate references in the Results section.
We remark that the purpose of Figure 1 is to show the overall discrepancy between R2 in the PrediXcan testing sets and actual R2 between predictions and measurements in SAGE. Among SAGE, the R2 distributions do not vary appreciably. This is a point which we feel was not sufficiently stressed in the text, so we have modified the 2 nd paragraph of Results subsection "Concordance of measured gene expression and PrediXcan predictions is lower than expected": "The highest mean R2 of 0.0298 was observed in MESA_AFA, while the lowest was observed in MESA_ALL (R2 = 0.0250), suggesting little appreciable difference in prediction quality between prediction weight sets." Note that R2 obscures the direction of correlation, which is why we mostly dwell on (signed) correlations in the text; we use R2 to compare to previous work. We stated this in the text: "As we are primarily interested in describing the relationship between predicted outcome and real outcome, we prefer Spearman's ρ to describe correlations, while for determining prediction accuracy, we use the standard regression R2, corresponding to the squared Pearson correlation, to facilitate comparisons to prior work." 4. As a follow up to comment 2, could the negative correlation described in results (lines 189 -195) be due to the inclusion of all 11K genes as oppose to those well predicted by each model. Would this still be the case if the authors restricted to just those genes listed in Supplementary Table 1?
The bias that the reviewer describes could possibly exist. But we show in Supp Figs 4-8 that the trends from Figures 1 and 2 still hold true for the 273 genes in common to all weight sets and the 39 of those with nonnegative correlation. These are apples-to-apples comparisons over the same genes and are therefore not biased by the number of available gene models in each prediction weight set.
To ensure that readers are aware of these results, the caption of Supp Table 1 now points readers to Supp Figs 4-8: "The intersection of genes with both predictions and measurements in SAGE across all seven weight sets is 273 (see Supplementary Figures 4-6), of which 39 produce predictions positively correlated to data in all comparisons (see Supplementary Figures 7 and 8)." 5. The results for the 564 genes that were trained in all GEUVADIS data should be presented similarly to the larger gene set (lines 241-247). Specifically, the author should highlight how the models trained on EUR fared in AFR and vise-versa as oppose to just giving ranges of the R2 across all comparisons.
In response to the reviewer's comment, we have added a substantial amount of text to the second paragraph of Results subsection "Cross-population prediction quality declines with increasing genetic distance": "A comparison of the 564 genes in common across all train-test scenarios (

I would be helpful to know the population sizes for each on the subpopulations presented on table 3 and 4, unless all of these are 89 as stated in line 252
The caption for Table 3 says: "All populations were subsampled to N = 89 individuals." We have added text to the caption for Table 4 to clarify that the results in Table 4 are also from 89 individuals per population: "…all 25 train-test scenarios. As in Table 3, all populations were subsampled to n = 89 subjects." 7. I am a bit confused on the simulation in which the admixed population inherits causal eQTLs from the parental populations. The explanation in the results states that when k=10 and if only 50% of the alleles are shared, then the AA would have 15 causal eQTLs (5 from CEU, 5 from YRI and 5 shared). So, wouldn't that be 15 causal eQTLs not 10? Also, would it not make more sense use all 5 shared and for the remaining 5 use 80% of the YRI specific and 20% of the CEU specific? While the authors explain the effect of this in the result it is hard to interpret how the simulation would reflect actual admixture.
We have updated the text in the Results subsection "Admixture influences cross-population gene expression prediction quality under known eQTL architecture" to clarify that the choice of k causal eQTLs refers to the ancestral populations. The first paragraph now reads: "We simulate eQTL architectures under an additive model of size k causal alleles (k = 1, 10, 20, and 40) in the ancestral populations (CEU and YRI)…" The third paragraph now reads: "…for k = 10 causal eQTLs in the ancestral populations (CEU and YRI)." Regarding our choice of eQTL architecture, we originally simulated eQTL architectures with 10 eQTLs per population where AA inherits eQTLs in proportion to admixture, precisely as the reviewer suggests. We considered whether this simulation choice was reasonable and opted for the simulation design presented here, under an assumption of varying ancestry-specific effects on each haplotype. We assert that allowing AA to inherit all eQTLs --and therefore potentially yielding a higher number of causal eQTLs vs. CEU and YRI -is a more biologically plausible model of eQTL architecture, and is consistent with other investigations into genomic architecture in heterogeneous populations, particularly Martin et al. 2017. This simulation assumption is also consistent with the behavior of cis-eQTLs as acting in a haplotype-specific manner. We do note in the manuscript that, when ancestry-specific eQTLs exist, this regime inevitably results in a greater number of relevant eQTLs in the admixed population. All of these assumptions have tradeoffs: under the reviewer's model, this would result in something much higher than the intended 50% shared eQTL architecture, given the preponderance of African ancestry in simulated African Americans. Figure 6 as the text and the figure do not seem to match. The authors states," For an architecture with no shared eQTLs, power between CEU to YRI is 0, while power is higher for CEU to AA (0.25) and YRI to AA (0.30)." But the bar chart shows CEU to AA and YRI to AA both to be close to identical at around 0.8.

I am confused by
Thank you for alerting us to this discrepancy. We have updated the text to reflect the numbers in Figure 6. Figure 7 -though admittedly closer. The authors wrote, "For example, when h2 = 0.20 and gene expression is predicted from AD to CEU, power at 0% YRI admixture is 0.56 (95% CI: 0.462 -0.658) and declines linearly with increasing YRI admixture; at 100% YRI, statistical power for AD to CEU is 0.46 (95% CI: 0.362 -0.558)." but the line depicted is clearly at 0.5 when the YRI admixture is at 100%. The author need to thoroughly examine all figure to ensure they are depicting what is written in the text.

The matching of numbers reported in results to the figures continues on
We thank the reviewer for carefully analyzing Figure 7, but in this case the numbers reflect the average for that specific simulation scenario, whereas Figure 7 shows the linear fits to allow the reader to compare and contrast each scenario. We kept this figure demonstrating the linear trends in the main text. However, based on the reviewer's observation, we have incorporated into the supplement three tables with raw estimates and confidence intervals. The main text now reflects the fact that This number is chosen to match the h 2 computed for whole blood RNA in Figure 3 of Gamazon et al. 2015. (https://www.ncbi.nlm.nih.gov/pubmed/26258848). The rationale behind choosing h 2 = 0.15 was merely to provide a realistic heritability level consistent with other investigations of eQTL architecture in GTEx and other datasets. We note that our SAGE RNA-Seq data also come from whole blood, so the h 2 used in our simulations is reasonably meaningful for the original observed issue with prediction quality in SAGE.
At this juncture, we also note that the figure in Gamazon et al. (2015) used DGN data, not GTEx as we stated in the text. We apologize for this oversight and have corrected the text accordingly. We also point the reader back to Gamazon et al. (2015).
The revised text reads: "…expression phenotype with cis-heritability h 2 = 0.15 (recapitulating the average h 2 in DGN whole blood RNA-Seq data [6])" 2. Why did the authors simulate the AA haplotypes as oppose to use the Hapmap ASW data and simulate haplotypes for that data?
The simulations in this manuscript are meant to illustrate gene expression prediction under two-way admixture. The AA haplotypes are forward-simulated from CEU and YRI for two reasons. Firstly, we can explicitly track the origins of each haplotype with full confidence. Secondly, we can correctly control ancestral proportions of CEU and YRI contributions to our simulated AA population. Using ASW haplotypes in the simulation could introduce inferential uncertainty about the ancestral origin of each haplotype. In addition, HAPGEN2 is not designed to generate realistic haplotypes from an admixed reference panel such as ASW.
3. 2 gene were removed from the TWAS analysis because they had no SNPs in the simulated data. I think this mean there were no SNPs within 2Mb of these genes. This strikes the reviewer as very odd. What is the explanation for this?
The data are taken from the HapMap3 example datasets included with IMPUTE/HAPGEN2: https://mathgen.stats.ox.ac.uk/impute/impute_v1.html#Using_IMPUTE_with_the_HapMap_Data Based on the reviewer's question, we re-confirmed in this data freeze of HapMap3 that there are no segregating variants around these genes to provide any signal. To clarify this point, we modified the text in the Methods section to reflect the reviewer's intuition: "We removed two genes, PPP6R2 and MOV10L1, that spanned no polymorphic markers within 2 megabases of their start and end positions in the HapMap3 dataset, resulting in 98 gene models used for analysis." 4. It would be useful to the reader to add the github repositories that contain the PredictDB models that were used in the paper.
A hyperlink to PredictDB already appears under the section "Online Resources", along with links to our own code and results, as well as public data sources such as GTEx, DGN, and GEUVADIS. Note that PredictDB uses Github to organize source code, but the prediction models themselves are posted on predictdb.org. As we previously described, we have set up a Box account for releasing our own models. 5. Gene numbers do not agree between the results and the methods. As an example, the methods states that 10,161 gene were found in at least one weighed set. But in the result the number is 11,545.
Thank you for catching this oversight. The correct number is 11,545 genes total. The Methods section now reflects the correct number.
6. The authors point out the least negative correlation was seen with MESA_AFHI and the most with MESA_AFA, but the number is the results are not shown on Supplementary Table 1 Thank you for this as well. The numbers in the text have been updated and now match Supp Table 1.
The third paragraph of Results subsection "Concordance of measured gene expression and PrediXcan predictions is lower than expected" now reads: "The least negative mean correlation across prediction weight sets was observed in GTEx v6p (-0.0044), while the most negative mean correlation (-0.0204) was observed with MESA_AFA (MESA African Americans, Supplementary Table 1). … While there are some fluctuations in prediction accuracy, the fact that correlations vary from -0.0204 to -0.0044 indicates that no prediction weight set produces practically meaningfully better correlations to data than the others."

"Utahans" is misspelled on line 217
We have followed the convention from utah.gov, where the state government uses the demonym "Utahns." 8. Figure 5 is incomplete. The ledged reads, "A dotted red line at h2 = 0.95 marks the power values shown in". Shouldn't this be at 0.205 (assuming this is related to Figure 6) also there is no red line.
Reviewer #1 also remarked on this. We thank both reviewers for highlighting this error. The reviewer is correct that Figures 5 and 6 are related - Figure 6 presents a cross-section of the panels of Figure 5 at h 2 = 0.205. The sentence fragment was from a previous version of the figure and has been removed.

Reviewer #3: In this manuscript, the authors describe a impressive set of analyses on the portability of gene expression QTLs (eQTLs) across diverse ancestries. They then extend this work to look at Transcriptome-Wide Association Studies (TWAS), and evaluate the extent of portability of these models across ancestry groups.
In the case of complex traits, we have larger sample sizes and are often discouraged by the losses in portability there, but this paper clearly makes the important point that the situation is even more discouraging for gene expression. The paper is well written and all analyses seem thoughtfull laid out. In addition, there has been momentous effort to make the data and software freely available, which should be commended. That being said, there are a few limitations to the current analyses, detailed below.
We thank the reviewer for their encouraging remarks about the manuscript. Our responses to the reviewer's concerns are given below.
1) The data come from myriad studies of: whole blood, PBMCs, monocytes, and LCLs. There is not a direct comparison between sample types, despite the availability of GTEx models trained on both LCLs and whole blood. Adding such a comparison of trained LCL/WB models would help researchers apply the results. (The lack of direct comparison to monocytes is understandable given the very preliminary nature of the MESA whole blood RNA-seq data generated as part of TOPMed, and the lack of existing TWAS models. As such, lack of monocyte comparison I don't think is a concern.) Correctly matching tissues is indeed important when predicting gene expression, but we assure the reviewer that we have given the most direct comparisons possible. We take MESA monocyte models as-is for reasons that the reviewer mentions. Comparisons with GTEx and DGN use prediction models trained in the same tissue (whole blood) that we have in SAGE. Using GTEx LCL models in SAGE fails to match tissue types and is unlikely to provide additional insight into cross-population generalizability of the models. Realistic cross-tissue analyses require MultiXcan (see Barbeira et al. 2019, https://doi.org/10.1371/journal.pgen.1007889 ) and are beyond the scope of this manuscript.
The concern about LCLs presumably concerns GEUVADIS since GEUVADIS expression data were taken from LCLs. But we specifically state in the final paragraph of the Introduction:

"We train, test, and validate predictive models wholly within GEUVADIS with a nested cross-validation scheme."
We previously tested PrediXcan models in GEUVADIS in Mikhaylova and Thornton (2019) (reference #43 in our paper). That analysis of prediction quality into GEUVADIS included both whole blood and LCL prediction models. Incidentally, we observed similar cross-population issues to what we found here, regardless of tissue matching.

2) GTEx, particularly in v8, has a substantial number of African-American and Hispanic participants with LCLs and Whole Blood. If possible, these should be included as a separate group for evaluation, to test whether the claim regarding SAGE and MESA_AFA is generalizable to other datasets. There are numerous different factors which might be contributing to the lack of reproducibility (e.g. WGS vs genotyping arrays; age effects; RNA isolation and sequencing protocols; source material; etc) and including GTEx Whole Blood and LCL expression in HIS and AFA individuals as evaluation sets would enable direct evaluation of some of these factors.
We do agree with the reviewer that disentangling potential technical heterogeneity would help understand these models, but providing a comprehensive investigation of potential sources is beyond the scope of this manuscript. We used PrediXcan whole blood models to match the tissue (whole blood) with our SAGE data, as an example of a real-world scenario and potential application setting. We note that the full SAGE study has nearly 1,800 participants with whole-genome genotype data (of which the 39 with RNA-seq data are a subset) and would be an ideal candidate for an application such as this. We used the MESA monocyte models for the reasons that the reviewer noted in the previous comment. We did not use LCL prediction models in SAGE since the whole blood ones are appropriate, and consistent with the reviewer's comments LCL models should be highly different.
Regarding the evaluation of other GTEx groups, we note that GTEx does not identify study subjects as Hispanic or Latinx, so presumably the reviewer is interested in behavior of PrediXcan models tested in African Americans from GTEx v8. We strongly urge caution when making comparisons between models trained or tested in different versions of GTEx. GTEx v8 uses a separate imputed variant set. While it would be interesting to test in GTEx v8 African Americans, the comparison is not trivial. Furthermore, our analysis with GEUVADIS was intended to test the scenario of limited technical heterogeneity. Our simulations derived from the well-characterized GEUVADIS samples still yielded substantial evidence of impaired cross-population generalizability, as well as some guidance on when they do (namely, when the eQTLs are the same across populations).
3) It would be nice to know whether the 564 genes (  Table 1) are representative of the whole transcriptome. A simple violin plot of the expression distribution of these genes and other genes, within each population, would suffice. It might be helpful to see their Test R^2 estimates as well, but I don't think that's critical.
The requested violin plots for SAGE (using GTEx) and GEUVADIS (broken down by population) have been added to the supplement (new Supp Figs #1,3,5,13,14,15,16) and are now referenced in the appropriate paragraphs of the Results section.
In Supp Fig 1 we use log-transformed TPM values from GTEx to show differences in baseline gene expression between all GTEx v7 genes versus the 273-and 39-gene subsets discussed in the manuscript. Supp Figs 3 and 5 show distributional summaries of R2 in SAGE versus test R2 from PredictDB using a "transcriptome-wide" set of 11,545 genes (Supp Fig 3) and the 273-gene subset (Supp Fig 5).
For Supp Figs 13-16 we plotted distributions of testing R2 broken down by train-test scenario; each supplementary figure shows a different subset of genes over the same data. In general, the subsets have genes that are expressed slightly better than average. We note that our original boxplots show that the distributions of R2 cluster around 0 for any reasonably well predicted set of genes. The story is slightly different for the training R2 from each repository, since the 273 genes in common to all repositories are slightly better predicted on average during training. Nevertheless, these 273 genes are not well predicted on average in SAGE.

4) Regarding the TWAS simulation:
While the simulation itself only used 100 genes, I think that readers would also be interested in understanding the power were a whole genome expression panel were used. Is it possible to recompute significant thresholds and provide these "transcriptome-wide" significance estimates as well? It would also be useful to know what the expected variance explained by the genes is --my understanding, for beta = 1, is that you are adding N(0, 0.1^2) noise still, so heritability should be very high? (It's a bit unclear whether the h^2 = 0.15 applies to the TWAS simulation as well, and lines 143-149 of simulate_twas_sge.R seem to indicate that h^2 isn't used.) If the reviewer is interested in extending this script for power calculations for whole-genome investigations, the relevant variable to modify in simulate_twas_sge.R for genome-wide power analysis is "ngenes" on line 170, that can be modified along with input sample size. We note that the trends would be expected to be monotonic with those observed in our simulation regime. The simulations use chromosome 22 to allow the simulations to act on a single chromosome. Across numerous simulation instantiations, our simulations provide sufficient complexity without needing to extend the approach to multiple chromosomes.
Regarding phenotype heritability in our simulations of TWAS power, we stated in the methods that we use a single noise distribution but vary the effect sizes to cover a range of heritability: "Effect sizes were fixed, and we tested various effect magnitudes β = 1 x 10-5, 5 x 10-5, 1 x 10-4, …, 1 x 10-1, 5 x 10-1, 1. The environmental noise ε was drawn from an N(0,0.1^2) distribution. Consequently, phenotypes therefore only varied with the expression measures from G." This yields a spectrum of phenotypic heritability or proportion of phenotype variance explained (PVE) as in xaxis of Figure 5 and the panel labels of Figure 7. We have updated the Methods text to clarify this: "Effect sizes were fixed, and we tested various effect magnitudes β = 1 x 10-5, 5 x 10-5, 1 x 10-4, …, 1 x  10-1, 5 x 10-1, 1, yielding a spectrum of phenotype heritability explained by gene expression. …" In contrast, the genetic heritability of expression is set to 0.15, but that is not the same as the TWAS phenotype of interest. The h2 lines in our code are placeholders from when we simulated phenotypes differently and are commented out because they are no longer used. In our current simulation scheme, h2 is calculated from the varied effects and noise.
Points of clarification or minor analysis: 5) Given that the SAGE participants were ascertained on the basis of rs28450894, some validation that effect sizes are not out of line at this locus is important to understanding the results. How close are the 273 genes to this SNP, and does this SNP (or a close LD partner) have non-zero weight in any models? It does seem rather unlikely that this SNP (or bronchodilator status in general) are driving the lack of signal, though, so simply acknowledging this and noting these distances and weights is sufficient in my opinion. Tables 3 and 4) and a new paragraph arguing that no obvious ascertainment bias exists in our analysis:  Table 4)."

We have added two supplementary tables (Supp
6) It appears that there is now a HIS weight set for MESA available on predictdb.org --I would suggest either including, or rewording "all four" (line 153) to "four of the".
We thank the reviewer for alerting us as we were not aware of this development. We have edited line 153 (now line 154 in the revised manuscript) to say "four of the MESA monocyte weight sets". 7) Are you powered (or is it possible to obtain) to estimate the heritability of the 273 genes in each population? Evaluating whether there is a relationship between R^2 and h^2 would be valuable. Currently, the Test R^2 is the comparison group, which should still be included but is perhaps underpowered and should not be considered a true upper bound. In particular, it would be nice to know if h^2 is different in AFA than in EUR. At the very least, showing a scatterplot of Test R^2 in EUR vs AFA in MESA/GEUVADIS of the different gene sets would be helpful.
The relationship between genetic heritability and R2 between predictions and measurements was established in Figure 3 of Gamazon et al (2015) with DGN. However, DGN has almost 1000 individuals. None of the real datasets in this manuscript (SAGE, n=39; GEUVADIS AFR, n=89; GEUVADIS EUR, n=373) are sufficiently powered to provide reliable heritability estimates, or even test for a significant difference from 0, using GCTA and the method for correcting h2 confidence intervals from FIESTA (Schweiger et al. 2018).
Originally we did investigate h2 vs. R2 for GEUVADIS EUR, but unfortunately that dataset is also limited by sample size.
As we do not have individual-level access to MESA we cannot perform the suggested test-R2 analysis, but the performance of these models was investigated by Mogil et al. 2018 (reference #38 in our manuscript). Table 1 and Table 2. In addition to the R^2 measures, perhaps a direct test of e.g. test statistic inflation be more interpretable?

8) Adding to Table 1 the number of individuals in the train and test populations, as well as the number of genes with positive correlation, would help with direct interpretation of
Some of the names used in Tables 1 and 2 state the sample size (EUR373 and EUR278). The remainder have sample sizes stated in the text (FIN = 95, AFR/YRI = 89). We have added the sample sizes for each sample in the captions of Tables 1 and 2. For Table 1, the number of genes used in each train-test scenario varies; some training sets produced gene models that others did not, driven in part by different training sample sizes. This mimics precisely the reality of existing PrediXcan models, since each repository has a different number of gene models and a different training sample size. Consequently, we did not mention the number of genes compared. To address this, we have added text in the caption to point interested readers to Supp  Table 2 were over the same genes. Thus, Table 2 provides an apples-to-apples comparison across genes (but not against sample sizes; this motivates the analysis behind Tables 3 and 4). We modified text within Table 2 to repeat from the caption and state the number of genes represented in each table.
We understand the reviewer's question about test statistics, and were the models more generalizable to the scenarios in Tables 1 and 2, this may be more illuminating. However, in light of the issues seen during our test, we believe that displaying the correlations is a more appropriate description and that there would be limited test statistic inflation. 7) What does it mean for CEU and YRI to have 0.0 shared ancestry? I see the simulation used haplotypes from CEU, YRI, or a mix thereof, but these sets of haplotypes can overlap. Some measure of haplotype sharing would be appreciated (or just stating the simulation proportion mixing directly, without making a populationlevel claim).
For the purposes of our simulation, CEU and YRI are distinct populations. The reviewer is correct to note that the two populations can still share haplotypes, but this would reflect the reality of deep shared ancestry that can exist, even within recently admixed populations. Given much previous research on this topic, we note that CEU and YRI represent reasonable proxies for the ancestral diversity present in modern African Americans (in particular, see Baharian et al. 2016, reference #56 in our manuscript), so whatever haplotype sharing exists here could also be present in real populations.
The places where we did not change the text are the title of Results subsection "Power to detect associations declines with decreasing shared ancestry" and the first sentence of that section.
We did change the caption of Figure 6. We replaced all instances of "shared ancestry" with the more precise phrasing "admixture proportion" as the reviewer suggests. Unlike the subsection header, here we explicitly refer to populations CEU, YRI, and the simulated admixed AA. 8) Do you have a sense of why AA->AA is so much better than CEU->CEU or YRI->YRI in Supplemental Figure  9? Should I be interpreting this in the context of total haplotypic diversity in AA?
The figure that the reviewer references, Supp Fig 9, is now Supp Fig 23 in this revision. The results under 0% or 50% shared eQTL from AA to AA are better than CEU or YRI into themselves as a consequence of the eQTL architecture (see Supp Fig 19 in revised manuscript). As we phrased it, the term "shared eQTL architectures" refers to how many eQTLs were shared by the ancestral populations (CEU and YRI). Since AA inherits all eQTLs, the result is that AA typically has more eQTLs than either CEU or YRI. This drives differences in the t-statistics, and therefore influences p-values and power. We note that differences in t-statistics in Supp Fig 23 exist between populations even when all eQTLs are the same across populations, which is a function of haplotype sharing between AA and its ancestral populations CEU and YRI, as well as the diminished haplotype sharing between CEU and YRI. Table 3 like the FIN testing population, trained in AFR, does substantially better than the other EUR test populations. Any idea why that might be?
The simplest explanation is that FIN is a single population, while EUR373 and EUR278 are composed of multiple populations (EUR373 = CEU + GBR + TSI + FIN; EUR278 = CEU + GBR + TSI). Consequently, prediction into a heterogenous test set such as EUR373 or EUR278 may be harder than prediction into the more homogeneous test set FIN.
A potential explanation is that the population bottlenecking in Finnish demographic history has reduced haplotypic diversity in FIN, while the population history of YRI has yielded more genetic diversity in YRI. This means that YRI is capable of predicting reasonably well into FIN, whereas FIN lacks sufficient genetic diversity to predict as effectively into YRI. A similar story could be told about the relationship between YRI and each of the four EUR populations, though we note that the Finnish reflect more isolation than the other three EUR pops. This latter explanation is somewhat supported by Table 4, in which YRI predicts better into EUR pops than EUR pops do into YRI. However, a full investigation of this would need to be completed in future research.

10) For simulated gene expression that went into the TWAS, was there a minor allele frequency cutoff in choosing causal eQTLs? And was this matched across populations?
Yes, the minimum admissible MAF for eQTLs in our simulations was 5%. The threshold was the same for all populations. Our source code reflects this: https://github.com/klkeys/sage-geuvadis-predixcan/blob/master/analyses/03_hapmap3simulation/src/02_test_prediction_models/simulate_crosspopulation_prediction.R We have added the following text to the Methods subsection "Simulation of gene expression": "Causal eQTLs were chosen at random among SNPs with at least 5% minor allele frequency. The same 5% minor allele frequency floor was applied to each population." 11) "The comparison of predictive models cannot easily differentiate predictions of 0 (no gene expression) and NA (missing expression) [L570-572]." --Could you please clarify this statement?