Evolutionary Distance of Amino Acid Sequence Orthologs across Macaque Subspecies: Identifying Candidate Genes for SIV Resistance in Chinese Rhesus Macaques

We use the Reciprocal Smallest Distance (RSD) algorithm to identify amino acid sequence orthologs in the Chinese and Indian rhesus macaque draft sequences and estimate the evolutionary distance between such orthologs. We then use GOanna to map gene function annotations and human gene identifiers to the rhesus macaque amino acid sequences. We conclude methodologically by cross-tabulating a list of amino acid orthologs with large divergence scores with a list of genes known to be involved in SIV or HIV pathogenesis. We find that many of the amino acid sequences with large evolutionary divergence scores, as calculated by the RSD algorithm, have been shown to be related to HIV pathogenesis in previous laboratory studies. Four of the strongest candidate genes for SIVmac resistance in Chinese rhesus macaques identified in this study are CDK9, CXCL12, TRIM21, and TRIM32. Additionally, ANKRD30A, CTSZ, GORASP2, GTF2H1, IL13RA1, MUC16, NMDAR1, Notch1, NT5M, PDCD5, RAD50, and TM9SF2 were identified as possible candidates, among others. We failed to find many laboratory experiments contrasting the effects of Indian and Chinese orthologs at these sites on SIVmac pathogenesis, but future comparative studies might hold fertile ground for research into the biological mechanisms underlying innate resistance to SIVmac in Chinese rhesus macaques.


Introduction
Recent work has produced complete genome sequence data for both the Indian [1] and Chinese rhesus macaque [2]. The existence of these draft sequences plays a very crucial role in permitting comparative genomic study of the differentiation of Chinese and Indian rhesus macaque amino acid sequences.
Amino acid sequence divergence between Chinese and Indian rhesus macaques has the potential to influence phenotype, especially response to immune system challenge, as genes related to immune response are some of the most rapidly evolving across species [3][4][5][6]. Additionally, evolution of the regulatory regions of genes is also known to play a very significant role in immunity [7,8]. Thus, heterogeneity in phenotypic response to pathogens is a function not only of what protein is expressed by a gene, but also a function of the timing [9,10], tissue type/biological location [11], quantity [12], and interaction [12] of genes being expressed.
It has been shown that Chinese rhesus macaques generally have increased elite controller status and more frequent long-term non-progression to simian-AIDs relative to Indian rhesus macaques [13,14]. Understanding the genetic sources of heterogeneity in immune response to SIV mac between Chinese and Indian rhesus macaques may thus provide important insights into SIV mac and HIV/AIDS biology more generally.
In this study, we focus on identifying genes related to HIV or SIV mac whose amino acid sequences show high levels of divergence across Chinese and Indian rhesus macaques, as these genes may be strong candidates for the difference in immunity to SIV mac across these subspecies. We do not investigate divergence in non-coding regulatory regions, although we believe such divergence to play a significant role in heterogeneity in immune response across subspecies. Including data from non-coding regions in our analysis would create significant methodological and computational difficulties at this point in time, but we believe that our methods will be amenable to such investigations as computational methods increase in power and efficiency, and genomic data increases in quantity and quality.
Previous comparative genomic investigations for evidence of selection [2] and high throughput single-nucleotide polymorphism (SNP) sequencing and linkage disequilibrium analysis [13] in Chinese and Indian rhesus macaques have found patterns consistent with positive selection on genes such as TCTE1, ZNF337, and TRIM5 [2], and MAST2, ELAVL4, and HIVEP3 [13]. The functional linkages between the gene candidates presented in these papers and resistance to SIV mac in Chinese rhesus macaques still remain ambiguous, however, as further laboratory work is needed to investigate the relevance of Chinese versus Indian orthologs on SIV mac pathogenesis.
In this analysis, we attempt to identify additional candidate genes for SIV mac resistance and test if we can re-identify previously described candidate genes using a new methodology. Our methods are similar at first pass to those used by Yan et al [2], in that we attempt to identify orthologous amino acid sequences between Chinese and Indian rhesus macaques using complete draft sequence data and use synteny mapping to evaluate the performance of the ortholog classifications. Our methods differ in that we do not infer orthology on the basis of synteny, but instead use a Python implementation of the Reciprocal Smallest Distance (RSD) algorithm [15] to evaluate orthology and then use synteny mapping to check the performance of the RSD algorithm.
The RSD algorithm functions to detect putative orthologs using sequence alignment in a way similar to the reciprocal best hit algorithm (RBH) [16]. The RSD algorithm, however, works around a shortcoming of the RBH algorithm that occurs when a forward blast yields a paralog best hit, but a reciprocal blast recovers an ortholog; in such cases, the RBH alogorithm excludes both pairs, but the RSD algorithm can often recover the true ortholog [15]. The RSD algorithm accomplishes this by conducting a forward blast of an amino acid sequence, i, from a draft sequence, I, onto a secondary draft sequence, J, to obtain a list of hits, H. Instead of simply checking for reciprocal best hits, the RSD algorithm moves on to compare each hit, h, meeting threshold criteria against the query sequence, i. A maximum likelihood estimate, e, of the evolutionary distance (substitutions per site) separating h and i is calculated, given an empirical amino acid substitution rate matrix [15]. Note that these numbers might appear surprisingly large if the focal amino acid sequences are of different lengths. For instance, the inferred orthologous amino acid sequences ENSMMUP00000038625 and ENSP00000407071 (GO identifier RAD50) are of very different lengths, increasing the associated e value to 2.2, even though the percent identify matrix from Clustal 2.1 shows a small but plausible sequence identity of 39.47 percent. Of all considered sequences in H, only the sequence with the smallest evolutionary distance is retained; this sequence, j, is then used for a reciprocal blast against the first draft sequence, I [15]. Hits from this blast are treated analogously, and an orthologous pair is considered to be found if and only if sequences i and j are the sequences with reciprocal smallest evolutionary distances [15].
An attractive feature of the RSD algorithm is that it provides estimates of evolutionary distance [17] between each pair of othologs. Given that the Indian and Chinese rhesus macaque sub-populations are estimated to have separated only about 160,000 years ago [18], a large majority of the genome is expected to be relatively constant across sub-populations. Thus, large differences in amino acid sequence between orthologous pairs in Chinese and Indian rhesus macaques are notable. Given that bacteria or viruses with structural, functional, or morphological similarities to SIV mac stand to place selective pressure on genes which might confer resistance or even elite controller status to such infections, and Chinese and Indian rhesus macaques show different normative phenotypes in relation to SIV mac pathogenesis, orthologs with negligible (* e < 0.05) evolutionary distance (about 15,448 of the 17,064 amino acid sequences considered in this study) across Chinese and Indian rhesus macaques are unlikely to be responsible for differences in resistance to SIV mac , conditional on the assumption that coding region variation is the key driver of this phenotypic difference. Orthologs with non-negligible evolutionary distance across Chinese and Indian rhesus macaques are candidates for genes that might be diverging due to a disjunctive array of demographic and selection pressures, of which immune system challenge is only a single example.
To narrow down which of these amino acid sequences might contribute to SIV mac resistance in Chinese rhesus macaques, we used GOanna [19] to associate gene symbols (human homologs) and gene function annotations to each rhesus macaque amino acid sequence, and then cross-tabulated the portion of our candidate list with the highest divergence/evolutionary distance scores (492 genes in total with divergence scores e > 0.15) with a list of 140 genes known to be involved in HIV-1 pathogenesis from a thorough literature review [20], a list of 169 candidate genes for HIV-1 susceptibility derived from a large-scale genome-wide common-disease common-variant analysis of HIV-1 in Africa [21], and four genome-wide siRNA analyses [22][23][24][25].
Additionally, we conducted a literature review of the 25 orthologs with the largest divergence/evolutionary distance scores from our RSD analysis in order to investigate if any of these genes had been associated with HIV or SIV pathogenesis by laboratory studies not cited in the previous literature reviews or association studies [20].
We complete our analysis by: 1) modeling ortholog divergence scores as a function of chromosome and chromosomal location (on the Indian rhesus macaque reference frame) to investigate if there are any distinct chromosomal locations with high concentrations of divergent orthologs, and 2) investigating divergence scores as a function of gene ontology classification categories.
While many disjunctive ecological differences may have led to divergent amino acid sequence evolution between Chinese and Indian rhesus macaques, researchers are especially interested in understanding how divergent evolution in each sub-population has led one subpopulation (Chinese rhesus macaques) to become resistant to the simian analog of HIV [13,14]. While careful laboratory studies are required to uncover the role of specific genes (and their interactions) on SIV mac and HIV resistance, wide-scale comparative genomic scans may provide useful insights and substantially reduce the list of candidate genes to be investigated with more rigorous laboratory methods; uncovering the functional and biological significance of genetic variation is made easier by integrating the partial strengths of genomic, evolutionary, and biochemical data and methods [26].

Establishing Orthology between Chinese and Indian Rhesus Macaque Reference Sequences
Supplementary S1 Table includes the full list of Chinese and Indian amino acid sequence orthologs, paired with the location information of each ortholog on each draft sequence, the estimated divergence score from the RSD algorithm, and the associated gene symbols from GOanna. Fig 1 plots the synteny maps implied by the ortholog classification of the RSD algorithm, dropping a small number of amino acid sequences that the RSD algorithm believes to have undergone inter-chromosomal transposition. These maps show that the vast majority of orthologous pairs are contained in expansive blocks of synteny. We observe only a small number of apparent intra-chromosomal transpositions per chromosome, the majority of which are

Identifying the Orthologs with Highest Divergence between Chinese and Indian Rhesus Macaques
We cross-tabulated our candidate list of 492 genes with divergence values of e > 0.15 with several lists of genes (Supplementary S2 Table) suspected to be involved in HIV or SIV pathogenesis from thorough literature review [20], association studies [21], or genome-wide siRNA analyses [22][23][24][25].
Using this methodology we identified five genes (CDK9, CXCL12, LPL, TRIM32, and TRIM21) from the set of genes described by [20], five genes (CXCL12, CYP26B1, EFR3A, NME6, and PIGX) from the set of genes provided by [21], zero genes from the set of gene provided by [22], four genes (BCAR1, DSP, ICAM4, and SLC35B1) from the set of gene provided by [23], six genes (DAPK2, PARVA, CTSZ, GTF2H1, LYPLA1, and GORASP2) from the set of gene provided by [24], and three genes (ANKRD30A, LPL, and TM9SF2) from the set of genes provided by [25]. These twenty genes are listed in Table 1 with their divergence scores and identifiers on the Indian and Chinese rhesus macaque reference frames.
Additionally, we conducted our own literature review of the 20 genes with the largest divergence scores from our RSD analysis; Table 2 presents our results. We find that at least 7 of the  [20][21][22][23][24][25] and also have highly divergent orthologs across Chinese and Indian rhesus macaques. The label "GOname" indicates the gene name assigned by GOanna. The labels "ProteinID.IR" and "ProteinID.CR" indicate the name of the amino acid sequence in the Indian and Chinese rhesus macaque draft sequence files, respectively. The label "e" is the estimate of evolutionary distance produced by the RSD alogirthm via implentation of PAML [17]. Finally, the labels "Chr.IR" and "Chr.CR" indicate the chromosome on which the amino acid sequences occur in the Indian and Chinese rhesus macaque draft sequence files, respectively.

GOname
ProteinID 20 genes with the highest divergence between the Chinese and Indian orthologs have been linked to HIV or SIV pathogenesis in some way by previous studies.

Identifying Evolutionary Divergence between Chinese and Indian Rhesus Macaque Orthologs as a Function of Chromosomal Region
Figs 2-4 present estimates of evolutionary distance between the Chinese and Indian amino acid sequences as a function of chromosomal location in megabases (Mb). Amino acid sequence divergence levels appear to show weak signs of concentration into regions on a chromosome, although the extent to which this pattern holds may be confounded by the clustering of amino acid sequences themselves into specific regions on each chromosome. To better estimate chromosomal position effects, we use a multi-level, zero-inflated gamma regression model, with random effects on chromosomal position generated using a Gaussian process. Across chromosomes, we see very little evidence for strong chromosomal position effects, as there is very little difference in the estimated probability of a non-zero evolutionary distance or the mean value of evolutionary distance as a function of chromosomal position. These results are plotted in Figs 5-8.
We did observe significant heterogeneity in mean evolutionary distance across chromosomes, with some (e.g. 5, 6, and 17) showing reduced amino acid sequence divergence and others (e.g. 8, 13, 19, and 20) showing increased evolutionary divergence. These results are plotted in Figs 9 and 10. Table 2. Table 2 presents the 20 orthologs with the highest levels of divergence according to the RSD algorithm. Of these 20 genes, 7 have been linked to HIV-1 pathogenesis in some way by previous research. The label "GOname" indicates the gene name assigned by GOanna. The labels "ProteinID. IR" and "ProteinID.CR" indicate the name of the amino acid sequence in the Indian and Chinese rhesus macaque draft sequence files, respectively. The label "e" is the estimate of evolutionary distance produced by the RSD alogirthm via implentation of PAML [17]. Finally, the labels "Chr.IR" and "Chr.CR" indicate the chromosome on which the amino acid sequences occur in the Indian and Chinese rhesus macaque draft sequence files, respectively.

Identifying the Gene Function Classes with Highest Evolutionary Divergence between Chinese and Indian Rhesus Macaques
To investigate the functional roles of genes associated with increased ortholog divergence scores, we used GOanna to link gene annotations with ortholog pairs. Table 3 presents the 20 gene function classes (of the 176 containing at least 100 genes) with highest median divergence scores across the rhesus macaque genome. Supplementary S3 Table contains the complete annotation list paired with ortholog divergence scores. Genes involved in viral processes and immunity show elevated levels of evolutionary divergence. . The red points illustrate the locations of the orthologs (including the orthologs with e = 0) on the horizontal axis, and the black lines represent the divergence scores (e) of these orthologs on the vertical axis. The area of elevated evolutionary distance on the beggining portion of Chromosome 1 corresponds to an area with significant differences in linkage disequilibrium between Indian-derived and Chinese rhesus macaques [13].

RSD Performance
The RSD algorithm appears to function with high accuracy in identifying orthologs, as synteny maps show large blocks of gene order conservation, sometimes spanning the entire chromosome (e.g. on chromosomes 8, 12, 17, and 18). There is, however, evidence of broken synteny on some chromosomes due to single amino acid sequence transpositions (e.g. on chromosomes 1, 4, 7, 9 and 20), as well as block transpositions (e.g. on chromosomes 14, 15, 16, and 19). Additionally, RSD believed some amino acid sequences to have orthologs on different chromosomes. Some researchers [2] opt to exclude proposed orthologs if the pair break blocks of synteny. We have not excluded any potential orthologs from our results based on synteny breaking, but we have marked (in Supplementary S1 Table) which potential orthologs break synteny or occur on different chromosomes (387 of 17,064 proposed orthologs showed such behavior). It is possible that potential orthologs which break synteny are either 1) mis-classified as orthologs by RSD, or 2) poorly localized in the draft sequences. If (1) is the case, then our estimates of evolutionary distance are invalid; if (2) is the case, then our estimates of evolutionary distance are valid, but our localizations are invalid.
We, however, find it possible that many of the transpositions discovered by RSD are true transpositions. RSD uses global information and minimum distance criteria to map orthologs, so misclassifications are unlikely conditional on the global information being complete, amino acid sequences being large, and divergence times being small. Transpositions (both intra-and inter-chromosomal) and small inversions are known to occur during the evolutionary process, and are frequently seen in inter-species mappings [27]. It is extremely unlikely to see transposition of blocks in which synteny is conserved if the sequences in the transposed blocks are not true orthologs. In such cases, however, the appearance of transpositions or inversions may be due to errors in localization of orthologs on one or both of the reference sequences. Transpositions may be indicative of important inter-gene interactions due to chromosomal position effects [28], epigenetic effects [29], or simply lowered odds of recombination separating alleles that have non-additive or synergistic effects (see Maynard-Smith's argument for the evolutionary origin of chromosomes [30,31]). By presenting the data and results without exclusion, we leave it to the reader to interpret the findings, while acknowledging where possible confounding methodological factors may inhibit strong inference.

Candidate Genes
We provide a brief and by no means exhaustive review of laboratory studies that address the functional relationships between the candidate genes for SIV mac resistance uncovered in this study and SIV mac or HIV pathogenesis.
Tripartite Motif Proteins 21 and 32 (TRIM21 and TRIM32). TRIM21 and TRIM32 are among the strongest candidate genes for SIV mac resistance in Chinese rhesus macaques identified in this study, as they are some of the most divergent orthologs in the entire rhesus macaque genome and are known to play an important role in HIV-1 pathogenesis. These genes are members of a family of tripartite motif (TRIM) proteins that are known to be involved in a variety of cellular functions, including differentiation, apoptosis, and immunity [32]; recently, a number of TRIM proteins have been found to display anti-retroviral activities, and have been associated with innate immunity [32][33][34]. In our analysis, however, the amino acid sequences corresponding to TRIM32, were localized on different chromosomes, and thus have elevated odds of being an ortholog misclassification or being poorly localized in one of the draft sequences.
The exact role of TRIM32 in HIV-1 pathogensis is still unclear, as some studies have found that it functions to repress viral gene expression [33], affect trafficking of viral protein [34,35], or attenuate transcription [36,37], while others have found that it affected HIV release from cells, but not viral gene expression [38].
TRIM32 has been shown to modulate interferon production and cellular antiviral responses by targeting MITA/STING [39]. In addition, TRIM32 plays a key role in TNF-α induced apoptosis [40].
A human study found significant difference in TRIM32 expression between natural elite controllers and anti-retroviral theory (ART) suppressed individuals, with TRIM32 expression being elevated in the elite controllers [41]. The comparison between elite controllers and an ART suppressed 'control' group is justified by the authors because HIV-1 replication drives expression of many restriction factors (such as TRIM32) and a comparison of gene expression in two 'aviremic' yet infected groups (human elite controllers and ART suppressed individuals) minimizes the confounding differences in HIV-1 antigen levels that would be introduced by uninfected controls, or untreated individuals [41]. This study indicates that heightened    expression levels of TRIM32 might decrease viral replication, perhaps through interactions with HIV-1 Tat [34,37].
A chimeric protein composed of rhesus macaque-derived TRIM5α and human-derived TRIM21 has been shown to potently inhibit HIV-1 infection [42,43]. Further, TRIM21CypA, a TRIM21-cyclophilin-A fusion protein, provides highly potent protection against HIV-1 without disturbing normal TRIM activity [44]. Some authors have even suggested that TRIM21 itself may be directly involved in anti-HIV activity [45].
More generally, a comparison of human and mouse TRIM proteins found that 21% of all mouse, but only 11% of all human TRIM proteins interfere with HIV-1 release [38]. We hope to see such comparative methods expanded to include comparison of Chinese and Indian rhesus macaque derived TRIM proteins, as TRIM21 and TRIM32 proteins derived from Chinese rhesus macaques may prove to have especially potent anti-HIV properties relative to those of Indian rhesus macaques.
It has been shown that CXCL12 is expressed at mucosal surfaces and can strongly reduce the transmission and propagation of X4-HIV at such sites [49], which helps to explain the differential pathogenesis of R5-and X4-SHIV in rhesus macaque models [50]. A naturally occurring splice variant of CXCL12 has been shown to be a potent HIV-1 entry inhibitor [51]. Table 3. Table 3 presents the 20 gene ontology (GO) categories (of the 176 GO categories containing at least 100 genes) with the highest median levels of divergence according to the RSD algorithm. The label "Zeros" refers to the number of genes in the GO category with a divergence score of zero. The label "N" refers to the number of genes in the GO category. GO categories related to viral processes and immunity are well represented among the GO categories that are diverging at higher rates relative to most other categories. We note that 3 of the 20 (15%) most divergent GO categories are related to viral processes or immunity, while only 2 of the other 156 (1%) of GO categories are related to viral processes or immunity. Table 3  It was thought that a SNP (rs1801157) on CXCL12 might modulate plasma levels, but this effect was not found [46,52]. Further, this SNP on CXCL12 has been thought to augment risk for HIV-1; an effect has been found in some studies, although the effect size is normally small [46]. A recent meta analysis, however, failed to find any significant association between this SNP and HIV-1 susceptibility [53]. Analysis of genetic variation in CXCL12 across Chinese and Indian rhesus macaques, may prove useful in determining if specific variants of CXCL12 confer increased resistance to SIV mac , which might in turn help clarify the role that CXCL12 variants might play in HIV-1 pathogenesis in humans.
Cyclin-dependent kinase 9 (CDK9). CDK9 is a cyclin-dependent kinase that has been linked to HIV-1 pathogenesis in numerous studies-the CDK9/cyclin T1 enzyme, especially, has been shown to play an important role in HIV-1 transcription [54][55][56]. Due to the central nature of CDK9/cyclin T1 in HIV-1 transcription, several studies have investigated CDK9 as a potential drug target [57,58].
Little is known about the role of CDK9 orthologs in SIV mac pathogenesis in vivo in rhesus macaques, but mutant variants of CDK9 lacking kinase activity have been shown to augment the basal transcriptional activity of the HIV-1 long terminal repeat [55].
DNA repair protein (RAD50). HIV-1 replication depends on integration of virally produced DNA into the host-cell's genome via integrase-dependent linkage [65]. It is argued that this linkage leaves an intermediate state that requires post-integration repair; HIV-1 is thought to exploit host double-strand DNA break (DSB) repair pathways in order to complete such repairs [65]. RAD50, along with Nijmegen Breakage Syndrome-1 protein (NBS1) and Meiotic Recombination 11 Homologue (MRE11), is thought to play a role in this process [65]. Little is known about the relationship between RAD50 variants and HIV-1 pathogenesis.
N-methyl-D-aspartate 1 (NMDAR1). NMDAR1 activation is thought to play a role in HIV-1 pathogenesis by modulating HIV-1 gp120-induced blood-brain barrier leakage, in that NMDAR1 antagonists have been shown to protect the blood-brain barrier from gp120-induced damage [66]. NMDAR1 has been shown to be significantly down regulated in HIV-1 infected astrocytes as opposed to control astrocytes [67].
Little is known about the relationship between NMDAR1 variants and HIV-1 pathogenesis; however, a dissertation on differential expression found that the NR1 subunit was significantly downregulated in SIV-infected Indian rhesus macaques, whereas the SIV-infected Chinese rhesus macaques showed no such signs of NR1 downregulation [68].
Mucin 16 (MUC16). MUC16 is a large membrane-associated mucin [69,70]. Mucins are present at the surface of epithelial cells in the male and female genital tracts and are thought to play an important role in immune defense by trapping and eliminating pathogens before they can cause infection [70]. Some mucins have been found to be potent in neutralizing HIV-1 (e.g. MUC5B, MUC6, and MUC7) [71,72], but the role of MUC16 is yet to be investigated.
Other Genes of Interest. ANKRD30A is a recently-discovered host factor for HIV-1 infection [25]; however, little is known about the role of ANKRD30A in HIV biology [73].
IL13RA1 was shown to vary in expression between HIV-1 infected and non-infected controls [74].
Expression of NT5M, a 5'-nucleotidase gene, was shown to vary (downregulation in macrophages) in response to Tenofovir, a pre-exposure prophylaxis used to prevent HIV-1 infection. Tenofovir functions to cause chain termination when incorporated into viral cDNA, and can thus inhibit viral replication [75]. Although Tenofovir has a direct effect on HIV-1 pathogenesis, immunomodulatory effects of Tenofovir, such as increased expression and secretion of cytokines and modulated expression of 5'-nucleotidases like NT5M and NT5E, may also be responsible for its anti-viral activity [75].
The Tat protein of HIV-1 has been shown to interact with Notch1 in vitro [76]. Further studies have demonstrated linkages between HIV-1 and HIV-associated nephropathy (and progressive renal failure), which is thought to arise through activation of the Notch signaling pathway [77].
PDCD5 is a signal gene for apoptosis and was found to be upregulated in HIV-infected individuals relative to controls [78].
CTSZ was one of 221 differentially expressed genes up-regulated in the peripheral blood mononuclear cells of HCV/HIV co-infected individuals versus non-infected controls [79].
GORASP2 was found to regulate HIV-1 infection, as it was one of 37 genes that decreased MAGI cell β-galactosidase reporter activity at least 2-fold for at least three out of four siRNAs in a study of HIV-responsive phosphoproteins [80].
GTF2H1 was found to be strongly upregulated after monocyte-derived dendritic cells were experimentally infected with HIV-1 [81].
Several SNPs in the region of TM9SF2 showed signs of association with human AIDS progression in vivo [82].
LPL, lipoprotein lipase, was associated with HIV in a genome-wide siRNA screen [25]; however, the role that LPL plays in HIV pathogenesis, if any, is not well understood. This being said, it has been shown that some HIV protease inhibitors bind to a site on HIV-1 protease that shares approximately 60% homology to a region within LPL [83]. For this reason, anti-retroviral therapy has been implicated in peripheral fat wasting (lipodystrophy), central adiposity, hyperlipidaemia, and insulin resistance in human subjects [83].
Knockdown of ICAM4 and BCAR1 by siRNA enhances the early stages of HIV-1 replication in HIV-infected CD4 cells [22].

Shortcomings of this Study
Notably, 3 of the 20 orthologs in Table 1, and 4 of the 20 orthologs in Table 2, occurred in different chromosomes. As the evolutionary distance between proposed orthologs increases, it becomes more likely that they may actually be misclassified, especially if the pair breaks synteny. We cannot be sure if these transpositions are real, or if they simply result from classification errors by the RSD algorithm.
Furthermore, this study contains a significant shortcoming in that it uses a form of comparative genomic analysis that is more aptly applied to inter-species comparisons than to intraspecies, inter-subspecies comparisons. This is because the use of single genome sequences to estimate the evolutionary divergence of orthologs is more defensible when amino acid sequences are more variable between groups than within groups. This assumption is not necessarily met in our study, because there is a strong potential for many of the amino acid sequences included in this analysis to be as variable within groups as between groups.
There are sure to be numerous amino acid sequences in our data for which our results reduce simply to a comparison between individual animals, rather than a true comparison between the normative genomic characteristics of sub-subspecies. While this weakness is inherent in our study design, it does not necessarily invalidate our results. Numerous genetic comparisons of Chinese and Indians rhesus macaques have shown that they differ genetically in normative ways at many loci. For instance, of 661 genic SNPs distributed almost uniformly across the genome, 457 were found strictly in one sub-population [84]. Furthermore, discriminant analysis of principal components conducted using 2,808 evenly distributed, (mostly) noncoding SNPs found that the genomic profiles of Chinese rhesus macaques clustered tightly together, with almost no overlap with the distribution of the genomic profiles of Indian rhesus macaques [85]. Further research, however, is needed to investigate amino acid sequence variation in candidate loci across rhesus macaque subspecies using multiple animals from each subspecies.

Conclusions
We used estimates of the evolutionary distance of orthologous amino acid sequences in Chinese and Indian rhesus macaques to create a candidate list of orthlogs that might be diverging due to selection pressures or demographic processes. To the extent that normative phenotypic differences in immune response to SIV mac are due to difference in amino acid sequences across rhesus macaque sub-populations, SIV mac resistance in Chinese, but not Indian rhesus macaques should be associated with divergent orthologs. We cross-tabulated a list of divergent orthologs with lists of genes known to be involved in HIV or SIV pathogenesis. We identified 20 candidate genes that occurred on both gene lists and found that several other highly divergent orthologs had been previously linked to HIV or SIV as well.
There appear to have been few or no prior studies of differences in coding variants or expression of these genes between rhesus macaque sub-populations, nor studies investigating the effect of differing rhesus macaque variants of these genes on SIV or HIV pathogenesis. Such studies in the future may help to shed light on the molecular mechanisms underlying differences in normative SIV mac resistance between these macaque subspecies.
Comparative genomic work can narrow the set of candidate genes to be investigated with more rigorous laboratory methods and case control studies; careful evaluation of genetic differences in these candidate loci across Indian and Chinese rhesus macaque sub-populations, and evaluation of the effect of these differences on SIV mac pathogenesis, is needed in order to identify the structural and functional differences in genes underpinning SIV mac resistance in Chinese, but not Indian rhesus macaques.

Data Sources
Amino acid sequence data for the Chinese rhesus macaque was accessed from the Comprehensive Library for Modern Biotechnology, BGI <http://climb.genomics.cn/10.5524/100002>, and amino acid sequence data for the Indian rhesus macaque was accessed from the Monkey Database, BGI <http://macaque.genomics.org.cn/page/species/index.jsp> in February, 2014.

Software, Programming Environment, and Analytical Methods
A Python implementation of the RSD 1.1.6 algorithm [15] was used in data analysis. Local installation of the script was conducted following instructions on GitHub <https://github.com/ todddeluca/reciprocal_smallest_distance>. RSD 1.1.6 depends on Python 2.7, NCBI BLAST 2.2.26, PAML 4.4, and Kalign 2.04, so these programs were installed locally as well (February, 2014).
The GOanna [19] program was used to associate amino acid sequence data with human gene IDs and gene ontology annotations.
All other data analysis and visualization was conducted in the the R programming environment [86]; the seqinr package was used to read the FASTA sequence files.
Statistical modeling was conducted using the Stan version 2.5 C++ library for Hamiltonian Markov Chain Monte Carlo simulation, using the RStan package in R [87].

Statistical Methods
We use a multi-level zero-inflated gamma regression model to estimate differences in the evolutionary distance of orthologs as a function of chromosome and chromosomal position. Random effects on chromosome control for heterogeneity in evolutionary distance across chromosomes, and chromosome-specific vectors of random effects generated using a Gaussian process control for heterogeneity in evolutionary rate as a function of location on each chromosome. More formally, on each chromosome, c, we model the evolutionary distance of the n th amino acid sequence, Y [c,n] , using the probability function: Y ½c;n $ ( GammaðA ½c;n ; B ½c Þ if Y ½c;n > 0; and Bernoulliðy ½c;n Þ if Y ½c;n ¼ 0 We note that the mean of a gamma distribution is defined as m ¼ A B . As such, A [c,n] is defined from a model of the mean of a gamma distribution as: where the correlation matrices Γ and P are formed as functions of the unit-normalized distance, D, between chromosomal zones z 1 and z 2 squared: P ½z 1 ;z 2 ¼ ( k ½c expðÀx ½c D 2 ½z 1 ;z 2 Þ if z 1 6 ¼ z 2 ; and where the parameters ρ and κ 2 (0, 1) represent the maximum correlation, and the parameters z and ξ 2 R + modulate the decay of correlation with distance between chromosomal zones. A more finely resolved Gaussian process would use a random effect on each observation, but given that the computational requirements of a Gaussian process model scale at about O(n 3 ), it is unfeasible to model each data point as arising from a Gaussian process. Instead, we divide each chromosome into Z = 40 ordered and equally dense zones, and let the Gaussian process generate a random effect that is shared among all amino acid sequences in that zone. We denote the zone of the n th amino acid sequence on a chromosome using the function notation Z (n), to return the zone, z, of the n th amino acid sequence.
To complete the Bayesian model, we use multi-level partial pooling to share information across chromosomes: a ½c $ Normalðm a ; s a Þ ð8Þ We specify weakly regularizing priors on μ α and μ β : and vague priors on σ α and σ β : s a $ Cauchyð0; 3ÞT½0; 1 ð12Þ where the truncation operator, T[0, 1], indicates truncated support on the interval (0, 1). Each element of the vector of scale parameters, B, for the gamma regression model is given a weakly informative prior: Each element of γ and ν, the variance parameters, get vague priors: g ½c $ Cauchyð0; 3ÞT½0; 1 ð15Þ n ½c $ Cauchyð0; 3ÞT½0; 1 ð16Þ Because the elements of ρ and κ are constrained to the unit interval, they get weakly informative Beta priors: Finally, the elements of the decay parameter vectors, z and ξ, get weakly informative priors: z ½c $ Normalð2; 3ÞT½0; 1 ð19Þ x ½c $ Normalð2; 3ÞT½0; 1 ð20Þ Supporting Information S1 Table. This file (Excel format) contains the full set of orthologs and evolutionary distances from our analysis. We have marked the genes that break synteny.
S2 Table. This file (Excel format) contains the full set of previously described HIV-related genes utilized to construct results presented in Table 1. Table. This file (Excel format) contains the full set of gene function classifications for which GOanna could amend annotations and for which the RSD algorithm found orthologs. (XLSX) S4 Table. This file (Excel format) contains the summary of the posterior parameter estimates from MCMC simulation using the Stan 2.5 C++ library.
(XLSX) S1 Code. This file (R script) contains the code used to construct our statistical model.