A novel variant of torque teno virus 7 identified in patients with Kawasaki disease

Kawasaki disease (KD), first identified in 1967, is a pediatric vasculitis of unknown etiology that has an increasing incidence in Japan and many other countries. KD can cause coronary artery aneurysms. Its epidemiological characteristics, such as seasonality and clinical picture of acute systemic inflammation with prodromal intestinal/respiratory symptoms, suggest an infectious etiology for KD. Interestingly, multiple host genotypes have been identified as predisposing factors for KD. To explore experimental methodology for identifying etiological agent(s) for KD and to optimize epidemiological study design (particularly the sample size) for future studies, we conducted a pilot study. For a 1-year period, we prospectively enrolled 11 patients with KD. To each KD patient, we assigned two control individuals (one with diarrhea and the other with respiratory infections), matched for age, sex, and season of diagnosis. During the acute phase of disease, we collected peripheral blood, nasopharyngeal aspirate, and feces. We also determined genotypes, to identify those that confer susceptibility to KD. There was no statistically significant difference in the frequency of the risk genotypes between KD patients and control subjects. We also used unbiased metagenomic sequencing to analyze these samples. Metagenomic sequencing and PCR detected torque teno virus 7 (TTV7) in two patients with KD (18%), but not in control subjects (P = 0.111). Sanger sequencing revealed that the TTV7 found in the two KD patients contained almost identical variants in nucleotide and identical changes in resulting amino acid, relative to the reference sequence. Additionally, we estimated the sample size that would be required to demonstrate a statistical correlation between TTV7 and KD. Future larger scale studies with carefully optimized metagenomic sequencing experiments and adequate sample size are warranted to further examine the association between KD and potential pathogens, including TTV7.


Introduction
In 1967, Kawasaki disease (KD) was first reported as a pediatric illness with fever and systemic mucocutaneous inflammation [1]. KD is a vasculitic disease that causes a high rate of dilatation of coronary arteries [2] and may cause subsequent chronic cardiac complications [3]. Although the incidence of KD has increased in Japan and many other countries, its etiology remains unknown.
The incidence of KD presents with seasonality [4,5]. KD is also often accompanied by prodromal intestinal and/or respiratory symptoms [6], and is preceded by contact with sick children [7,8]. These characteristics suggest that infections might cause KD. In particular, the refractory response of KD to antibiotic therapy supports a viral etiology model [9], while the involvement of bacterial infections in KD pathogenesis is also hypothesized [10]. Although a correlation between viral infections and KD was reported [11,12], these correlations were not reproduced in other studies [13][14][15]. Therefore, it remains to be elucidated whether the KDassociated infections were causal or incidental.
Importantly, KD incidence is much higher in populations with East Asian ancestry, in particular Japanese, compared to other ancestries [16]. This observation suggests a genetic propensity for KD among East Asian populations. Genetic surveys have identified variations at multiple loci as predisposing factors, including the genes for inositol 1,4,5-trisphosphate 3-kinase C (ITPKC) [17] and B-lymphoid tyrosine kinase (BLK) [18,19]. ITPKC and BLK are related to immune regulation, suggesting that the pathophysiology of KD is closely linked to immune dysregulation. However, the high-risk alleles of single nucleotide polymorphisms (SNPs) are not necessarily more frequent in East Asian populations than in Caucasian populations; thus, the etiology of KD is still disputed [20]. Overall, it is widely assumed that one or more infections trigger an immunological reaction that leads to KD in genetically susceptible individuals [21].
In the present study, we aimed to explore experimental and epidemiological methodologies to identify an infectious cause for KD. In this report, we focused upon viral infections, whereas our analysis of bacterial infections in our samples will be reported elsewhere. Here, we propose a detailed protocol for metagenomic sequencing that would efficiently compare the microbial profiles between patients with KD and control individuals. We also demonstrate a possibility to study how interaction between the viral infection profile (i.e., the virome) and host genetic predisposition (particularly, ITPKC and BLK) might be involved in the development of KD. Our explorations resulted the identification of a new variant of torque teno virus 7 (TTV7) that was present only in two patients with KD.
In addition to effective experimental procedures, an epidemiological study design is essential for future large-scale studies to explore potential causative agent(s) of KD. The sample size is especially critical-by adopting an excessively large sample size, an investigation might detect a microbe of even negligible clinical importance as a potential culprit, whereas an inadequately small sample size might fail to detect a microbe that has a meaningful association with KD. Therefore, we estimated the sample size required to effectively demonstrate a statistically significant association between viral infections and KD.

Participants
For a one-year study period, we enrolled 11 patients with KD and 22 control subjects (11 with diarrhea and 11 with respiratory infections) who were matched for sex, age and season of diagnosis (Table 1). As a result, we defined three groups of patients: 1) KD patients, 2) diarrhea controls, and 3) respiratory infection controls. The latter two groups constituted the control individuals. In terms of the number of KD-susceptible alleles of ITPKC or BLK per person, there were no statistically significant differences between the KD patients and the two control groups (Table 2).

Metagenomic sequencing of pooled samples
From each participant, we prepared several sample types, including serum, whole blood(WB) DNA, WB cDNA, nasopharyngeal aspirates (NPA), and feces. Half of each serum and NPA  GG  GG  GG  GG  AA  TT  TT  CC   2  GG  GG  GG  AG  GG  GG  CT  TT  CT   3  GG  GG  GG  GG  AA  GG  TT  CC  TT   4  GC  GG  GG  AA  AG  GG  CC  CT  TT   5  GG  GG  GC  GG  AG  GG  TT  CT  TT   6  GG  GG  GG  GG  AG  GG  TT  CT  TT   7  GG  GG  GG  GG  AG  AG  TT  CT  CT   8  GG  GG  GG  AG  GG  AG  CT  TT  CT   9  CC  GG  GC  AG  GG  GG  CT  TT  TT   10  GG  GC  GC  AG  AG  GG  CT  CT  TT   11  GG  GG  GG  AG  AG  GG  CT  CT  sample were treated with nuclease, to minimize any interference by the host DNA [22]. The samples were pooled according to sample type, within each of the three groups (i.e. KD, diarrhea, and respiratory infections). Pooled samples were used in metagenomic sequencing. After quality trimming, we obtained a total of 1,073,981,148 paired-end reads (accession number: DRA007000). We analyzed the dataset using the Livermore Metagenomic Analysis Toolkit (LMAT) [23][24][25]. After computational subtraction of human genomic sequences, we mapped 6,270,435 reads to microbial sequences. Specifically, we mapped 5,301,257 reads (85%) to bacteria (154 families, 460 genera, 1,644 species); 594,900 reads (9.5%) to viruses (16 families, 31 genera, 69 species); and 374,278 reads (6.0%) to eukaryotes. Table 3 shows the number of reads, classified as either viruses or bacteria, in each of the pooled samples. Nuclease treatment resulted in a much greater number of reads. Viral reads constituted the majority of the microbial reads in serum with and without nuclease processing, WB DNA, and WB cDNA, pooled from the KD patients. Fig 1 shows the composition of reads mapped to viral species (i.e. virome) in each sample (S1 Table). Among these viral species, torque teno virus 7 (TTV7) constituted the vast majority of reads in any of the sample types pooled from KD patients, but not in samples pooled from the control subjects.

Variants of TTV7
The reads from metagenomic sequencing were mapped to the reference sequence of TTV7 (Fig 2). For all sample types, the reads obtained from KD patients were mapped to both the coding region [i.e. open reading frame (ORF)] and the non-coding region. However, in the control subjects, the reads were mapped only to the non-coding regions in most of the sample types. This finding suggests that TTV(s) other than TTV7 exist in the control subjects, because non-coding regions are shared by multiple TTV strains, whereas only the ORFs are specific to individual TTV strains (S1 Fig). To identify anelloviruses other than TTV7, we mapped the reads to all human anelloviruses: TTV1 to 29, torque teno mini virus (TTMV) 1 to 9, and torque teno midi virus (TTMDV) 1 and 2. We found that the WB DNA of all three groups (i.e. KD, diarrhea, and respiratory infection) contained multiple TTVs (S2 Fig).

PCR and Sanger sequencing of TTV7
Using PCR, we detected TTV7 sequence in WB DNA, NPA, and feces from two KD patients (seq. 1 and seq. 5 in Table 1), but not in samples from the control individuals (Table 4). We used Sanger method to sequence the TTV7 genome from the two KD patients. While the reference sequence was 3,736 base pairs in length, we sequenced 3,699 base pairs, excluding the GC-rich region. As a result, we found that these two DNA sequences were very similar ( Fig 3A  and 3C, S2 Table). The sequences of these TTV7s are available at DNA Data Bank of Japan Fecal samples were not available for subjects with respiratory infections. NA, not available. Seq corresponds to the sequence number in Table 1.  (Table 1) and revealed variants in nucleotides (a), which resulted in amino acid changes (b). Similarly, TTV7 in KD patient seq. 5 harbored variants in nucleotides (c), which changed amino acids (d). The amino acid changes are identical between the two TTV7s. Spread sheet data is available in S2 Table. Splicing is not considered in the analysis of amino acid changes. (accession numbers: LC385662, LC385663). The translated amino acid sequences resulting from these two TTV7 sequences were identical ( Fig 3B and 3D, and S2 Table). In addition, it is possible that the changes in the nucleotide sequence affected the ORFs in these TTV7s (S3 Fig). Considering the complex processes of replication in TTVs (especially, alternative splicing) [26], the impact of these mutations may not be limited to these amino acid changes.

Metagenomic sequencing of individual samples
Because the reads mapped to TTV7 from WB DNA samples were the most homogeneous and the greatest in number (Fig 2), we employed WB DNA for the metagenomic sequencing of individual subjects (n = 33). Each individual sample yielded 13,387,302 reads on average (range: 11,593,598-14,805,260 reads), of which 93-94% were human sequences. The LMAT analysis on these datasets reported more than 10 viral reads for two viruses, which were found in only three individuals: TTV7 in two KD patients (seq. 1 and 5 in Table 1) and Human Herpes Virus 6B (HHV6B) in one KD patient (seq. 8).
We mapped the metagenomic sequencing reads to these viruses and to all human anelloviruses. As shown in Fig 4, we detected TTV7 in the two KD patients (seq. 1 and 5), but not in the control subjects ( Table 4). The depth of reads mapped to human genome was 0.42 per base in average, while the depth of reads mapped to TTV7 genome in the two KD patients was 3.0, demonstrating the abundance of TTV7. On the other hand, the depth of reads mapped to Similarly, the reads from KD patient seq. 5 were mapped to TTV7 (c) and generated variants (d). Although the depth was shallow, the reads covered the entire genome seamlessly (a, c). The variants (b, d) resembled the variants determined by Sanger sequencing (Fig 3), with limited accuracy due to the shallow depth. HHV6B genome (>160,000 bases) in the KD patient seq. 8 was only 0.078, indicating that HHV6B may not be present in this sample. Therefore, TTV7 was the only virus that was unambiguously present. This decreased sensitivity, when compared to the metagenomic sequencing of the pooled samples, was possibly due to the whole transcriptome amplification that we used only for the pooled samples but not for the individual samples (please see the Methods section). In summary, this result demonstrates that TTV7 had a much higher viral load in the peripheral blood than other viruses.

Re-evaluation of pooled samples
In metagenomic sequencing of the pooled samples, some reads from the control individuals apparently mapped to TTV7. For example, a considerable number of reads from the pooled WB DNA of the control diarrheal patients appeared to map to TTV7 (Fig 1). To re-evaluate whether TTV7 existed in the control diarrheal patients, we mapped the reads in the pooled WB DNA of the control diarrheal patients to the TTV7 reference genome. Of the 10,041 reads that mapped to TTV7, only 40 reads (0.4%) were mapped to the ORFs of TTV7 (Fig 5A).
Because only the ORFs of TTV7 are strain specific (S1 Fig), this finding implied that almost all of the reads that appeared to map to TTV7 in fact originated from TTVs other than TTV7. In addition, the 40 reads that mapped to the ORFs of TTV7 did not contain any of the variant nucleotides that we identified in this report (Fig 5B). In contrast, among the 193,415 reads from the pooled WB DNA of the patients with KD that mapped to TTV7, 156,566 reads (81%) mapped to the ORFs of TTV7 (Fig 5C) and contained the variant nucleotides ( Fig 5D). We repeated this bioinformatic analysis for other types of samples and confirmed that the variant TTV7 was not present in the control individuals. Therefore, the apparent identification of TTV7 in the control individuals was most likely erroneous because closely related strains of TTVs were simultaneously present in those individuals [27].

Statistical analysis
Two of 11 KD patients were positive for TTV7 sequence, whereas all 22 controls were negative (P = 0.104, by Fisher's exact test). Because the samples were matched, we employed exact logistic regression analysis to estimate the dependence of KD on the presence of TTV7 and the number of risk alleles of ITPKC and BLK in each individual. Due to the small sample size, we did not conduct a multivariate analysis. As shown in Table 5, we found that dependence of KD on the presence of TTV7 was not significant, with an odds ratio (OR) of 4.8 (P = 0.111). The dependences of KD on ITPKC and BLK were also not significant, with weaker ORs. Table 6 shows the sample size that would be necessary to detect a statistically significant association between TTV7 and KD. We also conducted a sensitivity analysis based upon multiple assumptions regarding the prevalence of TTV7 in the control population.

Discussion
We applied metagenomic sequencing to samples pooled from patients with KD, as well as from control subjects. We found that most of the sequencing reads in blood samples pooled from patients with KD were of viral origin, and the vast majority of these viral reads were TTV7. We also detected a variant of TTV7 in two patients with KD, whereas all of the matched control subjects were negative for this variant TTV7. Notably, the nucleotide sequences of TTV7 in these two KD patients were almost identical. Our small pilot study found only a marginal, non-significant correlation between KD and TTV7, which was comparable to the correlations between the host genotypes and KD. TTV was first identified by Nishizawa et al. in 1997 [28]. Subsequently, TTV and newly identified TTV-related viruses formed a novel viral family, Anelloviridae [29]. Currently, the Anelloviridae family contains three human viral genera (TTV, TTMV, and TTMDV) and an expanding list of non-human genera. The pathogenicity of anelloviruses remains unknown, except for minor respiratory symptoms and fever in children [30,31]. Among the TTVs, TTV7 was reported in 2000 [32].
In our study, the anellovirus TTV7 was the most abundant virus that was present in feces, upper airway, and blood of two KD patients. This pantrophic characteristic of TTV7 is consistent with the intestinal and respiratory symptoms that are prodromal to KD. Previous epidemiological studies predicted that at least one of the etiological agent(s) for KD would be capable of persistent infection [33,34]. Notably, persistent infection is a characteristic reported to be shared by the anelloviruses [35].
Furthermore, it was mathematically predicted that the infectious etiological agent of KD would be relatively common [34], based upon the young mean age of KD patients in Japan [36]. Consistent with this line of thought, TTVs are highly prevalent among healthy adult individuals [37]. However, TTV7 was reported to be rare [38]. In our study, TTV7 was detected in only two individuals among the 33 participants. The relative rarity of TTV7 is not consistent with epidemiological predictions for a common etiological agent of KD. Therefore, we must be cautious before hypothesizing that TTV7 is the sole etiological agent of KD. Instead, we hypothesize that TTV7 may be one of the multiple agents contributing to KD etiology.
TTVs have often appeared in the etiological studies of KD. For example, TTV was the only virus detected by PCR in a lymph node biopsy from a KD patient [39]. In contrast, an early anellovirus research study did not find a statistical correlation between TTV and KD [14]. However, we found that the PCR primer sequences used in this early study [14] did not capture many strains of TTV, including TTV7. This lack of sensitivity was because these early PCR primers did not account for the diverse variation in TTVs. Although recent degenerative PCR primers are more inclusive of this diversity [37], the differentiation between the TTV genotypes as well as the quantification of their viral loads remain a challenge for the field. In contrast, the metagenomic sequencing methodology, which was used for the pooled samples in this study, provides a relative quantification of viral loads in the form of number of reads and allows for differentiation between strains of TTV. However, the subsequent metagenomic sequencing method that we employed for the individual samples was not very sensitive. This limitation highlights the importance of careful optimization of sequencing methods that is specific for the sample types and the targeted viruses. We also estimate that a larger sample size will be necessary to yield a statistically significant result in a future etiological study of KD. Even though our sample size of KD patients was small, we could enroll the matched control subjects only with the assistance of an in-house software. This predicts the challenge in enrolling matched control subjects in a future study.
Hypothetically, the discovery of infectious etiological agent(s) should be confirmed by a temporal association between the infection and the disease (i.e., an infection with the etiological agent should precede development of the disease). That criterion was not met, however, in almost all previous etiological discoveries. Rather, statistical correlation between the presence of an infectious agent and the development of disease was employed to suggest etiology, at least in the early stage [40]. To prove a temporal association between a viral agent and KD, it would be necessary to collect samples sequentially from healthy infants until either they develop KD or they grow too old to develop KD (e.g., at 20 years of age). Considering the probability that a Japanese child will experience KD at some time in his or her life (1% of all children), such a study design would require a huge number of participants (e.g., 10,000 healthy infants phlebotomized regularly for 20 years to identify 100 patients with KD). The metagenomic sequencing of such a large number of samples would be cost-prohibitive. An alternative approach is an animal model. It might be possible to study the etiology of KD by administering a candidate microbe (e.g., TTV7) into mice and observing the resulting pathology in the coronary arteries [41][42][43]. Most microbes, including TTVs, are adapted to their natural host species, hence often induce exaggerated immunological responses in species other than their natural host. Therefore, studies using animal models may have limitations.
Our pilot study provides insights to inform future investigations of KD etiology. The sample size predicted from our study, the experimental procedures and data analysis methods will benefit future large-scale epidemiological studies. These larger scale studies may confirm TTV7 and discover additional microbial agents that could contribute to the etiology of KD. Studies with animal models might also contribute to the elucidation of the pathological process caused by these candidate microbes.

Ethical statement
This study was approved by the ethics committee of Japan Community Health Care Organization (JCHO) Osaka Hospital. Written informed consent was obtained from the guardians of the participants.

Participants
Between October 2015 and September 2016, we enrolled 11 patients with KD. The diagnosis was based upon the criteria of Japanese Circulation Society [44]. We found that it was impractical to use completely healthy individuals as controls because healthy children rarely visit a hospital. Therefore, we enrolled febrile individuals as controls. In enrolling matched control individuals, we adopted a 1:n (n>1) design, because that approach increases the statistical power as compared with a 1:1 design. However, the controls assigned to each patient with KD should be homogeneous across all the patients with KD: otherwise, nonhomogeneous controls might induce biases and thus generate unpredictable results. To achieve homogeneity among the control individuals, for each patient with KD, we enrolled two febrile control subjects: one with diarrhea and the other with respiratory infection. The control subjects were matched to a patient with KD by gender, age, and season of diagnosis. The difference in the age between a KD patient and the matched control subject was maximally 0.8 year, if the KD patient was younger than 4.5 years. For older patients with KD, the control subjects were also selected from children older than 4.5 years. The difference in the time-point at presentation between a KD patient and the matched control subject was no greater than 2 months. We developed a software to assist in selecting the candidates of control subjects on the daily basis.
We defined a patient of respiratory infection as a presentation of sore throat, cough, rales, and/or wheezes. We excluded diarrhea of non-infectious origin (e.g. anaphylaxis, druginduced) from the control subjects with diarrhea. All the control subjects were either febrile (>38.0˚C) at presentation, or within 24 hours after defervescence. Subjects with conjunctivitis, rash, and/or lymph-node enlargement were excluded from the control group. We collected samples from KD patients before initiating treatment with intravenous immunoglobulin.

Sample collection
Peripheral blood was phlebotomized in maximally sterile conditions and the serum was immediately aliquoted. One sterile swab collected nasal aspirate, while another swab collected pharyngeal aspirate. These two swabs were stored in a single Universal Transfer Medium tube (COPAN, Brescia, Italy) to produce NPA. Feces were also collected into another Universal Transfer Medium tube. We obtained feces from only two groups (six KD patients and six matched diarrheal patients). Aliquoted serum, blood clot which remained after separation of serum, and NPA were frozen within 1 hour after sample collection, whereas feces were frozen within 6 hours. The samples were preserved at -80˚C until being used.

Preparation of DNA and cDNA from WB
To identify the host genotype and to detect the microbial genomes in the peripheral blood cells, we extracted DNA and RNA from the blood clot. Subsequently, the RNA was converted to cDNA. We used a High Pure PCR Template Preparation Kit (Roche, Penzberg, Germany) for DNA extraction and a NucleoSpin RNA Blood kit (MACHEREY-NAGEL GmbH & Co. KG, Gueren, Germany) for RNA extraction. We immediately converted the extracted RNAs to cDNA with a Transcriptor First Strand cDNA Synthesis Kit (Roche, Penzberg, Germany). To synthesize cDNA, the RNA was denatured in a 13 μl cocktail containing 10 μl total RNA, 2 μl Random hexamer primers, and 1μl Anchored oligo (dT)18 primer at 65ºC for 10 minutes. Then, we added 0.5 μl Protector RNase Inhibitor, 0.5 μl Transcriptor Reverse Transcriptase, 2 μl dNTP Mix and 4 μl Transcriptor RT Reaction Buffer to the reaction mix. Finally, the reverse transcription reaction was performed in this 20 μl cocktail using the following conditions: pre-incubation at 25˚C for 10 minutes, incubation at 50˚C for 60 minutes, inactivation of Transcriptor Reverse Transcriptase at 85˚C for 5 minutes, and storage at 4˚C.

Preparation of nucleic acids from serum, NPA, and feces
An equal amount (450 μl) of the individual serum and NPA samples were placed in a pre-wet 0.45 μm Spin-X Tube Filters (Corning, Corning, NY), and centrifuged for 1 minute at 15,000 × g. The flow-through was pooled in each of the three groups (KD, diarrheal control, and respiratory infection control). The supernatants were ultra-centrifuged at 4ºC for 2 hours at 25,000 × g to pellet the virus particles. We then removed the supernatant and resuspended the pellet in 400 μl of 1× AM2238 Turbo DNase buffer (Ambion, Foster city, CA). Previously, it was reported that nuclease is effective in eliminating host nucleotides [22]. Therefore, we processed half of the resuspended pellet without nuclease, whereas the other half was processed with nuclease. We added four nucleases to the sample: 1 μl Benzonase 70664-3 (EMD, Darmstadt, Germany), 1 μl Baseline Zero DB0711K (Epicentre, Madison, WI), 1 μl AM2238 Turbo DNase (Ambion), and 1μl A7973 RNase A (Promega, Madison, WI), then incubated at 37˚C for 90 minutes. Nucleases were inactivated by addition of 2 μl of Baseline Zero DNase Stop Solution (Epicentre) and incubation at 65˚C for 10 minutes.
We used the QIAamp cador Pathogen Kit (QIAGEN, Hilden, Germany) for each set of serum and NPA samples (with and without nuclease treatment) following the manufacturers' protocol, to allow for maximal retrieval of viral DNA/RNA while minimizing the presence of bacterial/host DNA/RNA. The DNA/RNA was resuspended at the final step by adding 50 μl of Buffer AVE (QIAGEN,) to each QIAamp Mini column (QIAGEN). After centrifugation at full speed (20,000 × g), we collected the elute (50 μl).
The MoBio Powerviral Environmental RNA/DNA Kit (MoBio, Carlsbad, CA) was utilized for the individual fecal samples. Briefly, 0.25 grams of fecal material was added to bead lysis tubes and the nucleic acids were extracted following the manufacturers' protocols. Purified nucleic acids were eluted in 50 μl of RNase-free water.
We performed the reverse transcription reaction with the following reagents: SuperScript III Reverse Transcriptase ( The whole transcriptome amplification optimally amplifies viral nucleic acids [22]. Therefore, we applied the whole transcriptome amplification to the nucleic acids pooled from serum, NPA, feces, WB-DNA, and WB-cDNA. WTA ligation buffer (6 μl), WTA ligation reagent (2 μl), WTA ligation enzyme 1 (1 μl), and WTA ligation enzyme 2 (1 μl), which are all provided in the REPLI-g Cell WGA and WTA Kit (QIAGEN), and the pooled nucleic acids (10 μl) were mixed and incubated at 22˚C for 2 hours. Then, REPLI-g Midi reaction buffer (29 μl) and RFPLI-g Midi DNA polymerase (1 μl), also provided in the above kit, were added and incubated at: 30˚C for 8 hours, 95˚C for 5 minutes, then held on ice. The PCR product was cleaned using the QIAquick PCR Clean-up kit (QIAGEN) following the standard manufacturer's protocols.

Metagenomic sequencing of the pooled samples
We prepared DNA libraries for sequencing using the TruSeq PCR-free DNA Library Preparation Kit (Illumina, San Diego, CA). Quality and fragment size were assessed on the Tapestation (Agilent, Santa Clara, CA). Libraries were normalized to 2 nM, pooled, denatured, and diluted to 1.8 pM, according to the manufacturer's standard recommendations (Illumina). Sequencing was performed on the NextSeq 500 with the NextSeq Series High Output Kit version 2 (Illumina), using 150 base pair, paired-end reads.

Metagenomic sequencing of the individual samples
The WB DNAs from the 33 individual participants, without the whole transcriptome amplification, were outsourced to Hokkaido System Corporation, where TruSeq Nano DNA Library Prep kit (Illumina) was used for the metagenomic sequencing. We omitted the whole transcriptome amplification because TruSeq Nano DNA Library Prep kit includes a PCR step, and therefore allows for a lower input concentration. Quality and fragment size were assessed on BioAnalyzer 2100 (Agilent), and the concentrations were determined by quantitative PCR. The libraries were normalized to 4nM, pooled, denatured, and diluted to 14 pM, following the manufacturer's recommendation (Illumina). Sequencing was performed on the HighSeq 2500 with HiSeq Rapid PE Cluster Kit v2 and HiSeq Rapid SBS Kit v2 (Illumina), using 100 base pair, paired-end reads.

Bioinformatic analysis
The metagenomic sequencing data was analyzed using the LMAT [23][24][25]. LMAT is a metagenomic analysis pipeline to search for taxonomic identifiers associated with k-mers found in the reference genome database. The database contains 4,863 distinct bacterial species; 4,189 distinct viral species; 2,038 distinct eukaryotes; and 279 distinct archaea species. Each read was assigned a match score, which is the fraction of k-mers in the read that were matched to the corresponding reference genome divided by a randomly generated read with similar GC content and a comparable length. We discarded any read with a match score below the designated threshold that is expected by the null model. A match score threshold of 1 represents the percent identity match of double what would be observed by a random match, resulting in stringent species identification. A threshold of 0.5 is more permissive. Using 1 or 0.5 for the threshold did not qualitatively affect our results and conclusion, and we presented the results obtained by the threshold of 1.
CLC Genomics Workbench 10.1.1 (QIAGEN) was used to visualize the depth and extent of the reads mapped to the reference genome. This software was also used for variant analysis. We employed the accession numbers defined in ref. [29] as the reference sequences for TTVs.

Statistical analysis
Stata SE 13.1 (StataCorp, College Station, TX) was used for statistical analyses. Conditional regression analysis could not be used to assess the effect size (i.e., odds ratio), because the point estimate of the odds ratio reached an infinite positive value. Instead, we used exact logistic regression analysis. In estimating sample size, we assumed a standard statistical requirement for an epidemiological study (i.e. α level of 5% and β level of 90%).