A Method Enabling High-Throughput Sequencing of Human Cytomegalovirus Complete Genomes from Clinical Isolates

Human cytomegalovirus (HCMV) is a ubiquitous virus that can cause serious sequelae in immunocompromised patients and in the developing fetus. The coding capacity of the 235 kbp genome is still incompletely understood, and there is a pressing need to characterize genomic contents in clinical isolates. In this study, a procedure for the high-throughput generation of full genome consensus sequences from clinical HCMV isolates is presented. This method relies on low number passaging of clinical isolates on human fibroblasts, followed by digestion of cellular DNA and purification of viral DNA. After multiple displacement amplification, highly pure viral DNA is generated. These extracts are suitable for high-throughput next-generation sequencing and assembly of consensus sequences. Throughout a series of validation experiments, we showed that the workflow reproducibly generated consensus sequences representative for the virus population present in the original clinical material. Additionally, the performance of 454 GS FLX and/or Illumina Genome Analyzer datasets in consensus sequence deduction was evaluated. Based on assembly performance data, the Illumina Genome Analyzer was the platform of choice in the presented workflow. Analysis of the consensus sequences derived in this study confirmed the presence of gene-disrupting mutations in clinical HCMV isolates independent from in vitro passaging. These mutations were identified in genes RL5A, UL1, UL9, UL111A and UL150. In conclusion, the presented workflow provides opportunities for high-throughput characterization of complete HCMV genomes that could deliver new insights into HCMV coding capacity and genetic determinants of viral tropism and pathogenicity.


Introduction
Human cytomegalovirus (HCMV), the prototype member of the herpesvirus subfamily Betaherpesvirinae, is a ubiquitous virus with seroprevalences ranging from 45 to 100% in the adult population [1]. Primary infection or reactivation usually remains asymptomatic; however, the virus can cause serious illness in newborns and immunosuppressed individuals such as transplant recipients and AIDS patients [2]. HCMV has the largest genome of all human herpesviruses, with a size of approximately 235 kbp. The genome consists of two unique fragments, the unique long (UL) and unique short (US) regions, which are both flanked by a pair of inverted repeats, termed terminal/internal repeat long (TRL/IRL) and internal/terminal repeat short (IRS/TRS). Four genomic isomers are present in equimolar concentrations through inversion of UL and US relative to each other [3].
The first complete genome sequence of HCMV, derived from the highly passaged laboratory strain AD169, was published in 1990 with 208 open reading frames (ORFs) predicted as proteinencoding [4]. Through comparison of different laboratory strains and isolates passaged more moderately on cultured human fibroblasts, it has been well established that AD169 contains major genome rearrangements. These affect a region at the 39 end of the UL region, commonly referred to as the UL/b' region, resulting in the loss of a 15 kbp fragment which encodes 19 additional ORFs [5,6]. The HCMV genetic map was further refined by genome comparisons with chimpanzee cytomegalovirus and full genome sequencing of a handful additional clinical isolates [7][8][9][10]. The current HCMV genetic map as annotated on the HCMV reference sequence Merlin (NC_006273 [10]) contains 170 genes, some of which are only defined theoretically. In fact, recent publications defining the HCMV transcriptome have drawn a very sophisticated picture including alternative splicing and antisense transcription, which could redefine our understanding of the HCMV coding capacity [11][12][13]. The functionality of these products still awaits further confirmation. The determination of the complete genome sequence of additional, clinically representative isolates could assist in a better definition of the HCMV genetic map through comparative genomic approaches.
During the last years, next-generation sequencing (NGS) has immensely impacted the genomics field [14]. Although several complete HCMV genomes have been determined using the traditional cloning and Sanger sequencing approaches, it is still highly laborious and not suitable for high-throughput applications [4,9,10,15]. NGS technology obviates the need for cloning procedures by the generation of enormous amounts of short sequence reads starting from minimal input material. The benefits of NGS for HCMV genomics were first demonstrated through the elucidation of variants present in laboratory preparations of the AD169 and Towne strains [16]. In an attempt to evaluate the effectiveness of NGS with clinical HCMV isolates, Cunningham et al. compared a more traditional PCR-based amplification and Sanger sequencing approach with a NGS approach using the Illumina Genome Analyzer (IGA; Illumina, Inc., San Diego, USA) [17]. In addition, the 454 GS FLX (Roche Applied Science, Penzberg, Germany) platform was successfully used to determine the first complete genome sequence of an Asian HCMV isolate [18]. Cunningham et al. showed that sequencing of complete HCMV genomes directly from clinical material is achievable, but given the small fraction of viral DNA, not practically amenable to high-throughput. In order to achieve a high-throughput application with NGS technology, a protocol to amplify and isolate highly pure viral DNA is desirable.
Currently, 33 complete HCMV sequences are available in the NCBI GenBank (v196.0), including 17 derived from unpassaged or moderately passaged material (up to 10 cell culture passages). Additional sequences of clinical isolates are necessary to better apprehend the genetic diversity and coding capacity of HCMV strains. Since sequencing complete genomes of clinically representative HCMV isolates in high-throughput awaits new amplification protocols, we have developed a dedicated amplification, sequencing and analysis workflow for HCMV genome characterization. The workflow maximizes sequencing capacity through the generation of highly pure HCMV DNA (.90% viral DNA). The efficiency of using 454 GS FLX and/or IGA for HCMV full genome sequencing was compared. Using a series of validation experiments, we show that consensus sequences derived by the workflow are representative for the strain present in the original clinical isolate. The presented workflow enables high-throughput analysis of HCMV full genome sequences and could serve as an important tool in elucidating the genetic diversity of this complex herpesvirus.

Patient Samples, Viruses and Cell Culture
Seven PCR-confirmed HCMV-positive urine samples were included in the study (primers listed in Table S1). Sample BE/9/ 2010 was taken from a child with a primary infection presenting with fever. Samples BE/10/2010 i1 and BE/10/2010 i2 were collected on the same day from a congenitally infected infant that was asymptomatic at birth. Sample BE/11/2010 was obtained from a child with a primary infection with liver dysfunction. Sample BE/21/2010 was taken from a pulmonary transplant recipient who had received a transplant and seroconverted in 2007. Finally, samples BE/27/2010 i1 and BE/27/2010 i2 were collected from a patient receiving a renal transplant in 2008 and seroconverting in 2009.
Typically, 1 mL of urine was centrifuged for 10 min at 3006g and the supernatant was subsequently filtered through a 0.45 mm filter (Minisart NY25, Sartorius AG, Göttingen, Germany). A confluent monolayer of human embryonic skin-muscle fibroblast cells (E 1 SM [19]) in a 25 cm 2 flask containing 10 mL of DMEM (Life Technologies, Carlsbad, USA) supplemented with 10% fetal bovine serum (FBS, Life Technologies) was inoculated with 0.5 mL of the filtrate and incubated at 37uC in a humidified 5% CO 2 environment. Infected cells were passaged every two weeks by diluting cells 1:2 into a 75 cm 2 flask after trypsination (0.05% Trypsin-EDTA, Life Technologies).
Strain Merlin was obtained from ATCC (ATCC-VR-1590, Lot Nr. 58730771, passage 4). A confluent monolayer of E 1 SM cells in a 75 cm 2 flask containing 10 mL of DMEM was inoculated with 0.5 mL of the virus stock and the cells were incubated at 37uC and 5% CO 2 . After 1 h, the medium was removed and the cells were washed with 1X PBS (Life Technologies) before adding DMEM with 10% FBS.

Viral DNA Purification and Multiple Displacement Amplification
Since clinical isolates do not produce large amounts of cell-free virus, a procedure was needed to purify intracellular, viral DNA from large backgrounds of cellular DNA. We therefore adapted a protocol described by Sinzger et al. [20]. Briefly, cells from three 75 cm 2 flasks were trypsinized and pooled. After lysis in a Tris buffer containing sucrose and Triton X-100, cellular DNA was digested using micrococcal nuclease (Thermo Fisher Scientific, Waltham, USA). Subsequently, DNA was extracted using the QIAamp DNA Blood Mini Kit. Extracted DNA was amplified by multiple displacement amplification using the REPLI-g Mini Kit (Qiagen, Hilden, Germany). For each sample, three independent REPLI-g reactions were pooled. A mixture of 150 mL of REPLI-g products, 300 mL of pure ethanol and 15 mL of 3 M sodium acetate was incubated at 280uC for 2 h. The samples were centrifuged for 30 min at 20,0006g (4uC), the supernatant was removed and the pellets were washed with 70% ethanol. Afterwards, the samples were centrifuged again for 30 min at 20,0006g (4uC) and the supernatant was removed. The pellets were air dried and resuspended in 50 mL of QIAamp Elution Buffer (Qiagen).
For the purification of an unpassaged isolate, 200 mL of sample BE/21/2010 was centrifuged for 10 min at 3006g and the supernatant was centrifuged for 2 h at 100,0006g (4uC) in a type 35 rotor (Beckman Coulter Inc., Brea, USA). The pellet was resuspended in 200 mL of 1X PBS and DNA was extracted using the QIAamp DNA Blood Mini Kit (Qiagen). Extracted DNA was amplified through whole genome amplification as described above.
Quantitation of Viral and Cellular DNA Viral and cellular DNA contents were evaluated using a quantitative PCR assay (qPCR). HCMV DNA was quantitated through amplification of a fragment of the conserved major capsid protein-encoding gene UL86. For human DNA, a region of the bglobin household gene was amplified. Primers and probes were obtained from Eurogentec (Liège, Belgium); the sequences are listed in Table S1. The qPCR was carried out using TaqMan Universal PCR Master Mix (Life Technologies) on an Applied Biosystems 7500 Fast Real-Time PCR system (Life Technologies), following the manufacturer's protocols. Both standards and samples were quantitated in duplicate, viral and cellular DNA was quantitated in separate wells.
For absolute quantitation, standard series were produced by serial dilution of HCMV UL86 and human b-globin standards. The standards were prepared through PCR amplification of the qPCR targets and products were gel purified using the QIAquick Gel Extraction Kit (Qiagen). After spectrophotometrical quantitation with a BioPhotometer (Eppendorf, Hamburg, Germany), DNA concentrations were converted to copy number/mL using the formula described by Fronhoffs et al. [21]. Viral and cellular DNA copy numbers were converted to absolute weight (mg of DNA) for mutual comparison.

Next-generation Sequencing
For 454 GS FLX sequencing, total DNA was fragmented to an average length of 400 bp using a Covaris E210 system (Covaris). DNA fragments were end-repaired, 39-adenylated, ligated to adapters (GS FLX Titanium Rapid Library MID Adaptors kit, Roche) and size-selected (.350 bp) using the SPRIworks Fragment Library System II (Beckman Coulter Genomics). The quality of the library was evaluated using a high-sensitivity DNA chip on a model 2100 Bioanalyzer (Agilent Technologies). Libraries were quantitated with the TBS-380 Mini-Fluorometer (Promega) and subsequently pooled at equimolar concentrations. Prior to sequencing, clonal amplification was performed during an emulsion based PCR (GS FLX Titanium emPCR Kit, Roche). Sequencing was performed using the GS FLX Titanium Sequencing Kit (Roche). Following sequencing, processing of the raw sequence data was performed with the Roche Sequencing System Software package.
For Illumina Genome Analyzer (IGA) sequencing, total DNA was fragmented to an average length of 200 bp using a Covaris E210 system (Covaris). The ends of the fragmented DNA were repaired, adenylated and Illumina compatible adaptors (Index PE Adaptor Oligo Mix, Illumina) were ligated using the SPRIworks Fragment Library System I (Beckman Coulter Genomics). Fragments were indexed using the Multiplexing Sample Preparation Oligonucleotide Kit (Illumina) and the library was enriched during 12 PCR cycles. Enriched fragments were visualized on a Bioanalyzer (Agilent Technologies) for quality control and quantitation. Finally, samples were pooled at equimolar ratios and put on the Illumina cluster station for cluster generation using the TruSeq PE Cluster Kit v2 (Illumina). One hundred and nine cycles of multiplexed paired-end sequencing were performed using the TruSeq SBS Kit v5 (Illumina). Following sequencing on the GAIIx, processing of the raw sequence data was performed with the Illumina analysis software (Casava 1.7.0).

Assembly of Consensus Sequences and Finishing with Sanger Sequencing
IGA and 454 GS FLX datasets were first subjected to a quality control step using QUASR v6.08 (http://sourceforge.net/projects/ quasr/). Low-quality bases were trimmed from the 39 end of reads until the median quality of the reads was higher than 20. Reads smaller than 20 bp were removed. A de novo assembly was constructed for 454 GS FLX reads using MIRA v3.4.1.1 [22,23] with assembly settings on 'accurate' and for IGA reads with Velvet v1.2.07 [24]. The hash value, expected coverage and coverage cutoff parameters needed for Velvet assemblies were first estimated using VelvetOptimiser (Perl script, http://bioinformatics.net.au/ software.velvetoptimiser.shtml) and then manually adjusted to produce longer contigs (Table S2). The resulting contigs were assembled using Phrap v1.090518 [25] and Phrap contigs longer than 1,000 bp were included in a NCBI nucleotide BLAST search to find a suitable HCMV reference sequence. Next, all Phrap contigs were aligned to the selected reference sequence using NUCmer included in the MUMmer 3 package [26]. For this alignment, reference sequences were trimmed of its terminal repeats, except for the 50 bp region adjacent to UL and US regions, as described [17]. This alignment was used to build a hybrid reference combining Phrap contigs and pieces of the original reference sequence in regions where no contigs had mapped. Finally, a reference assembly was constructed using the 454 GS FLX and IGA reads with MIRA and a consensus sequence of the strain was generated. This assembly was outputted in ACE format and visualized using Tablet v1.12.12.05 [27,28]. The complete consensus sequence was manually inspected and any misassembly corrected.
Gaps remaining in the consensus sequences after assembly of NGS data were resolved through PCR amplification and Sanger sequencing. PCR reactions were carried out with HotStarTaq DNA Polymerase (Qiagen) using standard manufacturer's protocols. Primer sequences and annealing temperatures are presented in Table S3. Products were cleaned up before sequencing with ExoSAP-IT (Affymetrix, Santa Clara, USA). Sequencing reactions were performed in both directions using the BigDye Terminator v3.1 Cycle Sequencing Kit (Life Technologies) and sequencing products were analyzed in an ABI PRISM 3100 Genetic Analyzer. Chromatograms were interpreted and contigs were joined with the complete genome consensus using Lasergene SeqMan v7.0.0 (DNASTAR, Madison, USA).
To assess the content of the original sample preparations, a de novo assembly was built with the CLC Genomics Workbench using only 454 GS FLX and IGA reads that were not mapped onto the final HCMV reference sequence. The resulting contigs were analyzed with the blastn command of the BLAST+ application [31] using the complete nucleotide (nt) database. Output of contig database searches was interpreted with MEGAN4 [32].

Evaluation of Sequencing Technologies and Assembly Software
To evaluate the performance of assemblies using only 454 GS FLX, IGA or both read sets, de novo assemblies were constructed and the resulting contigs were aligned with NUCmer against the final genome sequence of the appropriate strain to evaluate genome coverage. The freeware software suites (MIRA, Velvet and Phrap) that were used to build the initial consensus sequences were compared to a commercial alternative (CLC Genomics Workbench). Statistical analyses were performed with SPSS Statistics v21 (IBM, Armonk, USA). Overall differences of n50 contig length and number of gaps left in the assembly were tested using the Friedman Test and individual groups were compared using the Wilcoxon Signed Ranks Test with Bonferroni correction.

Development of a Sample Preparation Protocol Generating Highly Pure HCMV DNA
For the development of a method to characterize HCMV genomes in high-throughput, a procedure was needed to amplify the viral material in clinical samples. To maximize sequencing capacity, extract purity had to be optimized. PCR-based amplification approaches that use a set of conserved primers covering the complete HCMV genome have been applied [17,33].
However, the labor-intensity of these methods compromises a high-throughput perspective. Therefore, we have amplified viral material through passaging isolates on E 1 SM cells, a human fibroblast cell line ( Figure 1). The number of passages on fibroblasts in this amplification was limited to avoid potential genetic adaptation of HCMV to growth on fibroblasts [34]. This implied that virus production would be low and predominantly cell-associated. Amplification by passaging HCMV clinical isolates on fibroblasts had already been used as a preparative step for NGS analysis, but usually DNA was isolated through a whole cell extraction [17,18]. The extracted viral DNA is usually heavily contaminated with cellular DNA, which impacts on the efficiency of the sequencing process. We chose to implement a technique to specifically purify cell-associated viral DNA [20]. This method is based on lysis of the cellular membranes and nuclease-based cleavage of cellular DNA, followed by extraction of viral DNA from nucleocapsids. Isolates were harvested at passage 2 or 4, when the first foci of cytopathogenic effect became visible. To assess viral yield and purity, we developed a quantitative PCR to evaluate the amounts of viral and cellular DNA present in the isolates. After virus isolation and DNA extraction, viral DNA yield and especially purity were considered unsatisfactory for NGS, since most samples (11/14) contained less than 500 ng of HCMV DNA and the majority of the DNA detected was of cellular origin ( Figure 2, pre-MDA). To further amplify viral DNA, samples were subjected to multiple displacement amplification (MDA). MDA makes use of the high processivity, strand displacement and proofreading capacity of the W29 DNA polymerase to amplify DNA using random primers. This method can amplify nanograms of DNA to micrograms and generates long contiguous strands with very low mutation rates (10 26 ). Amplification biases have been reported, but tend to be more problematic when starting from very low amounts of input material such as in single-cell sequencing [35]. MDA affected both yield and purity of viral DNA, with amounts of viral DNA mostly above 5 mg (11/14) and purities largely higher than 90% (11/14) (Figure 2, post-MDA). Only for the unpassaged isolate of strain BE/21/2010, viral DNA yield remained low with 600 ng of HCMV DNA, but with an estimated purity of 85%. The relative increase in viral DNA contents after MDA could possibly be attributed to the differential quality of viral and cellular DNA after nuclease treatment. While viral DNA was protected from nuclease activity by the viral capsid, cellular DNA was presumably heavily degraded. Although cellular DNA is still detected by the qPCR assay, which only amplifies a 167 bp fragment, we hypothesize that it is amplified much less efficiently by MDA than the intact viral DNA [36]. Because of the observed increase in viral DNA yield and purity, genomic contents of extracts could be characterized more efficiently, supporting highthroughput applications.

Sequencing and Assembly of HCMV Genomes Using 454 GS FLX and/or IGA Data
Purified HCMV DNA was analyzed using both 454 GS FLX and IGA to compare the performance of both systems in generating consensus sequences of complete genomes. Although several complete genome sequences of HCMV strains are available on NCBI GenBank, substantial regions of the genome are highly variable, which makes a mapping assembly unsuitable for analysis of distinct HCMV strains. Mapping assemblies left large areas of the genome uncovered. A large fraction of the unmapped reads, however, were found to be genuine HCMV reads with BLAST (data not shown). Therefore, a de novo assembly approach was chosen, followed by scaffolding of de novo contigs on HCMV reference sequences (Figure 1). A similar assembly approach using different software suites was already successfully implemented for HCMV [17]. Briefly, de novo contigs were mapped on HCMV reference genomes and a hybrid reference sequence was constructed combining contigs and pieces of the original reference sequence in regions with no contig coverage. Subsequently, the consensus sequence was derived from a mapping assembly of all sequencing reads against this hybrid reference.
The final consensus sequences were used to construct a reference assembly using 454 GS FLX and IGA datasets. The percentage of reads mapped to the HCMV consensus sequence was generally in accordance with the sample purity predicted by the qPCR assay (Table 1). Since qPCR assays only quantified cellular DNA as a possible contaminant, this measure could overestimate sample purity, but there was only a small difference between qPCR and read mapping purity estimates for most samples (9/14,5%, 11/14,10%). Only strains BE/21/2010 UP and BE/27/2010-1 showed a large discrepancy (.20%) between the purity estimates, with the actual amount of reads that mapped to the HCMV consensus much lower than expected by qPCR. This discrepancy could be explained by the fact that qPCR assays only detect one segment of viral and cellular DNA, while the sequencing data reflect total DNA levels. To identify additional contaminating DNA present in the isolates, de novo assemblies were performed using 454 GS FLX and IGA reads that did not map to the HCMV consensus sequence. These contigs were analyzed using BLAST (Table S4). For strain BE/27/2010-1, only the presence of human DNA could account for the discrepancy between qPCR and read mapping results. The unmapped reads of BE/21/2010 UP largely consisted of human DNA and some bacterial and papillomaviral sequences. With only 12% of NGS reads being HCMV-specific for BE/21/2010 UP, we essentially encountered the same limitations as Cunningham and colleagues [17] for sequencing of unamplified clinical material and confirm that this is currently not amenable to high-throughput applications, even after MDA. This result indicates that an amplification and/or enrichment procedure for viral DNA is crucial to efficiently utilize NGS high-throughput capacities, which is provided through our cell culture extraction and MDA workflow. For three other samples, a small number of HCMV sequences were detected that did not map to the consensus sequence during the reference assembly (Table S4). Nevertheless, these contigs, mostly smaller than 1,000 bp, could be aligned to the consensus using the NUCmer algorithm with similarities close or equal to 100% (data not shown).
Since both immunocompetent and immunocompromised patients can be co-infected by and shed multiple HCMV strains, the derived consensus sequences do not necessarily represent a single, contiguous genome but a collection of the most abundant variants at each position in the genome [33,37,38]. However, inspection of our assemblies always showed the predominance of a single variant throughout the entire genome, without any clear evidence of multiple infections, suggesting that these particular consensus sequences do represent contiguous strain sequences (data not shown).
To compare the utility of 454 GS FLX and IGA datasets in the characterization of HCMV genomes, de novo assemblies were constructed using only 454 GS FLX or IGA data and a combination of both. A commercial package, the CLC Genomics Workbench, was compared to MIRA and Velvet, which are freely available. MIRA was used for assembly of 454 GS FLX data, while IGA data were assembled with Velvet. Velvet uses a de Bruijn graph strategy which is better suited for large datasets than the overlaplayout-consensus strategy that MIRA utilizes [39]. Both datasets were combined through a Phrap assembly of combined contigs. The performance of de novo assemblies was compared by mapping the resulting contigs on the appropriate consensus sequence that was derived earlier. A complete overview of results for each dataset is presented in Table S2. Here, we present the range of n50 contig lengths (Figure 3a) and number of gaps left when contigs were mapped to the consensus sequence (Figure 3b). The n50 contig length states that 50% of the entire assembly is comprised in contigs equal to or larger than this length. These data clearly illustrate that IGA datasets produce assemblies that are comparable to those using both datasets combined. Both n50 contig length and number of gaps do not significantly differ between both cases (Wilcoxon Signed Ranks Test; n50 contig length: p = 0.123; gaps: p = 0.055). The n50 contig length is drastically lowered by using only 454 GS FLX data, which consequently increases the number of gaps left after the initial assembly (Wilcoxon Signed Ranks Test; n50 contig length: p,0.001; gaps: p,0.001). Our results show that IGA datasets outperform 454 GS FLX datasets. IGA sequencing has a higher throughput and lower cost per base and therefore achieves much higher coverages than 454 GS FLX sequencing in this study ( Table 1). The benefits of this higher coverage clearly outweigh the longer length of 454 GS FLX reads for de novo HCMV genome assembly. In fact, the combined use of 454 GS FLX and IGA datasets does not significantly alter de novo contig length. Taking into account the higher error rates of 454 GS FLX sequencing in homopolymeric stretches [40], IGA would be the preferred platform of both for high-throughput sequencing of HCMV isolates.
Based on n50 contig lengths, commercially and freely available software packages delivered no significantly different assemblies (Wilcoxon Signed Ranks Test; p = 0.933). There was, however, a small but significant difference in the number of gaps left in the assembly, with the freeware assemblies containing less gaps (Wilcoxon Signed Ranks Test; p = 0.031). When IGA data were involved, the assemblies produced by the CLC Genomics Workbench showed a smaller range in n50 contig lengths than Velvet assemblies. Assembly of IGA data using the CLC Genomics Workbench is in fact more user-friendly than Velvet assemblies that have to be optimized manually by adjusting several parameters. This optimization step makes these assemblies less reproducible.   Recently, novel freeware de novo assembly algorithms have been released that show improved performance and could be better alternatives to the commercial assembly options than Velvet [41][42][43][44].

Consensus Sequences are Representative for the HCMV Population Present in the Original Clinical Isolate
Four different approaches were combined to validate the consensus sequences that were generated using our preparation, sequencing and assembly pipeline. (1) Reference strain Merlin was resequenced and (2) consensus sequences of independent isolates of the same patient (BE/10/2010 and BE/27/2010) were compared.
(3) Strain BE/21/2010 was sequenced both directly from clinical material and after cell culture passage to evaluate how the consensus sequence was altered during cell culture adaptation. (4) Finally, strains BE/9/2010 and BE/11/2010 were sampled at different culture passages (2-11 passages) to characterize potential changes in the consensus sequence during further adaptation to cell culture.
(1) To validate our workflow, the HCMV reference strain Merlin was grown for one additional passage and harvested using the aforementioned protocol. The consensus sequence was generated using a de novo approach and the original reference sequence was only used to guide assembly of de novo contigs. The generated consensus sequence was aligned to the original reference [10]. Only two SNPs were detected between both sequences. The first SNP was situated in gene UL32, encoding the major tegument protein pp150, resulting in a silent CTC to CTG substitution at amino acid position 1,038. When the read alignment of the assembly was inspected, this mutation was observed in 65% of reads, with the other 35% still displaying the wild-type G. Another SNP, a G to C substitution, was initially noted in the IRL at nucleotide position 195,063. However, when variants that were segregated between IRL and TRL copies were added up, it was noted that only 24% of reads contained this substitution. Interestingly, these two substitutions were also noted when Merlin was cloned into a BAC and resequenced by Stanton and colleagues [45]. They reported the substitution in UL32 to be a single nucleotide polymorphism in the original Merlin population. The fact that these SNPs were also found using our workflow confirms that these were present in the original viral population.
(2) To assess the reproducibility of our consensus-generating pipeline, we independently passaged twice two samples taken from the same patient on the same day (BE/10/2010 i1 and BE/10/ 2010 i2) and subsequently purified, sequenced and assembled the genomes. After analysis, nearly identical consensus sequences were obtained with only a minor length difference in three homopolymer regions (Table S5). Likewise, strains BE/27/2010 i1 and BE/27/2010 i2 were derived from sequential isolates of the same patient, derived with an interval of 49 days. Both samples were independently passaged four times in E 1 SM cells and processed in our workflow after which consensus sequences were compared. Sequences only differed in the length of one homopolymer region (Table S5). All apprehensive homopolymer regions were situated in non-coding regions. These findings show that the generated consensus sequences are reproducible and furthermore indicate that the consensus sequence of strain BE/27/2010 remained stable during 49 days of intrahost viral replication.
(3) Strain BE/21/2010 was isolated and sequenced directly from HCMV-positive urine and simultaneously passaged four times in E 1 SM cells to characterize potential changes in the consensus sequence during initial adaptation to cell culture. A substitution was detected in gene UL30 (A13G) in 45% of reads derived from the passaged isolate. Differences between both consensus sequences were only situated in the length of four homopolymer regions and one trinucleotide repeat (Table 2). These regions also display variable lengths in different HCMV strains. Furthermore, a closer inspection of the assembly in these regions revealed some repeat length heterogeneity in NGS reads. This could both reflect technological constraints in the prediction of homopolymer lengths and intrapatient variability in repeat lengths. Given the fact that these repeats are mostly situated in non-coding sequences and these regions are inherently of variable length in different isolates, it seems perfectly conceivable that intrapatient heterogeneity exists. In fact, the length difference in the trinucleotide repeat cannot be explained by homopolymer error and thus probably reflects intrapatient heterogeneity.
(4) To characterize potential changes during further adaptation of HCMV to fibroblast replication, strains BE/9/2010 and BE/ 11/2010 were sampled and sequenced at different culture passages. Strain BE/9/2010 was sequenced after passage 2, 5, 7 and 11. Consensus sequences were derived independently. Consensus sequences for passages 2, 5 and 7 were identical, whereas passage 11 contained one substitution in gene UL44 (A128V), which encodes the DNA polymerase processivity subunit [46]. Analysis of the read alignments indicate that this mutation had arisen somewhere between passage 2 and 5 and was gradually becoming the dominant type at this position. At passage 2, all reads displayed the wild-type while at passage 5 the mutation was present in 3% of the reads. At passage 7 and 11, this fraction had risen to 33% and 77% respectively. This variability of UL44 has been shown before by Dargan et al., albeit at different positions and at much higher passage numbers [34]. Subsequently, strain BE/11/2010 was sequenced after passage 2, 5 and 9. All derived consensus sequences were identical. To summarize, both strain's sequences analyzed with the presented workflow were perfectly matching (up to passage 11), indicating that potential mutations during this period would not have a considerable impact on the overall consensus sequence.
It has been shown that gene RL13 and one of the genes of the UL128 locus (UL128, UL130 and UL131A; together UL128L) consistently mutate during passaging because of their inhibitory effect on HCMV replication in fibroblasts [34,45]. Interestingly, none of the five strains that were sequenced in our study showed obvious gene-disrupting mutations in UL128L or RL13. This would indicate that the strains had not yet undergone these hallmark mutations that accompany the initial adaptation to growth in human fibroblasts and could therefore be considered genetically unaltered by cell culture. It cannot be excluded however, that some of these strains do contain mutations in UL128L or RL13. Mutations could be present at different positions in different members of the viral population, which would result in a wild-type consensus sequence, as was the case for RL13 in Merlin [45].
Taken together, these validation experiments indicate that the presented workflow had only a minimal impact on consensus sequences of the clinical isolates under study. Most of the differences detected between independent replicates could most likely be attributed to heterogeneity of repeat lengths in the original clinical isolates. The stability of sequences throughout these procedures shows that they are characteristic for the original strains present in clinical isolates.

Genome Sequences Confirm Presence of Genedisrupting Mutations in Clinical HCMV Isolates
Adaptation of HCMV strains to cell culture is accompanied by changes in the HCMV genome, including gene-disrupting mutations [6,34]. More recently, evidence indicated that HCMV mutants could be present in unpassaged clinical isolates as well [17,34]. Strain JP was sequenced without in vitro amplification and was mutated in genes RL5A and UL111A [17]. We analyzed strain BE/21/2010 directly from clinical material and identified disruptive mutations in RL5A, UL9 and UL150. Furthermore, we examined ORFs currently annotated on the HCMV reference strain Merlin for the presence of gene-disrupting mutations in the other four strains under study and found that genes RL5A, UL1, UL9 and UL111A could contain disruptive mutations (Table 3). Mutations in RL5A, UL1, UL9 and UL111A have been identified in earlier publications [17,18,34]. The transgenic strain CINCY+ Towne (NCBI GenBank acc. no. GU980198) has a frameshiftinducing deletion in UL150, but this strain was passaged several times in human fibroblasts. To our knowledge this is the first report about a gene-disrupting mutation in UL150 present in an uncultured viral isolate. To rule out the possibility that mutations in strains BE/10/2010, BE/11/2010 and BE/27/2010 were acquired during passaging, viral genes of interest were PCR amplified and sequenced from the original clinical material (Table  S6). All verified gene sequences from the clinical material corresponded to the sequences generated with NGS from the passaged material. Furthermore, identical mutations were present in distinct strains ( Table 3). The fact that these mutations are conserved between independent and even geographically unrelated isolates provides a further indication of their widespread occurrence in clinical HCMV isolates.
HCMV gene family RL11 stands out in particular with several members (RL5A, RL6, UL1 and UL9) being suggested here and/ or elsewhere to be mutated in vivo [17,34,47]. Most of these genes are hypervariable and their gene products are poorly characterized. UL1 encodes an envelope glycoprotein that was suggested to be a cell-type specific tropism factor [48], but for RL5A, RL6 and UL9 no functionality data are available. The same holds for gene UL150. Most interestingly, gene UL111A is mutated in strain BE/ 27/2010, the previously sequenced strains JP and PH and four isolates from renal transplant recipients [17,49]. Strains BE/27/ 2010 and JP have deletions of 220 bp and 38 bp, which interfere with splicing of the second and first exon respectively. Strain PH has a substitution in the splice-acceptor site for the second exon. In the renal transplant recipients, three isolates (NCBI GenBank acc. no. EF488364-6) share a 5 bp deletion in the first exon, while a fourth isolate (EF535834) has a nonsense mutation in the first exon. UL111A encodes a viral interleukin-10 homolog, which has been shown to be involved in immune regulation, both during lytic and latent replication [50]. The observed existence of UL111A mutants in natural settings may have clinical significance, although more research is warranted to characterize the occurrence of mutations in different patient groups, both immunocompetent and immunocompromised. Interestingly, UL111A mutants have only been described in transplant recipients (BE/27/2010, PH and renal transplant isolates) or AIDS patients (JP), suggesting that the presence of these UL111A mutants could be associated with a defective immune system.
Our data indicate that the HCMV coding capacity is not fixed but can vary between different isolates. Additional full genome sequences from diverse patient groups and geographical areas are needed to characterize in further detail what ORFs can be mutated in clinical isolates, at what frequencies and in what patient groups.

Conclusion
The introduction of a new generation of sequencing technologies with high-throughput capacities has immensely impacted the field of genomics. Previous publications have provided a snapshot of the possible applications in the field of HCMV genomics and transcriptomics [12,13,[16][17][18]33,38]. We believe that the amplification, sequencing and analysis workflow that we present in this study can help to maximize the efficiency of sequencing HCMV strains in high-throughput. Given the large genetic background of HCMV, it could be interesting to routinely elucidate the complete sequence of strains that are used in mutational studies. This should no longer be considered as extremely laborious or costly. Additionally, the analysis of clinical HCMV isolates could assist in the refinement of the HCMV genetic map. It will provide a better knowledge of viral mutants and in which patient populations they are circulating. Finally, it could prove to be of value in the ongoing quest for genetic determinants of viral pathogenicity that has eluded scientists for more than a decade [37,51,52].

Supporting Information
Table S1 Primers and probes for HCMV UL86 and human bglobin qPCR. (DOCX)