Fig 1.
Coverage of capture sequencing reads to the MCPyV genome.
Shown are log scale coverage plots of all viral reads aligned to the MCPyV genome (JN707599). The viral genome structure is illustrated in the lower panel. Red lines mark positions of LT truncations. Coverage plots shown in green represent samples with point mutations resulting in a stop codon and premature LT protein whereas coverage plots in blue show samples in which deletions or inversions cause frameshifts and subsequent premature stop codons. Dashed lines in black indicate breakpoints of the MCPyV genome into the host genome. In UKE-MCC-4a four breakpoints into Chr20 were detected. In UM-MCC-52 dashed lines mark breakpoints into Chr5 (blue) and Chr4 (black).
Table 1.
Integration sites of MCPyV in MCC samples.
Fig 2.
MCPyV integration sites detected by capture sequencing.
(A): MCPyV integration sites in the human chromosomes. Depicted in blue is the distance between breakpoints on the host genome. Characteristics of the host genome at the breakpoints are indicated in brackets. (B) and (C): Schematic representation of the two characteristic groups of coverage profiles obtained by the mapping of virus-host fusion reads to the human genome. A schematic of virus-host fusion reads is depicted above the coverage patterns; red arrows indicate the direction of the MCPyV sequence in the fusion reads. (B) represents the first group characterised by short distances (4-18bp) between breakpoints on the host genome and inward-facing orientation of the fused viral sequences (upper panel). The middle panel shows coverage tracks from the cell lines MKL-1 and BroLi as examples. The bottom panel depicts a schematic model of the linear integration pattern deduced from the coverage profiles presented above. (C) represents the second group where host sequences in fusion reads map with large distances (17kbp to 300kbp) on the host genome and viral sequences show an outward-facing orientation. WaGa and MKL-2 coverage tracks are shown as examples with a schematic model of the integration pattern deduced from the coverage profiles above. Large host regions preceding the left virus-host junction are duplicated after the right virus-host breakpoint leading to a “Z” shape of the integration. Coverage tracks from all additional samples are provided in S4 Fig.
Fig 3.
Viral copy number calculation in WaGa and MKL-1 cells.
(A): Circos plot of copy number variations in WaGa and MKL-1 cells as calculated by FREEC using low coverage WGS data (ChIP-Seq input). The colour code indicates chromosome aberrations in fold haploid (black = 2n; green = 1n; red > = 3n; white = 0n). Female HDF cells are shown as control with n = 2. The position of MCPyV integrations are shown in the innermost circle (black: MKL-1; red: WaGa). (B): Normalized relative genomic DNA copy numbers immediately upstream (integration -60kbp) and downstream (+60kbp) of the respective MCPyV integration sites are shown in comparison to three indicated genomic control sites of the same length (Chr3, 4 and 5). Additionally, the 60kbp host duplication of WaGa cells is shown. Normalized data are presented as box and whisker plots of 5kb shifting windows (shift size = 2.5kbp) across the respective region of interest with median (horizontal line) and average (indicated by “+”). (C): Concatemeric copy numbers within each integration site in WaGa and MKL-1 were calculated from ChIP-Seq input data as described in the materials and methods section. Normalized data are shown as a box and whiskers plot of 1kbp shifting windows (shift size 0.5kbp) across the MCPyV reference genome (JN707599).
Fig 4.
Nanochannel and nanopore sequencing determine viral integration patterns and copy numbers.
(A): Optical signature (“barcode”) of a DNA fragment from MKL-1 cells. Shown is the time dependent intensity of the photoluminescence (PL intensity) of a single DNA fragment (1, blue), with an additional ATTO647N fluorescence peak (2, red). The fragment has a length of ~ 90kbp, calculated after calibration with λ-DNA (48kbp) as a standard. The peak of ATTO647N fluorescence has a length of ~ 17 kbp, corresponding to three integrated MCPyV copies (two complete copies, 5.4kbp each, and one partial copy with 4.1kb length). (B): Reads from nanopore sequencing for MKL-1 (upper panel) and WaGa (lower panel) mapped to the integration site of each cell line with an overview of the genomic locus in the reference genome (bottom), the integration locus as observed in the cell line (middle) and a close up on the integrated viral genome (top). For MKL-1, one read (104kbp in size) and three shorter reads cover the integration site. The long read confirms the linear integration of three concatemeric MCPyV copies (two full and one partial). For WaGa one 62kbp read covers the integration site. The read confirms the integration of two concatemeric MCPyV copies (one full and one partial) and the Z-pattern integration with duplication of the host sequence at the integration site. L and R indicate the left and right virus-host junction while (L) and (R) mark the position of the left and right junction sites in the host reference genome according to Table 1.
Fig 5.
Complex integration pattern of UM-MCC-52.
(A)+(B): MCPyV-host fusion reads from capture sequencing of sample UM-MCC-52 were mapped to the human genome. Shown is the coverage at the breakpoints in the host genome on Chr4 (A) and Chr5 (B). Red arrows indicate the direction of the viral sequences in the virus-host fusion reads. (RC) = Reverse complement orientation of MCPyV genome compared to the other junctions. Deduced integration patterns are shown below with a Z-pattern containing amplification of 17kbp host DNA in Chr4. The integration into Chr5 in addition to a Z-pattern must contain further inversions based on the read directions. As there is no indication for an inversion in the MCPyV genome, parts of host DNA at the right junction (R) must be inverted. (C)+(D): Reads from nanopore sequencing of UM-MCC-52 are mapped to both integration sites (Chr4, (C) and Chr5, (D)). In Chr4 0.52 MCPyV copies with three specific SNPs (bp 1,708; 1,792; 1,816; not present at the Chr5 integration) are integrated as a Z-pattern with duplication of 17kbp host DNA. In Chr5, MCPyV is integrated as a concatemer of at least 3.9 copies. MinION reads proof a Z-pattern integration with an insertion of 5.7kbp inverted duplicated host sequence at the right side that originates from 38kbp upstream of the 135kbp host sequence that is duplicated afterwards. Dashed coloured arrows indicate the complex structure of the integration locus. Duplicated host transcripts are shown in grey. L and R indicate the left and right virus-host junction while (L) and (R) mark the position of the left and right junction sites in the host reference genome according to Table 1.
Fig 6.
Complex integration pattern of UKE-MCC-4a.
(A): MCPyV-host fusion reads from capture sequencing of sample UKE-MCC-4a were mapped to the human genome. Shown is the coverage at the four breakpoints in the host genome (R I, L II, R II and L I), red arrows indicate the direction of the viral sequences in the virus-host fusion reads. 81 Reads at junction R II are mapped by BLAST only (not by aligner). MCPyV reads that are reverse complementary (RC) fused to the host sequences (compared to the other breakpoints) are identified at L II and R II. (B): MinION reads >40kbp aligning to the integration site with an overview of the genomic locus in the reference genome (bottom), the integration locus as observed in UKE-MCC4a (middle) and a close up on the integrated viral genome at both integration sites (site I and site II) as confirmed by MinION reads (top). Site I shows a Z-pattern integration (amplification of 120kbp host DNA between R I and L I) of 1.5 concatemeric copies of MCPyV harboring a deletion of 996 bp only in the first of the two consecutive MCPyV copies. Site II shows a linear integration of 0.75 copies MCPyV (without the deletion) with a loss of 34kbp host DNA between L II and R II. The patterned read confirms the insertion of site II in the duplicated host DNA between R I and L I as well as a second insertion of site I (I’) with duplicated host DNA after the first Z-loop. The dark blue MinION reads confirm the order I–II–I’ since they continue from site I and site I’ into the host genome over the host positions of L II and R II of integration site II. The amplification unit is I–II (approximately 10–20 repeated units, see C and calculation in D). Dashed colored arrows highlight the structure of the complex integration product. Duplicated host features are shown in grey. L and R indicate the left and right sites of the virus-host junctions I and II while (L) and (R) mark the position of the left and right junction sites I and II in the host reference genome according to Table 1. (C): Coverage of MinION reads (with a size > 3kbp) indicates amplification of the entire integration region. (D): Copy number calculation from MinION reads > 3kbp in the integration region relative to multiple random regions on the indicated host chromosomes. Assuming a chromosome number of n = 2 (most likely 3 for chr20) there may be either 10 large locus amplification units on both chromosomes of chr20 or 20 copies on only one chromosome of chr20.
Fig 7.
Microhomologies between virus and host sequences.
(A): Virus-host junctions of the LoKe cell line. Sequences at the virus-host junction (in grey) were derived from capture sequencing and aligned to reference sequences for the human genome (hg38) and MCPyV (JN707599). Depicted are 40bp upstream and downstream from the virus-host junction (indicated by a black line, extended for 3bp at the right junction due to an insertion). Human sequences are shown in blue and viral sequences in black letters. Microhomologies are illustrated in red. Microhomology scores were calculated between the virus and host sequences for the virus side (viral sequence of the junction) and the host side (host sequence of the junction). All additional samples can be found in S1 Fig. (B): P-values from statistical analysis of scores from the virus and host side of samples showing Z-pattern or linear integration compared to scores obtained for 200 random viral and host sequences. Identical bases at the virus-host junction were assigned to the viral side, results with identical bases assigned to the host side can be found in S6B Fig. The virus side of Z-pattern integration shows significantly higher homology scores (p<0.05, dashed line). The host side and the linear integration pattern are not significantly different.
Fig 8.
Histone modification pattern in MKL-1 and WaGa cells.
(A): Coverage of the activating histone mark H3K4-me3 and the repressive histone mark H3K27-me3 on integrated MCPyV obtained by ChIP-Seq of WaGa and MKL-1 cells (upper two panels). Dashed lines represent breakpoints into the host genome, red lines the truncating event in LT. In the lowest panel H3K4-me3 ChIP-Seq data from a replication assay (RA) performed in PFSK-1 cells that were published before [37] are included for comparison. Note: The viral reference genome JN707599 is presented starting with nucleotide 2,470 for better visualization of ChIP-Seq patterns located at the viral promoters (see annotation of X-axis). (B): ChIP-Seq data for H3K4-me3 and H3K27-me3 from WaGa (upper panel) and MKL-1 cells (lower panel). The left and the right panel represent the two host genomic regions (1mbp) of the WaGa (left) and MKL-1 (right) integration sites. The corresponding junctions (L and R, marked by arrows) are indicated. The asterisk marks an additional H3K4-me3 signal which is not present in MKL-1. The signal is located within the 66kbp host duplication and flanks junction R. It originates from the H3K4-me3 signal of the early region of the integrated MCPyV genome that harbors the right breakpoint (R, see A) and extends into the host chromatin. Host duplication in WaGa is visible by the marked enhanced ChIP input signal.
Fig 9.
Epigenetic properties of MCC cell lines and MCPyV integration sites.
(A): Correlation and clustering of H3K4-me3 profiles from WaGa and MKL-1 in comparison to 48 selected tumor cell lines and primary cells obtained from the ENCODE database. Correlation and clustering were performed using DeepTools and are based on MACS2 identified H3K4-me3 peak regions in the WaGa cell line. (B): Cellular chromatin environment at integration sites of MCC cell lines (350kbp window). Heat maps represent ENCODE ChIP-Seq signals of different cell types and cell lines (n is given beneath each modification) and include MKL-1 (M) and WaGa (W) data as indicated for H3K4-me3 and H3K27-me3 (please note increased track height of MKL-1 and WaGa for better visualization). Start and end of the bars in the integration track indicate positions of the left and right junctions of the respective integration site. Endogenous positive control regions were included for each histone modification using the same magnification (GAPDH: H3K4-me3 and H3K27-ac; ZNF268: H3K9-me3; HOXC13: H3K27-me3).
Fig 10.
(A): DNA replication of MCPyV is bidirectional (theta amplification) with replication forks starting at the ori (blue) and moving into opposite directions. Stalling replication forks (yellow star) can result in aberrant defective viral genomes. Top: Stalling replication forks induce mutations (black bolt) in the early region of the viral genome. The remaining fork induces unidirectional rolling circle amplification (RCA) resulting in large linear concatemers of mutated viral genomes. Bottom: Collision of a moving fork with a stalled fork leads to a dsDNA break at the moving fork. Recombination at the converging forks results in viral genomes with large inversions that truncate the early region. Both scenarios (RCA and break with recombination) yield linear defective (concatemeric) viral genomes. (B): (I) a linear viral genome is recognized as ds DNA break and undergoes resection of the 5’ ends by the host machinery. The same mechanism resects the 5’ end of a dsDNA break in the host DNA. (II) Homologies between viral and host sequences are used by microhomology-mediated end joining (MMEJ) to ligate the viral genome to a dsDNA break in the host genome. (III) The 3’ ss end of the viral genome invades a homologous host region and (IV) starts DNA synthesis in a D-loop structure (microhomology-mediated break-induced replication, MMBIR). (V) DNA synthesis reaches the original ds break with the viral genome and (VI) connects with the other side of the ds break by an unknown mechanism. (VII) The complementary strand is synthesized in a conservative mode using the newly synthesized strand as a template resulting in (VIII) an amplification of several kbp of host sequence surrounding the MCPyV integration site and a Z-pattern integration. (C): Without resection of 5’ ends a defective linear viral genome is integrated into a ds break of host DNA by nonhomologous end-joining (NHEJ). The integration mechanism is independent of homologies between viral and host sequences and results in a linear integration pattern.