In vivo insertion pool sequencing identifies virulence factors in a complex fungal–host interaction

Large-scale insertional mutagenesis screens can be powerful genome-wide tools if they are streamlined with efficient downstream analysis, which is a serious bottleneck in complex biological systems. A major impediment to the success of next-generation sequencing (NGS)-based screens for virulence factors is that the genetic material of pathogens is often underrepresented within the eukaryotic host, making detection extremely challenging. We therefore established insertion Pool-Sequencing (iPool-Seq) on maize infected with the biotrophic fungus U. maydis. iPool-Seq features tagmentation, unique molecular barcodes, and affinity purification of pathogen insertion mutant DNA from in vivo-infected tissues. In a proof of concept using iPool-Seq, we identified 28 virulence factors, including 23 that were previously uncharacterized, from an initial pool of 195 candidate effector mutants. Because of its sensitivity and quantitative nature, iPool-Seq can be applied to any insertional mutagenesis library and is especially suitable for genetically complex setups like pooled infections of eukaryotic hosts.

construction of the 195 (single-gene) insertional mutants of U. maydis and the library preparation protocol used, we expected the double-stranded fragments subjected to sequencing to have the following layout (both strands shown): The part denoted ". . . " is a genomic U. maydis sequence, more specifically a sequence from the 3' or 5' flank of one the 195 studied genes. Our custom script trim.tag.py matched the sequenced read pairs against this expected pattern, allowing up to 4 mismatches (not counting Ns) within the fixed part of each mate.
Our script then stored the UMIs as part of the read names, and stripped all technical sequences (i.e. everything except the ". . . " part) from the reads. If the two mates of a pair overlapped (i.e. for fragments shorter than 2 · 75 = 150 bp), a technical sequence from one mate possibly appeared reverse-completed on the other mate as well. We detected this by checking whether a gap-less ends-free alignment of the two reads had an identity ≥ 90%, and then used the alignment to locate and remove the corresponding part of the complementary mate as well.
Proper read pairs (read pairs where one mate maps in the forward direction, the other in the reverse direction, and the mates point "towards" one another) were assigned to a particular gene if either mate's first sequenced base mapped to within ±10 bp of one of the genes flanks, and the rest of that read continued "away" from the gene.
Improper read pairs (non-proper read pairs where nevertheless both mates were mapped) were ignored.
Singleton reads (i.e. reads whose mate could not be mapped) were assigned to a particular gene if their first sequenced base mapped to within a 1000 bp window on either side of the gene and they continued "towards" the gene.
Read pairs assigned to no or multiple genes were ignored.

UMI &
Correcting UMIs for sequencing errors (umicounts.tag.py). To count U. maydis insertional mutant genomes (i.e. cells), we counted the number of (sufficiently distinct, to protect against sequencing errors) combinations of UMI and mapping position within the reads mapping to a particular flank (3' or 5') of a particular gene. For the sake of brevity, UMI in the following denotes a combination of a particular 12 bp molecular barcode (so far called UMI) and the two mate's mapping positions.
To merge similar UMIs (which likely stem from the same cell), we used a variation of the algorithm of Smith et al. [5]. We started with the raw list of unique UMIs. We then marked an UMI p as mergeable into UMI q if the molecular barcodes disagreed at most at a single position, the mapping positions by no more than ±3 bases, and p was found in fewer reads than q. The UMIs not marked as mergeable were then assumed to be error-free. The read counts of UMIs that were mergeable (directly or indirectly) with a single error-free UMI were added to the error-free UMI's read count. UMIs marked (directly or indirectly) as mergeable with multiple error-free candidates were discarded as being ambiguous.
This produced, for both flanks of every gene, a separate list of assumedly errorfree UMIs and per-UMI read counts.
Correcting for artifacts and lost UMIs to estimate abundance (counts2results.R). We then further processed the per-flank UMIs using the algorithm of Pflug & von Haeseler [6], i.e. we removed all UMIs with a read count below a manually set read-count threshold (T = 1, except T = 5 for Experiment B R1 & R2 Output, and T = 9 for Experiment B R3 Output), and then estimated (for both flanks of every gene separately) the percentage of UMIs lost during sequencing and data filtering.
This yielded, separately for both flanks of every gene, a number n obs of observed UMIs (after all filtering steps) and a loss estimate . Given these two, a (flankspecific) estimate of true mutant abundance is n obs /(1 − ).  3 . We then further assumed that for neutral mutants the expected true input and output abundances are proportional (with the same factor λ for all neutral mutants in a particular pair of input and output libraries), but that the output abundances have additional dispersion d due to random fluctuations of mutant growth, i.e. that

S A
To find the null distribution (i.e. assuming mutant m is neutral) for the output UMI count N out m given observed input count n in m , we computed the posterior A in m | N in m (using degenerate prior Gamma(0, 0)), added dispersion d to get A out m | N in m , and combined with N out m A out m . The resulting negative binomial distribution depends on two mutant-independent parameters, proportionality factor λ and dispersion d, To control the false discovery rate (FDR), we applied the Benjamini-Hochberg (BH) procedure [7] (separately) to the collection of low and high p-values computed for a particular pair of input and output libraries, and set the FDR target to 10%.
To quantify the effect size, we also computed the log 2 fold change (lfc m ) between each mutant m's observed output UMI count and the expected value for neutral mutants, (5) lfc m = log 2 .
Selecting the neutral reference set. We started with a candidate list of 13 insertional mutants described as neutral in the literature (UMAG_01297, UMAG_01300, UMAG_01302, UMAG_02192, UMAG_02193, UMAG_03046, UMAG_03201, UMAG_03202, UMAG_03615, UMAG_06222, UMAG_10403, UMAG_10553, UMAG_12313), estimated λ and d for all 6 input-output pairs, and computed these mutants' log 2 fold changes. Suspecting that not all of these mutants are truly neutral, we looked for outliers (defined as for boxplots in R, values more than 1.5 IQR larger/smaller than the 75%/25% quantile) amongst these log 2 fold changes and discarded them. We repeated this procedure for the remaining 8 candidates (UMAG_01302, UMAG_02192, UMAG_02193, UMAG_03046, UMAG_03202, UMAG_03615, UMAG_10403, UMAG_10553), and found 3 additional outliers. The remaining 5 candidate mutants (UMAG_01302, UMAG_02193, UMAG_03202, UMAG_10403, UMAG_10553) were then used as the final neutral reference set, and all p-values, q-values and log 2 fold changes were re-computed based on this set.

Sensitivity of a genome-wide screen.
To estimate the sensitivity of a genomewide screen, we simulated experiments containing m = 20, 000 distinct mutants using the statistical model from equation (1), but assuming a negative binomial distribution for N out m to account for the additional dispersion d of the output abundances (see also equation 2). We assumed the input abundances to be identical for all mutants (i.e A in 1 = . . . = A in 20.000 = A in ), the output abundances of k mutants to show a virulence phenotype and hence to be reduced 2 −ρ -fold (i.e. A out 1 = . . . = A out k = A in · 2 ρ ), and the other m − k mutants to be neutral (A out k+1 = . . . = A out 20,000 = A in ). Based on ≈ 14% of mutants in our screen showing a reproducible phenotype, and supplemental table 5 of Lanver et al. [8] showing ≈ 22% of genes to be upregulated during infection, we set k = 20, 000·0.14·0.22 = 600 (i.e. ≈ 3% of mutants have a virulence phenotype). We set the additional dispersion d to the highest value observed in our 6 experiments (0.0126), and simulated 100 experiments for each input abundance A in = 1, 2, . . . , 100, once with log 2 fold change of ρ = −1.53 (corresponding to the "Reduced" group in figure 4a) and once with ρ = −2.75 (corresponding to the "Lost virulence" group). For each simulated experiment we computed q-values as described above (see Computing p-values, q-values and effect sizes), determined the percentage of significant mutants within the ones with a virulence phenotype, and averaged these percentages over the 100 experiments to compute the efficiencies shown in figure S3.