HaSAPPy: A tool for candidate identification in pooled forward genetic screens of haploid mammalian cells

Haploid cells are increasingly used for screening of complex pathways in animal genomes. Hemizygous mutations introduced through viral insertional mutagenesis can be directly selected for phenotypic changes. Here we present HaSAPPy a tool for analysing sequencing datasets of screens using insertional mutations in large pools of haploid cells. Candidate gene prediction is implemented through identification of enrichment of insertional mutations after selection by simultaneously evaluating several parameters. We have developed HaSAPPy for analysis of genetic screens for silencing factors of X chromosome inactivation in haploid mouse embryonic stem cells. To benchmark the performance, we further analyse several datasets of genetic screens in human haploid cells for which candidates have been validated. Our results support the effective candidate prediction strategy of HaSAPPy. HaSAPPy is implemented in Python, licensed under the MIT license, and is available from https://github.com/gdiminin/HaSAPPy.


Introduction
Next generation sequencing (NGS) has facilitated the exploration of animal genomes in a number of areas including recent approaches for genetic screening. In mammals, screening of important biomedical pathways can be performed orders of magnitude faster in cell cultures than in the organism and at a fraction of the cost. Cell based screens have been performed using RNA interference [1], sequence specific nucleases [2], and mutagenesis of haploid cells. The latter strategy has been originally implemented in a haploid human leukemia cell line [3] and recently extended to haploid embryonic stem cells (ESCs). A number of successful screens illustrate the power of pooled mammalian haploid cell screening. Clinically relevant pathways [4,5] including pathogen [3,[6][7][8] and toxin [9][10][11][12] mechanisms have been studied in human haploid cells. Haploid ESCs from mouse [13,14] and from human embryos [15] have been used to characterize mechanisms of drug action [13,14,16] and developmental questions [17,18]. PLOS  In haploid cells mutations can be introduced in a hemizygous state by chemical mutagenesis [19], viral [13,18,20], and transposon vectors [16,17] ensuring that potential phenotypic changes become expressed. Several screens have focused on gene trap vectors, which are characterized by high mutagenicity. Thereby the identification of mutations is possible through cloning the genomic flanking regions of the insertion sites. Typically, cell pools containing tens of millions of viral insertions are subjected to selection for predefined phenotypic characteristics. Comparison of viral insertions before and after selection is used to detect candidate genes, whose mutations become enriched. Use of NGS techniques to simultaneously analyse millions of insertion sites in large cell pools without isolating clonal cell lines has facilitated comprehensive screening in mammalian cells [3].
Previously, computational methods for candidate identification in microbial genetic insertional screens [21] and mammalian haploid cell screens [22] have been developed. Large mammalian genomes pose challenges as a lower mutation density can be achieved and regional differences in chromatin packaging can bias the distribution of insertions. Here, we present HaSAPPy (Haploid Screen Analysis Package in Python) for computational candidate identification. HaSAPPy analyses NGS datasets to reconstruct viral insertions in control and selected cell pools, estimates the enrichment of disruptive mutations, and the ratio of disruptive over neutral mutations for each gene. From the fold enrichment of these three parameters during selection candidate genes are identified as outliers from the distribution of all genes. We demonstrate that this strategy is effective by benchmarking candidate prediction from published datasets from screens aimed at identifying genes required for Xist-mediated gene repression [18] and for viral entry [8,20] in mouse and human haploid cells.

Design and implementation
HaSAPPy uses NGS datasets for predicting candidates from insertional mutagenesis screens in haploid cells based on selection for specific phenotypes including genetically encoded reporters, cell survival, and physical isolation of cells ( Fig 1A). Gene-trap insertion sites are identified by reads starting with the first base of the genomic sequence flanking the vector insertion [23] ( Fig 1B). Reads are preprocessed for eliminating regions with consistently low base quality and maintaining a maximum of sequence information. Subsequently, adaptor removal is performed and reads that become shorter than a threshold (default 26nt) are discarded. HaSAPPy has been preconfigured with three read mappers including the Burrows-Wheeler transform based Bowtie2 [24], and nvBowtie (https://github.com/NVlabs/nvbio/tree/master/nvBowtie), and the hash table index structure based NextGenMap [25] using a test suite [26]. HaSAPPy also accepts pre-aligned datasets in Sequence Alignment/Map (SAM) format and a threshold for alignment quality (MAPQ).

Reconstruction of virus insertions
Unequal amplification during NGS library preparation makes read numbers ineffective for estimating selection. We reconstruct independent insertions (I.I.) from the start positions of read alignments in a strand specific manner (Fig 1C). To avoid that sequence errors on read ends lead to multiple counting of insertions, reads within a genomic window on the same strand are attributed to the same virus insertion ( Fig 1C) and collapsed onto a single I.I. at the position with the highest initial read count. A single I.I. per genomic position guarantees that all insertions are independent. Genome annotation from Refseq Genes or ENSEMBL transcripts, which is available through the UCSC genome browser Tables interface (http://genome. ucsc.edu/cgi-bin/hgTables?command=start), is used for defining genomic intervals of genes [27] that cover all transcript variants listed for a unique gene name. To further evaluate the mutagenic effect of insertions two sub-regions 'Exon_specific' and 'Introns' are considered. Counts are obtained by iteration over I.I. from each dataset for all genes. Insertion counts for introns are obtained in an orientation specific manner for assessing the trapping function of the splice acceptor of the gene-trap (Fig 2A). Disrupting insertions (D.I.) are calculated as the sum of exonic I.I., and intronic I.I. with the splice acceptor aligned with transcription. A bias is calculated as the ratio of intronic insertions in sense over anti-sense orientation for each gene (Fig 2B). I.I., D.I., and Bias are the basic parameters for estimating selection of genes in a screen.

Candidate identification
HaSAPPy allows the analysis of multiple experiments against a single control, whereby each dataset can contain multiple replicates. For predicting candidates we implement a method that evaluates multiple parameters in parallel. Each gene is represented by a vector composed of the  (1). LAM PCR products are purified and a single stranded adaptor is ligated at the 3'end (2). NGS libraries are amplified by exponential PCR using primers at the end of the viral LTR and in the adaptor (3), and subsequently sequenced (4). P5, P7 represent Illumina NGS adaptors. (C) For reconstruction of virus insertion events the mapping of the first base of each read is assumed to represent the genomic position of the insertion event (I.). Read alignments are collapsed in a genomic window in a strand specific manner into the position of an independent insertion (I.I.), which is chosen as the position with the highest initial read count. The cumulative read count is reported and only insertions that satisfy a read count threshold (grey) are considered in the analysis. fold enrichment of I.I., D.I., and Bias in selected relative to control datasets. The majority of genes, for which mutations are not selected, clusters in this vector space. Selection is detected as the distance of a gene from this cluster by divergence of one or more parameters using the Local Outlier Factor (LOF) [28]. Candidates are then ranked by evidence for selection by sorting score factors in decreasing order.

Implementation
HaSAPPy is written in Python and depends on HTSeq [27] for handling sequence files and genomic coordinates, and pandas, matplotlib, numpy, scipy, sklearn, and xlsxwriter for data analysis and output. Parameters of the analysis including table layout and graphics are specified in a text file that is used to instantiate HaSAPPy. Sorting, filtering, and visualization of insertions over candidate gene regions in SVG format is supported (S1 Fig).

Results
We developed HaSAPPy by analysing a screen for silencing factors in X inactivation [18], which used an inducible Xist expression system in haploid mouse ESCs for identifying mutations that mediate cell survival by escaping inactivation of the X chromosome. 7 control (SRX1060416) and 7 selected (SRX1060407) datasets with a total of 300 million reads were analysed on workstations with Ubuntu Linux version 14.04, at least 32 gigabytes memory, and graphics processing units (GPUs). Runtimes between 90 minutes and 3 hours were predominated by read mapping and subsequent analyses were also performed on a Macbook (Intel Core i7, 2.3GHz) in 30 minutes. We have evaluated different read mappers and alignment parameters by benchmarking on experimental and generated read datasets using a previously Analysis of haploid cell genetic screens published test suite [26]. As a result HaSAPPy is preconfigured to run with Bowtie2 [24], nvBowtie, and NextGenMap [25]. The latter two aligners utilize accelerators for speeding up read alignment by taking advantage of recent hardware developments. Although some differences in the number of insertions assigned to candidate genes were observed ( Fig 3A, S2 Fig, Table A in S1 Text), our results suggest that GPUs can be effective in speeding up read mapping without a loss in sensitivity consistent with earlier results [25,26].

Reconstruction of independent virus insertion events
Independent insertions are reconstructed from read alignments for estimating the selection of mutations in potential candidate genes over background. Typically, each insertion results in multiple reads, whereby read counts reflect amplification and library preparation as well as the abundance of insertions in the cell populations [22]. We find that individual insertions can be represented by more than 10 5 reads. Their mapping to isolated genomic regions without annotated genes hints at strong amplification bias. To ensure that only insertions that have been selected independently are scored, we consider for each dataset at most one insertion for each genomic position by combining all reads starting within a genomic window on the same strand into a single independent insertion (I.I.). The genomic position with the highest initial read count is reported along with the cumulative read count (Fig 1C). We evaluated the effects of the window size on candidate ranking (Fig 3B, Fig 4A and 4B, and Table B in S1 Text). Although, increasing window sizes did not affect the ranking of the highest scoring genes, Analysis of haploid cell genetic screens variation was observed for lower ranked candidates. We determined that a window size larger or equal to 5 nucleotides resulted in greater reproducibility. The average distance between independent viral insertions can be expected to be larger than 5 bases suggesting that loss of information is negligible. Trimming or sequence errors cause alignment shifts of few nucleotides and are corrected by our approach.
Read alignment errors can affect the analysis of genetic screens, where a single read can be sufficient for the reconstruction of a viral insertion event. To mitigate this problem, we introduce a threshold for the number of reads that is required for considering an insertion in the analysis. Selecting stringent thresholds reduces the number of I.I., which is particularly problematic for unselected libraries with large numbers of insertions and lower read coverage (Table C in S1 Text). Also candidate genes that are supported by insertions with low read coverage become undetectable (Fig 3C, Fig 4C and 4D) reflecting a reduction in sensitivity. A threshold of 2 reads to consider an insertion can reduce noise from alignment errors without materially reducing I.I. numbers. Additionally, different read number thresholds can be used for adjusting for sampling rates in control and selected datasets.

Comparison of candidate ranking strategies
For obtaining a candidate list a scoring function or selection strategy needs to be implemented. Previous methods have applied Fisher's exact test (FT) for detecting enrichment of D.I. in selected compared to control samples [20]. VISITs implements a statistical candidate selection strategy by combining FT for D.I. enrichment and a binomial test for comparing D.I. to I.I. in selected samples [22]. Candidates are subsequently ordered by increasing probability of no difference between selected and control datasets. FT based methods have not considered quantification of additional biological parameters for evidence of selection. Rank-based methods can also be applied for sorting candidates, whereby genes are assigned a rank according to their number of D.I., and the difference of the logarithmic rank positions of a gene in the selected and control datasets is used for candidate ranking by sorting for increasing values. Although, these strategies are generally effective they do not take full advantage of gene structure information. To improve on this situation, we developed a new algorithm based on two considerations. Firstly, multiple parameters are evaluated for each gene in parallel for detecting evidence for selection. Secondly, evaluation of parameters is performed relative to all other genes. For each gene, we construct a vector using the fold enrichment of I.I., D.I. and Bias values between selected and control samples. The distance from the mass centre of all genes in 3-dimensional vector space is used to obtain a score by the Local Outlier Factor (LOF) algorithm [28]. Candidates are identified and scored as most diverging genes similar to outliers (Fig 5A and 5B, S3A Fig and Table 1).
We compared the performance of different candidate ranking strategies by applying methods available in HaSAPPy (FT, Rank, LOF) and VISITs [22] for analysis of the X inactivation datasets aligned with Bowtie2 ( Fig 5C and Table D in S1 Text). Ten genes were shared among the first twenty predicted candidates by all methods, whereby differences were observed in the ranking order (S4 Fig). The LOF and Rank methods showed substantial overlap (Fig 5D and 5E) despite the different algorithms and parameters. Candidates predicted by VISITs and FT showed limited overlap, which is likely explained by implementation of additional data transformations in VISITs. We next evaluated candidates by examining the orientation of insertions in introns (Bias [29]) as evidence for selection. We observed an overall lower Bias for genes that were uniquely predicted by FT and VISITs than by LOF (Table E in S1 Text). To further evaluate the robustness of candidate prediction, we segregated the 7 replicates of the X inactivation dataset and treated them as independent experiments. Candidate lists obtained by using the median number of insertions in control and selected samples substantially overlapped with lists obtained from using the entire dataset and so did their relative order (Fig 5F and Table F in S1 Text). Lower ranked candidates with few insertions were lost in the lists of median Rank and LOF analysis reflecting the subsampling of the data, whereas candidates specifically predicted by FT showed substantial divergence. In conclusion, replicates can increase the robustness of analyses, but the overall number of I.I. is correlated with sensitivity for detecting potential candidates. Xist and Spen were experimentally validated [18], and ranked top and third in the lists of the Rank and LOF methods (Table D in S1 Text). FT or VISITs produce large selections of candidates in which Xist and Spen are present. Effectively, these occupy lower positions. Tcf71l is predicted as a potential candidate by all methods and suggested at the top of the list output by VISITs and FT when sorted for decreasing probability of no difference. For this reason we were interested in evaluating Tcf7l1 experimentally and engineered a mutation in the first exon using CRISPR/Cas9 nucleases. Loss of Tcf7l1 protein was confirmed by Western analysis (S8 Fig). In HATX3 ES cells, a Doxycycline inducible Xist allele facilitates to study the effect of a Tcf7l1 mutation on Xist function [18]. Induction of Xist caused a similar cell loss in the parental and Tcf7l1 deficient HATX3 ESCs (Fig 5G), which was not comparable to increased survival caused by a mutation in Spen [18]. Therefore, Tcf7l1 is likely not involved in Xist mediated gene repression. Selection of Tcf7l1 might be explained by a generally enhanced selfrenewal and reduced differentiation of Tcf7l1 deficient mouse ESCs [30]. Hence, what appears to be detected is a positive selection for Tcf7l1 mutations in mouse ESCs, which is unrelated to a loss of Xist function. For elimination of such candidates additional considerations are required.

Visualization of insertions for candidate validation
We implemented a graphical view of insertions in candidate gene loci into HaSAPPy that facilitates a comparison between selected and control samples. Colour coding is used for sense and anti-sense insertions for visualization of a selection for disruptive over neutral mutations within introns. We observe an increase of insertion numbers in selected samples for genes that were identified by all ranking strategies (FT, VISITs, Rank and LOF) and the enrichment for mutagenic insertions can be confirmed (S4 Fig). Additionally, genes that are uniquely predicted by the LOF algorithm have similar properties (S5 Fig). In contrast, the candidate list produced by the FT method (S6 Fig) contains genes characterized by a high number of

Benchmarking candidate prediction
For comprehensively benchmarking HaSAPPy we reanalysed several published screens in human haploid cells for resistance to virus entry [8,20]. These studies differ from the screen of silencing factors in X inactivation in two important ways. Firstly, the screens were performed in human haploid cells and therefore require the use of the human genome and annotation for the analysis. Secondly, the datasets were generated using an Illumina HiSeq instrument and consist of much higher read numbers of 30 nucleotide length, which is shorter than the read length in our Xist screen [18]. HaSAPPy was run using our previously determined parameters.
The top 4 predicted candidates by the LOF algorithm have been experimentally validated factors for Lassa virus entry [20] suggesting that HaSAPPy detects candidates accurately and with similar success as the manually curated strategy of the original study (Fig 6A and 6B Table K in S1 Text). In addition, we find evidence for selection of NBPF20 through a high enrichment in D.I. mutations. NBPF20 has not been detected in the original study. Taken together, the performance of HaSAPPy with preset parameters in a total of 5 different screens [8,18,20] suggests a wide range of applications over different genomes and sequencing technologies. The robustness against changes in data analysis parameters (Fig 6C-6E and Table H-J in S1 Text) facilitates the adaptation and use of HaSAPPy as a standardized analysis method for haploid cell genetic screens.

Discussion
Analyses of genetic screens require sensitivity to avoid missing individual candidates. Pathways can be represented by single candidates, when redundancy or lethality reduces opportunities of discovery. Conversely, false predictions can lead to costly experimental validation, which limits the number of genes that are followed up. Therefore there is a need for effective candidate ranking. NGS datasets from genetic screens require strategies to identify evidence for selection and reduce technical noise that differ from other sequencing analysis problems. Our study introduces three methodical procedures for candidate prediction in haploid screens. Firstly, we aim to accurately reconstruct viral insertion events from read alignments while eliminating effects from amplification bias, sequencing and alignment errors. In HaSAPPy, reads within a genomic window are used to reconstruct a viral insertion event that is guaranteed to have been selected independently. This procedure is unlikely to remove truly independent insertions, but will remove effects from alignment shifts caused by sequencing and amplification errors. Window size and threshold for the number of reads for defining an insertion can be adjusted to further increase the robustness of candidate selection.
Secondly, HaSAPPy scores multiple parameters for each gene simultaneously to detect evidence for selection. Considering the mutagenic effect of intronic insertions ensures that candidates with a wide range of possible gene structures can be detected. Whereas an increase in the total number of insertions can be expected for selected genes, statistics can be less sensitive for small genes. Multiple viral integrations per cell can lead to co-selection of a driver mutation with passenger mutations that are not causal to the phenotype. Genes that are in genomic regions with frequent virus insertions can maintain high numbers of insertions through coselection. Additional evidence for selection can come from examining disruptive mutations, and the ratio between disruptive and neutral intronic mutations can be useful to detect evidence of selection if large introns are present. Scoring these three parameters simultaneously in a mathematically consistent framework is achieved by using a multidimensional outlier method. We verified that this approach is effective in predicting candidates from 5 different screens. HaSAPPy is able to rank candidates in order of decreasing evidence for selection and performs equal or better than previously used methods. Co-selection of passenger mutations Analysis of haploid cell genetic screens may affect the predictions of FT based algorithms leading to a larger set of candidates that are not sorted for biological evidence for selection. In addition, HaSAPPy compensates experiment specific effects that act on all genes using an outlier detection strategy without a need for normalizing datasets. Complex library preparation, amplification and insertion biases in haploid screening can lead to a lack of homogeneity in insertion coverage between samples. Normalization strategies are therefore difficult to implement and risk distorting the experimental information through unanticipated effects of data transformation.
Finally, for enabling researchers to add biological expertise and literature to candidate evaluation, HaSAPPy makes all parameters of the analysis accessible in a customizable table format. A graphical overview of insertions in candidate genes supports the user with selecting candidates for further studies.

Availability and future directions
HaSAPPy is implemented in Python and released as open-source software under the MIT license. It supports many typical haploid screening projects and can be adapted to experimental designs and candidate ranking strategies. Presently, HaSAPPy does not utilize transcript datasets of haploid cells [31]. In future versions candidates for which little evidence of transcription is observed could be eliminated and gene models could be refined. Development of graphical user interfaces for specifying run parameters could further facilitate interactive use of HaSAPPy on workstations. It will also be enticing to explore methods for improving read alignment of screening datasets, which often contain short reads and sequence errors. A preliminary analysis indicates that nearly half of all viral insertions occur within genes.