A Massively Parallel Pipeline to Clone DNA Variants and Examine Molecular Phenotypes of Human Disease Mutations

Understanding the functional relevance of DNA variants is essential for all exome and genome sequencing projects. However, current mutagenesis cloning protocols require Sanger sequencing, and thus are prohibitively costly and labor-intensive. We describe a massively-parallel site-directed mutagenesis approach, “Clone-seq”, leveraging next-generation sequencing to rapidly and cost-effectively generate a large number of mutant alleles. Using Clone-seq, we further develop a comparative interactome-scanning pipeline integrating high-throughput GFP, yeast two-hybrid (Y2H), and mass spectrometry assays to systematically evaluate the functional impact of mutations on protein stability and interactions. We use this pipeline to show that disease mutations on protein-protein interaction interfaces are significantly more likely than those away from interfaces to disrupt corresponding interactions. We also find that mutation pairs with similar molecular phenotypes in terms of both protein stability and interactions are significantly more likely to cause the same disease than those with different molecular phenotypes, validating the in vivo biological relevance of our high-throughput GFP and Y2H assays, and indicating that both assays can be used to determine candidate disease mutations in the future. The general scheme of our experimental pipeline can be readily expanded to other types of interactome-mapping methods to comprehensively evaluate the functional relevance of all DNA variants, including those in non-coding regions.


Introduction
Owing to rapid advances in next-generation sequencing technologies, tens of thousands of disease-associated mutations [1] and millions of single nucleotide polymorphisms (SNPs) [2,3] have been identified in the human population. With the large number of ongoing whole-exome and whole-genome sequencing projects [2,3], hundreds of thousands of new SNPs are now being discovered every month. Hence, there is an urgent need to develop high-throughput methods to sift through this deluge of sequence data and rapidly determine the functional relevance of each variant. Here, we focus on coding variants, firstly because traitand disease-associated SNPs are significantly over-represented in nonsynonymous sites [4], and secondly because the vast majority of disease-associated mutations identified to date reside within coding regions [1]. We evaluate the functional impact of coding variants by examining their effects on corresponding proteinprotein interactions, because most proteins carry out their functions by interacting with other proteins [5].
Recent studies have begun to use large-scale protein interaction networks to understand human diseases and their associated mutations [5,6]. By integrating structural details with high-quality protein networks, we created a 3D interactome network where the interface for each interaction has been structurally resolved [7]. Using this 3D network, we demonstrated that in-frame disease mutations (missense mutations and in-frame insertions/deletions) are significantly enriched at the interaction interfaces of the corresponding proteins [7]. Our results indicate that alteration of specific interactions is very important for the pathogenesis of many disease genes, highlighting the importance of 3D structural models of protein interactions in understanding the functional relevance of coding variants. However, many important questions still remain unanswered -for example, what fraction of protein-protein interactions is altered by disease mutations to cause the corresponding disorders? Furthermore, do structural details of the interacting proteins, especially the position of the mutation relative to the interaction interface, affect the ability of a given disease mutation to alter a specific interaction?
To address these questions, we decided to focus on proteins with known disease mutations that participate in interactions with available co-crystal structures in the Protein Data Bank (PDB) [8].
To detect the alteration of the interactions by disease mutations, it is necessary to first detect the interactions of the wild-type proteins using an assay of choice. This turns out to be a major bottleneck because all high-throughput interaction-detection assays have very limited sensitivity [9,10]. Our assay of choice is Y2H because there are over 16,000 human protein interactions detected by our version of Y2H that can serve as the reference interactome for comparison [11,12,13,14], the largest for any assay performed to date ( Figure S1). In total, there are 217 interactions detected by our version of Y2H with available co-crystal structures; 51 of these also have known missense disease mutations on corresponding proteins in the Human Gene Mutation Database (HGMD) [1] and the corresponding interactions for the wild-type proteins can be detected in our experiments with strong Y2H-positive phenotypes ( Figure S2; Materials and Methods). Here, we focused on missense mutations because they are intrinsically more likely to generate interaction-specific disruptions [6]. We established a high-throughput comparative interactome-scanning pipeline to clone disease mutations and examine their molecular phenotypes (Fig. 1). The methodologies established here can be readily applied to any non-synonymous variant in the coding region, including nonsense mutations.

Results
Clone-seq: A massively parallel site-directed mutagenesis pipeline using next-generation sequencing The first step of our pipeline is a massively parallel approach, termed Clone-seq, designed to leverage the power of nextgeneration sequencing to generate a large number of mutant alleles using site-directed mutagenesis in a rapid and cost-effective manner. Current protocols for site-directed mutagenesis require picking individual colonies and sequencing each colony using Sanger sequencing to identify the correct clone [15]. This standard approach is both labor-intensive and expensive; therefore, it does not scale up to genome-wide surveys. In Clone-seq, we put one colony of each mutagenesis attempt into one pool ( Fig. 1a; in other words, each pool contains one and only one colony for each desired mutation) and combine multiple pools through multiplexing for one Illumina sequencing run [16]. Colonies for generating different mutations of the same gene can be put into the same pool, which can be easily distinguished computationally when processing the sequencing results. This is true even for mutations occurring at the same site (Fig. 2a, Text S1).
For the 51 selected interactions, we chose 27 disease-associated mutations of residues at the interface (''interface residue''), 100 mutations in the rest of the interface domain (''interface domain'') and 77 mutations away from the interface (''away from the interface''; Fig. 3a,b). These interfaces were determined using solvent accessible surface area calculations as previously described [17,18] on 7,340 co-crystal structures (Materials and Methods). To set up our Clone-seq pipeline, we first started with 39 mutations from these 204 and picked 4 colonies for each mutation. As a reference, we also pooled together all the wild-type alleles in our human ORFeome library to be sequenced together with the 4 pools of the mutagenesis colonies. In total, there were 40.1 million Illumina HiSeq 16100 bp reads for our Clone-seq samples (Text S1) for an average of .2,5006 coverage on all desired mutation sites. Therefore, our Clone-seq pipeline has the capacity to generate .3,000 mutations in one full lane of a HiSeq run with 16100 bp reads, drastically improving the throughput and decreasing overall sequencing costs by at least 10-fold (Text S1). Fig. 2a presents a schematic of the criteria we use to determine which clones contain the desired mutation and can be used for subsequent steps. For example, in pool 1, all reads (ignoring sequencing errors) confirm that genes I and II each contain the desired mutation -T116A and G298T, respectively. For gene III, we want to generate two separate clones with two separate mutations -III A41T and III C194T . Since half the reads contain T41 (instead of A41) and the other half contain T194 (instead of C194), and we normalize DNA concentrations across all samples, we can infer that both mutant clones were generated successfully. In contrast, for gene IV, we see that while half the reads contain A511 (instead of G511), all the reads are wild-type at C74. Thus, we infer that while the IV G511A clone is successfully generated, the IV C74T clone is not. For gene V, although both mutant clones are successfully generated, half the reads contain an additional mutation, C436G. Since it is impossible to know which of the two clones for V contains this unwanted mutation, neither clone is usable. Similarly, we can determine mutant clones I T116A , III A41T , III C194T , IV C74T , IV G511A , V T53G , and V G272A as usable clones in pool n. Based on these criteria, we developed the S score calculation and used it to determine successful mutagenesis attempts (Materials and Methods). Out of 156 colonies for 39 mutations, 125 of them contain the desired mutations (S.0.8), an overall 80% PCR-mutagenesis success rate. In fact, we were able to pick correct clones for all 39 mutant alleles using only the first two pools in Clone-seq. All 78 clones from the first two pools, from which the correct ones were selected for use in subsequent steps, were also Sanger sequenced for verification. 55 Clone-seq positive results with S.0.8 were all confirmed and there is a clear separation in the S scores between the successful and failed mutagenesis attempts (Fig. 2b).
One major advantage of our Clone-seq pipeline is that it allows us to carefully examine whether other unwanted

Author Summary
With rapid advances in sequencing technologies, tens of millions of DNA variants have now been discovered in the human population. However, there are currently no experimental methods available for examining the impact of DNA variants in a high-throughput fashion. As a result, we have no functional data on the vast majority of these variants, which is a major roadblock to generating novel biological insights and developing new disease prevention therapeutic strategies. To address this issue, we have successfully developed the first massively-parallel sitedirected mutagenesis approach, Clone-seq, to leverage the power of next-generation sequencing to generate a large number of mutant alleles in a fast and cost-effective manner. In conjunction with Clone-seq, we established a high-throughput comparative interactome-scanning pipeline to experimentally elucidate the effect of variants on protein stability and interactions. Additionally, Clone-seq can be used to generate clones for all DNA variants, including those in non-coding regions.
mutations have been inadvertently introduced during PCRmutagenesis in comparison with the corresponding wild-type alleles, since we obtain reads spanning the entire gene. We found that there are on average 4-5 unwanted mutations introduced in each pool of 39 colonies. This corresponds to a 0.013% PCR error rate (Materials and Methods), in agreement with previous studies [19]. The detection of unwanted mutations, especially those distant from the mutation of interest, is achieved in traditional site-directed mutagenesis pipelines by Sanger sequencing through the gene of interest. This is costly and labor-intensive, especially because multiple sequencing runs are needed for one long gene. However, since Clone-seq yields reads spanning the entire gene, we were able to determine which of the generated clones definitely do not have unwanted mutations in the full length of their sequences as illustrated in Fig. 2a (Materials and Methods), and we pick only these clones for subsequent assays.
To further test our Clone-seq pipeline, we applied it to generate clones for 113 SNPs on 66 genes from the recently published Exome Sequencing Project dataset [3]. Using the same approach as described above, we sequenced 4 colonies each for the 113 alleles of interest using one third of a 16100 bp MiSeq run. We obtained 4.7 million reads for these 113 alleles. With a threshold of S.0.8, we were able to determine that 370 out of the 452 colonies (82%) contain the desired mutation, in perfect agreement with the PCR-mutagenesis success rate obtained earlier. We were able to choose colonies that contain only the desired mutation for all 113 alleles. Because the whole MiSeq run produced 17.7 million reads and we only used 4.7 million for generating the 113 mutant clones, Figure 1. Schematic of our comparative interactome-scanning pipeline. Our pipeline begins with Clone-seq (a), a massively-parallel low-cost site-directed mutagenesis pipeline leveraging next-generation sequencing. This is followed by a high-throughput GFP assay (b) to determine protein stability, and a high-throughput Y2H assay (c), along with SILAC-based mass spectrometry (d) to determine the impact of DNA coding variants on protein interactions. doi:10.1371/journal.pgen.1004819.g001 Figure 2. Identifying usable clones from Clone-seq. (a) Schematic illustrating criteria used to determine which of the clones generated by our Clone-seq pipeline are usable for further assays -green ticks indicate usable clones, while red crosses indicate clones that cannot be used. (b) Variation of S across different mutagenesis attempts that either contain or do not contain the desired mutation as confirmed by Sanger sequencing. doi:10.1371/journal.pgen.1004819.g002  1G3N) depicting a V31L mutation (on Cdkn2c) away from the Cdkn2c-Cdk6 interaction interface. (c) GFP assays that determine the stability of wild-type Rrm2b and the R41P and L317V mutations on Rrm2b that are at an interface residue and away from the interface for the Rrm2b-Rrm2b interaction; GFP assays that determine the stability of wild-type Hprt1 and the C206Y mutation on Hprt1 that is away from the interaction interface of Hprt-Hprt1. Empty vector was used as a negative control. doi:10.1371/journal.pgen.1004819.g003 the capacity of our Clone-seq pipeline using one full lane of a 16100 bp HiSeq run is estimated to be .3,000, exactly the same as our previous assessment (Text S1).
Finally, we generated the remaining 165 disease mutations (of the 204) and 717 other coding variants from the Exome Sequencing Project and the Catalog of Somatic Mutations in Cancer [20] using a full 16100 bp HiSeq run, including 40 mutations on a single gene -MLH1. Using 111.2 million reads for these 882 alleles, we found that 2,958 of the 3,528 colonies (84%) contain the desired mutation, again in excellent agreement with our previously obtained PCR-mutagenesis success rate. There was at least one colony with only the desired mutation for all 882 alleles, including all 40 MLH1 mutations ( Table S1). Therefore, our Clone-seq pipeline can generate a large number of mutations (.40) even for a single gene. In fact, to generate even more mutations for one gene, we can implement a two-round barcoding approach: generate groups of 40 mutations and barcode them differently for one HiSeq run ( Figure S3). Ten such groups will enable us to generate ,400 mutations for a single gene (Text S1). Since the average coverage of these 882 alleles is .3006, the capacity of our Clone-seq pipeline using one full lane of a 16100 bp HiSeq run is estimated to be .3,000, again in agreement with our previous two estimates (Text S1).
Overall, our pipeline has been significantly optimized to make it very efficient. We established a web tool (http://www.yulab.org/ Supp/MutPrimer) to design mutagenesis primers both individually and in batch. MutPrimer can design ,1,000 primers for ,500 mutations in one batch in less than one second. All of the 2,068 primers for the 1,034 mutations in this study were generated by MutPrimer. All mutagenesis PCRs are performed in batch using automatic 96-well procedures. Since single colony picking after bacterial transformation of mutagenesis PCR product is a ratelimiting step, we rigorously optimized this step and found that adding 10 mL mutagenesis PCR products to 100 mL competent cells and plating 50 mL transformed cells give the best transformation yield and well-separated single colonies. Furthermore, rather than individually streaking transformed cells onto agar plates one sample at a time, we were able to significantly increase throughput by spreading colonies using glass beads onto four sector agar plates which are partitioned into four non-contacting quadrants (Materials and Methods). In this manner, a 96-well plate of transformed bacteria can be plated out onto 24 four-sector agar plates in ,15 minutes. Traditional site-directed mutagenesis pipelines require miniprepping each of the selected colonies and sequencing them separately by Sanger sequencing. To drastically improve the throughput of our Clone-seq pipeline, we pooled together the bacteria stock of a single colony for each mutagenesis attempt to perform one single maxiprep, which makes the library construction step much more efficient and amenable to highthroughput (Text S1). Furthermore, existing variant calling pipelines [21] cannot be applied to our Clone-seq results because the expected allelic ratios built into these pipelines are a function of the ploidy of the organism. However, in our Clone-seq pipeline there is no concept of ploidy. We pool together many mutations for one gene in the same pool (e.g., 40 mutations for MLH1) and different genes often have different numbers of mutations. Our S score calculation and unwanted mutation detection pipeline was designed according to our pooling strategy (Materials and Methods).
In total, we have used the novel Clone-seq pipeline successfully to generate 1,034 (39+113+882) mutant clones without any additional unwanted mutations, confirming the scalability, accuracy, and throughput of our Clone-seq pipeline.
A high-throughput GFP assay to determine the impact of mutations on protein stability For the 204 mutations on proteins with co-crystal structures, we first examined whether the mutant proteins can be stably expressed in human cells. To do this, we tagged every wild-type and mutant protein with GFP at the C-terminus using highthroughput Gateway cloning (Fig. 1b). The GFP constructs were transfected into HEK293T cells and fluorescence intensities were measured by a plate reader ( Fig. 3c; Materials and Methods). All fluorescence intensity readings were also confirmed manually under a microscope. Compared with the corresponding wild-type proteins, the expression levels of 3 of the 27 ''interface residue'' mutants, 8 Table). To validate these findings, we also performed Western blotting for 8 random mutants that are stably expressed and 8 random mutants with significantly diminished expression levels (Fig. 4a). Western blotting results confirm our GFP intensity readings.
A high-throughput Y2H assay to determine the impact of mutations on protein interactions Next, we investigated whether these mutations could affect protein-protein interactions using Y2H ( Fig. 1c; Materials and Methods). We found that 21 of the 27 (78%) ''interface residue'' mutations, 57 of the 100 (57%) ''interface domain'' mutations, and only 22 of the 77 (29%) ''away from the interface'' mutations disrupt the corresponding interactions, thereby demonstrating a clear difference ( Fig. 4b; P = 3610 26 between ''interface residue'' and ''interface domain'' and P = 8610 210 between ''interface domain'' and ''away from the interface'') in terms of ability to interfere with protein-protein interactions between mutations at different structural loci within the same protein. Furthermore, comparing with the GFP results, we found that all destabilizing mutations were shown to disrupt the corresponding interactions in our Y2H experiments. By considering only the mutations that do not affect protein expression based on the GFP experiments, we found the same difference: 13 out of 18 (72%) ''interface residue'' stable mutations, 42 out of 83 (51%) ''interface domain'' stable mutations, and only 9 out of 52 (17%) ''away from the interface'' stable mutations disrupt the corresponding interactions ( Fig. 4b; P = 2610 25 between ''interface residue'' and ''interface domain'' and P = 9610 213 between ''interface domain'' and ''away from the interface''; Table S2). Since these interfaces are obtained from actual co-crystal structures, our results suggest that accurate structural information can help determine the functional impact of mutations on protein-protein interactions. Wild-type proteins corresponding to 113 of the 153 stably expressed mutant proteins also interact with other proteins as determined by our Y2H experiments (114 interactions in total, termed ''other interactions''); however, for these interactions, there are currently no cocrystal structures available in the PDB. Using these other interactions, we calculated the likelihood of a given mutation disrupting a specific interaction without any structural information to be 32% (Fig. 4b).

Relationships between measured molecular phenotypes and corresponding disease phenotypes
We then analyzed whether the molecular phenotypes measured by our high-throughput GFP and Y2H assays are correlated with corresponding disease phenotypes. We first examined how mutation pairs on the same gene affect protein stability and its relationship to their corresponding diseases. We find that pairs of mutations that are either both stable or both unstable cause the same disease in 68% and 70% of cases, respectively. However, pairs comprising one stable and one unstable mutation cause the same disease in only 30% of cases (P = 6610 29 and 8610 210 , respectively, Fig. 5a). For example, we find that the mutations R727C and L844F on the spindle checkpoint kinase Bub1b both cause the protein to become unstable and lose all its interactors. These mutations are both associated with the same disease, mosaic variegated aneuploidy, an autosomal recessive disorder that causes predominantly trisomies and monosomies of different chromosomes [22,23]. Since our GFP assay shows that these two mutations cause loss of protein product, our results are consistent with Matusuura et al.'s finding that a more than 50% decrease in Bub1b activity leads to abnormal mitotic spindle checkpoint function and mosaic variegated aneuploidy [24].
We then examined whether mutation pairs on the same gene disrupt the same set or different sets of interactions (i.e., their interaction disruption profiles) and investigated whether their disruption profiles correlates with disease phenotypes. We found that mutation pairs with the exact same disruption profile are significantly more likely to cause the same disease than those with different profiles (70% and 61% respectively, P = 3610 25 , Fig. 5b). For example, we found that two mutations on Smad4, R361C and Y353S, disrupt its interactions with Smad3 and Smad9 while leaving the interactions with Lmo4 and Rassf5 unaltered (Fig. 5c). These two mutations both cause juvenile polyposis coli [25,26], a disease is known to be caused by disruption of the core Smad/Bmp signaling pathways [27]. Our Y2H results clearly demonstrate that the R361C and Y353S mutations disrupt the Smad4-Smad3 and Smad4-Smad9 interactions ( Fig. 5c) leading to disruption of core Smad signaling pathways. However, the mutation N13S on Smad4 does not disrupt any of these interactions (Fig. 5c) and is associated with a different disease, pulmonary arterial hypertension. Our results agree with Nasim et al.'s finding that the N13S mutation does not alter downstream Smad signaling [28]. Our findings provide support for the hypothesis that the N13S mutation either impacts pathways outside the core Smad signaling network or are pathogenic only when combined with other environmental and genetic factors [29].
Overall, these results show that mutation pairs with similar molecular phenotypes in terms of both protein stability and interactions are significantly more likely to cause the same disease than those with different molecular phenotypes. This confirms that the molecular phenotypes measured by our high-throughput GFP and Y2H assays are biologically relevant in vivo. Furthermore, by comparing the molecular phenotypes, in particular the protein interaction disruption profiles, of mutations/variants to those of known disease mutations, potential candidate mutations for a variety of diseases can be identified.
A high-throughput mass spectrometry assay to determine the impact of mutations on protein interactions While we use only those interactions that are supported by cocrystal structures to estimate the fraction of interactions that are disrupted by mutations at different structural loci, the described procedures can also be applied to interactions with predicted interfaces and structural models [30,31,32,33]. This is of particular importance because over 90% of known interactions do not currently have corresponding co-crystal structures [33,34]. For example, Mlh1 is known to interact with Pms2, both of which are well-studied DNA mismatch repair genes frequently mutated in hereditary nonpolyposis colorectal cancer [35]. Although the structural basis of the Mlh1-Pms2 interaction still remains unknown, both our previous 3D reconstruction of the human interactome network [7,32] and the newly-established Inter-actome3D [33] database suggest that the HATPase_c domain is part of the interface for Mlh1's interaction with Pms2. Previous work has shown that a point mutation (I107R) on the HATPase_c domain of Mlh1 is associated with colorectal cancer and disrupts the Mlh1-Pms2 interaction [7,35,36]. First, using Y2H, we were able to confirm the disruption ( Figure S4). Next, we developed a high-throughput-amenable mass spectrometry pipeline using Stable Isotope Labeling by Amino acids in Cell culture (SILAC) [37,38], which was designed to reveal both lost/ weakened and gained/enhanced interactions of the target proteins (Fig. 1d) [39]. We added an HA-tag to the N-terminus of both wild-type and mutant Mlh1, as well as to GFP as a control, and performed four SILAC experiments: wild-type Mlh1 (heavy) vs. GFP control (light), mutant Mlh1 (heavy) vs. GFP control (light), wild-type (heavy) vs. mutant (light) Mlh1, and mutant (heavy) vs. wild-type (light) Mlh1 ( Fig. 6a; Materials and Methods). Interactors of wild-type/mutant Mlh1 are defined as those that bind wild-type/mutant Mlh1 more than 26 stronger than GFP control (Materials and Methods). For a lost/weakened interaction, we required that the interaction be more than 26 stronger with wild-type Mlh1 than with mutant Mlh1 as confirmed both in wild-type (heavy) vs. mutant (light) and in mutant (heavy) vs. wild-type (light) experiments; we further required that the interaction be detected in the wild-type vs. control experiment (Fig. 6a; Materials and Methods). For a gained/enhanced interaction, we required that the interaction be more than 26 stronger with mutant Mlh1 than with wild-type Mlh1 as confirmed both in wild-type (heavy) vs. mutant (light) and in mutant (heavy) vs. wild-type (light) experiments; we further required that the interaction be detected in the mutant vs. control experiment ( Fig. 6a; Materials and Methods). We were able to detect Pms2 as the only specifically weakened interactor caused by the mutation (Figs. 6b,c; E = 21.77; P = 3610 24 ), in agreement with our Y2H results and previous studies [7,36]. Additionally, we were able to detect Hspa8 as the only specifically enhanced interactor of the mutant protein (Figs. 6b,c; E = 2.71; P = 7610 28 ). Two other known interactors of Mlh1, Pms1 (Figs. 6b,c; E = 20.32; P = 0.21) [40] and Brip1 (Fig. 6b,c; E = 0.18; P = 0.32) [41], were also detected, although their interactions with Mlh1 are not affected by this particular mutation (Materials and Methods).
Hspa8 was not previously known to interact with Mlh1 and the impact of the Mlh1 I107R mutation on its interactions with Pms1 and Brip1 has not been reported in the literature. To verify our SILAC results, we performed in vivo co-immunoprecipitation using HA-tagged wild-type and mutant Mlh1 and tagged Hspa8 and Brip1 with V5 (Materials and Methods). Our co-immunoprecipitation results confirm that Hspa8 only weakly interacts with wild-type Mlh1, but the interaction is dramatically enhanced by a single amino acid substitution (I107R) (Fig. 6d, lanes 3 and 4), whereas the interaction between Mlh1 and Brip1 is not affected by this mutation (Fig. 6d, lanes 6 and 7; Materials and Methods). Hspa8 is a constitutively expressed member of the heat shock protein 70 family [42]. It functions as a chaperone to facilitate protein folding [42] and also functions as an ATPase in the disassembly of clathrin-coated vesicles during membrane trafficking [43]. A recent study reported that Hspa8 is specifically recruited to Figure 5. Relationships between molecular phenotypes and disease phenotypes. (a) Fraction of mutation pairs on the same gene that cause the same disease: for the same and different effects on protein stability. (b) Fraction of mutation pairs on the same gene that cause the same disease: for the same and different interaction disruption profiles. Error bars indicate +SE. (c) Crystal structure (PDB id: 1U7F) depicting the Y353S and R361C mutations (on Smad4) at interface residues for the Smad4-Smad3 interaction. (d) Y2H analysis of the effects of Smad Y353S, R361, and N13S mutations on its interactions with Smad3, Lmo4, Rassf5, and Smad9. Western blotting with anti-GFP antibody confirming the protein expression levels of wild-type Smad4 and its 3 mutant alleles -Y353S, R361C and N13S. c-tubulin was used as a loading control. doi:10.1371/journal.pgen.1004819.g005 reovirus viral factories, independent of its chaperone function [44]. Our Western blotting results demonstrate that the expression level of Mlh1 is not affected by the I107R mutation ( Figure S5). Therefore, our SILAC results suggest that Hspa8 may play an important role in colorectal cancer and that its function could be independent of its role as a chaperone.

Discussion
We have successfully developed the first massively parallel sitedirected mutagenesis pipeline, Clone-seq, using next-generation sequencing. Our Clone-seq pipeline is entirely different from previously described random mutagenesis approaches [45,46,47,48]. Clone-seq is used to generate a large number of specific mutant clones with desired mutations; each individual mutant clone has a separate stock and different clones can therefore be used separately for completely different downstream assays. In random mutagenesis, a pool of sequences containing different mutations for one gene is generated using error-prone PCR or error-prone DNA synthesis. Therefore, it is not possible to separate one mutant sequence from another and the whole pool can only be used for the same assay(s) together. Furthermore, it is not possible to control which or how many mutations are generated on each DNA sequence. In fact, to improve coverage, most random mutagenesis pipelines generate on average two or more mutations on each DNA sequence [45], which makes it impossible to distinguish the functional impact of each individual mutation on the same sequence. Site-directed mutagenesis and random mutagenesis are designed for different goals: if one wants to generate all possible mutations for a certain protein without the need to separate different clones, it would be more favorable to use random mutagenesis; whereas if one needs to have separate clones for each mutation, site-directed mutagenesis is required. As a result, the two approaches are complementary and not comparable.
While there are highly efficient methods for random mutagenesis [45,46,47,48], current protocols for site-directed mutagenesis are low-throughput and become prohibitively expensive if a large number of clones needs to be generated. Clone-seq directly addresses the necessity for a high-throughput site-directed mutagenesis pipeline. It is a robust, cost-effective and efficient method that can be used to generate a total of ,3,000 distinct mutant clones in one full lane of a 16100 bp HiSeq run. Cloneseq is suitable both for generating mutations across many genes as well as a large number of mutations on a few genes. The former situation is applicable when one wants to generate many mutations/variants from large-scale studies (e.g., whole-genome or whole-exome sequencing) since they typically identify mutations/variants on a large number of genes [49,50]. The latter situation usually arises in a study focused on a single pathway with a few genes of interest (e.g., an alanine-scanning mutagenesis to determine functional sites on a gene of interest [51]).
Integrating with Clone-seq, we also established a comprehensive comparative interactome-scanning pipeline, including highthroughput GFP, Y2H, and mass spectrometry assays, to systematically evaluate the impact of human disease mutations on protein stability and interactions. We examine each mutation individually, rather than looking at their combinatorial effects because these inherited germline disease mutations are extremely rare. Therefore, the probability of having even two of these in the same individual becomes infinitesimally small. Our results reveal that the overall likelihood of a given disease mutation disrupting a specific interaction is 32%. Accurate structural information of these interactions obtained from co-crystal structures greatly improves our understanding of the impact of disease mutations: 13 out of 18 (72%) ''interface residue'' stable mutations, 42 out of 83 (51%) ''interface domain'' stable mutations, and only 9 out of 52 (17%) ''away from the interface'' stable mutations disrupt the corresponding interactions, unveiling a clear dependence of the molecular phenotypes of disease mutations on their structural loci. These estimates are not affected by the false negative rate of our Y2H assay as we only use those interactions for which we can detect the wild-type interaction with strong Y2H phenotypes. Thus, any observed disruption is due to the mutation of interest and not an assay false negative. Furthermore, our Y2H pipeline has been shown to be of high quality and has an experimentally measured false positive rate of ,5% or lower in different organisms [9,12,52,53]. In addition, the interactions used to understand the relationship between molecular phenotypes and structural loci of disease mutations are all supported by co-crystal structures, therefore these interactions are not assay false positives. We also find that the molecular phenotypes detected by our GFP and Y2H assays correlate with known disease phenotypes, confirming the in vivo biological significance of our measurements.
Moreover, as shown by the Mlh1 example (Fig. 6), our comparative interactome-scanning pipeline can also be used with predicted structural models [30,31,32,33]. The consequent experimental results will clearly be affected by the quality of these predictions, which is not part of our pipeline. In fact, our experimental interactome-scanning pipeline can be applied to evaluate or improve these predicted models by testing mutations at different loci of a protein of interest and examining how these mutations disrupt different interactions of this protein.
Our comparative interactome-scanning pipeline described and validated here can be applied to experimentally determine in a high-throughput fashion the impact on protein stability and protein-protein interactions for thousands of DNA coding variants and disease mutations, which can directly lead to hypotheses of concrete molecular mechanisms for follow-up studies. Furthermore, the elucidation of molecular phenotypes of disease mutations is also vital for selecting actionable drug targets and ultimately for making therapeutic decisions. Finally, the general scheme of our pipeline can be readily expanded to other interactome-mapping methods, particularly other protein-protein [10], protein-DNA [54,55], protein-RNA [56], and proteinmetabolite interaction assays [57], to comprehensively evaluate the functional relevance of all DNA variants, including those in non-coding regions. Figure 6. Identifying interactions of Mlh1 that are affected by the I107R mutation using SILAC-based mass spectrometry. (a) Schematic illustrating criteria used to identify interactions that are lost/weakened, unchanged, and gained/enhanced due to the I107R mutation on Mlh1. Blue denotes samples cultured in light media and black denotes samples cultured in heavy media. (b) Scatter plot illustrating fold change (FC; log scale) in the amount of protein pulled down by wild-type Mlh1 and mutant Mlh1 (I107R). Values are computed based on the wild-type (heavy) vs. mutant (light) (X-axis) and mutant (heavy) vs. wild-type (light) (Y-axis) experiments. Green denotes enhancement of interaction, red denotes weakening of interaction, and gold denotes no change. Mlh1 is shown in grey. (c) Fold changes and read counts (r) for interactors of Mlh1 that can be reliably identified as weakened, unchanged, and enhanced due to the I107R mutation. (d) Anti-HA immunoprecipitation followed by Western blotting with anti-V5 antibody confirming that the Mlh1-Brip1 interaction remains unchanged and that the Mlh1-Hspa8 interaction is dramatically enhanced due to the I107R mutation. doi:10.1371/journal.pgen.1004819.g006

Selecting interactions with mutations on and away from the interface
To calculate atomic-resolution interaction interfaces, we systematically examined a comprehensive list of 7,340 PDB cocrystal structures. To define the interface, we used a water molecule of diameter 1.4 Å as a probe and calculated the relative solvent accessible surface areas of the interacting pair as well as the individual proteins involved in the interaction. Residues whose relative accessibilities change by more than 1 Å 2 are considered as potential interface residues, because amino acids at the interface reside on the surfaces of the corresponding proteins, but will tend to become buried in the co-crystal structure as the two proteins bind to each other [58]. So, for these residues, there should be a significant decrease in accessible surface area when we compare the bound and unbound states of the protein chains.
To identify interface domains, we required at least one of the following criteria to hold: 1. 3did [59] or iPfam [60] have identified the domain pair as interacting and each of the interface domains contains at least one interface residue based on our calculations. 2. The domain pair contains 5 or more interface residues for each protein according to our calculations.
We then identified the subset of these interactions that contain at least one disease mutation and are amenable to our version of Y2H [11,12,13,14]. Subsequently, we performed a pairwise retest of all these interactions and selected the ones that yield strong Y2H phenotypes, because subsequent steps involve detecting a significant decrease in these phenotypes. For cases where no primer satisfies all three criteria simultaneously, we relaxed criterion 2 to GC content $30%.

Primer design for site-directed mutagenesis
We established a supplementary web tool (http://www.yulab. org/Supp/MutPrimer) to design mutagenesis primers individually or in bulk.

Construction of mutant alleles using high-throughput site-directed mutagenesis PCR
All wild-type clones were obtained from the human ORFeome v8.1 collection [61]. To generate mutant alleles, sequence-verified single-colony wild-type clones and their corresponding mutagenic primers were aliquoted into individual wells of 96-well PCR plates. Mutagenesis PCR was then performed as specified by the New England Biolabs (NEB) PCR protocol for Phusion polymerase (M0530L), noting that PCR was limited to 18 cycles. The samples were then digested by DpnI (NEB R0176L) according to the manufacturer's manual. After digestion, samples were transformed into competent E. coli. Since single colony picking after bacterial transformation of mutagenesis PCR product is a rate-limiting step, we rigorously optimized this step. First, we tried different volumes of competent cells for transformation and found that single colony yields peak when ,100 mL of competent cells are used. It is also necessary to use ,10 mL of mutagenesis PCR product: any lower volume of PCR product results in significantly reduced colony yields, while higher volumes of PCR product do not increase yield. Finally, colony picking was done using four-sector agar plates (VWR 25384-308) that are partitioned into four non-contacting quadrants with glass beads poured onto each plate quadrant. Each bead-filled quadrant was inoculated with ,50 mL of transformed bacteria. This was then spread by lightly shaking the four-sector agar plate. Our optimized transformation protocol results in a large number of well-separated single colonies that can be easily picked the next day. Upon recovery, single colonies from each quadrant were then picked and arrayed into 96-deepwell plates filled with 300 mL of antibiotic media. Four colonies per allele were picked for next-generation sequencing.
DNA library preparation for Illumina sequencing DNA library preparation was performed using NEBNext DNA Library Prep Master Mix Set for Illumina (NEB E6040S) according to the manufacturer's manual. Briefly, 5 mg of pooled plasmid DNA (,100 mL, all samples were normalized to the same concentration) was sonicated to ,200 bp fragments. The fragmented DNA was first mixed with NEBNext End Repair Enzyme for 30 mins at 20uC. Blunt-ended DNA was then incubated with Klenow Fragment for 30 mins at 37uC for dA-Tailing. Subsequently, NEBNext Adaptor was added to dA-Tailed DNA. Adaptor-ligated DNA (,300 bp) was size-selected on a 2% agarose gel. Size-selected DNA was then mixed with one of the NEBNext Multiplex Oligos (NEB E7335S) and Universal PCR primers for PCR enrichment. At each step, DNA was purified using a QIAquick PCR purification kit (Qiagen 28104). Multiplexed DNA samples were combined and analyzed in one lane of a 16100 bp run by Illumina HiSeq 2500.

Identifying successful instances of site-directed mutagenesis based on next-generation sequencing
The mutant colonies were barcoded and pooled as shown in Fig. 1a. The multiplexed colonies were then run on an Illumina sequencer (2 HiSeq runs and 1 MiSeq run) to give 16100 bp reads. These reads were then de-multiplexed and mapped to the genes of interest using the BWA ''aln'' algorithm [62]. For each allele, we identified all reads that mapped to the position of the mutation of interest (R all ) and those that actually contained the desired mutation (R mut ). We then calculated a normalized score (S) that quantifies the fraction of reads containing the desired mutation: where k is the number of different mutations for the same gene.
For 39 mutations, we Sanger sequenced two mutant colonies per mutagenesis attempt to quantify the correlation between S and observation of the desired mutation. We found that all clones with S.0.44 are confirmed to be correct via Sanger sequencing with a clear separation between those that are correct and those that are not (Fig. 2b). However, to further ensure that the clones we picked were correct, we require S.0.8 for a colony to be scored as containing the desired mutation.

Identifying unwanted mutations
One major advantage of our Clone-seq pipeline over traditional site-directed mutagenesis protocols using Sanger sequencing [15] is that we can now carefully examine whether there are other unwanted mutations inadvertently introduced during the PCR process, in comparison with the corresponding wild-type alleles. It is essential to use clones with no unwanted mutations for downstream experiments, as the presence of these will make it impossible to determine whether the observed disruption is due to the desired or other undesirable mutation(s).
We use samtools ''mpileup'' [63] to obtain read counts for different alleles at each nucleotide for all the clones. We calculate the background sequencing error rate by calculating the average fraction of non-reference alleles across all nucleotides where we did not attempt to introduce a mutation. Any site that has a significantly higher fraction of non-reference alleles (using a P value cutoff of 0.2 from a cumulative binomial test) is considered to have an unwanted mutation. A lenient P value cutoff (0.2 as opposed to the more traditionally used 0.05 or 0.01) implies more stringent filtering in this case because we want to eliminate type II errors i.e., we want to identify all unwanted mutations at the cost of discarding a few clones that actually do not have any unwanted mutations.
We identified an average of 4-5 unwanted point mutations per pool. The overall per-base point mutation rate of Phusion polymerase was calculated to be ,10 24 . NEB's advertised error rate for Phusion polymerase varies from 4.4-9.5610 27 per PCR cycle. Since we perform 18 PCR cycles, the expected overall error rate is ,10 25 . Our calculated mutation is within an order of magnitude of this advertised error rate. It is slightly higher than the advertised rate as we use stringent filtering criteria as described above.

GFP assay
All wild-type and mutant clones were moved into the pcDNA-DEST47 vector with a C-terminal GFP tag using automated Gateway LR reactions in a 96-well format. After bacterial transformation, minipreps were prepared on a Tecan Freedom Evo 200, and DNA concentrations were determined by OD 260/ 280 with a Tecan Infinite M1000 plate reader in 96-well format. A 100 ng aliquot of each expression clone plasmid was used for transfection into HEK293T cells in 96-well plates using Lipofectamine 2000 (Invitrogen 11668019) according to the manufacturer's instructions. At approximately 48 hrs post-transfection, cells were processed with Tecan M1000. Fluorescence intensities were measured at 395 nm for excitation and 507 nm for emission, according to Invitrogen's manual. As negative controls, the fluorescence intensities corresponding to cells transfected with the empty vector were measured. The normalized fluorescence intensity was calculated as: where I corresponds to the measured intensity and I background corresponds to the average intensity of the empty vector controls for each plate. All I norm values greater than K are considered to correspond to stable protein expression. K corresponds to the range (maximum -minimum) of background fluorescence intensities of the empty vector controls for each plate. For this study, all fluorescence intensity readings were also confirmed manually under a microscope. All transfection and GFP experiments were repeated 3 times.

Y2H assay
Y2H was performed as previously described [7]. All wild-type/ mutant clones were transferred by Gateway LR reactions into our Y2H pDEST-AD and pDEST-DB vectors. All DB-X and AD-Y plasmids were transformed individually into the Y2H strains MATa Y8930 and MATa Y8800, respectively. Each of the DB-X MATa transformants (wild-type and mutants) were then mated against corresponding AD-Y MATa transformants (wild-type and mutants) individually using automated 96-well procedures, including inoculation of AD-Y and DB-X yeast cultures, mating on YEPD media (incubated overnight at 30uC), and replica-plating onto selective Synthetic Complete media lacking leucine, tryptophan, and histidine, and supplemented with 1 mM of 3-amino-1,2,4-triazole (SC-Leu-Trp-His+3AT), SC-Leu-His+3AT plates containing 1 mg/l cycloheximide (SC-Leu-His+3AT+CHX), SC-Leu-Trp-Adenine (Ade) plates, and SC-Leu-Ade+CHX plates to test for CHX-sensitive expression of the LYS2::GAL1-HIS3 and GAL2-ADE2 reporter genes. The plates containing cycloheximide select for cells that do not have the AD plasmid due to plasmid shuffling. Growth on these control plates thus identifies spontaneous autoactivators [64]. The plates were incubated overnight at 30uC and ''replica-cleaned'' the following day. Plates were then incubated for another three days, after which positive colonies were scored as those that grow on SC-Leu-Trp-His+3AT and/or on SC-Leu-Trp-Ade, but not on SC-Leu-His+3AT+CHX or on SC-Leu-Ade+ CHX. Disruption of an interaction by a mutation was defined as at least 50% reduction of growth consistently across both reporter genes, when compared to Y2H phenotypes of the corresponding wild-type allele as benchmarked by 2-fold serial dilution experiments. All Y2H experiments were repeated 3 times.

Analysis of interacting proteins by SILAC and LC-MS/MS
HEK293T cells were grown in SILAC media comprising SILAC DMEM (Thermo Scientific) and 10% dialyzed FBS (JR Scientific) supplemented with either 0.1 mg/ml L-lysine and Larginine (light media) or 0.1 mg/ml L-lysine 13C6, 15N2 and Larginine 13C6, 15N4 (heavy media). Heavy-or light-media cultured HEK293T cells were transfected using Lipofectamine 2000 (Invitrogen) in three 10 cm plates. 48 hrs after transfection, cells were washed three times in cold PBS and then resuspended in 5 ml RIPA buffer [1% NP-40, 50 mM Tris-HCl pH 7.5, 150 mM NaCl, 5 mM EDTA, 16 EDTA-free Complete Protease Inhibitor tablet (Roche)]. Cells were lysed for 30 mins on ice before centrifuging at 13,000 rpm for 10 mins. Cell lysates were incubated with 60 mL EZview Red Anti-HA Affinity Gel (Sigma-Aldrich) for 3 hrs. After 3 washes with RIPA buffer, bound proteins were eluted with 3 resin volumes elution buffer (100 mM Tris-HCl pH 8.0, 1% SDS). Eluted proteins from light and heavy media were mixed together, reduced with 5 mM DTT, alkylated with 15 mM of iodoacetamide, and then precipitated with 3 volumes PPT solution (50% acetone, 49.9% ethanol, 0.1% acetic acid). Proteins from pull-down experiments were solubilized with 50 mL Urea/Tris solution (8 M Urea, 50 mM Tris-HCl pH 8.0) and 150 mL NaCl/Tris (50 mM Tris-HCl pH 8.0, 150 mM NaCl) followed by the addition of 1 mg Trypsin Gold (Promega). Protein digestion was performed overnight at 37uC after which trifluoroacetic acid and formic acid were added to a final concentration of 0.2%. Peptides were de-salted with Sep-Pak C18 columns (Waters Corporation), dried in a speed-vac, and reconstituted in 85 mL of a solution containing 80% acetonitrile and 1% formic acid. Samples were fractionated by Hydrophilic Interaction LIquid Chromatography (HILIC) using a TSK gel Amide-80 column (Tosoh Bioscience). HILIC fractions were dried in a speed-vac, reconstituted in 0.1% trifluoroacetic acid, and analyzed by LC-MS/MS using a 125 mM ID capillary column packed in-house with 3 mm C18 particles (Michrom Bioresources) and a Q-Exactive mass spectrometer (Thermo Fisher Scientific) coupled with a Nano LC-Ultra system (Eksigent). Xcalibur 2.2 software (Thermo Fischer Scientific) was used for the data acquisition and Q-Exactive was operated in the data-dependent mode. Survey scans were acquired in the Orbitrap mass analyzer over the range of 380 to 2000 m/z with a mass resolution of 70.000 (at m/z 200). Up to the top 10 most abundant ions with a charge state higher than 1 and less than 5 were selected within an isolation window of 2.0 m/z. Selected ions were fragmented by Higher-energy Collisional Dissociation (HCD) and the tandem mass spectra were acquired in the Orbitrap mass analyzer with a mass resolution of 17.500 (at m/z 200). The fragmentation spectra were searched by using the SEQUEST software on a SORCERER system (Sage-N Research) and a human database downloaded from the International Protein Index (version 3.80). In all database searches, trypsin was designated as the protease, allowing for one non-tryptic end and two missed-cleavages. The following parameters were used in the database search: a mass accuracy of 15 ppm for the precursor ions, differential modification of 8.0142 Daltons for lysine and 10.00827 Daltons for arginine. Results were filtered based on probability score to achieve a 1% false positive rate. The Xpress software, part of the Trans-Proteomic Pipeline (Seattle Proteome Center), was used to process the raw data and quantify the light/heavy peptide isotope ratios. Results were also manually inspected.

Identifying loss and gain of interactors for Mlh1
We performed four SILAC experiments using both wild-type and mutant Mlh1, as well as GFP as a control: wild-type (heavy) vs. control (light) [ We use the following variables and define four ratios for all subsequent calculations. In the WT_Control experiment, the relative abundance of protein p pulled down by wild-type Mlh1 to protein p pulled down by GFP (WT p ) is quantified by the inverse of the geometric mean of r wc reads with Xpress values X i . In the Mutant_Control experiment, the relative abundance of protein p pulled down by mutant Mlh1 (I107R) to protein p pulled down by GFP (Mut p ) is quantified by the inverse of the geometric mean of r mc reads with Xpress values Y i . In the WT_Mutant experiment, the relative abundance of protein p pulled down with mutant Mlh1 (I107R) to protein p pulled down by wild-type Mlh1 is quantified by the geometric mean of r wm reads with Xpress values P i . The amount of mutant Mlh1 (I107R) to wild-type Mlh1 is quantified by the geometric mean of t wm reads with Xpress values C j . In the Mutant_WT experiment, the relative abundance of protein p pulled down with mutant Mlh1 (I107R) to protein p pulled down by wild-type Mlh1 is quantified by the inverse of the geometric mean of r mw reads with Xpress values Q j . The amount of mutant Mlh1 (I107R) to wild-type Mlh1 is quantified by the inverse of the geometric mean of t mw reads with Xpress values D i . where both FC wm and FC mw denote the fold change in protein abundance as the normalized ratio of the amount of protein pulled down with mutant Mlh1 to that with wild-type Mlh1.
To identify interactors that are lost/weakened due to the I107R mutation, we required the following criteria to hold simultaneously: 1. The protein has to be identified as an interactor of wild-type Mlh1: WT p .2, r wc $5. 2. The protein has to be identified as a lost interactor based on both Mutant_WT: FC mw ,0.5, r mw $5, and WT_Mutant: FC wm ,0.5, r wm $5.
The first criterion ensures that the protein identified is a true interactor of wild-type Mlh1. The second criterion ensures that the loss of interaction is significant and reliably observed across both WT_Mutant and Mutant_WT experiments.
Similarly, to identify interactors that are gained/enhanced due to the I107R mutation, we required the following criteria to hold simultaneously: 1. The protein has to be identified as an interactor of mutant Mlh1 (I107R): Mut p .2, r mc $5. 2. The protein has to be identified as a gained interactor based on both Mutant_WT: FC mw .2, r mw $5, and WT_Mutant: FC wm .2, r wm $5.
The first criterion ensures that the protein identified is a true interactor of the I107R mutant of Mlh1. The second criterion ensures that the gain of interaction is significant and reliably observed across both WT_Mutant and Mutant_WT experiments.
We also identify interactors of Mlh1 that are unaffected by the I107R mutation using the following criteria: 1. The protein has to be identified as an interactor of both wildtype Mlh1: WT p .2, r wc $5, and mutant Mlh1 (I107R): Mut p . 2, r mc $5.
Integrating both WT_Mutant and Mutant_WT experiments, we calculated a weighted average of the individual fold changes: E~r mw |log 2 (FC mw )zr wm |log 2 (FC wm ) r mw zr wm P values are calculated using a two-sided Kolmogorov-Smirnov test (with bootstrapping).
Cell culture, co-immunoprecipitation, and Western blotting HEK293T cells were maintained in complete DMEM medium supplemented with 10% FBS. Cells were transfected with Lipofectamine 2000 (Invitrogen) at a 6:1 (mL/mg) ratio with DNA in 6-well plates and were harvested 24 hrs after transfection. Cells were gently washed three times in PBS and then resuspended using 200 mL 1% NP-40 lysis buffer [1% Nonidet P-40, 50 mM Tris-HCl pH 7.5, 150 mM NaCl, 16 EDTA-free Complete Protease Inhibitor tablet (Roche)] and kept on ice for 20 mins. Extracts were cleared by centrifugation for 10 mins at 13,000 rpm at 4uC. 15 mL EZview Red Anti-HA Affinity Gel (Sigma-Aldrich) and 100 mL protein lysate were used for each co-immunoprecipitation reaction. The samples were rotated gently at 4uC for 2 hrs. HA beads were then washed three times with protein lysis buffer, treated with 66protein sample buffer, and subjected to SDS-PAGE. Proteins were then transferred from the gel onto PVDF (Amersham) membranes. Anti-HA (Sigma H9658), anti-V5 (Invitrogen 46-0705), anti-b-tubulin (Promega G7121), and anti-GFP (Santa Cruz sc-9996) antibodies were used at 1:3,000 dilutions for immunoblotting analysis. Table S1 Summary of the 4 colonies for each of the 40 mutations generated on MLH1. ''1'' indicates that for the mutation of interest, mutagenesis PCR succeeded for that colony and ''0'' indicates that for the mutation of interest, the mutagenesis PCR failed for that colony. None of the colonies contained any unwanted mutations on MLH1.