Assessment of antibody library diversity through next generation sequencing and technical error compensation

Antibody libraries are important resources to derive antibodies to be used for a wide range of applications, from structural and functional studies to intracellular protein interference studies to developing new diagnostics and therapeutics. Whatever the goal, the key parameter for an antibody library is its complexity (also known as diversity), i.e. the number of distinct elements in the collection, which directly reflects the probability of finding in the library an antibody against a given antigen, of sufficiently high affinity. Quantitative evaluation of antibody library complexity and quality has been for a long time inadequately addressed, due to the high similarity and length of the sequences of the library. Complexity was usually inferred by the transformation efficiency and tested either by fingerprinting and/or sequencing of a few hundred random library elements. Inferring complexity from such a small sampling is, however, very rudimental and gives limited information about the real diversity, because complexity does not scale linearly with sample size. Next-generation sequencing (NGS) has opened new ways to tackle the antibody library complexity quality assessment. However, much remains to be done to fully exploit the potential of NGS for the quantitative analysis of antibody repertoires and to overcome current limitations. To obtain a more reliable antibody library complexity estimate here we show a new, PCR-free, NGS approach to sequence antibody libraries on Illumina platform, coupled to a new bioinformatic analysis and software (Diversity Estimator of Antibody Library, DEAL) that allows to reliably estimate the complexity, taking in consideration the sequencing error.


Introduction
Antibody repertoires have been used in conjunction with display or selection technologies [1][2][3][4][5][6][7] and many libraries and antibody formats were created to satisfy the high demand for the different applications of recombinant antibodies [6,[8][9][10][11][12][13][14][15][16][17]. PLOS  allows to get more accurate complexity boundaries increasing the sequence pool for the inferential estimate of the library complexity. We believe that the PCR-free sequencing and DEAL analysis on the whole antibody coding sequence could establish new standards for a more reliable and quantitative estimate of antibody library complexity and define quality criteria for newly created antibody libraries.

Construction of human SPLINT libraries
We constructed two SPLINT (Single Pot Library of Intracellular Antibodies) scFv antibody libraries from human RNA isolated from peripheral blood lymphocytes extracted from four anonymous voluntary donors which signed written informed consent (see Ethic statement below), following a protocol modified from Marks and Bradbury [16] (see Supporting Information, S1 Fig). IgM cDNA (μ heavy and κ and λ light chains) was used as a template to amplify VH and VL regions. Primers are designed to anneal to the external framework regions of the V genes. In the first library (hscFv1), each VH and VL subclass was first individually amplified, using every possible combination of the 5' and 3' primers available for VH and VL chains. The amplified products were then combined at equimolar ratios, so that each VH and VL subclass was equally represented in the library. In the hscFv2 library, instead, all VH and VL subclasses were amplified together in a single reaction (one for the heavy and one for the light V region) using a mix of the 5' and 3' primers available. A third single domain "nanobody" library (hVH) was created using only the VH domains, amplified in a single reaction. At the end of each library construction, the assembled VH-VL DNA products for hscFv1 and hscFv2, or the VH products for hVH, were ligated in the pLinker220 vector [36] for yeast expression in the SPLINT format, using restriction sites BssHII/NheI (NEB). Ligation of each library (~1μg) was transformed by electroporation into Max Efficiency E.coli DH5α cells (Invitrogen). Transformation efficiency for each library was assessed by colony count. See Supporting Information for details. Ethic statement. The blood donors provided their written informed consent, and the samples were obtained from the Division of Transfusion Medicine and Transplant Biology, Pisana University Hospital, Pisa, Italy. The samples were received in an anonymous and completely de-identified format, and were not collected by ourselves. No request for approval of this study was presented to our Institutional Ethics Review Board.
Library DNA fingerprinting 100 individual bacterial colonies for each library transformation were picked and plasmid DNA extracted. Each scFv or VH present in the plasmid DNA was amplified by PCR and then digested with BstNI (NEB) at 37˚C for two hours. DNA fragments were resolved on a 8% acrylamide gel and stained with ethidium bromide. Pattern analysis was performed using Gel Quest software.

Real time PCR
Real time PCR was performed on the VH domain of the cDNA of each RNA sample used for library construction. All forward primers classes (BssHII-HuVHXaBACK, see Supporting Information, primers for VH) were tested against the most and the least abundant reverse primer class (HuJH4-5FOR and HuJH1-2FOR respectively). For each reaction, 4ng of cDNA sample were amplified following iTaq Universal SYBR Green Supermix (Bio-Rad) protocol.
Real time PCR was performed on a Rotor-Gene Q Platform and the result analyzed following Nordgård et al. 2006 [37].

Sequencing sample preparation
To attach sequencing adapters to the scFv sequences, a ligation-based approach was designed. DNA adapters were synthesized harbouring overhangs complementary to the cleavage product of the restriction enzymes used for excising the scFv fragment from the plasmid, namely BssHII and NheI.
The scFvs were excised from the library plasmid (~8μg of the library were digested 3h 37˚C with 4U of NheI (NEB), 3h 50˚C with 4U of BssHII (NEB)) and ligated to the adapters (forward adapter: scFv: reverse adapter in 10:1:10 ratio,~200-250 ng library 400U T4 ligase (NEB), O/N 16˚C). The ligation was run on agarose gel and the band corresponding to the single insert with 5' and 3' adapters was resolved and purified with MinElute Gel Extraction Kit (Qiagen). An example of ligation efficiency is shown in Supporting Information, S2 Fig).

NGS library processing
Libraries were quantified by Qubit dsDNA HS Assay Kit (ThermoFisher Scientific), diluted to 4nM, denatured with 0.1N NaOH (5' at RT), neutralized and diluted again in buffer HT-1 (Illumina) to a final concentration of 12.5 pM. Equimolar denatured Phi-X Control V3 DNA (Illumina) was spiked-in (20% of the volume) as an internal quality control and to increase the sample diversity according to Illumina guidelines. Sequencing was performed on MiSeq system with Reagent Kit v3-600 cycles (Illumina), with a cycle number of 350 and 250 for forward and reverse reads respectively.
Forward and reverse-complemented reverse reads were then combined into 540bp-long pseudo-reads with a custom python script, and pseudo-reads from the 4 different indexes were pooled together. The hVH nanobody llibrary reads were merged using PEAR [39], a pair-end read merger available at http://sco.h-its.org/exelixis/web/software/pear/.

Software development
DEAL was developed as a standalone C++ program, without the need of external libraries. To manage the amount of data, a x64 compiler must be used during compilation. Speed optimization (MSVC: /O2 or /Ox, GCC: -O2 or -O3) should also be applied during compilation, to overcome tail-end recursion and to avoid long computing times. The program is tested to compile under Microsoft Visual Studio environment and work in Microsoft Windows 64-bit operating systems. DEAL was written in a modular structure to allow an easy future parallelization, even if not yet implemented. DEAL is available at http://laboratoriobiologia.sns.it/ DEAL.
Small complementary python scripts were created for format conversion, reads joining and primer recognitions.
R scripts were used for graphs generation and statistical analysis. Pattern recognition for fingerprint analysis is implemented as a small Excel VBA script.

Statistical analysis
The theoretical complexity of the hscFv1, hscFv2 and hVH nanobody libraries were estimated by the truncated Negative Binomial distribution (NBp,s, where p and s are the probability and size parameters) to fit number of sequences Nseq as a function of cluster cardinality x (see S3 Fig). Assuming Nseq(x)~C Ã NBp,s(x), it is possible to calculate the coefficient C, that represent the sought complexity to be estimated from the data (see details in Supporting Information).

Library sequencing
Three sample libraries, 2 scFv libraries (hscFv1 and hscFv2) and a single domain library (hVH), were created from cDNA derived from human lymphocytes RNAs and amplified in bacteria (see Materials and Methods). The first two libraries reflected two different methods used to amplify V regions for scFv libraries construction, and were sequenced to find, if present, the advantage of one method over the other. The hVH instead was sequenced to calculate the complexity of a single domain library and to demonstrate the advantage of single domain sequencing (where the reads overlap). Assuming that each transformed bacterium takes one copy of plasmid DNA, we can define the first hard cap of the library complexity as the number of total transformants obtained determined through CFU count. For the three libraries, hscFv1, hscFv2 and for the hVH nanobody library, this complexity upper bound is 15.8, 14.0 and 6.0 million elements respectively (Table 1). Standard fingerprint analysis on 100 random library clones for each library was performed and no duplicate was found (data not shown).

PCR-free sample preparation
In order to avoid any PCR-amplification step to attach sequencing adapters to the library to be sequenced which could insert mutations in the sequences, a ligation-based approach was  Since sequences outside the Complementary Determining Regions (CDR), including the initial and distal framework regions, are rather constant, this poses serious technical limitations due to the sequencing by synthesis technology (SBS), which results in bad cluster recognition and low cluster count. To circumvent the problem, 4 barcoding index sequences (iS1 and iS2) were inserted just after each Illumina sequencing primer site. Using these barcodes allows to balance the read base content of the first 6 sequencing cycles. Moreover, inserting a different number of shifter bases before scFv fragment allowed to spread the eventual error in different positions of a sequence in the case of a error prone cycle, thus enabling correction during analysis.

Sequencing
Libraries underwent SBS sequencing on Illumina MiSeq, executing 350bp forward (R1) and 250bp reverse (R2) reads after ligation of the adapters to the scFv fragments, excised by restriction enzyme digestion from the library vector (Fig 1; see Materials and Methods). 10-15 million of raw reads were produced for each experiment. The presence of the index/shifter sequence (iS) in the adapters successfully balanced the base composition of the first sequencing cycles, which are critical for cluster detection and run parameter estimation (S4A- S4C Fig). Median Phred score (Q-score) is an empirical measure of the confidence of base identification and remained greater than or equal to 30 (99.9% base accuracy) beyond the 300 th and 200 th cycle for R1 and R2 respectively (S4A and S4B Fig). Phred score is an estimate of the probability of error for any given base. The error rate distribution for hscFv1 library is shown in Fig 2A. On the other hand, the presence of a Phi-X phage DNA spike-in, as an internal control, allowed to assess the impact of sequencing errors on the library sequence analysis, by comparing Phi-X reads to the reference sequence, getting a real measure of per tile and per cycle error rates ( Fig  2B and S4D Fig). Notably both the scale and the shape of these two distributions differ, leading  Phi-X derived and Phred score derived error rate distribution. A) Phred score error rate distribution for the hscFv1 library of the merged reads. Error rate increases with sequencing cycles. B) Control Phi-X derived error rate distribution for the hscFv1 library of the merged reads. Error rate is more prominent in the early sequencing cycles (spikes), with a small increase at the end of each read. The error distribution does not match the Phred score distribution and the shape differs as well. C) Scatter plot of the correlation of Q-score and log 2 (% Mismatches) in Phi-x control spike-in library. Each point represents the mean value from a single flow cell tile at a given sequencing read number, encoded by to the conclusion that neither can be used alone for error rate estimation. While the median error rate was generally low (0.34±0.12%), both the presence of error rate "spikes" (Fig 2B and  S4D Fig) and the lack of homogeneous correlation between Phred score and the measured error rate (Fig 2C) revealed that particular attention must be spent, to distinguish the real base variants from background technical noise. The Phred-score-derived error rate and the Phi-X control error rate were similar in all the three libraries analyzed (data not shown). This data analysis revealed that the sequencing run was successfully performed. However, intrinsic technical limitations of the NGS, namely the presence of two unrelated error estimates requires an additional analysis. To this purpose a new software (DEAL) was developed.
The Diversity Estimator of Antibody Library (DEAL) program DEAL is a software tool to minimize the possible confusion between the real base sequence variants (biological diversity) from the background technical noise (technical misreading), taking into account both Phred derived error rate and Phi-X derived error rate.
DEAL is based on sequences identity collapse designed to ignore the error prone bases. The number of the collapsed sequences is an estimation of the complexity of the analyzed library.
DEAL is divided in two main steps: in the first step, the sequences are clustered by identity, using a "seed" of a 10-20 bp stretch in the CDR3s; in the second step, each element of the cluster is analyzed by binary comparison (Fig 3).
colour (red to blue: R1 cycle 1 to 350; R2 cycle 1 to 250; colour flex point is set at cycle 38). The Q score in the first 40 reads fails to be predictive of mismatch rate. Similar results were obtained for hscFv2 and hVH libraries. https://doi.org/10.1371/journal.pone.0177574.g002

Fig 3. Diagram of DEAL workflow.
A) Diagram of the seed creation process. In the figure, the black arrows represent the combined reads of the scFv library after the trimming. The seed is created combining the two seeding regions. The seeding regions are placed in the CDR3s to maximize the number of different seeds: the higher the number, the faster the program will run. B) Binary tree of the seeds. The program uses a binary tree approach to group identical seeds. During the comparison, if one sequence does not match any other sequences seen so far, a new branch of the tree is created in the mismatching position. C) The input of the binary comparison step. While the seeding step takes only into account the diversity of the seeding regions, the binary comparison analyzes the whole length of the combined reads. D) Flagging process. If some positions of the sequence are unreliable due to being associated to a low Phred quality score (as shown in the figure) or to a poor quality cycle (from Phi-X errors, not shown in the figure), the program flag them for correction. E) The three different scenarios that can occur during binary comparison among the sequences in the same seeding group. Mismatching The first "seeding" step is necessary to reduce the calculation to local independent subproblems that can be solved by binary comparison. The seeding regions could be placed in any part of the sequence but since the CDR3s are the most variable regions, seeding in these regions would make the program run faster. The more the seed is variable, the bigger the number of seed groups become, lowering the mean number of sequences per group and thus further reducing the computational time required for the next step.
Moreover, this analysis is based on a binary tree approach which has the great advantage of being time and memory saving. The program starts by creating a small sequence stretch (seed) from the seeding region provided (Fig 3A) building a tree of identical sequences. While comparing the seeds, in the presence of a mismatch, a new branch is created (Fig 3B). The first step ends when all sequences are analyzed and all the identical seeds form individual clusters. The number of these clusters is the seed complexity.
In the second step, the sequences within each seed cluster are compared along the whole sequence length (Fig 3C). The process is a binary comparison, so each sequence is compared to every previous analyzed sequence in its seed group. In this step, the possibility of sequencing errors is taken into account. The program flags uncertain base read positions as unreliable, by checking both for a low Phred quality score associated to the base considered ( Fig 3D) and for a high error rate in the sequencing cycles, that is retrieved from the error rate of control phage DNA (Phi-X). The two flagging descriptors (Phred quality and cycle quality) relate to two different checkpoints. If one or both the quality checks are not passed, the base is considered unreliable. Fine tuning can be done on the thresholds of these checks which can be defined as command line arguments. At this point, three scenarios ( Fig 3E) can occur: i) if the sequences do not match, two subgroups are created; ii) if two sequences match and only in one an unreliable-flagged base is present, the base in that position will be assigned as the other sequence's reliable base, and the two sequences are then merged and resolved as the same one; iii) if in two matched sequences at a specific position both bases are unreliable-flagged the resulting base in the merged sequence will be an unreliable-flagged base. In this case it is possible that two different bases have to be merged. To keep track of all the possibilities in unreliable positions of merged reads we applied the IUPAC ambiguity code for nucleobases to the stored sequences.
The result of the computation is the creation of many small groups of matching sequences, whose number represents the sample complexity.
Information about the grouping steps of an analysis, as well as the final group distribution and all the sequences associated to the groups is also available using DEAL. It is possible to customize some parameters such as the seed positions and the optional computation of complexity of the deduced aminoacid sequence. In conclusion, DEAL software is able to ignore error prone bases during sequence identity collapse, leading to a reliable estimate of the library complexity.

Upper and lower limit estimate of library complexity and outlier determination
The complexity of the three libraries was analyzed by NGS followed by DEAL. Data are summarized in Table 1. The sequencing cluster count (12.27, 10.98 and 15.52 million for human scFv libraries 1 and 2 and VH nanobody library respectively) was of the same order of magnitude as the upper cap complexity of the libraries, determined by transformants count (15.8, 14.0 and 6.0 million respectively). To be noticed, the VH nanobody library sequencing cluster count exceeded almost 3 times the upper cap limit, indicating an almost complete coverage of the library. Since it is crucial to have good quality sequencing data, the reads underwent a very strict quality trimming process. Only the reads which had a median Phred score of at least 32 (base call accuracy > 99.937%) survived the filter. The sequence count after trimming was 6.02 and 4.09 million for hscFv1 and hscFv2 respectively. A higher trimming survival count (14.9 million) was obtained for the hVH nanobody library, due to both the shorter length and the overlap of the two reads, which improves the median quality of the sequences.
DEAL was then applied to the quality trimmed data, using for the 2 human scFv libraries the default parameters: i) unreliable flag set when either error threshold for the Phi-X in the position was over 1% or quality in position is less than Phred 32; ii) seed position was located in the CDR3 of both VH and VL fragment (position 280-300 for read 1 and 470-490 for read 2). For the VH nanobody library, DEAL was set to allow a variable length in the input, the seed was placed in the 280-300 region only. Moreover the VH nanobody library protein complexity was also calculated.
The resulting distribution of clusters by cardinality is shown in Fig 4. A crowding in the first dozen groups can be seen, meaning that clusters with few elements are a clear majority. At the other end of the cardinality plot a clear single high cardinality element can be observed. This outlier element was present in all the analyzed libraries and was shown to originate from the backbone of the library plasmid vector, due to an incomplete digestion of the parental vector in the library construction process. In general the number of outliers is indicative of problems occurred during library construction, due for example to an overrepresentation of few specific clones. Indeed a high number of outliers would be indicative of an unbalance in the library lowering the chances of successful selections. Quantitative estimate of antibody library diversity by PCR-free NGS Another kind of outlier is represented by the single element cluster (group with cardinality 1). This group contains, in addition to the real biological singletons (sequences with no other sequence in the same cluster), all the sequences where errors were not associated with a quality drop (and thus unable to be corrected by DEAL), that are unlikely to cluster with any other sequence in the sequencing run. For these reasons, the single element group is not reliable; therefore it should not be considered in complexity modeling to avoid an overestimation of the complexity. The number of DEAL clusters with cardinality two or greater represent the insurmountable lower limit of the library complexity ( Table 1). The three libraries show respectively 0.6, 0.2 and 1.4 million unique clusters that satisfy the requisite, defining their lower limit complexity. However, if the coverage is not sufficient (i.e. for hscFv1 and hscFv2), raw data can not directly provide a measure of the total complexity. Thus a theoretical complexity calculation is also needed. To this purpose, an estimate of the theoretical complexity for each library was obtained using the truncated Negative Binomial distribution (Table 1

Library analysis: Primer class distribution and VH-VL independent assortment validation
From the sequencing data, the chain independent assortment is another useful parameter to measure the skewness of a library. The NGS sequences were parsed with a custom python script for the primers used in the construction of the libraries. The distribution of the libraries by primer class is shown in Fig 5. A large portion of these are unclassifiable, probably due to the error spikes present in the primer regions, that impair the primer recognition, as discussed above (Fig 2B). The independent assortment of the VH-VL chains of the scFv libraries is shown in Fig 5A and 5B. No difference is observed between the frequency of forward and reverse primer pairs, compared to the theoretical combinatorial model, showing a good combinatorial assortment. Moreover, the frequencies of primers pairs assortment of the VH nanobody library matches the theoretical VDJ combinatorial model showing that very little amplification bias occurred during library construction (Fig 5C).
The primers distribution, for all the three libraries, showed a very low percentage of reads for the class corresponding to the forward primer VH5 (HuVH5aBACK). To verify whether this reflected a problem related to NGS or to the real biological distribution, Real-Time PCR on the libraries and their corresponding cDNA was performed. Results showed that the same class distribution and abundance was present in the original cDNA, showing that the distribution bias of the VH5 primer class was not due to a sequencing issue (S5 Fig). Usually, two different methods are used to amplify V regions for scFv libraries construction. In one, each VH and VL subclass is first amplified individually and then the products are combined at equimolar ratios [40,41]. Alternatively, all the VH and VL subclasses are amplified together in a single reaction (one for each type of V region) [16].
Therefore we compared the deduced complexity of two human scFv libraries, constructed by the two methods. To verify this issue, we constructed two human scFv libraries, using both protocols. The primer distribution obtained from NGS clearly reflects the two different approaches in library construction. hscFv1 shows a more poised distribution of all the V subclasses compared to hscFv2, in which, instead, there is a clear imbalance in classes representation (Fig 5). However, the Real Time data indicate that the protocol with the single common amplification reaction generates a library that faithfully reproduces the real distribution present in the natural repertoire. Thus, the first method guarantees an equal representation of all the different V subclasses, although it might not reflect the natural distribution. Observed distribution is the primer pair proportion found after sequencing. Expected distribution is the multiplication of the two primers proportion (expected distribution given the independence between chains for the scFv libraries or given a balanced VDJ recombination for hVH). UC = unclassified. This category includes Determination of the protein diversity of a VH library through in silico translation of NGS reads The quality of an antibody domain library is ultimately determined not only by its nucleic acid sequence complexity but also by the proportion of the sequences in the library that code for full length antibody domain protein. As the quality of a library resides in the protein conformational diversity, its most relevant estimator is protein complexity, which represents the true and relevant complexity of a library. While for scFv libraries the NGS reads do not yet allow to unambiguously deduce the full protein coding sequence by in silico translation, due to sequence length limitations, the shorter VH length allows the full coverage of the entire VH sequences, since R1 and R2 reads overlap in the central DNA stretch. Considering that the central region is read twice, if a sequencing error occurs at a given position of the forward read, it can be corrected comparing the base at the same position of the reverse read and vice versa. Thus, assigning that position to the base with the best quality between the discording pair, this correction lowers the global error rate and guarantees the best quality read in the critical CDRs regions and allows the correct frame to be determined, for amino acid sequence deduction. This is still not possible for scFv libraries, due to the undefined number of nucleotides present in the unsequenced gap between the VH and VL.
The protein diversity of the VH nanobody library was determined by in silico translation of DEAL clusters, where all synonym sequences were aggregated and both the frameshift and nonsense mutation were filtered out. The length distribution of the sequenced VH nanobodies shows that the correct frame is maintained in 86.57% of the analyzed sequences (Table 2, Fig 6). Since, in Illumina sequencing, the proportion of indel mutations is negligible, the frameshifts are mostly due to non-functional VDJ recombination. The majority of the sequences are around 370 nucleotides in length and the in frame sequences are the most abundant. The minority of out of frame (+1 or +2 base pair) sequences shows an identical bell shaped distribution. Sequences with stop codons are also very rare: 10.3% considering the univocal stop codons in error-free sequences, 16% considering possible stop codons in error-flagged sequences. Indeed, filtered data shows that 80% of the hVH library sequences code for functional nanobodies ( Table 2). The total protein diversity derived by DEAL clusters count is 4.73 million, which is reduced to 3.2 million by filtering the nonsense and frameshift incomplete peptides. Therefore, ignoring the protein group of cardinality 1 (due to presence of possible technical errors), the lower bound of the library complexity is 1.43 million proteins, of which 1.2 million are functional.

Discussion
A quantitative evaluation of the sequence complexity of antibody libraries is critical to verify their quality and eligibility for screening purposes. Next generation sequencing is dramatically changing how antibody quality and complexity are estimated. We present here a PCR-free NGS sequencing strategy associated with the DEAL software, to compensate for technical errors, which allows to determine the minimal functional complexity of an antibody library. The PCR-free library preparation avoids PCR-induced errors, sacrificing the convenience to selectively amplify the regions of interest in such manner as to minimize sequencing error on those key regions albeit losing all the information on the other areas of the antibody. The main advantage of the proposed method over current approaches is the definition of a "unique" biologically relevant sequence, filtering out as best as possible technical errors from all the sequences that do not match any primer. The name of the primers is a shorter version of the original name listed in Supporting Information (Primer used for library construction). https://doi.org/10.1371/journal.pone.0177574.g005 Quantitative estimate of antibody library diversity by PCR-free NGS the sequencing data. In fact, we found that Phred quality score generated by the sequencer alone is not sufficient to identify an error. Indeed, when comparing Phred quality to the error rate data extracted from the control phage DNA, it is clear that both parameters have to be taken into consideration.
In fact, Phred score is a pinpoint (both single sequence and single base) quality assessment, while the control phage DNA error rate can only be a per tile sequence quality assessment. Thus, it is not possible to substitute the Phred score with the error rate and both are required for the analysis. In addition, to define the origin of an error we need to discriminate between pre-sequencing and sequencing errors. While sequencing errors need to be identified and removed (or limited), dealing with pre-sequencing errors is more complicated. In particular, some presequencing errors are introduced by the nucleic acid handling step during library construction (cDNA, V regions amplification). These errors are acceptable, sometimes deliberate [42], since they improve the diversity of the library. Instead, mutations introduced during the sequencing sample preparation (mainly PCR-derived) result in a misinterpretation of the sequences present in the library and thus have to be avoided. Therefore a PCR-free approach in the sequencing sample preparation is certainly to be preferred, since it reduces the number of errors (not associated to a quality drop) that DEAL cannot resolve. It should be noted that this method could also be used to calculate the diversity of the natural repertoire from biological samples, such as immune or cancer cells, however libraries from such samples have to be amplified by PCR in order to obtain the quantity needed for the ligation.
The deduced amino acid sequence of the VH single domain library showed an unexpected length distribution, with more than 10.3% of non-functional out of frame sequences. Because in MiSeq indels are very rare, the length is most likely a natural feature rather than a sequencing artifact. As of today, the molecular mechanism whereby lymphocytes discard nonsense or out of frame variable domains and assemble only functional chains remains somewhat obscure. Interestingly, we observed in the hVH library a 7:1 ratio between functional and non-functional chains, which reflect the IgM RNA composition of PBLs. Indeed, this observation is in accordance with previous reports in mouse models [43,44]. In addition, the observed ratio appears to be remarkably high, even hypothesizing DNA silencing mechanism or RNA nonsense mediated decay taking place after transcription. In fact, if all the non-functional sequences were translated, there would be a considerable amount of useless transcripts and protein products. Moreover since this ratio refers only to a single chain, and an assembled antibody must have four correct chains to be functional, the ratio of functional assembled antibody would be even lower. Thus, it is clear that some kind of mechanism prevents the translation of these non-functional RNAs in vivo. Regarding the implications for antibody library screening, assuming the worstcase scenario where the light chains have the same functional proportion of the heavy chains, we could tentatively conclude that more than 80% of single chain library elements and more than 64% of scFv library elements encode a correct protein product.
Concluding, besides deep sequencing, the only other available method to determine the quality of an antibody library is a functional assay: the actual selection against an antigen. This is a valid practical strategy, although it is time consuming and can only give information on whether a library is adequate for screening purposes. The best general prediction of the capability of a library to undergo successful screenings is the estimate of its complexity in the most accurate possible way. Indeed, the complexity of a library can be directly linked to the probability to find a given binder of adequate affinity [19]. The method presented provides an advance towards this goal, by eliminating PCR steps and by compensating the technical errors. This method was validated for SPLINT intrabody libraries, but it is readily extendable to any antibody library of standard size. Moreover, DEAL software is independent of the sequencing platform and can be even more reliable with the greater number and more precise sequences that will be hopefully available in the future with the advancement of NGS technology.