Palindromic Nucleotide Analysis in Human T Cell Receptor Rearrangements

Diversity of T cell receptor (TCR) genes is primarily generated by nucleotide insertions upon rearrangement from their germ line-encoded V, D and J segments. Nucleotide insertions at V-D and D-J junctions are random, but some small subsets of these insertions are exceptional, in that one to three base pairs inversely repeat the sequence of the germline DNA. These short complementary palindromic sequences are called P nucleotides. We apply the ImmunoSeq deep-sequencing assay to the third complementarity determining region (CDR3) of the β chain of T cell receptors, and use the resulting data to study P nucleotides in the repertoire of naïve and memory CD8+ and CD4+ T cells. We estimate P nucleotide distributions in a cross section of healthy adults and different T cell subtypes. We show that P nucleotide frequency in all T cell subtypes ranges from 1% to 2%, and that the distribution is highly biased with respect to the coding end of the gene segment. Classification of observed palindromic sequences into P nucleotides using a maximum conditional probability model shows that single base P nucleotides are very rare in VDJ recombination; P nucleotides are primarily two bases long. To explore the role of P nucleotides in thymic selection, we compare P nucleotides in productive and non-productive sequences of CD8+ naïve T cells. The naïve CD8+ T cell clones with P nucleotides are more highly expanded.


Introduction
Surveillance of intracellular proteins are primarily mediated by T cells which recognize processed antigens presented by major histocompatibility complex (MHC) molecules on the cell surface. CD8 + T cells recognize endogenous proteins processed and transported through the endosomal pathway and presented by class I MHC molecules, while CD4 + T cells recognize exogenous proteins presented by class II MHC molecules. The specificity and affinity of T cell recognition is largely contained in a and b chains of the T cell receptor (TCR). TCR sequence diversity resides primarily in the complementarity determining region 3 (CDR3) loops of a and b chains, which bind to the peptide antigen, conveying specificity. The nucleotide sequence that encodes the CDR3 loops are generated by V(D)J recombination: variable (V b ), diversity (D b ) and joining (J b ) genes in the genome are rearranged to form a b chain, while V a and J a genes rearrange to form an a chain.
The large TCR repertoire arises as a consequence of combinatorial and junctional diversity. Existence of multiple V b , D b and J b genes in the genome and the number of ways V b , D b and J b genes can recombine give rise to combinatorial diversity. Junctional diversity is generated by deletion of nucleotides adjacent to the recombination signal sequences (RSSs) of the V b , D b and J b gene segments and non-templated insertion of random nucleotides (N nucleotides). N nucleotides which increase the diversity by exponential order are inserted by the DNA polymerase terminal deoxynucleotidyl transferase (TdT) at the V b 2D b and D b 2J b junctions in a template independent manner for the TCRb chain. In this manuscript we will denote the nucleotide insertion at D b 2J b and V b 2D b junctions as N1 and N2 regions, respectively.
To express a complete TCR in a developing T cell, an initial DNA break is performed by the recombination activating gene RAG1-RAG2 complex, which binds to the RSS and introduces a DNA double-strand break at the precise border of the coding end and the RSS. The ends bearing the coding sequence are sealed covalently and form transient hairpin loops, which are then cut open and digested to varying extent by an exonuclease. The position of the cut(s) at various points along the hairpins by the Artemis enzyme phosphorylated by a DNA-dependent protein kinase (DNA-PK) plays a key role in the formation of sequence variability in the final joint of the TCR loop. Usually single strand cuts on both strands of the hairpin result in deletion of a few nucleotides at the coding ends. However, for a subset of rearrangements, an asymmetric cut on only one strand of the hairpin transfers nucleotides from one strand to the other and results in a 39 or 59 protruding single strand which is complementary with respect to the end of the coding sequence. These short complementary palindromic sequences are called P nucleotides [1,2,3,4], the primary focus of this manuscript. Here we will denote the observed and true complementary sequence as palindrome and P nucleotide respectively. P nucleotides were first discovered in 1993 by [5] Thompson et al. However, little information is available on the in vivo role of P nucleotides in the TCR. P nucleotides are usually short (ƒ3 nts), but their location in the CDR3 regions of TCRs suggests that they can significantly influence the cellular adaptive immune response. Since the TCRb CDR3 is of fixed length and has a primary role in antigen recognition, any small perturbation in the nucleotide sequence encoding the TCR could affect the T cell drastically. Thus P nucleotides in the TCR could affect both positive and negative selection in the thymus and can help determine antigen recognition in the periphery. If we do not consider P nucleotide in TCR receptor analysis, we would overestimate the number of nucleotides inserted by the TdT and this in turns overestimate the TCR sequence diversity. Furthermore, understanding P nucleotides is a prerequisite for studying the pattern of nucleotide distributions contributed by the TdT enzyme.
Our knowledge of P nucleotides is incomplete due to limited amount of TCR sequence data available. Applying the Immuno-Seq deep-sequencing assay to TCRs, we sequence millions of TCR sequences in parallel from a single sample, creating a large data set of TCRb CDR3 sequences, which we utilize to study P nucleotides in a unified and coherent manner [6,7]. We are able to estimate P nucleotide distributions in individuals and then compare across a cohort of healthy people and different T cell subtypes.
In this study, we analyze P nucleotides in the repertoire of TCRb CDR3 sequences from the genome of naïve and memory CD8 + and CD4 + T cells that were obtained from the peripheral blood of seven healthy people. We build a linear model to characterize P nucleotide probabilities, and apply this model to each observed palindromic sequence to assign a probability that the sequence is a P nucleotide or a random insertion by TdT. This probability model can be directly used as part of an algorithm to decompose CDR3 junctions into their constituent parts. We elucidate the properties of P nucleotides in different subsets of T cells and across different donors using this model. We show that the distribution of P nucleotides are highly biased with respect to gene segment usages and its coding end and the majority of observed palindromes are not actually P nucleotides. We also show that P nucleotide distribution in productive sequences is different from non-productive sequences, and longer P nucleotides at the specific coding end is biased towards the high average copy number in CD8 + naïve T cell. Additionally, we show that junctions with P nucleotide sequences have fewer non-template insertions and induce a reading frame bias in the b chain of the TCR.

Results
We applied a high-throughput sequencing approach [6] to estimate the P nucleotide distribution in the naïve and memory compartments of CD8 + and CD4 + T cells obtained from the peripheral blood of healthy adults. Genomic DNA was purified from both the naïve (CD45RO 2 , CD45RA hi , CD62L hi ) and memory (CD45RO + , CD45RA int/neg ) CD4 + and CD8 + T cells (Table S1, Table S2). More than 5 million TCRb CDR3 sequence reads were generated from ,1 million template genomes in each of the naïve and memory samples. A mean of 400,000 CD8 + naïve, 60,000 CD8 + memory, 340,000 CD4 + naïve, and 170,000 CD4 + memory unique CDR3 nucleotide sequences were observed ( Table 1).

Method Overview
We briefly review the linear probability model for characterizing P nucleotide distribution, referring to the Materials and Methods for a full and technical description. The probabilistic model is designed to address the problem that palindromic nucleotides can occur from two different sources, true P nucleotides and random insertions from TdT that happen to be palindromic by chance. The linear probability model uses the palindromic and TdT (N nucleotides) data of T cell receptors of ethnically diverse individuals and aims to estimate the P nucleotide probabilities of different length. Unlike the method described in the earlier works [3,4] in which P nucleotides were scored as a Pvalue using a binomial distribution, the method described here is non-parametric, and does not rely on data belonging to any particular distribution. This approach being non-parametric, based on fewer assumptions, is more robust and has a wider applicability. In linear probability models a linear system is described at each coding end of the given gene segment. It takes as inputs the palindrome length frequencies in the cell population and the TdT insertion probabilities at each position, and output the probability of P nucleotides of varying lengths. The empirical frequencies of palindrome lengths are calculated directly from the data set. The estimation of TdT insertion probability is described below using an assumption that TdT adds nucleotides independently.
At each of the four non-recessed coding ends of the CDR3b chain of T cell receptors either there is no P nucleotide or P nucleotide due to hairpin structure. If P nucleotides arise, they are usually followed by random N nucleotide insertions by the TdT enzyme. Given the rearranged CDR3b chain of T cell receptor, there is no well-defined boundary between the P and N nucleotides to decompose them into disjoint sequences without error. Observed palindrome sequences arise purely by true P nucleotides due to hairpin or purely by N nucleotides due to the TdT enzyme, or by a joint combination of both. We establish a linear system to estimate the P nucleotide probabilities of different length with the assumption that each equation of the system quantifies the observed palindrome length frequency as a mixture combination of P nucleotide and N nucleotide probability distributions.
Modeling palindromic sequences up to m bases gives rise to an m by m linear system, where the unknowns are the probabilities of P nucleotide from 1 to m base long. The observed frequency f n that the first n nucleotides (n#m) are self-complementary or palindromic is equal to the sum of the probabilities that these nucleotides are independently added by the TdT or P nucleotides are followed by N nucleotides added by the TdT enzyme or that an observed palindromic sequence has at least an n base P nucleotides. Thus in each equation of the linear system the palindrome length frequency f n can be expressed mathematically as a sum of three terms. These three terms represent the three different biological processes that give rise to palindromes in T cell receptors. The first term represents all n nucleotides are independently added by the TdT enzyme, and thus this term encodes the well-known TdT bias for inserting particular nucleotides. The second term represents the first i base is a P nucleotide, where i#n21, and the remaining n2i bases are N insertions; and the third term represents that there is an at least n base P nucleotides. In general, the vector of palindrome length frequency [f 1 , f 2 ,… f M ] T can be written as a sum of the TdT probability vector, TdT bias, and a matrix A times the vector of unknown P nucleotide probability [p 1, p 2, … p M ] T . The matrix A encodes the information content of the biological steps that gives rise to P nucleotides. The P nucleotide probability vector is solved by inverting the matrix, A, and multiplying it by the difference of palindrome length frequency vector [f 1 , f 2 ,… f M ] T and the TdT probability vector.
In this manuscript we restrict P nucleotide analysis of palindromic sequences up to M = 3 bases because of the following two reasons. First, many of the palindromic sequences in the data set fall in the range of 1 to 3 nucleotides long that gives rise to average palindrome length of 1.6260.02. The Table 1 shows the actual counts of the palindrome length in the data set of each T cell compartment, and the histogram of the palindrome length frequencies is given in Figure 1A. Second, though we see longer palindromes greater than equal to 4 nucleotides long in our data set, their frequencies fall down drastically, more than a power of 4. Moreover, these longer palindromes only account for less than 2-3% of the total palindromes.
In order to solve for P nucleotide probability the linear system requires the knowledge of TdT probabilities. To estimate the TdT insertion probabilities based on the assumption that N nucleotides are added independently, we take the middle regions of N2 and N1 segments and calculate the mono nucleotide frequencies. These middle regions must be added by TdT insertions. The regions towards the either end of N1 or N2 segments have potential to be biased due to P nucleotide additions and other unknown factors, thus these regions do not account for a true contribution of TdT insertions. For instance, to estimate the TdT probabilities towards the 59 end of the D b gene segment, N2 segments of seven nucleotides long with no deletion at the 59 end of the D b gene segment are used and nucleotide frequencies in their middle regions (3 rd , 4 th and 5 th bases from the coding end) are calculated. We observed that the nucleotide frequencies at the 3 rd , 4 th and 5 th bases from the 59 of D b segment do no vary significantly and are consistent among all donors (data not shown). We took the nucleotide distribution at the 3 rd base as an estimate of the TdT probabilities. Similarly the TdT probabilities at the 39 of V b and D b and the 59 J b of coding ends are estimated.
Mean estimated TdT probabilities for T cells are given in Table S3. The TdT has a strong bias for certain nucleotides like G/C. In addition we also found that nucleotides inserted by the TdT are biased with respect to the coding end. For example, nucleotides towards the 39V b coding end is T rich, while nucleotides towards the 59 and the 39 of the D b coding ends are G and C rich respectively. We substitute the coding end specific TdT probabilities and empirical frequencies of palindromes into the linear system and solve for the P nucleotide probabilities of different length.

Distribution of P Nucleotide
We applied the above linear model to each coding end of the V b , D b and J b gene segments of TCRs and solved for P nucleotide probabilities. We used these probabilities to calculate the expected number of one, two and three bases P nucleotides in each data set and then normalized by dividing by the total number of unique sequence in the data set. Normalizing this way allows comparison of the results between donors and across T cells subtypes. Several features of P nucleotide distributions were established and characterized.
P nucleotides were observed in both naïve and memory compartments of the CD8 + and CD4 + T cells ( Figure 1B). Total P nucleotide percentages in the CD8 + and CD4 + naive compartments were 1.8% and 1.9% respectively while the corresponding percentages in the CD8 + and CD4 + memory compartments were 0.9% and 1.7% respectively. Some coding ends have been found to display a high proportion of P nucleotides while the others have few or none ( Figure 1B). P nucleotides are mainly seen at the 59 of D b gene segments, they range from 1% of data set in the naïve T cells to 0.5-0.9% in the The mean number of in-frame, read-through TCRb CDR3 nucleotide sequence reads obtained from the CD8 + CD45RO 2 CD45RA hi CD62L + (naïve), CD8 + CD45RO + CD45RA low (memory), CD4 + CD45RO 2 CD45RA hi CD62L + (naïve) and CD4 + CD45RO + CD45RA low (memory) T-cell samples. The last seven columns show the average counts of palindromes of length one to seven observed in the data set at four different non-recessed coding ends. Palindromic sequence longer than 7 nucleotides long is not observed in the data set. doi:10.1371/journal.pone.0052250.t001 memory T cells. On the other hand, we observed very few P nucleotides at the 39 of D b despite the fact that there were comparable numbers of sequences with no deletion at this coding end (Table S1 and Table S2). We have learned from our previous work that the D b 1 and D b 2 gene segments are utilized almost uniformly [6] in TCRs from both the naïve and memory T cells. In contrast the P nucleotide distribution was strongly biased with more P nucleotide seen at the D b 1 than at the D b 2 gene segment. We found that 0.5% of each data set contained P nucleotides at the J b coding ends, with the CD4 + and CD8 + naïve T cells showing a similar distribution while in the memory compartment the CD4 + is two fold higher than the CD8 + . We showed in our previous work that the J b gene usages observed in the 4 different cell types were relatively constant within a given donor [7]. We compared the P nucleotide frequency at J b coding ends with the J b gene usages in CD8 + naïve T cells. We found that P nucleotides are mainly seen in the members of the J b 2 family and correlate with their usages, while in the J b 1 family P nucleotides are mainly seen at J b 1-1 and J b 1-5 and do not necessary correlate with their usages ( Figure 1B). The coding end 39V b contributed few (0.3%) P nucleotides as compared to the 59D b and 59J b , and their distributions were similar in both compartments of the CD8 + and CD4 + T cells. Most of the P nucleotide contribution at 39V b came from the gene segments of the V b 4 family and the V b 20 gene segment in the naïve and memory populations of the CD8 + and CD4 + T cells. P nucleotide additions have a different average length depending upon the particular coding end involved. The majority of P nucleotides consist of one nucleotide at 39V b , two nucleotides at the 59D b , and both one and two nucleotides at the 39J b ( Figure 1B). Our model also identified some three base P nucleotides at the 59D b , though their expected frequencies were 150 and 400 fold lower in the naïve and memory compartment, respectively than the two base P nucleotides. Most observed three base or longer palindromes are two base P nucleotides that appear to look longer because of palindromic N nucleotide additions added by the TdT enzyme.

Assigning P Nucleotide Class
Several algorithms classify all n base palindromes as n base P nucleotide additions, which overestimate P nucleotide distribution in T cell repertoire [8]. For each coding end and the gene segment we calculated the conditional probability of P nucleotide length given an observation of a palindrome in the naïve and memory compartment of the CD8 + and CD4 + T cells. Technical description of an algorithm for assigning P nucleotide using the conditional probability model is described in Materials and Methods. As it was described and reasoned in the above section, Method Overview, calculation is done for one to three bases long palindromes and the classification is performed according to the maximum conditional probability. We found that conditional probabilities behave very similarly in all populations of T cells. The average result of all the donors of all cell types is shown in the table (Table 2) and the average result for each cell type is not shown but available on request.
Several observations can be made from this table. Based on our maximum conditional probability model for classification, we did not predict any single base P nucleotides at any coding end of the V b , D b , and J b gene segments ( Table 2). In the case of one base palindromes the average conditional probabilities of one base P nucleotides are 0.22, 0, 0.04, and 0.13 at 39V b , 59D b , 39D b , and 59J b respectively, which are far below the threshold level of 0.5. Therefore we interpret that one base palindromes arise primarily by the TdT insertion and not by the hairpin resolution. Given an observation of a single base palindrome, we predict with high probability that this base is inserted by the TdT, not a P nucleotide. Furthermore, in two base palindromes the average conditional probabilities of one base P nucleotides at 39 V b , 59D, 39D, and 59J b are 0.15, 0, 0.01, and 0.06 respectively, which are again far below the threshold level of 0.33. The conditional probabilities in three base palindromes behave similar to the two base palindromes case. Most of the two or three base observed palindromes were not classified as one base P nucleotides.
In cases of two and three base observed palindromes, our model predicts either a two base P nucleotide or no P nucleotide. This implies that the third base of the observed palindrome is mostly contributed by the TdT enzyme instead of being part of the hairpin resolution. Most of the two or three base observed palindromes were not classified as one base P nucleotides.
According to the maximum conditional probability model most of the V b gene segments are not predicted to show any P nucleotide ( Table 2). The classified two base P nucleotides are mainly restricted to few specific V b gene segments (V b 10 family, V b 19, V b 20-1, V b 23-1, and V b 27), with two base P nucleotides are dominated by V b 20-1 and V b 27 gene segments. We also noticed that there were a few V b gene segments ( where no palindromes were observed at all. At 59D b end, both the D b gene segments were labeled as two-base P nucleotides with high probabilities and the number of sequence with D b 1 gene segments classified as two-base P nucleotide dominates the D b 2 by three folds. At 39D b end, conditional probability was too low for both of the D b gene segments to be classified as P nucleotide. In J b segments, eight of the thirteen were classified as two base P nucleotides while the remainders of the J b segments were classified as no P nucleotide with high conditional probabilities.

P Nucleotide Contribution to TCRb Diversity
Our calculations show that P nucleotides are primarily two bases long. We also observe that if there is a two-base P nucleotides in the TCR sequence, it is mainly seen at one of the four coding ends. Very few sequences have two base P nucleotides at more than one coding end. In each population of T cells, we calculated the mean fraction of sequences which have two base P nucleotides at both ends of N2 or N1 segments ( Table 3). The mean fraction of two base P nucleotides at both ends of N2 and N1 segments are 0:712|10 {3 and 0:143|10 {3 respectively for all cell types. It is not surprise that the second term is smaller than the first one, since the 39D b gene segments rarely show any P nucleotide. We saw that many of the two base P nucleotides are restricted to the 59 of D b and J b ( Figure 1B). We also calculated the mean fraction of sequences which have two base P nucleotides jointly at the 59 of D b and J b . Their mean fractions are 1.4610 23 in all cell types while the CD8 + memory has 8.3610 24 . These frequencies are hundred times smaller than the total P nucleotide frequencies and therefore we conclude that the formation of two bases P nucleotides at coding ends is statistically a mutually exclusive event and thus the probability that the TCR sequence will have a two-base P nucleotides at more than one coding end will be very small.
We have previously estimated the total number of unique TCRb CDR3 sequences in the entire T cell repertoire of a healthy adult to be around 3|10 6 and 1:5|10 5 in the naïve and memory compartment respectively [7]. We showed here that total P nucleotide frequency in naïve T cell subsets was approximately 0.02 while the CD8 + and CD4 + memory compartments has approximately 0.01 and 0.018 respectively ( Figure 1B), which lead us to estimate using a simple linear extrapolation that approximately 60,000 sequences in the CD8 + and CD4 + naïve repertoires, 15,000 and 60,000 sequences respectively in CD8 + and CD4 + memory repertoires would contain P nucleotides of one to three bases long. The frequency of sequences that has two base P nucleotides at one of the four coding ends is found to be around 0.0125 in each cell type while the CD8 + memory has 0.008, suggested that around 37,000 sequences would be expected to have two base P nucleotides at any one of the four coding ends in every T cell repertoire except the CD8 + memory.

P Nucleotide Role in Thymic Selection
A subset of T cells has both alleles rearranged. Since we sequence genomic DNA, both the productive and non-productive alleles are observed. For the non-productive alleles, the sequences are either out of frame or have a stop codon. These nonproductive alleles account for ,15% of TCRb sequences. These alleles are non-functional, and therefore are not subjected to direct selection in the thymus. By comparing properties of these nonproductive sequences to the productive sequences in the CD8 + naïve repertoire, we are able to deduce which factors contribute to positive and negative selection in the thymus.
We calculate and compare the expected number of P nucleotides in productive and non-productive TCRb sequences at three coding ends, 39 of V b , 59 of D b and J b . We found that nonproductive sequences have larger expected number of P nucleotides at 59 of D b and J b than productive sequences. On the other hand at the 39 of V b , P nucleotides are more predominant in functional TCR sequences than non-functional sequences ( Figure 2). We showed in our previous work that V b 2D b 2J b usage in the functional TCRb CDR3 sequences was highly non uniform but qualitatively similar to non-functional sequences [6]. In spite of the fact that the V b 2D b 2J b utilization is attributed equally in both sets of functional and non-functional TCRb sequences, we saw statistically significant differences in P nucleotides observed at coding ends in functional and nonfunctional sequences.

P Nucleotides Correlate with High Average Copy
We determined that the majority of P nucleotides consist of two nucleotides long in length and they are mainly constrained to the 59 of D b and J b gene segments ( Figure 1B). We asked if P nucleotides at these coding ends are associated with large clone size. We take all sequences with no deletion at the specific coding end of the given gene segment and group them as P nucleotide and no P nucleotide depending upon whether a sequence contains P nucleotide greater than one base or not. In order to avoid spurious long P nucleotides at the coding end of the given gene segment, we set the P nucleotide probability of two and three base at the threshold value of 0.05. At each coding end of the gene segment average copy number for the two groups is calculated if the P nucleotide probability at its coding end is found to be above the threshold value. Average copy numbers are normalized by total number of reads in each donor. In each data set we summed the normalized average copy number over D b gene subset and call it the D b sum. Similarly it was done for J b gene subset and J b sum is obtained. We found that all the D b genes contribute to the D b sum while seven to eight J b genes segments contribute to the J b sum. We observe that sequences bearing receptors with longer than one base P nucleotides at both the D b and J b coding ends are associated with high average copy number (clone size) as compared to sequences with one or no P nucleotide. Using the Student t-test for paired samples, we found that at the 59 of D b , the two groups are statistically different at the significance level of 0.01 in all T cell compartments except the CD8 + memory. On the other hand at the J b end, the two groups are statistically different, using Student t-test for paired samples, in each of the naïve T cell subsets at the significance level of 0.01. Therefore, the distribution is more skewed in the naïve population of T cells in the two groups ( Figure 3); with longer P nucleotides being biased towards high average copy number. In contrast the distributions in the memory subsets were not as skewed as the naïve ( Figure S1).

Characteristic of the Length of Junctional Insertion with P Nucleotide
We showed that a self-complementary nucleotide sequence of two bases was most likely to be a two base P nucleotide. Since the CDR3 length of the TCRb is constrained, presence or absence of P nucleotides can affect the number of nucleotides added by TdT. We asked if sequences with P nucleotides correlates with junctional insertion. We assess the distribution of the length of the junctional insertion at N2 and N1 segments with and without the two base P nucleotides at either end of the N2 and N1 segments.
To compare the number of junctional insertions at N2 and N1 segments of the TCRb CDR3 with P nucleotides (consisting of more than one base) and no P nucleotide sequence, we partition the data set from each donor into two groups. We call a sequence a P nucleotide sequence if we observe a palidrome greater than one base at either end of the N2 segment otherwise as no P nucleotide sequence. As mentioned above we observe very few P nucleotides at both ends of N2. For each P nucleotide and no P nucleotide sequence we calculate the length of nucleotides inserted by the TdT enzyme. Note that to calculate the length of nucleotides inserted by TdT in P nucleotide defined sequences, we need to factor out the P nucleotide length from the total insertion, since these P nucleotide bases are not inserted by the TdT enzyme. The calculation was repeated for the N1 segment.
The results show that CDR3 sequences with P nucleotides consisting of more than one nucleotide are likely to have fewer non-template insertions as compared to sequences with no P nucleotide in the naïve and memory compartment of the CD8 + and CD4 + T cells. Results are shown for the CD8 + T cells, (Figure 4) and the CD4 + T cells show a similar trend ( Figure  S2). It is interesting to note that the two distributions intersect each others when the number of insertions becomes equal to three and this is consistent in each T cell compartment. Thus the probability of up to three insertions is higher in the P nucleotide sequences as compared to the no P nucleotide sequences. On average P nucleotide sequences have three nucleotide insertions in both the naïve and memory compartments while no P nucleotide sequences were found to have four and five nucleotide insertions in naïve and memory compartments respectively. This makes the P nucleotide sequences one to two bases closer to the germ line as compared to the no P nucleotide sequences. And thus the addition of P nucleotides consisting of two bases can reduce the diversity of TCRb repertoire up to 64 fold.   Comparing the effects of P nucleotides on the reading frame, we found that the addition of P nucleotides at the 59 end of the D b gene segments increased the frequency of the reading frame 1 in all populations of T cells. When a P nucleotide was inserted, 50-60% of T cells used reading frame 1 to transcribe their TCRb chains. Sequences with the addition of two base P nucleotides in this reading frame, the first position of the codon is occupied by any nucleotide and the two bases of the P nucleotide contribute to the non-synonymous and the synonymous position of the codon. Thus the consequences of two base P nucleotide in this reading frame leads to four canonical amino acid residues, alanine, proline, serine, or threonine in the CDR3b sequences depending upon if the first position of the codon is occupied by G, C, T or A respectively. The respective frequency of the amino acid residues in all T cell subtype is given in Table 4. Proline and threonine were coded more than uniformly while alanine was less than uniform in this reading frame.
The third reading frame with P nucleotides was favored slightly less as compared to no P nucleotide. But note that in this reading frame, sequences with no P nucleotide could add many possible amino acids at this position, while the same reading frame with P nucleotide would only add a proline reside since the two bases of the P nucleotide only affect the non-synonymous positions of the codon. This means that 30-40% of sequences with P nucleotides  Table 4. Amino acid usages in reading frame 1. in this reading will only have proline followed by glycine and thus suggested that these residues might play some important biological role in TCR recognition. In the P nucleotide sequences the combined affect of reading frames 1 and 3 at the amino acid level introduce a strong bias for proline which is the only cyclic amino acid and allows less degrees of freedom. We found that irrespective of the P nucleotides, the frequency of the D b 1 segment in the reading frame 2 was higher than the D b 2 segment, 25% and 10% in D b 1 and D b 2 segments respectively in all T cell populations. Thus, when T cells rearrange their DNA, they are less likely to use D b 2 segment to encode TCRb chains in the 2 nd reading frame and the addition of P nucleotides further decrease this usage ( Figure 5). Furthermore D b 2 gene segments also contain a stop codon ''TAG'' and we observed that whenever the D b 2 segment was used in the 2 nd reading frame, eleven nucleotides were invariably deleted from its 39 to avoid this stop codon. This also explained the lower number of P nucleotide additions at the 59 end of D b 2 as compared to D b 1. Thus, the P nucleotide addition in this reading frame only leads to incorporation of arginine followed by aspartic acid in the CDR3 TCRb.

Discussion
High-throughput DNA sequencing has enabled us to characterizes P nucleotide distributions in millions of distinct TCRb CDR3 sequences found in multiple individuals. We develop a novel algorithm to calculate P nucleotide at full-length coding ends in TCR. This algorithm allows us to calculate the probability of an observed palindromic nucleotide being truly P nucleotide as opposed to be inserted by the TdT as a palindromic N-nucleotide. We find that average length of P nucleotide is 2 and the range is 1 to 3 ( Figure 1B). These findings are consistent with biochemistry of the Artemis:DNA-PKcs in V H segments of human immunoglobulin heavy chain locus, which has a distribution of hairpin opening position that varies from 1 to 3 nucleotide 39 of the hairpin tip, with 2 nucleotide being the dominant hairpin opening site [9,10]. Our results are also in agreement with the works of two different authors [3,4], where they investigated the P nucleotide using extrachromosomal plasmid assay and also performed an indepth analysis of two loci, the TCRb locus and the IgH locus. They found that 95% of P nucleotides (palindromes in our terminology) were one to two residues in length and 5% were 3 bp long. These agreements allow us to conclude that the length distribution of P nucleotides could also be generalized to other loci.
The P nucleotide distribution profile is biased in two ways. First, it favors the P nucleotide insertions at certain coding ends, 59 of D b and J b ( Figure 1B). Second, at the coding end it is further biased with respect to the gene sequence used, some gene sequences display high proportion of P nucleotides while others show few or none ( Table 2). The 39V b coding end P nucleotides are dominated by V b 20-1 and V b 27 gene segments. The coding end of these two gene segments are A/T rich and therefore TdT bias for G/C couldn't account for these differences. However, the observed frequencies of V b -J b utilization suggest that these segments are predominantly used in all subsets of T cell [6], and thus either positive or negative selection could biased the samples. Parallel with this, Meier and Lewis's work also reported that the palindrome frequency data associated with 39V b coding end were dominated by a single V b gene segment. The 59J b coding end P nucleotides are mainly observed in the members of the J b 2 family ( Figure 1B), and in the case of 2 and 3 base palindrome data they are classified as 2 base P nucleotides with high conditional probabilities ( Table 2).
The total palindrome percentage at the coding end ranges from 2 to 6% ( Figure 1A), while the total P nucleotide percentage, sum over all the coding ends, ranges from 1-2% in the CD8 + and CD4 + naïve T cells, with less or so in the memory T cells ( Figure 1B). We compared our results with Meier and Lewis's work that used the substrate assay to analyze palindrome formation at coding joints [3]. They reported that percentage of palindromes at the coding joint ranges from 3% to 16%, see their Table 2, which is higher than our result. This difference is expected and could be accounted by considering the fact that their substrates were analyzed without selective bias, positive and negative selection, of an intact adaptive immune system and thus their estimated palindrome frequencies could be overestimated. It is estimated that 70-80% of the rearranged T cells die during the selection processes of the VDJ recombination [11]. If we consider this fact, Meier and Lewis's estimates could be scaled down and become agreeable to our result. They also comprehensively revaluated a few hundred sequences from previously published studies; see their Table 3, [3]. They showed that the palindrome frequencies in TCRb locus at 39V b , 59D b , 39D b , and 59J b are 0.1, 0.15, 0.04, and 0.11 respectively, which are approximately 2-3 folds higher than our result ( Figure 1B). Although TCRb genes with P nucleotides constitute a small fraction of the total repertoire, they are observed consistently in our cohort and have evidence of strong selection.
According to the data shown ( Figure 1A), 1-2% of the total unique sequences have single base palindromes at each coding end. We computed the conditional probability of P nucleotide using an algorithm outlined in Materials and Methods. As detailed in (Table 2) classification of P nucleotides based on maximum conditional probability, in the case of one base palindrome, did not predict any single base P nucleotides at any coding end of the V b , D b and J b gene segments. This result supports the view that the TdT is mainly responsible for single base palindromes. An alternative possibility is that one base palindromes were actually a 59 and 39 palindromic overhangs of $2 nucleotides long in length, and they could go through exonuclease trimming by Artemis protein to yield palindromes of single base. It has been shown that the purified Artemis protein possesses a single-strand-specific 59 to 39 exonuclease activity [10]. To more fully characterize the single base palindromes, we computed the nucleotide frequency, composition, of the single base palindromes as well as the average mono nucleotide frequency of the TdT insertions ( Table 5).
Though TdT has a coding end bias (Table S3), on average it has a strong well-known bias for inserting G/C nucleotides [4]. The result showed that the nucleotide frequency of single base palindromes is overrepresented at the coding ends terminating in G/C as compared to A/T, and this could be explained by the TdT's bias for inserting palindromic G/C nucleotides.
In the CD8 + naïve T cells three donors are related, donors 1 and 3 are siblings and shared the same HLA, they have different age, and are the daughters of donor 2, while the other four donors are completely unrelated in HLA and belong to different age groups. At each coding end we computed the total fraction of palindromes of 1-7 base long in each donor. The fraction of palindromic inserts varied, but no systematic difference emerged (Table S4), though this data set is small to make any general conclusion. We observed that there is no variation in palindromic insertion at the 59J b with regard to the HLA. We saw some HLArelated differences at the 39V b , the two related siblings shows similar fraction of palindromes, donors of African origin show higher fraction of palindrome insertions. One might expect a variation in palindromes due to HLA as the V b gene segments interact with the HLA as well as with the peptide for antigen recognition.
The significant differences between P nucleotide usage in productive and non-productive naïve TCRb sequences suggests strong thymic selection ( Figure 2). The error bars in Figure 2 are standard deviation. The standard errors are the STD/sqrt (7), which implies that P nucleotide frequency differences in the productive and non-productive sequences are statistically significant (P,.01). P nucleotides fix particular amino acids at specific positions in the CDR3 sequence, so thymic selection is expected ( Figure 5). However, the 30-50% observed is very large. Most other sequence properties, such as CDR3 length and GC content, are indistinguishable between productive and non-productive TCRs, and show little sign of influencing thymic selection. These differences are observed for P nucleotides for V b , J b , and 59D b .
On the other hand, there are very few P nucleotides at the 39D b end in either naïve or memory compartments of CD8 + and CD4 + T cells ( Figure 1B). As this observation holds for both productive and non-productive, thymic selection is not a viable explanation. Another possibility is the strength of the RSS signal [11]. The 39D b 23 RSS mediates joining to J b 12 RSS more efficiently than V b 23 RSS does to 59D b 12 RSS and thus we hypothesized that efficient joining could give rise to fewer P nucleotides at the coding end.
This manuscript is based on the analysis of palindromes at the end of non-recessed coding ends. For the completion, we also briefly report the presence of another kind of palindromes at nucleolytically processed coding termini, referred by Gauss and Lieber as Pr nucleotides (referred to here as recessed palindromes) [4]. Although, the biological process that gives rise to these Pr nucleotides is not clear, the authors have hypothesized that their generation could be attributed to either fortuitous addition by the TdT enzyme or by a hairpin intermediate. Histogram of the recessed palindrome length frequencies, based on 1 to 10 nucleotide deletions at coding termini, is shown in (Figure 6). In contrast to the conventional palindromes which are on average 2 nt in length ( Figure 1A), the frequency of a single nucleotide recessed palindromes is over-represented. We find that 98-99% of the recessed palindromes are 1 to 3 bases long, and the average length is 1.2, which are consistent with the results of Gauss and Lieber [4].
With the algorithm presented to assign a probability that an observed palindrome is truly a P nucleotide, we can begin to study the functional role of P nucleotides [12,13]. Functional immuno- logical assays such as tetramer sorting enables the identification of the subset of T cells which bind a particular HLA:peptide antigen complex. With an assay, we are able to readily sequence the set of these TCR sequences. Combining such functional assays with the characterization of P nucleotides presented here, we can elucidate the function role TCRs with P nucleotides in different disease contexts.

Notations and Terminology
For our purposes, and in keeping with previous works, we introduced some notations and terminologies. We denoted the nucleotide insertion at D b 2J b and V b 2D b junctions as N1 and N2 regions respectively. We denoted the observed and true complementary sequence as palindrome and P nucleotide respectively. The probability of i-base P nucleotide is represented as p i . ''Coding'' end referred to the sequence directly adjacent to the recombination signal sequences (RSS). TCRb CDR3 region is defined as a sequence that begins at the nucleotide base that encodes the 2 nd conserved cysteine at the 39 of the V b gene and ends at the nucleotide base that encodes the conserved phenylalanine at the 59 of the J b gene.
Isolation and Purification of Naïve and Memory CD8 + and CD4 + T Cells Isolation and purification of naïve and memory compartments of the CD8 + and CD4 + T cell samples were carried out as described in [6,7].

Sequencing of CDR3 Regions from Rearranged TCRb Genes
Polymerase chain reaction (PCR) amplification and sequencing of rearranged TCRb CDR3 regions, in reverse complement from 39J b , were done as described in [6], and see also Figure 1 in the reference [7].

Preprocessing of Genome Analyzer (GA) Sequence Data
Raw GA sequence data, 60-nucleotide long, were preprocessed to remove PCR and sequencing errors. Clustering of data into unique TCRb CDR3 sequences were done using Hamming distance, as described in [6,7]. The sequence data is again reverse complemented to bring them back into 39V b to 59J b orientation.

Identification of TCRb CDR3 Sequences and VDJ Decomposition
Identification of the TCRb CDR3 region and the nomenclature were done according to the definition established by the International ImMunoGeneTics collaboration [8]. Accurate decomposition of TCRb CDR3 nucleotide sequence into constitutive gene segments represents a major challenge for P nucleotide analysis. Each nucleotide of the solexa sequence is assigned to the most likely gene segment. An updated IMGT algorithm was used to delimit the 59 end of J b , the 39 end of the V b , and both the ends of the D b by aligning the solexa sequence with the reference genes. Since somatic hypermutation do not occur in T cell, any difference between the reference gene and solexa sequence is not allowed in Figure 6. Palindrome length frequencies at recessed coding ends. Histogram of the average palindrome length frequencies at the four recessed coding ends in the data set of naïve and memory compartments of the CD8 + and CD4 + T cells. Each entry is normalized by its total number of unique sequences. The mean height is shown in the figure. Different colors represent one to six bases palindromes. In contrast to the palindromes at non-recessed coding ends, events of one base palindrome are more common at nucleolytically processed coding termini. doi:10.1371/journal.pone.0052250.g006 the CDR3 region and therefore the first nucleotide difference determine the end of the gene segment.
Each of the 13 different J b gene segment has a unique nucleotide motif between the position 12 and 18 from the heptamer signal sequence. The last several bases of solexa sequence are searched for J b gene specific J motif, this in turn uniquely determined which J b segment was used. If none of the J motif was found in the solexa sequence, the sequence was not considered for P nucleotide analysis. If the J motif is found, solexa sequence is compared base by base, 39 to 59 orientation, to the reference J b gene segment. Truncated or nontruncated 59J b coding end is respectively determined by the next nucleotide difference or by the end of reference J b gene segment.
Each of the 54 V b gene segment has a 7-base nucleotide motif which encodes the first four amino acid residues ''CASS'' of the CDR3. Solexa sequence is searched for V b gene specific motif. If V b gene specific motif was not found for any of the V b gene, the solexa sequence was dropped from the analysis. If the motif is found, solexa sequence is divided into two subsequences: 1 st subsequence is the sequence downstream of the motif, containing the motif, and the 2 nd subsequence is upstream of the motif. Since the V b part of the CDR3 region is too short, V b assignment is done in two steps, unlike the original IMGT algorithm. In the first step truncated or nontruncated 39V b coding end, thus the number of undeleted V b nucleotides in the solexa sequence, is determined by comparing the 1 st subsequence, nucleotide by nucleotide in 59 to 39 orientation, to the CDR3 part of the reference V b gene segment. In the second step upstream similarity score was calculated for each reference V b gene, without gap, using a sliding window between the 2 nd subsequence and upstream sequence of the CDR3 part of the reference V b gene. The number of undeleted nucleotides calculated in step 1 was added to the upstream similarity score; adding this creates more weight to the CDR3 part of the nucleotides of the V b gene segment. If the V b segment in the solexa sequence is more likely to be close to the full coding end more likely it is to be uniquely defined. The V b gene corresponding to the maximum score is assigned to the solexa sequence. If the maximum score is not unique, V b gene cannot be assigned uniquely to the solexa sequence, the solexa sequence is dropped from the analysis.
After parsing V and J, the nucleotide sequence between 39V b and 59J b termini is compared with the reference D b gene segments. For each D b gene similarity score was calculated, without gap, using a sliding window between the nucleotide sequence and the reference D b gene. The D b gene corresponding to the maximum score is assigned to the sequence and is used to delimit both the ends of the D b segment using the procedure described above for V and J. If the D b gene segment in the solexa sequence is not uniquely assigned or the number of undeleted nucleotides in the D b part of the solexa sequence is less than 5, the solexa sequence is dropped and not considered for P nucleotide analysis.
At each indentified and delimited full-length coding end, 39V b , 59D b , 39D b , 59J b , the algorithm search for short palindromic sequences. The algorithm starts the search from the last (first) nucleotide of the gene segment at the 39(59) coding end and compared it with the first (last) nucleotide downstream to the 39(59) coding end. If the comparison is found to be complementary, the iteration moves to the second last (second) nucleotide of the gene segment and compared it with the second (second last) nucleotide downstream to 39(59), and so on. The iteration stops when the next nucleotide comparison is not found to be complementary. Number of iterations determines the length of the palindrome. After palindromic nucleotides are identified they are excluded from contributing to the upstream or downstream gene sequences that remained to be queried. N1 and N2 regions are the part of the sequence that left after the identification and delimitation of the V b , D b , J b and palindromic nucleotides.
for the coding end to have a P nucleotide. 3) Since p 1 wt 2 p 1 , comparing the first and second equations gives b 1 wb 2 . In order to have a one base P nucleotide this condition must hold. 4) Again since p 3 wt 3 p 3 , comparing the second and the third equations gives b 3 wt 3 b 2 . This inequality implies that in order to have a three base P nucleotide, the information content of P nucleotide in three base pairs must be greater than the combined information of two base pairs and the information that the 3 rd base pair is inserted by the TdT. 5) As a special case, if no nucleotide is added by the TdT, P nucleotide probabilities trivially become p 1~b1 {b 2 , p 2~b2 {b 3 , p 3~b3 , i.e. all the palindromic sequences are P nucleotides.
In order to solve for p, the above system requires the estimation of the entries of the matrix which depend only on the TdT probabilities and the vector b which are dependent upon the TdT probabilities as well as on empirical frequencies of observed palindromes. We calculated the empirical frequencies of the palindromes based on the data. Estimation of the TdT probabilities is described in the section Method Overview. Mean estimated TdT probabilities for the CD8 + and CD4 + T cells are given in Table S3. We substituted the coding end specific TdT probabilities and the empirical frequencies of palindromes into the above system and solve for P nucleotide probabilities of different length.

Algorithm for Assigning P Nucleotide Class
Many palindromic sequences appear at V b 2D b or D b 2J b junctions that are derived in whole or in part by nucleotide insertions from the TdT. Classifying all of the n-base palindromes into n-base P nucleotides overestimate the P nucleotide's distribution profile and underestimate the TdT distribution in the data set. Here we take an approach of classifying a given palindrome into a P nucleotide of certain length using a maximum conditional probability model. For a n base palindrome at a given coding end we calculated the conditional probability of P nucleotide of 1 to n length, and the conditional probability of no P nucleotide and then chose the P nucleotide class that has the largest conditional probability.
Let X and Y be random variables over the length of a P nucleotide and an observed palindrome respectively. Using a Baye's rule, given a n base palindrome, the conditional probability of a m-base P nucleotide is written as. where p i is the i-base P nucleotide probability, t i is the probability that the i th base with respect to the coding end is a TdT insertion and m[ 1,n ½ . The conditional probability of no P nucleotide is calculated using the expression P 0~P n m~1 Pr (X~mDY~n). P nucleotide probabilities p i at each coding end are estimated using the linear system outlined in Materials and Methods, and the TdT probabilities t i are estimated as described in Method Overview and theirs estimates are given in Table S3. In the above equation the denominator captures the number of possible ways a n base palindrome arises while the numerator expresses the probability of the event that a m base P nucleotide is followed by n{m bases of TdT insertions. For example, if we observed a two base palindrome at any given coding end, the conditional probability of 0, 1 and 2 base P nucleotide become t 1 t 2 t 1 t 2 zt 2 p 1 zp 2 , p 1 t 2 t 1 t 2 zt 2 p 1 zp 2 and p 2 t 1 t 2 zt 2 p 1 zp 2 respectively. Thus, we classified the two-base palindrome as a 0-base, no P nucleotide, or 1-base, or 2-base P nucleotide corresponding to the term which maximizes the conditional probability.

Statistical and Computational Analysis
In each donor and at the each coding end, the expected number of P nucleotide was calculated using the expression P 3 i~1 p i n i , where n i represents the number of observed i-base palindrome and p i is the probability of i-base P nucleotide, estimated using the linear system described in Materials and Methods. The pvalues were determined by the Student t-test for paired samples. All bar heights are shown as group means and the vertical error bars represent one standard deviation. All computations are performed in Java and Matlab.

Supporting Information
Figure S1 Average copy number with and without P nucleotide. Correlation between average copy number and P nucleotide at 59 D b and 59 J b gene segment in memory compartments of CD8 + and CD4 + T cells. Heights represent the mean of the sum of the normalized average copy number over D b gene subsets and similarly for J b gene segment subsets. Copy numbers were normalized by respective number of total reads in each donor. The error bar indicates one standard deviation. (TIF) Figure S2 Number of insertions distribution in P nucleotide sequence. Insertion distribution of P nucleotide and no P nucleotide CDR3 sequences observed in the naïve and memory CD4 + compartments of every possible pair of individuals as a function of the number of nucleotide insertions at the N2 and N1 segments.
(TIF) Figure S3 Reading frame biased induced by P nucleotide. Heights represent the mean percentage of sequence with a reading frame in P nucleotide and none P nucleotide. Results are shown for D b 1 and D b 2 gene segments in CD4 + naïve and memory T cells. Error bars indicate the standard deviation. (TIF) Table S1 Summary of CD8 + naïve and memory data set. Number of in-frame, TCRb CDR3 nucleotide sequence, total reads and unique number of sequence obtained from the CD8 + CD45RO 2 CD45RA hi CD62L + (naïve) and CD8 + CD45RO + CD45RA low (memory) T-cell samples from each donor are given in the second and third columns respectively. The corresponding number of sequence with untrimmed coding ends at 39V b , 59D b , 39D b , and 59J b are shown in the last four columns. Each row shows the naïve (above) and memory (below) data.

(DOC)
Table S2 Summary of CD4 + naïve and memory data set. Number of in-frame, TCRb CDR3 nucleotide sequence, total reads and unique number of sequence obtained from the CD4 + CD45RO 2 CD45RA hi CD62L + (naïve) and CD4 + CD45RO + CD45RA low (memory) T-cell samples from each of the six donors are given in second and third columns respectively. The corresponding number of sequence with untrimmed coding ends at 39V b , 59D b , 39D b , and 59J b are shown in the last four columns. Each row shows the naïve (above) and memory (below) data. (DOC)

Table S3
TdT probabilities estimate for all cell types at each of the four coding ends. Probability estimate of N nucleotide insertion is shown based on independent assumption. In order to avoid any potential bias middle regions of N2 and N1 segments are considered for the calculation of average nucleotide frequency, described in details in Methods Overview. (DOC)

Table S4
Demographic characteristic of palindromic sequences. In each donor the fraction of palindromic inserts of up to six bases long is shown against the coding ends in CD8 + naïve T cells. The 59J b does not show any variation while the 39V b shows some fluctuations with regard to the HLA. The two related siblings, donor 1 and donor 3, show the similar pattern of palindromes at 39V b , 59D b and 59J b coding ends. (DOC)