Development of a Bioinformatics Framework for the Detection of Gene Conversion and the Analysis of Combinatorial Diversity in Immunoglobulin Heavy Chains in Four Cattle Breeds

We have developed a new bioinformatics framework for the analysis of rearranged bovine heavy chain immunoglobulin (Ig) variable regions by combining and refining widely used alignment algorithms. This bioinformatics framework allowed us to investigate alignments of heavy chain framework regions (FRHs) and the separate alignments of FRHs and heavy chain complementarity determining regions (CDRHs) to determine their germline origin in the four cattle breeds Aubrac, German Black Pied, German Simmental, and Holstein Friesian. Now it is also possible to specifically analyze Ig heavy chains possessing exceptionally long CDR3Hs. In order to gain more insight into breed specific differences in Ig combinatorial diversity, somatic hypermutations and putative gene conversions of IgG, we compared the dominantly transcribed variable (IGHV), diversity (IGHD), and joining (IGHJ) segments and their recombination in the four cattle breeds. The analysis revealed the use of 15 different IGHV segments, 21 IGHD segments, and two IGHJ segments with significant different transcription levels within the breeds. Furthermore, there are preferred rearrangements within the three groups of CDR3H lengths. In the sequences of group 2 (CDR3H lengths (L) of 11–47 amino acid residues (aa)) a higher number of recombination was observed than in sequences of group 1 (L≤10 aa) and 3 (L≥48 aa). The combinatorial diversity of germline IGHV, IGHD, and IGHJ-segments revealed 162 rearrangements that were significantly different. The few preferably rearranged gene segments within group 3 CDR3H regions may indicate specialized antibodies because this length is unique in cattle. The most important finding of this study, which was enabled by using the bioinformatics framework, is the discovery of strong evidence for gene conversion as a rare event using pseudogenes fulfilling all definitions for this particular diversification mechanism.


Introduction
The basic genetic mechanism in developing immunoglobulin diversity is similar in all jawed vertebrates. Immunoglobulins (Ig) are Y-shaped hetero-tetramers consisting of two identical heavy chains (IGH) and two identical light chains, either κ or λ in mammals (IGK, IGL) [1]. Both chains are functionally divided into variable and constant domains that are combined during B-cell development. The variable domain is rearranged by separate heavy and light chain variable (IGHV, IGKV, IGLV), diversity (IGHD), and joining (IGHJ, IGKJ, IGLJ) germline gene segments [2]. In addition, the imprecise junction of the germline gene segments and somatic hypermutations contribute to antibody diversity [3][4][5].
Gene conversions are difficult to detect especially within a large number of sequences e.g. like those obtained from high throughput sequencing. Gene conversion in immunoglobulins is characterized by clusters of nucleotide changes [14], sometimes only triplets [11], originating from upstream genes of the rearranged segment [4,13]. High degree of flanking homology of the conversion region ensures the genetic exchange [13], whereby 3 to 5 nucleotides seem to be the minimal overlapping requirement [11].
Detection of gene conversion in bovine IGHV is complicated due to the incomplete IGH locus annotation. The main bovine IGH locus was assigned to the Bos taurus autosome (BTA) 21 but exons coding for variable, diversity, and joining segments were also found on BTA7, BTA8, and BTA20 [16][17][18]. In silico locus and mapping analyses identified 36 IGHVs of which nine are functional and belong phylogenetically to the bovine IGHV family 1 (boVH1). The second bovine IGHV family consists solely of non-functional IGHVs that have not been identified in expression analyses yet. Eleven IGHV segment pairs shared 100% sequence identity, whereas two of these pairs contain a functional segment and either an ORF or a putative functional segment, respectively [17]. The high proportion of pseudogene segments leads to the assumption of their use in gene conversion events. Two IGHJ loci possessing six IGHJ segments were detected on BTA11 by BAC clone and locus-specific PCR analysis and were found to rearrange at low frequency while those located on BTA21 rearrange at high frequency. Only two out of these six IGHJ were classified as functional whereas one is involved predominantly in the recombination process [19,20]. Fifteen IGHD genes were detected and revealed a sub-cluster organization. IGHD are classified into four families and the IGHD exons revealed huge size differences [21,22]. The organization of the actual bovine germline repertoire and its possible allelic variants is incomplete and needs to be investigated in more detail [17]. Since, even the organization of the extensively studied human immunoglobulin germline repertoire is questioned and requires ongoing analyses [23].
In all rearranged bovine immunoglobulin isotypes, exceptionally long complementarity determining region 3 of the heavy chain (CDR3H) possessing up to 67 aa were described [17]. Together with IGHD2 and IGHJ1, the germline IGHV10/34 segment was found to be the only variable segment rearranged in these exceptionally long CDR3Hs [17,24], in a non-isotype dependent manner [17,25]. Furthermore, those specialized CDR3H possessing several cysteine residues enabling the formation of intra-CDR3H disulfide bonds. Together with the C-terminal part of IGHV10/34, which forms an ascending β-strand, the CDR3H is consequently exposed like a knob like structure on top of the β-strand stalk whereby the descending β-strand is formed by the C-terminal IGHD portion. There are no similar structures described yet [26].
An additional bovine specific mechanism for antibody diversification is the insertion of conserved short nucleotide sequences into the IGHV-IGHD junction, which was found in intermediate and exceptionally long CDR3Hs [24].
Currently available programs like IMGT/Junction Analysis [27], IMGT/V-QUEST [28,29] and IMTG/HIGHV-QUEST [30], VBASE2 [31], JoinSolver [32], iHMMun-align [33], and IgBLAST [34] allow the annotation of only the entire IGHV sequence to germline IGHV segments. Differentiated analysis of single parts is not directly possible. Most of the databases are focusing on mouse and human immunoglobulin genes (VBASE2, human, [31]). For cattle and other livestock or companion animals, separate databases have to be created (IgBLAST, [34]). Furthermore, the IMGT numbering system does not provide numbering for CDR3H larger than 31 aa. Placing of bovine intermediate as well as of exceptional long CDR3H in this numbering system is therefore not possible and consequently does not allow correct analysis of the rearrangement in those immunoglobulins. In addition, only IgBLAST allows the adjustment of parameters for IGHD identification.
Detailed genetic analysis of the pre-and post-immunization humoral immune response is important to describe the developing diversity and the effectiveness of vaccines and to detect possible individual and breed related differences including non-responders. As a conclusion those analyses help to develop fast recombinant antibodies for passive vaccination, therapy or diagnostic by genetically pre-selection of newly developed or abundant sequences.
As a first attempt to gain more insight into bovine breed specific differences, we compared the dominantly transcribed and the combinatorial diversity of germline IGHV, IGHD, and IGHJ segments as well as somatic hypermutations and putative gene conversions of IgG in the four cattle breeds Aubrac, German Simmental, German Black Pied and Holstein Friesian, by using a newly developed Bioinformatics framework. This new bioinformatics framework combines and extends several analysis tools and takes into account the unique specificities of bovine immunoglobulin sequences of exceptionally long CDR3Hs. In addition, it allows for the adaptation of alignment parameters for the single segments and enables the selective analysis of the different functional regions of the variable domain (namely framework regions and CDRs) to determine putative gene conversions. This new tool should facilitate a fast and detailed analysis of data sets generated by high throughput sequencing.

Results
For sequence analysis, we developed a new bioinformatics framework using MUSCLE [35,36] for the initial fast and accurate multiple nucleotide sequence alignment. Eventually, the sequence distances were calculated with ClustalW [37]. For nucleotide alignments of IGHV and IGHJ, default values of MUSCLE were used. To improve the biological significance of the assignment of germline and sample IGHDs, we tested three different procedures by changing parameters for gap opening/extension and including a scoring matrix for matches, transversions, and transitions. To determine the germline origin using the new bioinformatics framework, only the FRHs were aligned to avoid interference with the highly diversified CDRH [18]. To analyze possible gene conversion events, FR1-3Hs and CDR1-3Hs were extracted and aligned separately to the corresponding regions of the IGHV reference sequences to find the most similar germline segment.
For testing our bioinformatics tool, we established a sample sequence set for a detailed analysis of the transcribed bovine immunoglobulin repertoire. Blood samples were taken from 10 animals per cattle breed: Aubrac (A), German Simmental (GS), German Black Pied (GBP), and Holstein Friesian (HF).
In total, 160 IgG heavy chain sequences per breed (n = 640 sequences) were investigated as described above. The variable regions were identified and extracted at the 5' end (N-terminal end) using the nucleotide motif GCCTCCACC coding for AlaSerThr marking the start of the first constant region of all bovine IgGs. Due to premature Stop-codons or incompletely amplified variable regions, 131 sequences were excluded from further analyses. Consequently, 509 sequences remained: 137 in A, 116 in GS, 111 in GBP, and 145 in HF. Sequences analyzed are published under accession numbers KT761498-KT762006.
We were not able to unambiguously assign ten of 116 analyzed GS sequences to one germline IGHV; this was also true for 14 out of 111 GBP sequences, 12 out of 145 HF sequences, and 19 out of 137 analyzed sequences of A. Samples showing equal divergence to at least two germline IGHV included combinations of IGHV3/33, IGHV6, IGHV10/34, IGHV17(ORF)/31(F), IGHV1S28, and IGHV1S32-40. The most frequent multiple assignment affected IGHV6 and IGHV1S34, followed by IGHV6 and IGHV1S35 in all A, GS, and HF. IGHV6, IGHV1S34, and IGHV17(ORF)/31(F) could not be differentiated in two sequences of GS, one sequence of GBP and two sequences of HF. Ambiguously assigned germline IGHVs and multiple assignments are listed in Table 1.
Including the multiple assignments as described above, a total number of 595 possible transcribed germline IGHV were observed. Overall, the most frequent variable gene segment was IGHV1S39; this was identified in 17.65% of all sequences. This IGHV was used in 51 A sequences (n = 167, 30.54%), in 19 sequences of GBP (n = 140, 13.57%), in four sequences of GS (n = 130, 3.08%), and 31 sequences of HF (n = 158, 19.62%) (Fig 1, Table 2). IGHV3/33 was represented by 14.29% of all sequences. The number of transcribed IGHV3/33 varied from 15 in GS (11.54%) to 32 in HF (20.25%) (Fig 1, Table 2). In similar proportions of 13.45% and 12.27% germline IGHV36/29(F) and IGHV1S40 were used. IGHV1S28, IGHV1S34, and IGHV6 were transcribed in proportions of 8.07%, 8.57%, and 9.24%. The other transcribed IGHVs were identified in minor proportions of 0.17% to 5.04%, respectively. These rarely used IGHVs were identified once or twice in GS and HF but up to ten times in A (Fig 1). Very high significant differences were calculated for IGHV usage within the breeds and between the breeds (P<0.0001).
Separate analyses of FRH 1-3 and CDRH 1-2 to detect gene conversion events. In livestock such as chicken, rabbit, and cattle the use of pseudogene segments is known to contribute to immunoglobulin diversity [4,10,11,13,14,38,39]. Gene conversion in immunoglobulins is characterized by clusters of nucleotide changes [14], sometimes only triplets [11], originating from upstream genes of the rearranged segment [4,13]. A high degree of flanking homology of the conversion region ensures the genetic exchange [13], whereby three to five nucleotides have been shown to be the minimal overlapping requirement [11].
Consequently, FR1-3H and CDR1-2H were analyzed separately to identify mutations within the FR1-3H and CDR1-2H that would indicate possible gene conversion events. The current genomic annotation of germline immunoglobulin segments in cattle makes a correct prediction of 5' donor segments difficult. Nonetheless, larger contigs have been identified [17] and enabled us to show gene conversion events in bovine immunoglobulin heavy chains variable regions.
For instance, the calculated putative originating germline genes for nucleotide changes identified solely in the CDR2H region of KT761864 were IGHV4C/32C, IGHV9C/35C, and IGHV18C/30C (divergence 0.190). In the calculation covering the whole V-region and using only FR1-3H, IGHV6 was identified as the originating gene for KT761864 (divergence 0.075). All genes but IGHV18C/30C are located on BTA21, whereas IGHV4C/32C is located upstream of IGHV6 and is therefore most likely used for the gene conversion. There are two triplets in KT761864 and IGHV4C/32C that are different from IGHV6 due to a transversion mutation in the first changed triplet (from AAT in IGHV6 to TAT in KT761864 and IGHV4C/ 32C) and a transition mutation in the second changed triplet (from GAT in IGHV6 to AAT in KT761864 and IGHV4C/32C; Table 3).
Due to high sequence similarities between functional and pseudogenes, we initially assumed the preferable transcription of functional germline gene segments. If none of the functional germline segments was assigned as origin, we concentrated on pseudogenes that were calculated as unique source of the sample sequences considering the particular regions.
CDR3H length distribution. In all four cattle breeds the program identified very short CDR3Hs (less or equal 10 aa, group 1), CDR3Hs of intermediate length  aa, group 2) as well as exceptionally long CDR3Hs (at least 48 aa, group 3) (Fig 3, Table 5). Very high significant differences were calculated for the number of sequences within the 3 groups of lengths (P<0.0001) when they were compared between all breeds but also within the breeds (GS, GBP, A, HF: P<0.0001). Fourty-four (7.37%) sequences possessed a CDR3H length with ten or less amino acid (aa) residues. The highest amount of sequences within this group were identified in breed A (12.57%), followed by HF (7.55%), GBP (5.00%) and GS (3.05%). In the breeds A, GS, and HF one and four sequences possessed only four amino acid residues within the CDR3H. Five amino acid residues were the shortest CDR3H in sequences of GBP. CDR3H of group 2 were identified in 532 of all sequences analyzed (89.11%). 93.57% of GBP sequences were found to use 11 up to 47 aa. In HF, 89.31% of the sequences were attributed to this group as well as 83.83% of A and 90.84% of GS sequences. The most frequent CDR3H length was 22 aa, which was found in 10.78% of A sequences, in 8.57% of GBP, in 12.21% of GS, and in 10.69% HF sequences. Simultaneously, this length was identified preferably in CDR3H of A and HF. Nevertheless, in GS CDR3H with a length of 21 aa was identified CDR3Hs of group 3 were identified in 21 sequences of all four breeds (9.22%). The breed GS showed the highest number of these sequences (6.11%) followed by A (3.59%). German Black Pied and HF sequences possessed smaller proportions of the exceptionally long CDR3H with 3.14% and 1.43%, respectively. Whereas in GS and HF sequences with 65 aa were the longest CDR3Hs (1.53%, 0.6%), 63 and 54 aa were counted in the longest CDR3Hs of A (0.63%) and GBP (0.71%), respectively.
Assigning IGHD to their germline origin using 3 different procedures. We tested three different procedures to assign germline and sample IGHDs in order to improve the biological significance. At first, we applied the default values of MUSCLE, in procedure 2 we changed the penalties for gap opening (= -4) and gap extention (= -0.3), and in procedure 3 we additionally incorporated a modified scoring matrix (match +2) to evaluate transversion and transition mutations.
In all three procedures, the assignment of germline and transcribed IGHD revealed clear results for the sequences analyzed. Nevertheless, using procedure 3 we obtained results   matching short and long sample IGHD sequences best to germline short CDR3H and exceptionally long CDR3H, respectively, whereas procedure 1 and 2 assigned a major number of group 3 CDR3H to germline IGHD of moderate length. Twenty-one different germline IGHD were transcribed, whereby IGHD8 located on BTA21, IGHD3 located on BTA7, IGHD5 as well as the very short IGHDQ52 located on BTA8; these were preferred in all breeds (Fig 4, Table 6). While IGHDQ52, IGHDS10 and 14 [22] were solely transcribed in sense orientation, germline IGHD1 to IGHD8 gene segments were transcribed in antisense direction in 38 sequences distributed over all four breeds investigated. IGHD4 (antisense (as), [21]) was identified the most often, followed by IGHD1 (as, [40]), IGHD3 (as, [40]), IGHD2 (as, [40]), IGHD5 (as, [21]), IGHD6 (as, [21]), IGHD8 (as, [21]) and IGHD7 (as, [21]). Very high significant differences were calculated for the IGHD usage within the cattle breeds (P<0.0001) but not between the breeds (P = 0.06). In comparison, 20 different germline IGHD were transcribed using default conditions of MUSCLE for alignments (procedure 1), whereby IGHD8 being located on BTA21 was the preferred IGHD in all breeds. The very short IGHDQ52 located on BTA8 was transcribed in high frequencies in the breeds A, GBP, and HF (S1 Fig, S1 Table). While IGHDQ52, IGHDS10 and 14, and IGHD6 were solely transcribed in sense orientation, germline IGHD1 to IGHD8 gene segments were transcribed in antisense direction in 23 sequences that were distributed over all four breeds. IGHD4 (as, [21]) was identified the most often. Very high significant differences were calculated for the IGHD usage within the cattle breeds (P<0.0001) but not between the breeds (P = 0.1630). After changing penalties for gap opening and gap extension (procedure 2), 17 different germline IGHD were transcribed, whereby IGHD8 located on BTA21, was the most observed IGHD in all breeds (S2 Fig, S2 Table). Germline IGHD1, IGHD4, IGHD6, and IGHD8 gene segments were transcribed in as direction in 24 sequences that were distributed over all four breeds. Again, IGHD4 (as, [21]) was identified the most often. Very high significant differences were calculated for the IGHD usage within the cattle breeds (P<0.0001) but not between the breeds (P = 0.6654). Assigning the FR4H and IGHJ to their germline origin. Located on BTA21, IGHJ1 and IGHJ6 [22] were identified as origin to the transcribed gene segments, which defines the FR4H within the samples analyzed. IGHJ1 was transcribed preferably in the sequences investigated in all breeds (98.83%). IGHJ6 [22] (1.17%) was detected in only one sequence of each A, GS, and GBP (0.73%, 0.85%, 0.90%), as well as in three sequences of HF (2.05%) animals. One sequence of GS and HF, respectively, showed IGHJ1 and IGHJ6 as a possible originating germline segment ( Fig 5, Table 7). Statistical analysis revealed very high significant differences (P<0.0001) for the usage of IGHJ1 and IGHJ6 [22] within all cattle breeds. No significant differences were calculated between the four breeds.
Recombination of IGHV, IGHD, and IGHJ in different cattle breeds using procedure 3. Recombined IGHV, IGHD, and IGHJ were identified for each sequence and all possible frequencies were analyzed statistically within and between the breeds examined. In total, 597 recombinations were analyzed including double assigned germline origins for IGHV and IGHJ.
Twenty-one out of the 147 IGHV-IGHD-IGHJ combinations were identified in all four cattle breeds investigated. In animals of A, 81 different rearrangements were found, whereas in GBP 74 different recombinations were observed. German Simmental had 74 combinations and HF had 80 variations. Within the cattle breed A (167 rearranged sample sequences), the combination IGHV1S39-IGHD3 (s. BTA7)-IGHJ1 (AY158087) was the most frequently used in 5.99%. Chi square calculations revealed significant differences in usage frequencies of the recombinations in the breed A (P = 0.0108). In animals of HF (159 rearranged sample sequences), the dominant rearrangement was IGHV1S40-IGHD5 (s, BTA8)-IGHJ1 (AY158087) (4.40%). The Chi square test did not show significant usage differences (P = 0.0546) for the different rearranged gene segments in this breed. High significant differences were seen for the recombined IGHV-IGHD-IGHJ in the breeds GS (P = 0.0002), while in GBP no significant difference was found (P = 0.9585). In GS animals rearranged IGHV36/29(F)-IGHD8 (s, BTA21)-IGHJ1 (AY158087) (9.16%) was observed in at least six sequences (131 rearranged samples).

Recombination of IGHV, IGHD, and IGHJ in different cattle breeds using procedure 2.
When applying procedure 2 with changed values for gap opening and gap extension, we identified 119 different combinations of IGHV, IGHD, and IGHJ (S4 Fig, S5 Table) were identified. As seen for the other procedures, most combinations occurred in less than ten sequences. Calculations revealed very high significant differences in usage frequencies of the rearranged gene segments between the breeds investigated (P<0.0001). Fifteen different rearrangements were observed in a quantity of ten to 42 sequences and were observed across all four breeds. The major rearrangement was IGHV1S39-IGHDQ52 (s, BTA8)-IGHJ1 (AY158087) (7.04%). Beside the dominant combinations, 43 were identified solely in one sequence, whereby six were identified in A, 13 in GS, 12 in GBP, and 12 in HF. Again, seven variations rearranging IGHJ6 were identified.

Variability based on amino acid substitutions
We counted the amino acid substitutions at each position to calculate variability as described by Wu and Kabat [41]. The results are shown in the variability plots for each breed separately in Fig 7A-7D. The amino acid positions were numbered in accordance to the IMGT numbering systems [42]. Therefore, FRHs and CDRHs are defined by the following amino acid positions: FR1H: 1. Within the FR1-4H little variability is discernable whereas in CDR1-3H an increase in variability is observed as expected. A maximum of variability in FRHs is calculated for position 96 in FR3H. Variability at this position is 23.7 in A, in HF 19.0, whereas in GS and GBP variability is 12.7 and 17.5, respectively. In CDRHs, variability increases from CDR1H to CDR3H in all breeds. In CDR1H breed A showed the highest variability. In CDR2H variability varies between 58 in GS and 93 in GBP. GBP also showed highest variability in CDR3H. The lowest variability within CDR3H was found in GS. Both HF and A showed moderate variability of CDR2H and CDR3H in comparison to GBP and GS. Overall, variability was much higher at the transition areas between FRH and CDR than in the middle of the CDRHs.

Discussion
This study makes a substantial contribution to the analysis and understanding of the development of the transcribed bovine immunoglobulin repertoire. We examined possible gene conversions within the variable region of bovine heavy chains. We investigated the dominantly transcribed IGHV, IGHD, and IGHJ gene segments and their combinatorial diversity using a newly developed bioinformatics framework, which considers the unique specificity of exceptionally long CDR3 group of bovine immunoglobulin heavy chains. During the development of the program, we applied different conditions (procedure 1 to 3) to improve the alignments of the single gene segments. The progress of assigning germline IGHV, IGHD, and IGHJ to sample sequences is also shown in this study. Unlike previous studies, we complemented our investigation with the analysis of breed specific differences in the four different cattle breeds Aubrac, German Simmental, German Black Pied, and Holstein Friesian.
Due to the limited germline sequence divergence recent studies on bovine immunoglobulin genetics focused on antibody diversification strategies and the junctional diversity of the antibody repertoire. Specific diversification strategies were identified such as the generation of exceptionally long CDR3H [17,26,43], the insertion of conserved short nucleotide sequences (CSNS) at the IGHV-IGHD junction [24], the use of pseudogene fragments in lambda light chains as well as gene conversions [14], and somatic hypermutations independent of exposure to external antigens during B-cell development [44]. In addition, more germline gene segments were determined over the last few years [17,18,22]. Since the current bovine genome assembly is still incomplete, the full germline repertoire remains under active investigation.
Previous analyses of the bovine immunoglobulin repertoire aimed at identifying rearranged germline gene segments applied various software tools for sequence alignments. As these tools are of limited use for detailed analysis of rearranged immunoglobulin genes due to the difficult and error prone manual assembly of different genes, specialized software tools have been developed. The most familiar ones are IMGT/Junction Analysis [27], IMGT/V-QUEST [28,29], IMTG/HIGHV-QUEST [30], VBASE2 [31], JoinSolver [32], iHMMun-align [33], and IgBLAST [34]. Only IgBLAST enables the analysis of both nucleotide and protein sequences for FR/CDR and allows the user to either apply the numbering system of Kabat or the IMGT system [42,45]. Matching germline IGHV, IGHD, and IGHJ genes as well as details at rearrangement junctions may be analyzed. Searches against germline gene databases and other databases are possible [34]. Tools other than IgBLAST do not provide simultaneous database searches or the analysis of protein sequences. All of these immunoglobulin sequence analysis tools support organisms such as human, mouse, rat, rabbit and rhesus monkey, but representation of livestock is missing or incomplete. We did not apply these analysis tools, as they do not consider the bovine specific occurrence of exceptionally long CDR3H. Therefore, we developed a new bioinformatics framework to address this specific case.
In contrast to the analysis tools mentioned above, our program not only searches our updated bovine specific immunoglobulin germline gene database but also is able to load other pre-designed databases. Matches are identified on the basis of nucleotides as it is the case for all other tools. For the delineation of FR and CDR, we apply the IMGT nomenclature that is currently recommended and most widely used. In addition, we focused on the adjustment of search parameters for IGHV and IGHJ and especially for the identification of IGHD.
The analysis of Ig heavy chain variable regions in four cattle breeds revealed the usage of 15 different IGHV segments, 21 IGHD segments, as well as two IGHJ segments. IGHV1S39 was used most frequently followed by IGHV3/33. Rarely used IGHV segments were IGHV1S26, IGHV1S32, IGHV1S33, IGHV1S37, and IGHV1S38. In bovine fetal bone marrow, ileum, and spleen high frequencies of IGHV3/33 (= IGHV1S3/S20) and IGHV1S39 was observed as well as low frequencies of IGHV1S38 and IGHV1S26 [22]. The transcription of IGHV1S32 and S37 has not yet been described. Among the 20, 17, and 21 transcribed IGHD (regarding procedures 1-3), IGHDS8, IGHDS5, IGHDS10, and IGHDQ52 (= IGHDS9) were preferred in all breeds. IGHDS1 to IGHDS8 were also found to be transcribed in antisense direction in the third calculation procedure but in low numbers. Using the first procedure, IGHDS6 was not identified in antisense direction, and applying the second procedure IGHDS1, IGHDS4, IGHDS6, and IGHDS8 were shown to be transcribed in antisense orientation. Previous studies also elucidated the transcription of 14 IGHD, where the occurrence of IGHDS5 was the most frequent one and was present in 42% of the sequences analyzed in bovine fetus [22]. The assignment of the FR4H to germline IGHJ revealed the transcription of IGHJ1, and IGHJ6 [22] with IGHJ1 clearly preferred. In the cattle breed A, procedure 3 confirmed IGHV1S39-IGHD5-IGHJ1 as the most common recombination of gene segments which is identical to the most frequent finding in bovine fetus [22]. This recombination belongs to immunoglobulins possessing a CDR3H region of intermediate length. Statistical analyses showed significant different transcription levels of IGHV, IGHD, and IGHJ segments within the breeds.
The usage of pseudogene segments has already been described for animals such as chicken [4,38]. In bovine lambda light chains, fragments of pseudogenes were also shown to contribute to immunoglobulin diversity in a gene conversion process [14]. In the current analysis, possible gene conversion events were identified by the assignments of parental germline IGHV to separate FR1-3H and CDR1-3H. In addition to the IGHV identified for the complete variable region based exclusively on FR1-3H, several pseudogenes were assigned as possible originating germline IGHV in the separation analysis. We can exclude allelic variations due to their assignment to different boVH families. For instance, the pseudogenes IGHV4C/32C, IGHV9C/35C, and IGHV18C/30C belong to the boVH2 family [17], but seem to contribute to gene conversion events by nucleotide substitutions. In particular, IGHV4C/32C, which was mentioned in the example above meets the criteria for gene conversion such as the location upstream of the rearranged segment IGHV6 (boVH1) [4,13] and clusters of nucleotide changes [14]. Further, the flanking homology of the conversion region supports the genetic exchange [13] and the separation from IGHV6 by more than 18 kb on the genome allows looping during rearrangement [17]. In comparison, in chicken the nearest pseudogene is separated by 7 kb [4]. It should be noted that it is difficult to consider the order of gene segments to evaluate the plausibility of other gene conversions due to the incomplete annotation of the bovine genome [17,18]. Finally, our data indicate an exchange between the two bovine VH families which obviously is rare and which might be an influence of breed or method of analysis when compared to previous results [18].
The length distribution of CDR3H consists of short CDR3H (group 1), intermediate length CDR3H (group 2), and exceptionally long CDR3H (group 3) in all four cattle breeds. In the breed GS the highest percentage of group 3 CDR3H was calculated. The longest CDR3H with 65 aa were found in GS and HF sequences. The longest ever detected CDR3H in cattle was 67 aa long using IMGT numbering [17]. In contrast, four amino acids made up the shortest CDR3H in A animals, GS animals, and HF animals. The maximum length of group 2 CDR3H was 22 aa.
The combinatorial diversity of germline IGHV, IGHD, and IGHJ-segments is represented by 162 different rearrangements that were expressed with significant differences (procedure 3). In comparison, 147 and 119 different recombinations of IGHV-IGHD-IGHJ were identified using calculation procedures 1 and 2, respectively. In the four breeds, different rearrangements were found. In detail, 91 different combinations occurred in A (procedure 1: 81, and 2: 58), 74 in animals of GBP (procedure 1: 74, and 2: 65), 72 in GS animals (procedure 1: 74, and 2: 57), and 85 in the breed HF (procedure 1: 80, and 2: 62). Most of these combinations were observed in less than ten sequences but seven occurred in up to 21 sequences in all four cattle breeds examined (procedure 1: up to 21, and 2: up to 42).
In sequences belonging to group 1 CDR3H, combinations of IGHV3/33 and IGHJ1 (AY158087) together with IGHDQ52 (s, BTA8) dominated over all breeds using procedure 3. As IGHDQ52 is the shortest IGHD segment possessing only four amino acids, these results explain best the origin of short CDR3H. Group 3 CDR3H mostly exhibited IGHV10/ 34-IGHD2 (s, [40])-IGHJ1 (procedure 3), or IGHV10/34-IGHD8 (s, BTA21)-IGHJ1 (procedure 1 and 2) in all breeds. Only results from procedure 3 identified biological meaningful combinations of germline IGHV, IGHD, and IGHJ as it gave the best explanation for the origin of group 3 CDR3H. IGHD2 is the longest IGHD segment identified so far. Further, IGHV10/ 34, which is identical to IGHV1S1 and IGHV1S15, was found to contribute solely to those exceptional lengths [17,18,22]. Crystallization of two bovine IgG with exceptionally long CDR3H revealed that the "ThrThrValHisGln" terminal motif of IGHV10/34, that initiates an ascending β strand in the folded antibody enables the formation of the "stalk and knob" structure in addition to inserted conserved short nucleotide sequences (CSNS) [22,26]. Furthermore, in sequences of group 2 a higher number of recombinations were observed than in sequences of group 1 and 3. As group 3 CDR3H regions are unique in cattle, the few preferably rearranged gene segments within this group may indicate specialized antibodies.
Variability plots indicated quite similar features within the variable region in all breeds. Nevertheless, amino acid residues at each position vary between the samples and the breeds and within the regions. In FR1-4H little variability was calculated, whereas the variability increased from CDR1H to CDR3H, which was described already as concentrated areas of diversity in equine heavy and light chain CDRs [46,47]. The breed A possesses the highest amino acid variations in CDR1H and HF in CDR2H while GS exhibited lowest variability in these two CDRHs. Further, transition areas between FRHs and CDRHs had a higher variability than the middle of CDRHs. Position 96 in FR3H shows the highest variability within the FRHs. This residue is located on the outer surface of the variable region of the immunoglobulin molecule [26] within the area where the constant region is connected to the variable region. The high variability at this position may indicate an influence on the position of variable and constant region and their sterical orientation, which may affect light chain pairing as heavy chains possessing group 3 CDR3H are connected to a special type of lambda light chains [26,48].
Further analysis revealed that no amino acids were assigned to the IMGT amino acid positions 10, 31-34, 60-62, and 73 in cattle. This means, that one amino acid position within FR1H, four positions in CDR1H, three positions in CDR2H, as well as one position in FR3H were not filled. Consequently, in cattle 8 out of 12 amino acid positions within CDR1H are covered. In CDR2H, 10 positions are available and 7 are covered. Compared to FR-IMGT and CDR-IMGT lengths of functional and ORF IGHV-genes of human IGHV, mouse IGHV, rat IGHV, arabian camel IGHV, sheep IGHV, and pig IGHV the missing amino acids within FR1H and FR3H are conserved in all animals mentioned [49,50]. Averaged eight to ten amino acids were positioned in CDR1H whereas in CDR2H six to ten amino acids were placed by the IMGT numbering system in human, mouse, rat, camel, sheep, and pig IGHV. Therefore, the positions of missing amino acids are congruent with other species.
In the breed A, the highest number of recombinations and variability were observed when compared to the other breeds investigated. GS possessed the lowest number of recombinations and showed less variability except in the middle of the CDR3H region. This finding indicates the contribution of insertions and deletions to diversity in case of few rearrangements [25]. It should be noted that A and GS were kept under the same management in a mixed herd. GBP and HF were kept at different farms. The breeds kept in different areas were consequently exposed to different antigens. Thus, the individual number of rearrangements per breed and differences in variability additionally indicate a specialized immune response as animals on one farm are challenged with the same environment.
The application of the newly developed bioinformatics framework led to important new results. Our analyses demonstrated that the bovine heavy chain diversity is not restricted to the use of a limited number of germline genes although there are preferred rearrangements within the three groups of CDR3H lengths. We also found strong evidence for gene conversion using pseudogenes. Despite current advances in the understanding of bovine immunoglobulin diversification, future investigations of the germline repertoire are necessary.

Detailed analyses of immunoglobulin sequences using a newly developed bioinformatics framework
For sequence analysis, we developed a new bioinformatics framework using MUSCLE [35,36] for the initial fast but accurate multiple nucleotide sequence alignment and following ClustalW [37] for calculating the sequence distances after deduction of the amino acid residues. Both programs are available as stand-alone algorithms and were implemented into our program. The immunogenetics nomenclature (IMGT) was used to assign framework regions 1-4 (FR1-4H) and the complementarity determining regions 1-3 [42].
Therefore, germline nucleotide sequences were imported in FASTA-format (IGHV, IGHD, IGHJ [17,18,22]). Using the functional IGHV, the nucleotide sequences were translated into amino acids to number the codons of the functional germline IGHV gene segments according to the IMGT system from FR1H to FR3H. This required first the identification of the conserved and preassigned positions of Cys23, Trp41, Leu89, and Cys104 defined by Lefranc et al. [42]. Following, the nucleotide sequences of germline pseudo IGHV gene segments were aligned separately using MUSCLE [35,36] to obtain the putative open reading frame. The previously defined positions of the codons using the functional genes were transferred onto the pseudogenes. Insertions as well as deletions of nucleotides within the pseudo gene segment sequence were discarded. The last 33 nucleotides of germline IGHJs were then used to define the FR4H. The region between FR3H and FR4H is defined as CDR3H. This region was later used to align the sample CDR3H to germline IGHD segments. Defining FRHs and CDRHs allowed, beside the analyses of the complete transcribed genes, the alignment of IGHV, IGHD and IGHJ using different parameters to improve the biological significance as well as the analyses of the single functionally divergent regions to determine putative gene conversion events in those regions. All functional and pseudo gene germline segments are referred as reference sequences.
Framework regions and CDRs of transcribed sequences (designated as sample sequences) were aligned pairwise to the isolated reference sequences after isolation from the first IgG constant region. Following, minimal divergence was used to identify the most similar sample sequence and reference sequence pairing.
For nucleotide alignments of IGHV and IGHJ, default values of MUSCLE were used. We tested three different procedures to assign germline and sample IGHDs to improve biological significance. In procedure 1, we applied default values of MUSCLE [35,36], in procedure 2 we changed the penalties for gap opening to -4 and for gap extention to -0.3 [22], and in procedure 3 we additionally incorporated a new scoring matrix with match = 2, transversion = -1, and transition = 1 to evaluate transversion and transition mutations, whereby the IUB (international union of biochemistry) code for single and wobble bases was used.
The three procedures were applied to a set of sample sequences of IgG-derived variable regions from four different cattle breeds. Each nucleotide sequence of our sample sequences was aligned separately to the reference sequences to determine the most similar reference sequence as germline origin. Following, the codons were translated into amino acids. Sample sequences possessing premature Stop codons or not covering the full length of the variable region due to incomplete sequencing were eliminated and were not incorporated in further analyses. The remaining sample sequences were annotated in accordance to the IMGT nomenclature. For exceptional long CDR3H no positions are defined in the IMGT system, therefore positions had to be added as required and designated as 111.1-111.x and 112.y-112.1 in accordance to the IMGT numbering system [42]. To determine the germline origin, only the FRHs were aligned to avoid interference with the highly diversified CDRH [18]. To analyze possible gene conversion events, FR1-3Hs and CDR1-3Hs were extracted and aligned separately to the corresponding regions of the IGHV reference sequences to find the most similar one (Fig 8). The results were presented as an html table.
To display accumulation of amino acid substitutions in distinct segments of the variable region, variability was calculated as described by Wu and Kabat [41]. Thus, the frequency of the most common amino acid at a distinct position was calculated first. The number of the most abundant amino acid at a given position was divided by the number of all amino acids observed at this position. This means, only samples possessing an amino acid residue at this position in accordance to the IMGT nomenclature were considered. Subsequently, the number of different amino acids at the given position was divided by the frequency of the most common amino acid residue to determine variability. The variability results were written into a txtfile, which allows further analyses in statistical software.
For the statistical analyses of the distribution of IGHV, IGHD, and IGHJ segments including putative gene conversions within one breed and among breeds have been compared by applying non-parametric tests. Such test procedures, i.e. the Chi 2 -test for analyzing IGHV, IGHD and IGHJ independently and Fisher's exact test for the gene segment × breed contingency table, are implemented in the software package SAS, Version 9.2. Blood samples were collected from ten randomly chosen animals per breed during routine blood sampling for mandatory examinations in disease control. Peripheral blood mononuclear cells (PBMCs) were isolated using Ficoll gradients (GE Healthcare Europe GmbH, Germany) according to the manufacturer's protocol. Cells were stained with trypan blue and viable cells were counted. Total RNA was isolated from 1x10 7 cells using the RNeasy 1 Mini Kit (Qiagen, Germany). The first-strand cDNA was synthesized using pd(N) 6 primers from 3 μg of total RNA in a total volume of 20 μl (SuperScript™III First-Strand Synthesis SuperMix, Life Technologies GmbH, Germany).

Generation of the sample sequence set
Ethical statement. B-lymphocytes were isolated from already existing blood samples obtained from the tail vein by the herd veterinarian. Blood sampling of cattle is mandatory under German federal and state laws.
Amplification of immunoglobulin heavy-chain isotype restricted variable regions. To amplify the variable regions restricted to γ1-3 isotype heavy chains, a PCR was performed with primers binding within the leader region and the 3'UTR (bIgG_leader: ATG AAC CCA CTG TGG ACC; bIgG_3'UTR: CAG GAG GAA TGC ACA CAG). The primers were based on database sequence information and assigned to accession number X62916. The primer boIgG_leader anneals to position 22-39, and the primer boIgG_3'UTR to position 1518-1535. To monitor the integrity and purity of the cDNA, 527 bp of the bovine GAPDH (Glycerinaldehyde 3-phosphate dehydrogenase) were amplified as a positive control. A no-template control served as a negative control for the PCR. The total reaction volume of 50 μl included 0.67 μl of cDNA, 200 μM dNTPs (Bioline, Germany), 5 μl of 10x PCR buffer (75 mM Tris-HCl, pH 9.0; 2 mM MgCl 2 ; 50 mM KCl; 20 mM (NH 4 ) 2 SO 4 ), 0.4 μM of each primer, and 2 units of DNA polymerase (Biotools, Spain). PCR was performed under cycling conditions of 95°C for 5 min, followed by 35 cycles of 95°C for 1 min, 59.4°C for 1 min, 72°C for 2 min, and terminated with elongation at 72°C for 10 min. Length and purity of the PCR products were evaluated by means of electrophoresis on 1% agarose gels.
Cloning and sequencing of the PCR products. The PCR products were purified and concentrated using the MiniElute Gel Extraction Kit (Qiagen, Germany) in accordance to the manufacturer's protocol except QX1 buffer replaced QG buffer. Samples were eluted with 13 μl The bioinformatics framework analyzed the germline origin of bovine immunoglobulin heavy chain variable segments (IGHV), diversity segments (IGHD), and joining segments (IGHJ). One approach used the framework regions (FRH) 1 to 3 to determine the closest germline IGHV. The second approach analyzed single functional regions FRH1 to 3 and complementarity determining regions (CDRH) 1 and 2 independently to reveal putative gene conversion events. (C) The last step included the calculation of the variability and the statistical analyses.
doi:10.1371/journal.pone.0164567.g008 EB buffer (10 mM Tris-HCl, pH 8.5) and were stored at 4°C. Purified products were cloned into the pCR 1 2.1-TOPO 1 3.9 kb TA vector (Invitrogen TM , Karlsruhe, Germany) and transformed into chemically competent One Shot TOP10 E. coli cells (Invitrogen TM , Karlsruhe, Germany). Transformants were plated on LB agar containing 0.3 mM ampicillin, 40 μl 2.44 μM Xgal (5-bromo-4-chloro-3-indolyl-beta-D-galactopyranoside), and 40 μl 1 M IPTG (Isopropyl β-D-1-thiogalactopyranoside) for blue-white selection. After incubation at 37°C, overnight cultures of randomly selected white transformants were grown in a 5 ml LB-ampicillin broth. Plasmids were isolated using the MiniPrep Kit (Qiagen, Germany). In order to assess the insert size, plasmid DNA was cleaved with EcoRI (New England Biolabs, Germany) or a colony PCR was performed. Therefore a 25 μl mixture containing 2 μl cell culture, 0.4 μM of vector specific primers M13 (-20) Forward and M13 Reverse (Invitrogen, Germany), respectively, and one PCR-bead (GE Healthcare Europe GmbH, Germany) were used in a hot start PCR at 95°C for 5 min, denaturation at 95°C for 1 min, annealing at 60°C for 1 min, and extension at 72°C for 2 min up to a total of 30 cycles. A final extension at 72°C for 10 min was included after the final cycle before PCR mixtures were cooled down to 4°C. The size of the resulting fragments and of the PCR products was confirmed by agarose gel electrophoresis.