Automated analysis of immunosequencing datasets reveals novel immunoglobulin D genes across diverse species

Immunoglobulin genes are formed through V(D)J recombination, which joins the variable (V), diversity (D), and joining (J) germline genes. Since variations in germline genes have been linked to various diseases, personalized immunogenomics focuses on finding alleles of germline genes across various patients. Although reconstruction of V and J genes is a well-studied problem, the more challenging task of reconstructing D genes remained open until the IgScout algorithm was developed in 2019. In this work, we address limitations of IgScout by developing a probabilistic MINING-D algorithm for D gene reconstruction, apply it to hundreds of immunosequencing datasets from multiple species, and validate the newly inferred D genes by analyzing diverse whole genome sequencing datasets and haplotyping heterozygous V genes.

It is easy to see that is maximized by one of the modified strings. This observation leads to a brute-force algorithm for solving the String Reconstruction Problem (with complexity O(|s|*N 2 )) that simply computes for each of the N modified strings. Below, we describe a O(|s|*N) algorithm for solving this problem that is linear in the input size.
We assume for simplicity that all modified strings are different. This is not a strict assumption as one can always add special symbols to distinguish all strings. We denote = log( -./ − 1) and search for a string that maximizes ( 4 ) 5 46/ . We denote a t-symbol prefix (t-prefix) of a string c as c t and the set of all t-prefixes of strings from C as C t . Given a string s and an integer t, we say that a string c is t-similar to s if t-prefixes of s and c coincide. The number of strings in C that are t-similar to s is denoted as sim t (C,s). Given a string s, score ; ; = score ;</ ;</ + ; , × log ;./ − 1 ; − 1 . (1) We use this recurrence to efficiently compute score(C|s) for each string s from C using dynamic programming. We construct a trie of all strings in C [1]. Each vertex in the trie is a t-prefix s t of a string from C, and we recursively compute score (C t |s t ) in each vertex of the trie using the above recurrence (assuming that the score of the root is × log( − 1)). The optimal string is the string corresponding to the leaf node with the maximum score ( Figure A). All scores can be computed by a single Depth First Search, assuming that all values sim t (C,s) are computed during the construction of the trie. The set of modified strings is shown on the left, and their trie is shown on the right. The string associated with each vertex is the one that is formed by traversing from the root node to the vertex. The number of leaves under each vertex is shown on the left. The scores for all vertices in the path from the root node to the leaf node with the maximum score are shown in blue. The leaf CATTAT is the optimal seed string.

Supplemental Note: Greedy Algorithm
The pseudocode of the greedy algorithm is as follows: greedy_string (

Supplemental Note: MINING-D Parameters
The most important parameter of the MINING-D algorithm is m, the number of seed k-mers. The default value of m should be different across species, since different numbers of D genes take part in the recombination process in each species. To decide on the default m for each species, we applied MINING-D to all datasets with different values of m. The results are shown in Table A. Based on the results in the table, we chose the following as the default values: human (m = 600), mouse (m= 300), rat (m = 300), rhesus macaque (m = 600), Bactrian camel (m = 300), and rabbit (m = 100).
The p-value threshold was chosen to be 10 -36 . This value achieves 80% power from the test with a sample size of 2000 when the effect size (deviation from uniform distribution) is medium, according to the definition of the medium effect for chi-squared test. Having a strict (very low) threshold on the p-value may lead to some missing nucleotides on the sides of the genes, but since we are also doing genomic validation, the whole gene can be recovered from the genomic reads. On the other hand, high p-value threshold will not only lead to extra nucleotides on the sides, it will also cause more extensions to be made from a single k-mer, leading to more false positives. As another test, we also tried to extend the known human IMGT genes in Healthy Human CDR3 datasets using this threshold. 95% of the time, no extension was made to any gene.

Supplemental Note: Defining Relative Positions
Looking at the relative positions of the extensions of k-mers has some advantages over looking at the relative positions of the k-mers. Since a relatively short k-mer can be a part of two of the three types of V, D, and J genes, the mean relative position among all the CDR3s of which such a k-mer is a substring can be misleading. Moreover, even if the k-mer is a substring of only one gene, the relative position of the extension gives a better estimate of the position of the CDR3 part of which the k-mer is a substring as illustrated in Figure C and Figure D.

Supplemental Note: Removing Unidirectional Extensions
Not all the unique extensions in the central cluster correspond to different D genes. Some of them are multiple reconstructions of the same D gene and are very similar to each other in the sense that they differ from each other by only a few nucleotides only at the edges. Most of them can be eliminated by making the observation that a highly abundant k-mer that the algorithm starts with might not always be, as a whole, a substring of a D gene. For example, the k-mer shown in Figure E can be among the highly abundant k-mers chosen to extend if the D gene shown in (a) is represented highly in the CDR3 sequences. When extended, it only extends to the right as shown in (c), retaining the random insertions in the k-mer. We can eliminate such unidirectional extensions because we expect some of the central k-mers of the D gene to also be among the highly abundant 10-mers. Such k-mers will be extended in both directions (bidirectional extensions), and by eliminating the unidirectional extensions, we reduce the number of reconstructions per D gene.
Formally, let the number of nucleotides added to the left and right of the k-mer be N L and N R , respectively. We put the following constraint on N L and N R : where is a parameter of the algorithm. We used = 0.5. The possible values of N L and N R with = 0.5 are shown in Table B

Supplemental Note: Immunosequencing Datasets
Summaries of all the human and non-human immunosequencing datasets analyzed in this study are shown in Table C and Table D Figure E. A highly abundant 10-mer (b) that is formed by random insertions (two nucleotides in the beginning) and 8 nucleotides from a highly abundant D gene (a). Since this 10-mer was not a substring of the D gene, its extension (c) is also not a substring of the D gene.

Supplemental Note: Benchmarking MINING-D against IgScout
We compared the results of IgScout and MINING-D on all datasets from the project PRJEB18926. The results are shown in Figure F. A gene is said to be present in a dataset if at least one variation of the gene is found in the dataset and missing otherwise. In most datasets, both IgScout and MINING-D miss three D genes with very low usage (IGHD1-14, IGHD1-20, IGHD6-25) and a very short IGHD7-27 gene (11 nt). These D genes are also reported as missing in multiple studies on analyzing the usage of D genes [2][3][4][5]. While IgScout also misses three more short D genes with low usage i.e., IGHD1-1, IGHD4-4, IGHD1-7, MINING-D infers these genes for some individuals. We also compared the MINING-D and IgScout results on non-human datasets. Tables E-G compare the results of IgScout and MINING-D on ten Mouse datasets (4 Balb/c mice, 4 C57BL/6 mice, and 2 pets), all Rat datasets, and all Camel datasets. Figure G presents the distributions of missing and extra nucleotide bases in the inferred genes (as compared to the IMGT genes for all mouse datasets) for both MINING-D and IgScout.

Table E. Comparison of IMGT genes inferred by IgScout and MINING-D in various Mouse datasets.
The gene IGHD2-10*01 (the only gene inferred by IgScout but missed by MINING-D) and the gene IGHD2-1*01 (inferred by MINING-D but missed by IgScout) only differ at the first position. Table E. Only genes that were inferred by both IgScout and MINING-D were included in the comparison.

Supplemental Note: Novel Variations
All the variations found using MINING-D for humans, camels, rhesus macaques, mice, rats, and rabbits are shown in Table H. The polymorphisms in the genes validated using genomic data are highlighted in red.

Usage of D genes in the Intestinal Repertoire dataset.
We analyzed the usage of D genes in datasets corresponding to memory and plasma cells, IgA and IgM isotypes from ileum and colon tissues from 4 individuals, and naive cells from ileum from 3 individuals ( Figure J). 43.5% of CDR3s on average were traceable in each dataset. For IgM naive cells (ileum mucosa), the number of traceable CDR3s was 71.42% on average, whereas for memory and plasma cells from the same tissue, it was 43.25% and 43.12%, respectively. The D gene IGHD3-3*01 was used significantly less in plasma and memory cells from both tissues compared to naive cells from ileum and PBMCs from healthy individuals (Figure 4). Similarly, the gene IGHD6-6*01 seems to be under-used in plasma and memory cells from the ileum tissue compared to naive cells. Subtle differences can also be found among the usage between different isotypes from the same individual's tissue, e.g., genes IGHD2-21*02, IGHD2-8*01, IGHD3-16*02, IGHD5-5*01/IGHD5-18*01, and IGHD7-27*01 are presented more in the IgM isotype than the IgA isotype in the colon tissue from individual 0.   Usage of D genes in mice datasets. Figure M shows usage of various known and novel genes/variations in different datasets corresponding to different strain, cell type, and tissue from mice. Usage of D genes in the Rhesus macaque datasets. 52.6% of CDR3s on average were traceable in each dataset. The usage of the IMGT genes and the validated novel genes and variants of known genes is shown in Figure N. Usage of D genes in the Camel datasets. 31.7% of CDR3s on average were traceable in each dataset. Although the small sample size (n = 3) limits generalizability, the low number of traceable CDR3s could be due to high level of hypermutation within the CDR3 region as compared to other species.
Since there is no IMGT database for camels, we used the alpaca IMGT database as a reference to analyze the usage. The usage of these genes and the validated novel variants of these genes is shown in Figure O. It can be seen that the D gene usage profiles are very different for the VH and the VHH isotypes within individuals, especially for individuals 2 and 3. Usage of D genes in the Rat datasets. 54.3% of CDR3s on average were traceable in each dataset. The usage of the IMGT genes and the validated novel variants is shown in Figure P. Genes belonging to the IGHD2 and IGHD3 families were underutilized as compared to other gene families, and the novel variants were among the genes that were utilized in most of the datasets. There is no clear distinction between the usage profiles between HuD and DNP immunized rats. This could be due to one or more of the following reasons: (a) the CDR3s here are from unsorted cells from spleen and not antigen specific cells; (b) the usage profiles of individuals might not be identical before immunization, hence masking the pattern if there was any.

Supplemental Note: Highly Used D Genes in Non-human Datasets
To find the genes with the highest usage among the datasets of a species, we picked the top 3 genes from each dataset. A gene is said to be highly used in all datasets from a species if it is one of the top 3 genes in at least 3 datasets. We found 3 highly used D genes for camels, 3 for macaques, and 4 for rats (Table R and Figure Q).

Supplemental Note: Benchmarking MINING-D on simulated CDR3s
We simulated 250,000 mutation-free CDR3s using the human D genes listed in the IMGT database (except for IGHD1-14*01, IGHD4-23*01, IGHD5-24*01) and IgSimulator tool [6]. We then generated four mutated versions of each of these CDR3s using mutation rates equal to 0.01, 0.05, 0.1, and 0.2. In total, we had one unmutated and four mutated datasets resulting in the average number of SHMs per CDR3 equal to 0, 0.7, 3.7, 7.4, and 14.8, respectively. For datasets with mutability < 0.1, MINING-D inferred all genes except for IGHD7-27*01 and one of the allelic variants of the gene IGHD2-2. There were no missing or additional nucleotide bases in the inferred D genes as compared to the D genes used for simulating the CDR3s. The missed gene IGHD7-27*01 is the shortest human D gene (11 nucleotides) that cannot be inferred using the default value of k (k=10) for MINING-D. MINING-D inferred only one of the allelic variants IGHD2-2*01 and IGHD2-2*03 since they differ only at the first base as shown below.
IGHD1-20*01 GGTATAACTGGAACGAC IGHD1-7*01 GGTATAACTGGAACTAC For the dataset with the mutation rate 0.2, multiple partial sequences were inferred per gene for some of the genes. However, there were no falsely inferred genes. For example, the following two sequences were inferred for the gene IGHD2-15*01

Supplemental Note: Benchmarking MINING-D on TCR datasets
We downloaded ten TRB cell datasets corresponding to 7 individuals from the immuneACCESS database (Table T). Each dataset consists of short sequences (~100 nt) fully covering CDR3s and partially covering V and J genes. Since our preprocessing step is not designed for such short sequences, we skipped CDR3 search step and used original sequences as an input for MINING-D. Information about the D genes in the TRBD locus and the datasets they were inferred from is provided in Table U. MINING-D inferred 2 genes in most of the datasets, however in 3 datasets, 3 genes were inferred. The falsely inferred genes (shown in  Table V. Information about falsely inferred D genes from TCR datasets using MINING-D. All three sequences are substrings of some V genes.

Supplemental Note: Non-genomic insertions in naive and cord blood Rep-Seq datasets
We compared V-D and D-J insertions in datasets extracted from cord blood and naive B cells ( Figure  R). The median length of V-D insertions is 5 and 4 nt in naive and cord blood datasets, respectively. The difference between the V-D insertion lengths in naive and cord blood datasets is statistically significant, p-value is 4.14e-25 (according to the Kruskal-Wallis test). The median length of D-J insertions is 3 and 0 nt in naive and cord blood datasets, respectively. The difference between the D-J insertion lengths in naive and cord blood datasets is statistically significant, p-value is 5.61e-251. Thus, we confirm that B cells from cord blood have shortened (or even missing) insertions compared to B cells from naive datasets.

Figure R. Lengths of V-D and D-J insertions in naive (blue) and cord blood (orange) datasets.
For the comparison, we used 3 naive datasets from the PRJNA355402 project (accession numbers are SRR5063084, SRR5063092, and SRR5063097) and 4 cord blood datasets from the PRJNA393446 project (SRR5811753, SRR5811754, SRR5811755, and SRR5811756). For better visualization, we discarded several long insertions from naive datasets (with lengths exceeding 20 nt).