Comparison of sequence- and structure-based antibody clustering approaches on simulated repertoire sequencing data

doi:10.1371/journal.pcbi.1013057

Table 1.

BCR repertoire clustering approaches.

More »

Expand

Fig 1.

Antibody pair dataset characteristics.

A dataset of antibody pairs with similar function, i.e., highly overlapping antigen binding regions, was created and annotated. The final set contains 213 antibody pairs comprised of 54 unique antibodies. A: The included antibodies bind to one of five protein antigens of well researched species. Bars indicated the number of antibodies associated with each species, the dots indicate the sizes of the antibody pair clusters into which the antibodies group. The majority of antibodies bind to SARS-CoV-2 derived antigens. B: The CDRH3 amino acid sequence length of the included antibodies ranges from 9 to 23, the most common length is 11 amino acids. The diversity of CDRH3 sequence length is in agreement with previous observations. C: A scatter plot shows the CDRH3 amino acid sequence identity and epitope overlap of each antibody pair. Color and marker style indicate the antigen species. Only antibody pairings with an epitope overlap of 0.75 are included. A kernel density estimate plot indicates the distribution of CDRH3 sequence identity across the dataset.

More »

Expand

Fig 2.

Performance comparison of different clustering strategies.

The three approaches, clonotyping, SAAB+ and SPACE2, were applied to the repertoire. Their performance on the annotated set of 213 functionally similar antibody pairs was evaluated. A: Euler diagram showing the overlap of correctly clustered antibody pairs between methods regarding. B: A scatter plot shows the CDRH3 sequence identity and epitope overlap of each antibody pair. Marker style indicates the antigen species. Color indicates which methods correctly grouped each antibody pair (see Euler diagram for color code). The majority of antibody pairs have not been identified together by any methods (gray, 184 antibody pairs). C: All three strategies cluster antibody pairs with a significantly higher CDRH3 sequence identity compared to the full antibody pair set. D: The epitope overlap of clustered antibody pairs is similar between the methods, albeit SPACE2 identified antibody pairs with a slightly higher epitope overlap compared to the full set. Statistical significance was tested using the Wilcoxon rank-sum test.

More »

Expand

Table 2.

Results of simulated repertoire clustering.

More »

Expand

Fig 3.

Limitations of clustering approaches.

Both IGX-Cluster and SPACE2 partition the antibodies, before clustering based on CDRH3 sequence identity or CDR RMSD, respectively. Swarm and box plots show the distribution of antibody pairs across the partition strategies. Colored dots indicate the correctly identified antibody pairs for each method. A: Clonotyping partitions antibodies based on matching V and J genes. Antibodies with identical V and J gene usage have a higher sequence identity than antibodies with identical gene usage in only the V or none of the genes. Partitioning based on only the V gene can improve the coverage slightly but is limited by the low sequence identity between these antibodies. B: SPACE2 partitions based on same length in all six CDR regions. The majority of antibody pairs (193, 90.61%) do not meet this requirement. Of the antibody pairs with same CDR lengths, 70% (14 out of 20) are correctly grouped together. C: The natural logarithm of cluster sizes across the full repertoire dataset indicates how stringent the different partitioning strategies applied by either IGX-Cluster (dark and light green) or SPACE2 (blue) are. The criterion of same CDR region lengths is the most stringent, while requiring solely the same V gene is the least stringent and leads to the largest cluster sizes.

More »

Expand

Fig 4.

Random clustering rates and sensitivity of clustering strategies across varying settings.

A: To calculate the random clustering rates, the cluster size distributions for the full repertoire created by each approach were gathered. The repertoire antibodies were then randomly assigned to clusters of the same sizes to infer how likely random assignment to the same cluster is for the antibody pairs. Low rates indicate a low probability of grouping functionally similar antibody pairs by chance. Lowering the clustering requirements in the IGX-Cluster and SPACE2 setup increases the average cluster sizes and thus the random clustering rate, but it remains below the rate of SAAB+. B: Comparing different clustering settings shows that sensitivity can be increased while the precision remains at 100%, i.e., the number of falsely grouped antibody pairs does not increase. The sensitivity increase of SPACE2 is limited by the CDR same length requirement.

More »

Expand