Fig 1.
Schematic of the analysis pipeline.
The BCR or TCR repertoires of a cohort of healthy donors are sequenced from blood samples. Raw reads are processed by removing duplicates and extracting the CDR3. For a given cohort, the distribution of sharing numbers m is obtained by counting the number of sequences that are found in exactly m individuals. In parallel, sequences are used to train a probabilistic generative model, in a two-step process: first learning the recombination model Pgen from nonproductive sequences; second learning a selection model Q to describe productive sequences. The model is validated by comparing its prediction for the distribution of sharing numbers with the data. The repertoires of individuals currently or recently infected with SARS-CoV-2 are collected and processed as described for healthy donors. The distribution of sharing numbers is compared with the model prediction obtained previously, this time showing departures from the model, due to the enriched sharing of public COVID-associated clonotypes. These clonotypes are identified as those that are significantly more frequent in the cohort than predicted by the theory. The sequence features and organization of COVID-associated clonotypes are then analyzed and validated against databases of receptors with known SARS-CoV-2 specificity.
Fig 2.
Sharing of IgH and TCRβ CDR3 repertoires of healthy individuals.
(A) Distribution of the sharing number (the number of individuals in which a sequence is seen) of CDR3 amino-acid sequences of the heavy chains of IgM repertoires from 10 individuals. The prediction from the raw recombination model (Pgen, green line) underestimates sharing. Adding an ad hoc correction factor assuming a fraction q of sequences passing selection (q = 0.759 ± 0.001) gives a good fit (see S1 Fig). The prediction from the generation and selection model (Ppost, red line) reproduces the curve perfectly, with no need for a correction factor. (B) Distribution of sharing number for the IgG repertoires of the same donors. The analysis is done on the naive ancestors of reconstructed clonal lineages. The Pgen model is again inaccurate, requiring a correction factor q = 0.636 ± 0.005, while the Ppost works well. (C) Comparison of the sharing number distribution between two equal-size cohorts of IgM and IgG repertoires. For an identical sequencing depth, IgG repertoires present a higher level of sharing, suggesting stronger convergent selection than in IgM repertoires. (D) Distribution of the sharing number of CDR3 amino-acid sequences of TCRβ from 666 patients. Models predictions are shown for Pgen and Ppost, with or without a correction factor q. The correction factor is q = 0.037 for Pgen, and q = 0.472 ± 0.002 for Ppost, indicating a better accuracy of the latter. In addition, the corrected Pgen model (black line) overestimates the number of sequences shared by all individuals relative to the data and to the corrected Ppost prediction (orange line). (E) Value of the corrective factor q, interpreted as the inverse strength of convergent selection (q = 1: no selection; q ≪ 1: strong selection), for different subgroups of the TCRβ cohort. We observe stronger selection in CMV-positive individuals than in CMV-negative ones, reflecting their common antigenic exposure. Convergent selection also substantially increases with age. An additional control to account for the possible convergent selection bias due to CMV + content in each age group showed negligible influence. Significance obtained with student’s t test.
Fig 3.
Identification and analysis of COVID-associated antibody heavy chains from significantly shared sequences.
(A) Sharing number distribution of IgG heavy chain CDR3 from SARS-CoV-2 positive individuals. This distribution is compared with the sharing expectation in healthy individuals, obtained using the Ppost model. The discrepancy suggests the enrichment of these repertoires in SARS-CoV-2 specific antibodies. (B) Scatter plot showing the sequence probability derived from our theoretical model, , vs. the empirical probability
of all shared sequences. This parameter is obtained by maximum-likelihood estimation and represents the value of the probabilistic model under which the observed data are the most probable. Red sequences are substantially more frequent in the data than predicted (posterior probability <10−4), and are predicted to be associated with SARS-CoV-2. (C) IGHV gene usage in healthy (green) and COVID-19 (orange) repertoires, and among COVID-associated sequences (blue). (D) IGHJ gene usage for the same groups (same color code). (E) Convergent selection factor q learned from pairs of COVID-19 individuals in different severity group. Individuals with more severe symptoms seem to have a higher level of selection, although the difference is not statistically significant (Student’s t test (p = 0.12))).
Fig 4.
Networks analysis of IgG heavy chains associated with SARS-CoV-2.
(A) Each node represents a CDR3aa heavy chain clonotype. Edges connect clonotypes with two or fewer amino acid mismatches in their CDR3 region, and with the same V and J segments. Non-connected vertices are not shown. Colored nodes represent sequences that are at most one amino acid mismatch away from previously reported SARS-CoV-2 neutralizing antibodies, with the color indicating the region of the virus recognized by the antibody. The size of the node is proportional to its sharing number, i.e. the number of people in our cohort where the sequence is found. (B) Network of light chains CDR3 amino acid sequences found in previous reports to be paired with a colored heavy chain of (A). The level of conservation is even more remarkable than for heavy chains.
Fig 5.
Identification and analysis COVID-associated TCRβ.
(A) Sharing number distribution of CDR3 amino-acid TCRβ clonotypes from 1414 SARS-CoV-2 positive individuals. The model prediction from Ppost (trained on healthy donors, red curve) is good and almost indistinguishable from the one corrected for COVID-19 selection (orange curve), suggesting low convergent selection. (B) Model prediction versus empirical frequency
for all shared CDR3 amino-acid clonotypes. Clonotypes marked in red are significantly more frequent in COVID-19 donors than expected. (C) TRBJ gene usage in generic healthy and COVID repertoires, and in COVID-associated sequences. (D) Network analysis of TCRβ amino acid CDR3s of COVID-associated clonotypes. Edge mark clonotypes with 2 mismatches or fewer. Vertices are colored according to the location of their recognized antigen according to the MIRA assay (see main text). (E) TRBV gene usage for the same groups. Significance is obtained using Student’s t test.
Fig 6.
Repertoire-based SARS-CoV-2 diagnostics.
(A) Distribution of log likelihood ratios calculated from the IgG repertoires of 30 individuals (20 SARS-CoV-2 positive, and 10 negative). Positive values of the score imply likely SARS-CoV-2 positivity. The test perfectly separates positive and negative individuals. (B) Distribution of probabilities of SARS-CoV-2 positivity obtained from logistic regression on the presence or absence of clonotypes in TCRβ repertoires (1,000 individuals from the testing set; model trained on 1,000 individuals from the training set). Values over 1/2 indicate likely SARS-CoV-2 positivity. The distributions of healthy and COVID-19 individuals have little overlap (93% specificity). (C) ROC curve obtained by tuning the positivity threshold in (B). The AUC is the area under the curve (AUC = 1: perfect discrimination; 1/2: no discrimination). ROC of the likelihood test is shown for comparison.