Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Biophysicochemical motifs in T cell receptor sequences as a potential biomarker for high-grade serous ovarian carcinoma

  • Jared Ostmeyer,

    Roles Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliation Department of Population and Data Sciences, UT Southwestern Medical Center, Dallas, TX, United States of America

  • Elena Lucas,

    Roles Data curation, Validation, Writing – review & editing

    Affiliation Department of Pathology, UT Southwestern Medical Center, Dallas, TX, United States of America

  • Scott Christley,

    Roles Methodology, Supervision, Writing – review & editing

    Affiliation Department of Population and Data Sciences, UT Southwestern Medical Center, Dallas, TX, United States of America

  • Jayanthi Lea,

    Roles Writing – review & editing

    Affiliation Department of Obstetrics and Gynecology, UT Southwestern Medical Center, Dallas, TX, United States of America

  • Nancy Monson,

    Roles Writing – review & editing

    Affiliation Department of Neurology and Neurotherapeutics, Department of Immunology, UT Southwestern Medical Center, Dallas, TX, United States of America

  • Jasmin Tiro,

    Roles Writing – review & editing

    Affiliation Department of Population and Data Sciences, UT Southwestern Medical Center, Dallas, TX, United States of America

  • Lindsay G. Cowell

    Roles Conceptualization, Funding acquisition, Methodology, Supervision, Visualization, Writing – review & editing

    Affiliation Department of Population and Data Sciences, Department of Immunology, UT Southwestern Medical Center, Dallas, TX, United States of America


We previously showed, in a pilot study with publicly available data, that T cell receptor (TCR) repertoires from tumor infiltrating lymphocytes (TILs) could be distinguished from adjacent healthy tissue repertoires by the presence of TCRs bearing specific, biophysicochemical motifs in their antigen binding regions. We hypothesized that such motifs might allow development of a novel approach to cancer detection. The motifs were cancer specific and achieved high classification accuracy: we found distinct motifs for breast versus colorectal cancer-associated repertoires, and the colorectal cancer motif achieved 93% accuracy, while the breast cancer motif achieved 94% accuracy. In the current study, we sought to determine whether such motifs exist for ovarian cancer, a cancer type for which detection methods are urgently needed. We made two significant advances over the prior work. First, the prior study used patient-matched TILs and healthy repertoires, collecting healthy tissue adjacent to the tumors. The current study collected TILs from patients with high-grade serous ovarian carcinoma (HGSOC) and healthy ovary repertoires from cancer-free women undergoing hysterectomy/salpingo-oophorectomy for benign disease. Thus, the classification task is distinguishing women with cancer from women without cancer. Second, in the prior study, classification accuracy was measured by patient-hold-out cross-validation on the training data. In the current study, classification accuracy was additionally assessed on an independent cohort not used during model development to establish the generalizability of the motif to unseen data. Classification accuracy was 95% by patient-hold-out cross-validation on the training set and 80% when the model was applied to the blinded test set. The results on the blinded test set demonstrate a biophysicochemical TCR motif found overwhelmingly in women with HGSOC but rarely in women with healthy ovaries, strengthening the proposal that cancer detection approaches might benefit from incorporation of TCR motif-based biomarkers. Furthermore, these results call for studies on large cohorts to establish higher classification accuracies, as well as for studies in other cancer types.


Despite the tremendous genomic heterogeneity between cancers, there is evidence that cancer patients mount T cell responses against antigens they have in common, including tumor antigens. Shared tumor antigens can be generally classified into three categories: (1) self-antigens with dysregulated expression or increased copy numbers, such as MelanA, HER2, SOX2, and NY-ESO-1 [15], (2) altered self-antigens, such as recurrent oncogenic mutations, including BRAFV600E and CDKR24C [6] and TGF-βRII frameshift mutations [7], and (3) non-self-antigens–viral epitopes expressed by virus-induced cancers, such as those derived from Human Papilloma Virus [8, 9], Hepatitis B Virus [10], and Epstein Barr Virus [11]. Ovarian cancer is considered rich in the first category of shared tumor antigens, with relatively large percentages of ovarian cancers expressing MAGE-A1, MAGE-A3, NY-ESO-1, and others [12, 13]. In the case of the alpha folate receptor, 97% of ovarian cancers were found to express it, with the vast majority having moderate or strong expression levels, while only 63% of healthy ovaries were found to express it, and in all cases the expression was weak [14].

Evidence for T cell responses against shared tumor antigens comes from studies demonstrating the presence of T cells with binding capacity for, and reactivity to, the shared antigens [1, 3, 4, 1520]. Indeed, responses against shared tumor antigens may outnumber those against mutated neoantigens, including for highly mutated cancers such as melanoma [21, 22]. In addition to effector T cells responding to tumor antigens, a significant portion of the tumor-infiltrating lymphocyte (TIL) population is expected to be regulatory T cells that are reactive to tissue-restricted self-antigens associated with the organ of cancer origin, as these T cells are highly enriched in cancer lesions [21]. Thus, on balance, we expect much of a TIL population to be composed of T cells with specificity for antigens shared across cancer patients and not present, or present at significantly reduced levels, in cancer-free individuals.

We hypothesized that the above-described T cell responses could serve as the basis for cancer early detection biomarkers and sought to develop a method for detecting them that didn’t require knowledge of the target antigens and didn’t rely on the assumption that T cells responding to a common target would express T cell receptors with the same amino acid sequence. Utilizing publicly available TCR deep sequencing data, we applied multiple instance learning (MIL) after converting the TCR amino acid sequences to a biophysicochemical representation using Atchley Factors [2325]. We found that TCR repertoires from breast or colorectal cancer TILs could be distinguished from adjacent healthy tissue repertoires by the presence of TCRs bearing specific, biophysicochemical motifs in their antigen binding regions [25]. The motifs were different between the two cancer types, and both achieved high classification accuracy. The colorectal cancer motif achieved 93% accuracy, while the breast cancer motif achieved 94% accuracy.

In the current study, we sought to establish the plausibility of using TCR motifs for ovarian cancer detection and applied our method to locally collected patient samples. We made two significant advances over the prior work. First, the prior study used patient-matched TILs and healthy repertoires, collecting healthy tissue adjacent to the tumors. Thus, the classification task was to distinguish two repertoires that had both been collected from an organ effected by cancer, one repertoire from within the cancerous lesion and one repertoire from a lesion-free region. The current study collected TILs from patients with high-grade serous ovarian carcinoma (HGSOC) and collected healthy ovary repertoires from cancer-free women undergoing hysterectomy with salpingo-oophorectomy for benign disease. Thus, the current classification task is distinguishing repertoires from women with HGSOC versus repertoires from women with healthy ovaries. The second advance comes from the opportunity to assess the motif on a blinded test data set. In the prior study, only a training data set was available, and classification accuracy was measured by patient-hold-out cross-validation. In the current study, both a training and test data set were available. Thus, in addition to assessing classification accuracy of the motif by patient-hold-out cross-validation, the ability of the motif to generalize to a new, independent cohort of data not used for motif discovery was assessed.

The current study revealed a TCR biophysicochemical motif present overwhelmingly in HGSOC TILs repertoires but rarely in healthy ovary repertoires. The motif is specific to HGSOC, i.e., it is different from the motifs previously identified for colorectal and breast cancer. The classification accuracy assessed by cross-validation on the training data was 95% (19/20). Applying the same model selection and cross-validation procedure to data with permutated labels resulted in an average classification accuracy of 55%, and the accuracies of all 20 permutations were < 95%. Application of the best model to the unseen test set resulted in a classification accuracy of 80% (16/20), indicating that the motif has some capacity to generalize. These results strengthen the proposal that cancer detection approaches might benefit from incorporation of TCR motif-based biomarkers and call for studies assessing the approach on large training and testing data sets and on additional cancer types.

Materials and methods


We obtained 40 archived tissue blocks from the Pathology Laboratories of Parkland Health and Hospital System and two university hospitals associated with UT Southwestern Medical Center (St. Paul Hospital and Clements University Hospital): 20 HGSOC specimens and 20 normal ovary specimens. The study was approved by the UT Southwestern Medical Center IRB, study number STU-2018-0239, with a waiver of consent because anonymized, archived, FFPE tissue blocks were used. We divided the blocks into two cohorts, one training cohort (Cohort I) and one test cohort (Cohort II), each with 10 HGSOC specimens and 10 normal ovary specimens (Table 1). In both cohorts, all donors were between 50 and 59 years of age. In the training cohort, eight of the HGSOC samples were stage IIIC, and two were stage IVB. The test cohort was more heterogeneous with respect to the HGSOC stage. Six were stage IIIC; one was stage IVB. The remaining three samples were stages IIA, IIB, and IIIA1(i). Of the 20 control samples, in addition to normal ovarian tissue, 12 had fallopian tube tissue within the block from which our samples were cut. In seven of the remaining eight cases, ovarian sections demonstrated serous (tubal-type) epithelial inclusions. Thus, these controls are representative of the tissue from which ovarian cancer is believed to arise. Tissue curls were sent to Adaptive Biotechnologies for sequencing of the TCR β (TCRB) locus at survey depth. Sequencing was based on genomic DNA, and, for the tissue blocks with fallopian tube tissue or epithelial inclusions, the DNA from the various tissue types was combined for sequencing. We did not quantify the number of TILs present in the tissue by immunohistochemistry prior to sequencing, but the number of unique TCR in Table 1 estimates the number of T cell clones present the sequencing sample [26, 27]. The study design is shown in Fig 1A.

Fig 1. Study overview.

(a) Ovarian samples are collected from patients with and without HGSOC cancer. High-throughput immune receptor sequencing reveals the TCRβ CDR3 sequences found in each tissue sample. (b) The CDR3 sequences are cut in motifs. In this example, a motif is assembled from three amino acid residues. Only a single residue from the CDR3 may be skipped, allowing for a single gap. Otherwise, the amino acid residues are contiguous neighbors. (c) Each amino acid residue is converted into a set of five chemical features using Atchley factors for a total of fifteen features describing the motif. The relative abundance of each motif is included as an additional sixteenth feature. (d) Each feature is multiplied by a weight (β1 through β16) that determines its relative importance, and a bias value (β0) is added to calculate a logit. The logit can be converted into a probability value for that motif. (e) The weights and bias value are picked such that there is at least one motif with a probability value close to 1 in each HGSOC sample and all motifs in each healthy ovary sample have a probability close to 0.

Table 1. Patient characteristics.

Age, stage, patient diagnosis, and the number of unique TCRB sequences for each sample in the training and validation cohorts.

The data are freely available from the VDJServer Community Data Portal (CDP) ( under the project accession 3276777473314001386-242ac116-0001-012 [28]. The sequences are available in FASTA format in the “Browse Project Data” section. Annotated alignments are available in the tab-separated-values format recommended by the Adaptive Immune Receptor Repertoire Community in the “View Analyses and Results” section as output from IgBlast [29, 30].

Representing TCRs

As previously described [25], we analyzed X-ray crystallographic structures of human TCRs bound to peptide–MHC complex obtained from the Protein Data Bank in order to determine how to represent TCRB sequence in a way that would capture the antigen binding capabilities of the corresponding TCRB chain. We focused on complementarity determining region 3 (CDR3), because it is the somatically generated portion of the gene and the primary determinant of the chain’s antigen-binding specificity. We also focused on residues that directly contact peptide in a peptide–MHC complex. The crystal structure analysis revealed that TCRB CDR3 residues in contact with peptide tend to lie near each other, forming a local neighborhood of contact residues. The size and relative location of this neighborhood varied, but it rarely included any of the first or last three CDR3 residues, its average length was four, and in ~25% of cases, a non-contact residue was interspersed between the contact residues. Thus, to capture CDR3 contact residues, we excluded the first and last three CDR3 residues and partitioned the remaining sequence into every possible contiguous strip of three amino acid residues, referred to as a motif (Fig 1B). We also allowed one residue in the CDR3 sequence to be skipped when assembling a motif (Fig 1B). Such skipped residues are referred to as a gap. Our expectation is that, for each TCRB CDR3, at least one of its motifs contains residues that contact the peptide component of the receptor's cognate antigen. Alternative models were considered but exhibited reduced performance (Table 2).

Table 2. Different model configurations evaluated on Cohort 1.

Each row represents a different model, and the columns describe the configuration of each model. The first row (bold font) corresponds to the model configuration with the best performance for the breast and colorectal cancer datasets [25]. The second row (bold underlined font) corresponds to the best performing model configuration presented here. The first column indicates the number of amino acid residues in the motif. The second column indicates the number of CDR3 amino acid residues that could be skipped when assembling a motif. For example, if the value is 2, then 2 CDR3 amino acid residues could be skipped. The third column indicates if binary indicators indicating whether the corresponding CDR3 residue was ignored were used. For example, if a CDR3 residue was ignored but would have been in the third position of a motif if it had been included, then the 3rd indicator would have a value of 1. The fourth column indicates if an amino acid was skipped in the CDR3 for the given position in the motif. The fifth column indicates if the expected frequency of the motif in blood was included as a feature. The expected frequency was estimated using publicly available data from 786 presumed healthy individuals [31]. The sixth column indicates if the log of the motif relative abundance was used for the relative abundance term. Column 7 indicates if each feature is squared and used as an additional feature, resulting in 2nd order terms in the model. Column 8 indicates if batch normalization was used. Column 9 (fourth from last) is the log-loss averaged across the one-holdout cross-validations. Column 10 (third from last) is the accuracy computed over the one-holdout cross-validations. Column 11 (second from last) is the number of gradient steps used to fit the model as determined by early-stopping. Column 12 is the number of fits to the training data, of which the best fit to the training data is applied to the holdout sample.

When different TCRs bind the same peptide, the TCRB CDR3 contact residues may be different amino acids across the different TCRs. Thus, to identify motifs with different amino acid sequences but similar antigen-binding capabilities, we represented each motif using numerical values for the biophysicochemical properties of its component amino acids. We used Atchley factors as the biophysicochemical descriptors [24]. Atchley factors were derived from a set of over 50 amino acid properties by identifying clusters of properties that co-vary. The five Atchley factor values for each amino acid residue correspond loosely to its polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. For input into our model, each amino acid residue was represented by a vector of its five Atchley factor values (Fig 1C). With three amino acid residues in a motif, there are a total of 15 Atchley factor values that we represent using the symbols f1 through f15.

T-cells undergo clonal expansion in response to antigen stimulation, creating copies of the T-cell and its receptor. Thus, the quantity of a motif can indicate whether receptors containing it have encountered their cognate antigen. We therefore included an estimator of motif quantity as a feature in the model. When calculating the relative abundance of a motif in each sample, we identified every TCRB sequence containing the motif in its CDR3 and summed over the sequences’ template counts, CCDR3. This provided the motif count, Cmotif. We then divided by the total count of all motifs in the sample, T, to get the motif’s relative abundance, fq. As with most statistical classifiers, it is important to normalize the input into the model, i.e., the model features. For each feature, we calculated a weighted mean and variance of the feature values over all motifs, where the weights were the relative abundances of each motif, fq. Thus, motifs that appear more frequently exerted a greater influence on feature mean and variance than motifs that appeared only once or a few times. We then subtracted the mean from each feature value and divided by the square root of the variance to obtain a normalized value for model input.

Logistic model

The 15 Atchley factor values for the three motif residues along with the motif’s relative abundance were combined into a single vector [f1, f2, …, f15, fq] representing the features of a motif. Every motif was scored on the basis of these features using a logistic function that calculates the probability that a motif was derived from a HGSOC-associated repertoire. We used the logistic function because of its widespread use and simplicity, and because it models the outcome of a two-category process. The first step was to compute a biased, weighted sum of the features, referred to as a logit. The logit for the ith motif is represented as li. (1) The bias term β0 along with the weights β1 through β16 are the parameters of the model and were fit using gradient descent optimization techniques as described below. Every motif was scored using the same values for the weights and bias term.

Once the logit was computed, the value was passed through the sigmoid function to obtain a probability value between 0 and 1 for the ith motif.


Aggregating motif probabilities (multiple instance learning)

To predict whether a repertoire was derived from HGSOC or healthy ovarian tissue, the probabilities assigned to each motif must be aggregated into a single value that predicts the repertoire-level label. This problem of predicting a label for a set of objects from the scores for the individual objects in the set can be formally described as MIL [23]. According to the standard assumption of MIL, at least one motif from HGSOC-derived repertoires must have a high probability, while none of the motifs from healthy tissue-derived repertoires should have a high probability. This assumption can be implemented by simply taking as the repertoire score the maximum score over all motif scores in the repertoire. Thus, the probability that repertoire j is tumor-derived given the individual motif probabilities was computed as: (3) Using 0.5 as the threshold score, a repertoire is predicted to be tumor-derived when at least one motif is scored with a probability ≥ 0.5. The predictions from Eq (3) were used to fit the model’s parameters using Cohort I.

Parameter fitting

Values for the bias term β0 and weights β1 through β16 were selected to maximize the probability that each prediction from Eq (3) is correct, assigning tumor-derived repertoires a probability close to 1 and healthy ovary-derived repertoires a probability close to 0, using the following objective function: (4) where Eq (3) introduces a nonlinear operation into the model, resulting in a model that cannot be fitted using standard optimization techniques for logistic regression. Thus, gradient optimization techniques were used, such that each parameter is iteratively adjusted along the gradient in a direction that maximizes the log-likelihood, which in turn maximizes the likelihood that each prediction is correct (Fig 2).

Fig 2. Workflow for model selection and parameter fitting.

(a) Left Panel: The diagram shows how cohort I is used to train and evaluate each model. Model performance is evaluated by an exhaustive 1-holdout cross-validation using only cohort I. (b) Right Panel: The diagram shows how the best performing model is evaluated with unseen test data. The best performing model is refitted to all the samples in cohort I, and then used to score the test samples from cohort II. The same random initial β-coefficients are reused from (a) when refitting the best performing model in (b).

Because gradient optimization techniques are sensitive to the initial values of the bias and weight terms, the initial values must be carefully selected. As is typical, the bias term β0 was initialized to 0. The weight values β1 through β16 were initialized using two different distributions, depending on the feature. Weight values β1 through β15 are for Atchley factors, and initial values were sampled from . The weight value for β16 is for the relative abundance of each motif, and an initial value was sampled from . This initialization scheme ensures that the contribution from the 15 Atchley factors has the same expected magnitude as the contribution from the motifs’ relative abundances.

Next, the Adam optimizer, a gradient descent-based optimizer, was run for 2,500 iterations with a step size of 0.01 [32]. Default values for the other Adam optimizer settings were used (b1 = 0.9, b2 0.999, ε =10−8). A limitation of gradient descent-based methods is there is no guarantee of finding the globally optimal solution. Although the logistic model is a linear model, the motif probabilities were aggregated together in a non-linear fashion, and multiple local minima could exist. To address this, 217 = 131,072 runs of Adam optimization, each starting from different initial values as described in the previous paragraph, were used, and the best fit solution over all runs was used to classify new samples. By identifying the best fit to the training data across a huge number of runs, we were attempting to find the globally optimal solution.


Overfitting is always a concern with any statistical classifier. We previously found that L1/L2 regularization and dropout worsened the performance of our approach, perhaps because of the highly non-linear characteristics of our model [33]. Thus, we did not apply them in this study, but we did apply early stopping, as described below, which we previously found to significantly improve the model’s performance.

Model selection and testing

We used Cohort I for model selection and validation by patient-hold-out cross-validation (Fig 2, left panel). The same initial values for the weights β1 through β16 were reused with each cross-validation, ensuring the only variation between runs was due to the patient sample being held out, and not because each cross-validation used differential initial values for the weights β1 through β16. We evaluated multiple models (Table 2) and selected as the best model the one with the lowest average negative log-likelihood on patient hold-out cross-validation. Briefly, we considered motifs of either three or four amino acid residues. We also considered gaps, where we allowed either one or two amino acid residues from the CDR3 to be skipped when assembling each possible motif. Additional features and modifications to the model were considered, as indicated in Table 2. For each model, the optimal number of gradient optimization steps was determined by examining the average log-likelihood of each model at each training step in the patient-holdout cross-validation.

To account for model selection bias, the phenomenon whereby we identified a model that performs well on Cohort I without having discovered a generalizable signal, we evaluated the selected model on the test cohort, Cohort II (Fig 2, right panel). The weights and bias term for the best model identified via cross-validation on Cohort I were refit using all 20 Cohort I samples, and then Cohort II samples were scored.

The code is available here:


Cohort I

The best performing model by patient-holdout cross-validation on Cohort I used a motif of three amino acid residues and allowed for a single gap (Table 2). Under that model, the average number of motifs per tumor sample was 7,683.3, and the average number of motifs per healthy sample was 6,154.2. The largest number of motifs in any sample was 13,277. The best average log-likelihood was observed at the last (2,500th) gradient optimization step. The model correctly classified 95% (19/20) of held-out samples with an average log-likelihood of 0.332 bits (Fig 3A). The model correctly classified all healthy ovarian samples, giving a specificity of 100%, although one was quite close to the threshold score of 0.5. The model correctly classified all but one tumor sample, giving a sensitivity of 90%. To estimate the probability of correctly classifying 19 of 20 samples by chance, we performed a permutation analysis with 20 permutation runs. For each permutation, the sample labels were permuted and then patient-holdout cross-validation was performed. Early stopping was applied. The classification accuracies of all 20 permutations were < 95%, allowing us to assign p < 0.05 to the observed accuracy (Table 3). The average log-likelihood over all permutations was 0.993 bits, and the average accuracy was 55%.

Fig 3. Results.

(a) Classification results obtained by leave-out cross-validation for each patient in Cohort I. (b) Illustration of the classifier weights averaged across all 20 cross-validation runs (error bars for the standard deviation are omitted because the range was too small to plot relative to the size of each arrow). For each of the five Atchley factors, the weights are shown for the three residue positions. The weight for the log-frequency of the receptor is also shown. Positive weight values are shown pointing up, and negative weight values are shown pointing down. The length of the arrow corresponds to the weight's magnitude. (c) All motifs with a score above 0.5 (middle column) are shown for the 20 patient samples. Each motif is shown in the context of its respective CDR3. The leftmost column indicates the patient and the right most column indicates the number of times the motif is observed in the sample. (d) Classification results obtained on Cohort II test samples. (e) The ROC curve shows true and false positive rates for different thresholds of a positive diagnosis based on the model applied to Cohort II. The area under the curve is 0.79. (f) All motifs with a score above 0.5 (middle column) shown for the 20 patient samples in Cohort II. Each motif is shown in the context of its respective CDR3. The leftmost column indicates the patient and the right most column indicates the number of times the motif is observed in the sample.

Table 3. Permutation results.

Each row corresponds to a single permutation of the Cohort I data set, indicated in column 1. The second column shows the loss averaged over all patient-hold-out cross-validations. The third column shows the classification accuracy over all patient-hold-out cross-validations. The fourth column shows the fitting step, out of 2500, at which the lowest average loss was observed.

To discern the features that increase the probability of a HGSOC categorization, we examined the model weights across all 20 cross-validation runs (Fig 3B). The weights reveal how each Atchley factor contributes to the score and the relative importance of each position in the motif. Motifs with a positively charged, hydrophilic residue that tends to participate in alpha-helices in position 1, followed by a small residue that tends to participate in bends and coils in position 2, followed by a large, positively charged residue in position 3 will be scored by the model with a high probability of deriving from a HGSOC-associated repertoire. The weight for the relative abundance of the motif is positive indicating that more abundant motifs would have a higher probability than less abundant motifs.

We aligned the high scoring motifs from each holdout sample and present them within the context of the CDR3 sequences from which they originated (Fig 3C). The motifs varied in terms of their component residues, but a restricted set of amino acids was observed at each position. Amino acids Glutamic acid, Lysine, and Arginine were common in position 1, Tryptophan and Tyrosine were common in position 2, and Histidine and Tryptophan were common in position 3. We also determined the number of times each CDR3 appeared in each sample and noted that most of them appear only once. None of the CDR3 sequences are shared across patients.

Cohort II

Given the potential for overfitting and model selection bias, we assessed the model’s performance on samples not used for model selection or parameter fitting, i.e., on Cohort II. After selecting the best performing model using cross-validation on Cohort I, as described above, we then refit the parameters of the selected model using all 20 Cohort I samples using 2,500 gradient optimization steps, which was determined to be the optimal number of steps in the cross-validation (Table 2). The resulting weights β1 through β16 appear indistinguishable from those in Fig 3C.

The newly fitted model was then applied to Cohort II and correctly classified 80% (16/20) of the samples with an average log-likelihood fit of 0.821 bits. The model correctly classified all but one healthy ovarian sample (specificity 90%) and misclassified three tumor samples (sensitivity 70%) (Fig 3D). The area under the Receiver Operating Characteristic (ROC) curve was 0.79 (Fig 3E).

We aligned the high scoring motifs from the Cohort II samples and present them within the context of the CDR3 sequences from which they originated (Fig 3F). As with the Cohort I motifs, the amino acid residues present at each position vary, but the variability is restricted to a subset. As with the Cohort I samples, amino acids Glutamic Acid and Lysine are common in position 1, Tryptophan and Tyrosine are common in position 2, and Histidine, and Tryptophan are common in position 3. In contrast, Arginine was common in position 1 of Cohort I motifs but is found in position 1 of only one Cohort II motif, and Aspartic Acid is common in position 3 of Cohort II motifs but was not observed in position 3 of Cohort I motifs. As with Cohort I, we found that the majority of CDR3s containing high-scoring motifs were present only one time in their sample.


We previously hypothesized that T cell responses against antigens shared among cancer patients might enable development of a new approach to cancer detection [25]. Shared tumor antigens are not favored for antigen-targeted immunotherapy where the goal is to elicit such a high degree of tumor-cell killing that the tumor is eradicated. In that case, antigens with expression patterns highly-restricted to the tumor and that are targeted by high-affinity TCRs are needed. For cancer detection, however, it is only necessary that the corresponding TCRs be present in patients with the cancer and not in those without or that they be present with an elevated abundance in those with cancer relative to those without.

To determine whether such T cell responses might enable cancer detection, we first sought to develop a method for identifying the corresponding TCRs that didn’t require knowledge of the target antigens and didn’t rely on the assumption that T cells responding to a common target would express TCRs with the same amino acid sequence. To accomplish this, we developed the method described here, converting amino acid sequences into numerical vectors whose components correspond to amino acid biophysicochemical values, such as charge, and applying multiple instance learning. In all cases in which the method has been applied, it has identified a motif that can distinguish the tissue or patient groups of interest with solid performance [25, 33]. We hypothesize that TCRs bearing these motifs have overlapping antigen binding profiles and are concentrated in cancer tissue due to the presence of a common antigen there. This is a hypothesis that will have to be tested experimentally, but the strong classification performance of the motifs warrants further study, despite uncertainty regarding any shared antigen specificity.

In our first application of this method to TCRs, we considered motifs of four residues and did not allow gaps [25]. Additionally, we took the natural logarithm of the motif relative abundance term. Taking that same model and fitting the weight values on Cohort I, we obtained a classification accuracy of 90% with a likelihood error of 0.666 (Table 2). To determine whether we could improve the performance, we explored additional models not considered in our prior work (Table 2). The best performing model used a three-residue motif allowing for one gap and achieved a classification accuracy of 95% with a likelihood error of 0.332 (Table 2). Thus, while the approach has produced good results across multiple cancer types, each one has required optimization of the motif representation to obtain the best performance. Additional innovation to the modeling approach is required to produce a method that works across multiple cancer types without this customization.

Whenever multiple models are evaluated on the same data and the best performing model is selected, model selection bias can occur. To determine the extent of model selection bias in our Cohort I result, we evaluated the selected model’s performance on Cohort II, which is wholly unseen (i.e., not used for parameter fitting or model selection). The classification accuracy on Cohort II is 80% with a likelihood error of 0.821. Reduced performance on test data is expected, and these results indicate that the model has identified a signal that is expected to generalize to new samples with 80% accuracy.

We have applied the method to three cancer types and in each case identified a distinct biophysicochemical motif. While for breast cancer, all receptors bearing the motif were of high abundance, and in some cases were the top most abundant clone, for colorectal cancer, all but a few of the motif-bearing clones were of low abundance [25]. In the case of ovarian cancer, we again observed that motif-bearing clones are of low abundance, and in fact, in all but a few cases, the corresponding CDR3 sequences were observed in the sample only a single time. While this is perhaps surprising, we note that frozen tissue was used for the colorectal samples in our prior study, while the ovarian samples in this study were all formalin-fixed paraffin-embedded samples that had been collected between 2009 and 2016. The samples are therefore likely subject to significant DNA damage and to have significantly reduced sequence coverage of target regions [34]. It seems unlikely that the motif identified by our approach is purely an artifact given that it correctly classified 80% of the Cohort II samples. Taking the data at face value, it appears the motifs that mark repertoires as being HGSOC-associated are found in low frequency clones.

While our previous results demonstrate that TCR repertoires from TILs can be distinguished from adjacent healthy tissue repertoires by the presence of TCRs bearing specific, biophysicochemical motifs in their antigen binding regions, our current results go further by demonstrating that TILs repertoires from women with HGSOC can be distinguished from ovarian tissue-associated repertoires from women with healthy ovaries. Thus, in this case, we are distinguishing women with cancer from women without cancer, which is the classification task that is directly relevant to cancer detection. Despite this significant advance over the prior work, however, there are still several limitations that must be addressed. First, the HGSOC samples used in this study were primarily from women with stage III or IV disease. It is critical to determine whether this or another signature can be detected at early stages of disease, particularly before the appearance of invasive disease. Second, to have any potential utility for cancer detection, the signature must be detectable in tissue collected by minimally invasive means. That typically means blood. While the overlap between TILs T cell repertoires and the peripheral T cell repertoire has been shown to be relatively low, it is much higher, with as much as ~50–60% overlap, when the CD8+PD-1+ subset of peripheral T cells is sorted [3540]. Furthermore, the specific antigens recognized by this subset were similar to that of the TILs population [40]. Thus, it is reasonable to expect that a TCR signature found in the tissue can be detected in this or another peripheral T cell subset.

An additional potential utility of our approach is in the diagnosis of women who present with an ovarian mass. Thus, it will be essential to assess the signature on benign ovarian tumors, as well as on ovarian cancers of other types, to determine whether the signature presented here is present in those cases or whether these have their own unique signature.

Taken together, our current and prior results indicate that TCR-based biomarkers have potential utility for cancer detection. They justify further studies on larger patient cohorts designed to improve the generalizability of the signature with a particular focus on blood samples from patients with early stage disease. Additionally, they justify application of this method in other cancer types, such as pancreatic cancer, where, like ovarian, the need for early detection methods are particularly critical.


  1. 1. Romero P., Dunbar P.R., Valmori D., Pittet M., Ogg G.S., Rimoldi D., et al. Ex vivo staining of metastatic lymph nodes by class I major histocompatibility complex tetramers reveals high numbers of antigen-experienced tumor-specific cytolytic T lymphocytes. J Exp Med, 1998. 188(9): p. 1641–50. pmid:9802976
  2. 2. Disis M.L., Wallace D.R., Gooley T.A., Dang Y., Slota M., Lu H., et al. Concurrent trastuzumab and HER2/neu-specific vaccination in patients with metastatic breast cancer. J Clin Oncol, 2009. 27(28): p. 4685–92. pmid:19720923
  3. 3. Dhodapkar K.M., Gettinger S.N., Das R., Zebroski H. and Dhodapkar M.V., SOX2-specific adaptive immunity and response to immunotherapy in non-small cell lung cancer. Oncoimmunology, 2013. 2(7): p. e25205. pmid:24073380
  4. 4. Dhodapkar M.V., Sexton R., Das R., Dhodapkar K.M., Zhang L., Sundaram R., et al. Prospective analysis of antigen-specific immunity, stem-cell antigens, and immune checkpoints in monoclonal gammopathy. Blood, 2015. 126(22): p. 2475–8. pmid:26468228
  5. 5. Dhodapkar M.V., Sznol M., Zhao B., Wang D., Carvajal R.D., Keohan M.L., et al. Induction of antigen-specific immunity with a vaccine targeting NY-ESO-1 to the dendritic cell receptor DEC-205. Sci Transl Med, 2014. 6(232): p. 232ra51.
  6. 6. Kvistborg P., van Buuren M.M. and Schumacher T.N., Human cancer regression antigens. Curr Opin Immunol, 2013. 25(2): p. 284–90. pmid:23566921
  7. 7. Shima K., Morikawa T., Yamauchi M., Kuchiba A., Imamura Y., Liao X., et al. TGFBR2 and BAX mononucleotide tract mutations, microsatellite instability, and prognosis in 1072 colorectal cancers. PLoS One, 2011. 6(9): p. e25062. pmid:21949851
  8. 8. Kenter G.G., Welters M.J., Valentijn A.R., Lowik M.J., Berends-van der Meer D.M., Vloon A.P., et al. Vaccination against HPV-16 oncoproteins for vulvar intraepithelial neoplasia. N Engl J Med, 2009. 361(19): p. 1838–47. pmid:19890126
  9. 9. Stevanovic S., Draper L.M., Langhan M.M., Campbell T.E., Kwong M.L., Wunderlich J.R., et al. Complete regression of metastatic cervical cancer after treatment with human papillomavirus-targeted tumor-infiltrating T cells. J Clin Oncol, 2015. 33(14): p. 1543–50. pmid:25823737
  10. 10. Koh S., Tan A.T., Li L. and Bertoletti A., Targeted Therapy of Hepatitis B Virus-Related Hepatocellular Carcinoma: Present and Future. Diseases, 2016. 4(1).
  11. 11. Louis C.U., Straathof K., Bollard C.M., Ennamuri S., Gerken C., Lopez T.T., et al. Adoptive transfer of EBV-specific T cells results in sustained clinical responses in patients with locoregional nasopharyngeal carcinoma. J Immunother, 2010. 33(9): p. 983–90. pmid:20948438
  12. 12. Krishnadas D.K., Bai F. and Lucas K.G., Cancer testis antigen and immunotherapy. Immunotargets Ther, 2013. 2: p. 11–9. pmid:27471684
  13. 13. Want M.Y., Lugade A.A., Battaglia S. and Odunsi K., Nature of tumour rejection antigens in ovarian cancer. Immunology, 2018. 155(2): p. 202–10. pmid:29772069
  14. 14. Markert S., Lassmann S., Gabriel B., Klar M., Werner M., Gitsch G., et al. Alpha-folate receptor expression in epithelial ovarian carcinoma and non-neoplastic ovarian tissue. Anticancer Res, 2008. 28(6A): p. 3567–72. pmid:19189636
  15. 15. Lee P.P., Yee C., Savage P.A., Fong L., Brockstedt D., Weber J.S., et al. Characterization of circulating T cells specific for tumor-associated antigens in melanoma patients. Nat Med, 1999. 5(6): p. 677–85. pmid:10371507
  16. 16. Romero P., Valmori D., Pittet M.J., Zippelius A., Rimoldi D., Levy F., et al. Antigenicity and immunogenicity of Melan-A/MART-1 derived peptides as targets for tumor reactive CTL in human melanoma. Immunol Rev, 2002. 188: p. 81–96. pmid:12445283
  17. 17. Munson D.J., Egelston C.A., Chiotti K.E., Parra Z.E., Bruno T.C., Moore B.L., et al. Identification of shared TCR sequences from T cells in human breast cancer using emulsion RT-PCR. Proc Natl Acad Sci U S A, 2016. 113(29): p. 8272–7. pmid:27307436
  18. 18. Massa C., Robins H., Desmarais C., Riemann D., Fahldieck C., Fornara P., et al. Identification of patient-specific and tumor-shared T cell receptor sequences in renal cell carcinoma patients. Oncotarget, 2017. 8(13): p. 21212–28. pmid:28177902
  19. 19. Bai X., Zhang Q., Wu S., Zhang X., Wang M., He F., et al. Characteristics of Tumor Infiltrating Lymphocyte and Circulating Lymphocyte Repertoires in Pancreatic Cancer by the Sequencing of T Cell Receptors. Sci Rep, 2015. 5: p. 13664. pmid:26329277
  20. 20. Nakanishi K., Kukita Y., Segawa H., Inoue N., Ohue M. and Kato K., Characterization of the T-cell receptor beta chain repertoire in tumor-infiltrating lymphocytes. Cancer Med, 2016. 5(9): p. 2513–21. pmid:27465739
  21. 21. Savage P.A., Leventhal D.S. and Malchow S., Shaping the repertoire of tumor-infiltrating effector and regulatory T cells. Immunol Rev, 2014. 259(1): p. 245–58. pmid:24712470
  22. 22. Gee M.H., Han A., Lofgren S.M., Beausang J.F., Mendoza J.L., Birnbaum M.E., et al. Antigen Identification for Orphan T Cell Receptors Expressed on Tumor-Infiltrating Lymphocytes. Cell, 2018. 172(3): p. 549–63 e16.
  23. 23. Carbonneau M.-A., Cheplygina V., Granger E. and Gagnon G., Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 2018. 77: p. 329–53.
  24. 24. Atchley W.R., Zhao J., Fernandes A.D. and Druke T., Solving the protein sequence metric problem. Proc Natl Acad Sci U S A, 2005. 102(18): p. 6395–400. pmid:15851683
  25. 25. Ostmeyer J., Christley S., Toby I.T. and Cowell L.G., Biophysicochemical Motifs in T-cell Receptor Sequences Distinguish Repertoires from Tumor-Infiltrating Lymphocyte and Adjacent Healthy Tissue. Cancer Res, 2019. 79(7): p. 1671–80. pmid:30622114
  26. 26. Carlson C.S., Emerson R.O., Sherwood A.M., Desmarais C., Chung M.W., Parsons J.M., et al. Using synthetic templates to design an unbiased multiplex PCR assay. Nat Commun, 2013. 4: p. 2680. pmid:24157944
  27. 27. Robins H.S., Ericson N.G., Guenthoer J., O'Briant K.C., Tewari M., Drescher C.W., et al. Digital genomic quantification of tumor-infiltrating lymphocytes. Sci Transl Med, 2013. 5(214): p. 214ra169.
  28. 28. Christley S., Scarborough W., Salinas E., Rounds W.H., Toby I.T., Fonner J.M., et al. VDJServer: A Cloud-Based Analysis Portal and Data Commons for Immune Repertoire Sequences and Rearrangements. Front Immunol, 2018. 9: p. 976. pmid:29867956
  29. 29. Vander Heiden J.A., Marquez S., Marthandan N., Bukhari S.A.C., Busse C.E., Corrie B., et al. AIRR Community Standardized Representations for Annotated Immune Repertoires. Front Immunol, 2018. 9: p. 2206. pmid:30323809
  30. 30. Ye J., Ma N., Madden T.L. and Ostell J.M., IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res, 2013. 41(Web Server issue): p. W34–40. pmid:23671333
  31. 31. Emerson R.O., DeWitt W.S., Vignali M., Gravley J., Hu J.K., Osborne E.J., et al. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat Genet, 2017. 49(5): p. 659–65. pmid:28369038
  32. 32. Kingma, D. and J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  33. 33. Ostmeyer J., Christley S., Rounds W.H., Toby I., Greenberg B.M., Monson N.L., et al. Statistical classifiers for diagnosing disease from immune repertoires: a case study using multiple sclerosis. BMC Bioinformatics, 2017. 18(1): p. 401. pmid:28882107
  34. 34. Carrick D.M., Mehaffey M.G., Sachs M.C., Altekruse S., Camalier C., Chuaqui R., et al. Robustness of Next Generation Sequencing on Older Formalin-Fixed Paraffin-Embedded Tissue. PLoS One, 2015. 10(7): p. e0127353. pmid:26222067
  35. 35. Chen Z., Zhang C., Pan Y., Xu R., Xu C., Chen Z., et al. T cell receptor β-chain repertoire analysis reveals intratumour heterogeneity of tumour-infiltrating lymphocytes in oesophageal squamous cell carcinoma. The Journal of pathology, 2016. 239(4): p. 450–8. pmid:27171315
  36. 36. Page D.B., Yuan J., Redmond D., Wen Y.H., Durack J.C., Emerson R., et al. Deep Sequencing of T-cell Receptor DNA as a Biomarker of Clonally Expanded TILs in Breast Cancer after Immunotherapy. Cancer immunology research, 2016. 4(10): p. 835–44. pmid:27587469
  37. 37. Emerson R.O., Sherwood A.M., Rieder M.J., Guenthoer J., Williamson D.W., Carlson C.S., et al. High-throughput sequencing of T-cell receptors reveals a homogeneous repertoire of tumour-infiltrating lymphocytes in ovarian cancer. The Journal of pathology, 2013. 231(4): p. 433–40. pmid:24027095
  38. 38. Jang M., Yew P.Y., Hasegawa K., Ikeda Y., Fujiwara K., Fleming G.F., et al. Characterization of T cell repertoire of blood, tumor, and ascites in ovarian cancer patients using next generation sequencing. Oncoimmunology, 2015. 4(11): p. e1030561. pmid:26451311
  39. 39. Sims J.S., Grinshpun B., Feng Y., Ung T.H., Neira J.A., Samanamud J.L., et al. Diversity and divergence of the glioma-infiltrating T-cell receptor repertoire. Proceedings of the National Academy of Sciences of the United States of America, 2016: p. 201601012.
  40. 40. Gros A., Parkhurst M.R., Tran E., Pasetto A., Robbins P.F., Ilyas S., et al. Prospective identification of neoantigen-specific lymphocytes in the peripheral blood of melanoma patients. Nat Med, 2016. 22(4): p. 433–8. pmid:26901407