RefPlantNLR is a comprehensive collection of experimentally validated plant disease resistance proteins from the NLR family

Reference datasets are critical in computational biology. They help define canonical biological features and are essential for benchmarking studies. Here, we describe a comprehensive reference dataset of experimentally validated plant nucleotide-binding leucine-rich repeat (NLR) immune receptors. RefPlantNLR consists of 481 NLRs from 31 genera belonging to 11 orders of flowering plants. This reference dataset has several applications. We used RefPlantNLR to determine the canonical features of functionally validated plant NLRs and to benchmark 5 NLR annotation tools. This revealed that although NLR annotation tools tend to retrieve the majority of NLRs, they frequently produce domain architectures that are inconsistent with the RefPlantNLR annotation. Guided by this analysis, we developed a new pipeline, NLRtracker, which extracts and annotates NLRs from protein or transcript files based on the core features found in the RefPlantNLR dataset. The RefPlantNLR dataset should also prove useful for guiding comparative analyses of NLRs across the wide spectrum of plant diversity and identifying understudied taxa. We hope that the RefPlantNLR resource will contribute to moving the field beyond a uniform view of NLR structure and function.


Introduction
Reference datasets are critical in computational biology [1,2]. They help define canonical biological features and are essential to benchmarking studies. Reference datasets are particularly important for defining the sequence and domain features of gene and protein families. Despite this, curated collections of experimentally validated sequences are still lacking for several widely studied gene and protein families. One example is the nucleotide-binding leucine-rich repeat (NLR) family of plant proteins. NLRs constitute the predominant class of disease resistance (R) genes in plants [3][4][5]. They function as intracellular receptors that detect pathogens and activate an immune response that generally leads to disease resistance. NLRs are thought The class of pathogen to which NLRs in the RefPlantNLR dataset confer a response. Some NLRs may be involved in the response against multiple classes of pathogens, while others have a helper role or are found to be involved in allelic variation in autoimmune/hybrid necrosis responses, and (E) the per genus reduced redundancy set at a 90% sequence similarity threshold are plotted as a stacked bar graph. The number of experimentally validated NLRs belonging to the monophyletic TIR-NLR, CC-NLR, CC R -NLR, or CC G10 -NLR subclade members is indicated. Underlying data and R code to reproduce the figures in S5 Data. CCAU : AbbreviationlistsinFigs1 À 6havebeenupdated:Pleaseverifythatallentr , coiled-coil; HD, helical domain of apoptotic proteaseactivating factors; LRR, leucine-rich repeat; NB, P-loop containing NTPase domain; NLR, nucleotide-binding leucine-rich repeat; TIR, Toll/interleukin-1 receptor; WD, winged helix domain.
https://doi.org/10.1371/journal.pbio.3001124.g001 either annotated genomic features and transcriptomic data or alternatively can be run directly on the unannotated genomic sequence. NLR-Parser, RGAugury, RRGPredictor, and DRAGO2 identify transcript and protein sequences that have features of NLRs and are best described as NLR extractors [25][26][27][28]. RGAugury, RRGPredictor, and DRAGO2 also extract other classes of immune-related genes in addition to NLRs. These various tools use predefined motifs to classify sequences as NLRs, but they differ in the methods and pipelines. NLR-Annotator-an extension of NLR-Parser-and NLGenomeSweeper can also use unannotated genome sequences as input to predict the genomic locations of NLRs [29,30]. This output then requires manual annotation to extract the final gene models, and some of the annotated loci may represent partial or pseudogenized genes.
The goal of this study is to provide a curated reference dataset of experimentally validated plant NLRs. This version of RefPlantNLR (v.20210712_481) consists of 481 NLRs from 31 genera belonging to 11 orders of flowering plants. We used RefPlantNLR to determine the canonical features of functionally validated plant NLRs and benchmark NLR extraction tools. We found that these NLR extraction tools can extract the majority of NLRs in the RefPlantNLR dataset; however, the domain architecture analysis produced by these tools is often inconsistent with that of RefPlantNLR. In order to simplify NLR extraction, functional annotation, and phylogenetic analysis, we developed NLRtracker: a pipeline that uses InterProScan [31] and predefined NLR motifs [32] to extract NLRs and provide domain architecture analyses based on the canonical features found in the RefPlantNLR dataset. Additionally, NLRtracker outputs the extracted NB-ARC domain facilitating downstream phylogenetic analysis. RefPlantNLR should also prove useful in guiding comparative and phylogenetic analyses of plant NLRs and identifying understudied taxa for future studies.

Construction of the RefPlantNLR dataset
To construct the current version of RefPlantNLR (v.202110712_481, S1-S3 Dataset), we manually crawled through the literature, extracting plant NLRs that have been experimentally validated to at least some degree. We defined experimental validation broadly as genes reported to be involved in any of the following: (1) disease resistance; (2) disease susceptibility, including effector-triggered immune pathology or trailing necrosis to viruses; (3) hybrid necrosis; (4) autoimmunity; (5) NLR helper function or involvement in downstream immune responses; (6) negative regulation of immunity; and (7) well-described allelic series of NLRs with different pathogen recognition spectra even if not reported to confer disease resistance. We defined NLRs as sequences containing the NB-ARC domain (Pfam signature PF00931) or a P-loop containing nucleoside triphosphate hydrolases (NTPase) domain (SUPERFAMILY signature SSF52540) combined with plant-specific NLR motifs [32] (see Material and methods for the used motifs) (Fig 1A). This resulted in 479 sequences. We also included RXL [33], which has an N-terminal Rx-type CC domain and C-terminal LRR domain, as well as AtNRG1.3 [34], which has a C-terminal LRR domain, both of which contain the RNBS-D motif of the NB-ARC domain but otherwise do not get annotated with a P-loop containing NTPase domain. Altogether, these 481 sequences form the current version of RefPlantNLR (S1 Table).
In addition to the 481 NLRs present in this version of RefPlantNLR, we separately collected several characterized animal, bacterial, and archaeal NB-ARC proteins (S2 Table, S4 Dataset), which can be used as outgroups for comparative analyses. Furthermore, several characterized plant immune components have features often found in NLRs-such as the RPW8-type CC or the TIR domain-but lack the NB-ARC domain or NB-ARC-associated motifs that we used to define NLRs (see above). Since these proteins may have common origins with plant NLRs or may be useful for comparative analysis of these domains, we have collected them separately as well (S3 Table, S5-S7 Dataset).

Description of the RefPlantNLR dataset
The 481 RefPlantNLR entries belong to 31 genera of flowering plants (Fig 1B and 1C) and are described in S1 Table. The description includes amino acid, coding sequence (CDS) and locus identifiers, as well as the organism from which the NLR was cloned, the article describing the identification of the NLR, the pathogen type and pathogen to which the NLR provides resistance (when applicable), the matching pathogen effector, additional host components required for pathogen recognition (guardees or decoys) or required for NLR function, and the articles describing the identification of the pathogen and host components. From this dataset, we extracted 472 unique NLRs and 488 NB-ARC domains of which 406 were unique (S8 and S9 Dataset). NLRs with identical amino acid sequences were recovered because they have different resistance spectra when genetically linked to different sensor NLR allele (e.g., alleles of Pik), are different in noncoding regions leading to altered regulation (e.g., RPP7 alleles), or have been independently discovered in different plant genotypes (e.g., RRS1-R and SLH1).
The distribution of the RefPlantNLR entries across plant species mirrors the most heavily studied taxa, i.e., Arabidopsis, Solanaceae (Solanum, Capsicum, and Nicotiana), and cereals (Oryza, Triticum, and Hordeum) (Fig 1B). These 7 genera comprise 77% (370 out of 481) of the RefPlantNLR sequences. When accounting for redundancy by collapsing similar sequences (>90% overall amino acid identity per genus), these 7 genera would still account for 73% (220 out of 303) sequences (Fig 1C). It should be noted that there could be different evolutionary rates between NLRs, and, hence, some subfamilies may still be overrepresented in the reduced redundancy set.
In total, 31 plant genera representing 11 taxonomic orders are listed in RefPlantNLR. Interestingly, these species represent a small fraction of plant diversity with only 11 of 59 major seed plant (spermatophyte) orders described by Smith and Brown represented, and not a single entry from nonflowering plants (S4 Table) [35]. Arabidopsis remains the only species with experimentally validated NLRs from the 4 major clades (CC-NLR, CC G10 -NLR, CC R -NLR, and TIR-NLR) (Fig 1). For Arabidopsis, tomato, and rice, we compared the distribution of NLRs across the 4 major clades in the RefPlantNLR dataset and the published genome and found no major differences (S1A Fig).
We also mapped the frequency of the pathogens that are targeted by RefPlantNLR entries. Most validated NLRs in the RefPlantNLR dataset are involved in responses against fungi followed by oomycetes (Fig 1D and 1E for the reduced redundancy set). Responses to certain pathogen taxa is not constrained to particular subclasses of NLRs as all of TIR-NLRs, CC G10 -NLRs, and CC-NLRs are involved in resistance to the main pathogen classes (fungi, oomycete, bacteria, and viruses). The notable exception is the CC R -NLR subclade, which has only been validated for its helper function (Fig 1D and 1E). Additionally, CC G10 -NLR subclade members have not been assigned a helper activity, and CC R -NLR subclade members have not been implicated in autoimmunity or hybrid necrosis (Fig 1D and 1E), even though several RPW8-only proteins are involved in hybrid necrosis [36,37].
The average length of RefPlantNLR sequences varies depending on the subclass (Fig 2A  and 2C for the reduced redundancy set). CC-NLRs varied from 665 to 1,845 amino acids (mean = 1,079, N = 347), whereas TIR-NLR varied from 380 to 2,048 amino acids (mean = 1,159, N = 105). NB-ARC domains were more constrained (mean = 345, N = 406, stdev = 33) (Fig 2B). Nonetheless, 23 atypically short NB-ARCs (155 to 274 amino acids) and 1 long NB-ARC (422 amino acids) were observed at more than 2 standard deviations of the mean illustrating the overall flexibility of plant NLRs even for this canonical domain (Fig 2B  and 2D for the reduced redundancy set).
We noted that some of the unusually small NLRs lacked an SSFR domain, while some of the small NB-ARC domains appeared to be partial duplications of this domain. In order to look at domain architecture of NLRs more widely and to determine whether these unusual features are common, we functionally annotated the RefPlantNLR dataset using InterProScan [31] and predefined NLR motifs [32], as well as using LRRpredictor [38] (S10 Dataset) and an HMMAU : PleasedefineHMMatitsfirstmentioninthesentenceInordertolookatdomainarchitectureofNLRs for the recently discovered C-terminal jelly roll/Ig-like domain (C-JID) of TIR-NLRs [39] (S11 and S12 Dataset for the combined GFF annotation). This functional annotation can be visualized using the refplantnlR R package. We used this functional annotation to map the domain architecture of RefPlantNLR proteins (Fig 3A and 3B for the reduced redundancy set).
Even though CC-NLR and TIR-NLR domain combinations were the most frequent (61% and 19%, respectively), we observed additional domain combinations. In the RefPlantNLR dataset, a subset of NLRs lack the N-terminal domain but still group with the major NLR clades based on the NB-ARC phylogeny. Some TIR-NLRs lack an SSFR domain. Noncanonical  integrated domains are found in all NLR subfamilies and occur at the N-terminus, in between the N-terminal domain and the NB-ARC domain, at the C-terminus, or both ends. Of these noncanonical domains, the N-terminal late-blight resistance protein R1 domain (also known as the Solanaceae domain; Pfam signature PF12061) only occurs in association with the NB-ARC domain and has an ancient origin likely in the most recent common ancestor of the Asterids and Amaranthaceae [40]. Other noncanonical domains are also more widespread, including the monocot-specific integration of a zinc-finger BED domain in between the CC and NB-ARC domain [41,42]. Finally, some NLRs have significantly truncated NB-ARC domains as is the case for Pb1, AtNRG1.3, and RXL (Fig 3C). For Arabidopsis and rice, the number of characterized NLRs containing integrated domains appears to be slightly enriched as compared to all NLRs in the reference genome, whereas there is no NLRs with integrated domains identified in the tomato reference genome (S1B Fig). Finally, in Arabidopsis, there remains a number of NLR domain architectures, which have no counterpart in the RefPlantNLR set (S1C Fig).
We explored the phylogenetic diversity of RefPlantNLR proteins using the extracted NB-ARC domains with non-plant NB-ARC domains as an outgroup (Fig 4, S13-S15 Dataset). As with previously reported NLR phylogenetic analyses, RefPlantNLR sequences generally grouped in well-defined clades, notably CC-NLR, CC G10 -NLR, CC R -NLR, and TIR-NLR. Within this phylogeny, some of the branches, notably of Wed and Pi54, are long and may represent highly diverged NB-ARC domains. Since Pb1 [43], RXL [33], and AtNRG1.3 [34] do not match the Pfam NB-ARC domain, they were not included in this phylogenetic analysis.

Benchmarking NLR annotation tools using RefPlantNLR
We took advantage of the RefPlantNLR dataset to benchmark NLR annotation tools by determining their sensitivity in retrieving NLRs and accuracy in annotating NLR domain architecture. This is particularly justified because the majority of NLR prediction tools have only been evaluated using the reference Arabidopsis NLRome, which is not representative of NLR diversity across flowering plants (Fig 1). We selected 5 NLR annotation tools for benchmarking ( Table 1). These tools differ in the methods used for NLR extraction and functional annotation. NLGenomeSweeper, RGAugury, and RRGPredictor all use InterProScan [31] to functionally annotate sequences and extract NLRs based on co-occurrences of certain domains; however, they differ in which signatures are considered for the functional annotation. By contract, DRAGO2 relies on custom HMM models to functionally annotate sequences, whereas NLR-Annotator uses MEME with custom NLR motifs [32] for NLR extraction.
Since NLR-Annotator and NLGenomeSweeper only take nucleotide sequence input, whereas RGAugury only works on protein sequences, we decided to proceed with the benchmarking using only the RefPlantNLR entries with CDS information (457/481). In this way, we ensured that we could compare the tools on the same number of sequences. Out of the NLRextraction tools, DRAGO2 has the highest sensitivity, retrieving all of the RefPlantNLR entries when run on amino acid sequences (Fig 5A, Table 1). NLR-Annotator has the second highest sensitivity, retrieving 448/457 (98.0%) of the sequences (Fig 5A, Table 1). It has previously been noted that NLR-Annotator does not perform well on retrieving the CC R -NLR subclade members [25]. Indeed, NLR-Annotator missed 7/10 (70%) of CC R -NLRs in the RefPlantNLR There is currently no InterProScan signature or motif for the CC G10 Nterminal domain. Underlying data and R code to reproduce the figures in S5 Appendix. CC, coiled-coil; LRR, leucine-rich repeat; NB-ARC, nucleotidebinding adaptor shared by APAF-1, certain R gene products, and CED-4; NLR, nucleotide-binding leucine-rich repeat; TIR, Toll/interleukin-1 receptor.

PLOS BIOLOGY
Next, we compared sensitivity and domain annotation accuracy of the NLR annotation tools according to the 4 main NLR subclades. Since these tools only functionally annotate the canonical NLR domains, we did not consider integrated domains and the late blight R1 domain. While DRAGO2 is the most sensitive in retrieving NLRs, it correctly annotated the domain architecture of less than half (44.9%) of the RefPlantNLR sequences (Fig 5B, Table 1). DRAGO2, RGAugury, and RRGPredictor often failed to functionally annotate the CC domain (Fig 5B). Since these tools use Coils [45] to predict CC domains, we conclude that this program is not very sensitive to predict plant NLR CC domains. Additionally, Coils does not distinguish between the different types of CC domains such as the RPW8-type CC or Rx-type CC. Although NLR-Annotator does not automatically output a domain architecture analysis as the other tools, upon conversion of the motif analysis to domain architecture, we found that NLR-Annotator has the highest domain annotation accuracy of all tools, correctly annotating 403/457 (88.2%) of the NLRs (Fig 5B, Table 1). The other tools did not perform much better than DRAGO2, correctly annotating between 31.5% to 61.9% of RefPlantNLR entries (Fig 5B,  Table 1). When looking at the different NLR subclades, it becomes clear that most tools correctly identify and annotate TIR-NLRs, while domain prediction accuracy is lower for the other NLR subclades (Fig 5B). The exception to this is NLR-Annotator, which accurately annotates the domains of 288/326 (88.3%) CC-NLRs. This is possibly because NLR-Annotator was validated with the wheat genome, which contains a large proportion of CC-NLRs and some of the used motifs are specific to monocot CC-NLRs [32], whereas the other tools were validated with Arabidopsis, which has a higher abundance of TIR-NLRs as compared to other

PLOS BIOLOGY
species. Finally, comparing these NLR annotation tools on the reduced redundancy RefPlantNLR set revealed a similar pattern (S3 Fig, S1 Appendix for the full analysis). Based on the benchmarking using RefPlantNLR, we find DRAGO2 to be the most sensitive tool for NLR extraction, while NLR-Annotator is the most sensitive tool for use on genomic input. None of the tools performs well on the domain architecture analysis except for NLR-Annotator; however, to extract such a domain architecture output from NLR-Annotator does require a substantial effort on the user side.

NLRtracker: An NLR extraction and annotation pipeline based on the core features of RefPlantNLR
To address the limitations of the current NLR annotation tools highlighted above, we generated a novel pipeline we called NLRtracker. NLRtracker uses InterProScan [31] and the predefined NLR motifs [32] to annotate all sequences in a given proteome or transcriptome and then extracts and annotates NLRs based on the core NLR sequence features (late blight R1, TIR, RPW8, CC, NB-ARC, LRR, and integrated domains) found in the RefPlantNLR dataset (Fig 6A, S2 Appendix, Table 1). The functional annotation can then be visualized using the refplantnlR R package or other software of choice. Additionally, NLRtracker extracts the NB-ARC domain for comparative phylogenetic analysis. Since NLRtracker is based on the features found in the RefPlantNLR dataset, it exactly reproduces the RefPlantNLR domain architecture and extracts all RefPlantNLR entries. To compare NLRtracker to other NLR annotation tools, we used the Arabidopsis, tomato, and rice RefSeq genomes. In this way, we could compare whether NLRtracker also performs well on datasets other than RefPlantNLR.
In addition, we could also assess the accuracy of each NLR annotation tool, which is not possible with the RefPlantNLR dataset.
Using all tools, we extracted a total of 1,615 NLRs from the reference Arabidopsis (N = 441), tomato (N = 250), and rice (N = 924) genomes (Fig 6B). The total number of NLRs belonging to each subclade in each species is reflected in the RefPlantNLR dataset (Fig 6B). In addition to the 4 main subclades of NLRs, we also retrieved a highly conserved TIR-NB-ARC (TN) class of proteins, which phylogenetically clusters separately from all other plant NLRs and whose gene structure is clearly distinct from other TIR-NLRs [46], as well as certain NB-ARC containing proteins, which did not clearly belong to any of these clades (S4 Fig). This included an Arabidopsis NB-ARC protein with an integrated ZBTB8B domain in between the NB-ARC and the LRR, as well as a rice protein containing an NB-ARC domain with a C-terminal ARMtype SSFR (Fig 6B, S3 Appendix for the full analysis).
Using this dataset, we evaluated the sensitivity and specificity of each NLR annotation tool. Sensitivity was defined as the total percentage NLRs retrieved out of the total NLR dataset, while specificity was defined as the total number of sequences annotated as NLRs being genuine NLRs. False positives could include TIR-or RPW8-only proteins annotated as genuine NLRs, or unrelated sequences annotated as NLRs. On this dataset, NLRtracker had the highest sensitivity (retrieving 1,611/1,615 (99.8%) NLRs with 100% accuracy (Table 2, Fig 6B, S5 Fig). The 4 missed sequences were retrieved by DRAGO2 and included 1 NLR with an N-terminal CC-domain and a C-terminal LRR but which did not get annotated with an NB-ARC domain using InterProScan or the predefined NLR motifs, and 3 truncated proteins, 2 of which contain an LRR domain while one does not get annotated with any domain using InterProScan.
Of the preexisting tools DRAGO2 was the most sensitive, retrieving 1,526/1,615 (94.5%) NLRs; however, it also was the least accurate method extracting 91 false positives (Table 1, S5  Fig). These false positives were predominantly proteins containing a P-loop containing NTPase domain unrelated to the NB-ARC domain, e.g., ABC transporter ATP-binding cassette domain, AAA ATPase domain, Adenylylsulphate kinase domain and others. Similarly, RGAugury extracted 13 such false positives. By contrast, the 8 false positives extracted by RRGPredictor are RPW8-containing proteins lacking an NB-ARC domain. In conclusion, the NLRtracker tool we developed here is more sensitive and more accurate than previously available tools for extracting NLRs from a given plant proteome/transcriptome. Additionally, NLRtracker facilitates domain architecture analysis and phylogenetic analysis. Combining the extracted NB-ARC domain generated by NLRtracker with the RefPlantNLR extracted NB-ARC dataset (S9 Dataset) should greatly facilitate comparative phylogenetics and reveal

PLOS BIOLOGY
RefPlantNLR: A comprehensive collection of experimentally validated plant NLRs the phylogenetic relationships of a newly annotated NLR. Nevertheless, the quality of the output remains dependent on the quality of the input sequences, and none of these tools can determine whether an extracted sequence represents a genuine NLR, as in having a genuine NB-ARC domain or consisting of a full-length protein. For example, 94/1,615 extracted proteins do not get annotated with an NB-ARC domain, of which 44 do not get annotated with a P-loop containing NTPase domain but do contain NB-ARC-specific motifs in combination with domains found in NLRs. Some of these may represent genuine NLRs such as Pb1 or RXL, which have undergone regressive evolution, whereas other may be partial or pseudogenes. Finally, all NLR extraction tools require independent annotation of gene models. Unfortunately, it remains difficult to correctly predict NLR gene models in an automated way, and such annotation often requires manual curation. The functional annotation of NLR gene models generated by NLRtracker can be used to assess whether a given NLR gene model is likely to be correct, or whether it lacks key features, indicating that it is either degenerated or pseudogenized, or alternatively incorrectly annotated.

Additional applications of the RefPlantNLR dataset
We showed that RefPlantNLR is useful for benchmarking and improving NLR annotation tools. Additional uses of the dataset include providing reference points for newly discovered NLRs with NLRtracker feeding into the large-scale phylogenetic analyses that are necessary for classifying NLRomes. Phylogenetic analyses would help assign NLRs to subclades and provide a basis for generating hypotheses about the function and mode of action of novel NLRs, which phylogenetically cluster with experimentally validated NLRs. This type of phylogenetic information can be combined with other features such as genetic clustering and has, for instance, proven valuable in previous work on rice and solanaceous NLRs [23,48] and for defining the CC G10 -NLR class [13].
Furthermore, known mutants and sequence variants can be mapped onto a phylogenetic framework, such as the RefPlantNLR tree (Fig 4). For example, the CC-NLR ZAR1 and TIR-NLR ROQ1 are bound to ATP in their activated form [49,50], whereas the TIR-NLR RPP1 is bound to ADP in its activated form [39]. RefPlantNLR has already proven useful in interpreting a feature of the recently elucidated structure of the RPP1 resistosome [39]. The authors used RefPlantNLR to determine that although most CC-NLRs contain a TT/SR motif in which the arginine interacts with ATP, a subset of TIR-NLRs contain a charged or polar substitution creating a TTE/Q motif interacting with ADP in the activated form [39]. Interestingly, a phylogenetically distinct subgroup of CC-NLRs known as the MIC1 group [41] is an exception to this rule by having a TTE/Q motif in their ADP binding pocket and thus may also  [35] do not have a single experimentally validated NLR. Certain taxa have subfamily-specific contractions and expansions and hence may contain unexplored genetic and biochemical diversity of NLR function. Looking forward, combining the output of NLRtracker with RefPlantNLR may highlight understudied subgroups of NLRs even within well-studied organisms. For example, while all currently studied CC R -NLRs act as helper NLRs for TIR-NLRs, it has been reported that the CC R -NLR subfamily has experienced clade-specific expansions in gymnosperms and rosids, pointing to potential biochemical specialization of this subfamily in these taxa [51]. In addition, although NLRs have been reported in non-seed plants and some of these appear to have distinct N-terminal domains [11], their experimental validation is still lacking.
The RefPlantNLR dataset has inherent limitations due to its focus on experimentally validated NLRs. First, it is biased toward a few well-studied model species and crops as illustrated in Fig 1. Additionally, RefPlantNLR entries are somewhat redundant with particular NLR allelic series, such as the monocot MLA and spinach alpha-WOLF, being overrepresented in the dataset (Figs 1 and 4). These issues, notably redundancy, will need to be considered for certain applications where it may be preferable to use the reduced redundancy dataset (S16 Dataset).
Finally, to facilitate the use of NLRtracker, we have run NLRtracker on the current NCBI RefSeq plant genomes (S5 Table, S17 Dataset). The NCBI RefSeq genomes have been annotated with the same genome annotation pipeline, which facilitates comparisons between species. Since NLRtracker also annotates integrated domains, we looked at the distribution of integrated domains in NLRs across the plant kingdom (S6 Fig). Some species such as tomato completely lack NLRs with integrated domains, whereas other flowering plant species have up to 17.5% of NLRs with integrated domains. Since NLRtracker is based on RefPlantNLR, which only contains entries from flowering plant species, the functional annotation may not be as accurate on nonflowering plant species. Indeed, Physcomitrium patens (a moss) and Selaginella moellendorffii (a lycophyte) appear to have a large proportion of NLRs with integrated domains (S6 Fig). When looking at the types of integrated domains, these are predominantly protein kinase domains for P. patens and ARM repeat-type SSFRs for S. moellendorffii (S7 Fig), which likely reflect ancient lineage-specific expansions [11]. Since integrated domains are thought to be effector targets or mimics thereof which genetically integrated into NLRs, the complete set of integrated domains provides a starting point for identifying putative effector targets (S18 Dataset).

Conclusions
We hope that the RefPlantNLR resource will contribute to moving the field beyond a uniform view of NLR structure and function. It is now evident that NLRs are more structurally and functionally diverse than anticipated. Whereas a number of plant NLRs have retained the presumably ancestral 3 domain architecture of the TIR/CC R /CC G10 /CC fused to the NB-ARC and LRR domains, many NLRs have diversified into specialized proteins with degenerated features and extraneous noncanonical integrated domains [15,21,24]. Therefore, it is time to question holistic concepts such as effector-triggered immunity (ETI) and appreciate the wide structural and functional diversity of NLR-mediated immunity. More specifically, a robust phylogenetic framework of plant NLRs should be fully integrated into the mechanistic study of these exceptionally diverse proteins.

Sequence retrieval
RefPlantNLR was assembled by manually crawling the literature for experimentally validated NLRs according to the criteria described in the results section. NLRs are defined as having an NB-ARC and at least 1 additional domain. Where possible, the amino acid and nucleotide sequences were taken from GenBank. For some NLRs, only the mRNA has been deposited and no genomic locus information was present. When GenBank records were not available, the sequences were extracted from the matching whole-genome sequences projects or from articles and patents describing the identification of these NLRs.

Sequence deduplication
The NLR amino acid sequences were clustered using CD-HIT at 90% sequence identity (v4.8.1 [60]; Usage: cd-hit -i RefPlantNLR.fasta -o RefPlantNLR _90 -c 0.90 -n 5 -M 16000 -d 0). A custom R script (S4 Appendix) was used to assign representative sequences per cluster per genus, i.e., if a single cluster contained sequences from multiple genera, we assigned a representative sequence per genus. The reduced redundancy sequences are provided in S16 Dataset.

Phylogenetics
The NB-ARC domain of all NLRs were extracted and deduplicated. For sequences containing multiple NB-ARC domains, the extracted NB-ARC domain was numbered according to occurrence in the protein. Sequences were aligned using Clustal Omega [61], and all positions with less than 95% site coverage were removed using QKphylogeny [62] (S14 Dataset). RAxML (v8.2.12) [63] was used (usage: raxmlHPC-PTHREADS-AVX -T 6 -s RefPlantNLR. phy -n RefPlantNLR -m PROTGAMMAAUTO -f a -# 1000 -x 8153044963028367 -p 644124967711489) to infer the evolutionary history using the Maximum Likelihood method based on the JTT model [44]. Bootstrap values from 1,000 rapid bootstrap replicates as implemented in RAxML are shown [64] (S15 Dataset). The RefPlantNLR phylogeny was rooted on the non-plant outgroup and edited using the iTOL suite (6.3) [65].

Figures describing RefPlantNLR
The figures describing the RefPlantNLR dataset were generated using a custom R script (S5 Appendix).

Description of NLRtracker
NLRtracker (S2 Appendix) runs InterProScan (v5.51-85.0) [31] and FIMO from the memesuite (v5.1.1) [59] using predefined NLR-motifs [32]. An R script that depends on the Tidyverse [66] extracts sequences containing NLR-associated domains and classifies them into different subgroups: We did not apply additional cutoffs to the InterProScan output. For the MEME output, we filtered for hits with a score �60.0 and a q-value �0.01. Additionally, for NLR extraction using the linker and MHD motif, we applied a more stringent cutoff requiring a score �85.0. NLRtracker outputs the domain architecture analysis, as well as the domain boundaries. Additionally, the NB-ARC is extracted facilitating phylogenetic analysis. The current version of NLRtracker can be accessed through GitHub (https://github.com/slt666666/NLRtracker).

Description of refplantnlR R package
The NLRtracker output can be directly used with the refplantnlR R package to visualize the domain architecture, or, alternatively, the RefPlantNLR or NCBI RefSeq NLRtracker output can be loaded. The drawing of the domain architecture analysis is based on drawProteins [68]. The current version of refplantnlR can be accessed through GitHub (https://github.com/ JKourelis/refplantnlR).
NLRs were grouped in different subclades based on phylogenetic clustering with the RefPlantNLR CC R -NLR, TIR-NLR, CC G10 -NLR, and CC-NLR subgroups, while those that did not clearly fall into any of these groups but contained a TIR-domain and P-loop containing NTPase domain were classified as TN subclade members. The remainder was grouped together and classified as other. Proteins that were extracted but did not belong to the NLR subfamily were manually inspected and classified as false positives. Additionally, TIR-or RPW8-only (TX and RPW8, respectively) proteins extracted as NLRs were marked as false positives. A custom R script was used to generate the analysis (S3 Appendix).