The metabotyping of an East African cassava diversity panel: A core collection for developing biotic stress tolerance in cassava

Cassava will have a vital role to play, if food security is to be achieved in Sub-Saharan Africa, especially Central and East Africa. The whitefly Bemisia tabaci poses a major threat to cassava production by small holder farmers in part due to their role as a vector of cassava mosaic begomoviruses (CMBs) and cassava brown streak ipomoviruses (CBSIs). In the present study untargeted metabolomics has been used as a tool to assess natural variation, similarities and attempts to identify trait differentiators among an East African cassava diversity panel that displayed tolerance/resistance to the effects of Bemisia tabaci infestation. The metabolome captured, was represented by 1529 unique chemical features per accession. Principal component analysis (PCA) identified a 23% variation across the panel, with geographical origin/adaption the most influential classification factors. Separation based on resistance and susceptible traits to Bemisia tabaci could also be observed within the data and was corroborated by genotyping data. Thus the metabolomics pipeline represented an effective metabotyping approach. Agglomerative Hierarchical Clustering Analysis (HCA) of both the metabolomics and genotyping data was performed and revealed a high level of similarity between accessions. Specific differentiating features/metabolites were identified, including those potentially conferring vigour to whitefly tolerance on a constitutive manner. The implications of using these cassava varieties as parental breeding material and the future potential of incorporating more exotic donor material is discussed.


Introduction
Plant breeding relies on the selection of accessions carrying desirable agronomic and consumer traits. This is not always a straightforward task, especially when the resolution or sensitivity of the phenotyping methodology is low or heavily influenced by environmental factors. The robustness of phenotyping criteria for assessing disease resistance in cassava has been already addressed, and has highlighted the effects of the environment on the visible symptoms of disease [1] and on the development rate of its vector, the whiteflies [2]. In the absence of reliable or clearly distinguishable phenotypes, metabolite analysis has been introduced as a tool for aiding selection and assessing effects of genetic differences, both among candidate donors (parents), siblings or even families derived from genetic crosses [3][4][5][6]. Food security is one of the major challenges facing society in the 21 st century and over 800 million of people rely on cassava production to subsist. Viral diseases like the cassava mosaic and cassava brown streak diseases (CMD and CBSD) constitute one of the major threats for cassava production in Sub-Saharan Africa (SSA) [7]. The presence of these viruses was first detected in Uganda and Tanzania back in the 1930s. CMD has been reported throughout SSA while CBSD is currently restricted to Eastern and Southern Africa but its predicted spread to West Africa is considered highly damaging. Outbreaks of CMD were controlled in the past through integrated pest management (IPM) measures including the release of advanced varieties bred for disease resistance [1,8] but those were susceptible to CBSD. This together with the "superabundant" populations of African cassava whitefly Bemisia tabaci in the 2000s [9] caused new CBSD outbreaks [10]. African cassava whitefly, Bemisia tabaci (Gennadius) sensu lato is the vector of both viral agents causing CMD and CBSD but it also affects crop production directly, through the feeding of adults and nymphs, which in turn promotes the colonisation of sooty mould fungus on leaves. Since the raise and spread of African cassava whitefly superabundant populations, the development of improved cassava varieties possessing virus and whitefly resistance has constituted one of the core milestones of national breeding programs in Sub-Saharan African countries and international consortiums [8,11,12]. Efforts in finding natural sources of African whitefly (Bemisia tabaci) resistance have been focused on collections of local landraces and advanced lines originally developed for virus resistance [13][14][15][16][17] with the intention of pyramiding both disease and vector resistance into farmer-preferred varieties adapted to local agroecosystems.
In order to increase agricultural production and combat food (and nutritional) security, new crop varieties are urgently required [18]. In the case of cassava, varieties displaying tolerance to biotic and abiotic stresses, while maintaining consumer quality will be vital. To develop such lines, new alleles or combinations from "exotic" diverse germplasm need to be introgressed into existing donor or domesticated germplasm. In the present article we have used metabolomics as a phenotyping tool (metabotyping), in conjunction with genotyping by sequencing to characterise elite and pre-breeding materials found among cassava varieties presently used in East Africa. These materials are commonly referred to as the "5CP" material due to their assembly in the 5CP project [14]. Despite their importance as core resources for the development of new cassava varieties, very little detail is known on their genetic pedigree or chemical composition. The present study has provided valuable insights into these criteria that underpin their use in further plant breeding activities.

Plant material
The clean and virus-free plant material used in this study originated from the 5CP project [14]. Plants were grown in a mixture of soil and compost (John Innes No.2; Fargo Ltd., Arundel, UK) in an insect-free quarantine glasshouse at 25 ± 5˚C, 50-60% relative humidity and L14:D10 (light:dark) hours for three months. The 5CP´s cassava collection of elite varieties (Table 1) were grown under identical greenhouse conditions in order to minimise environmental effects and assess natural chemical variation under controlled conditions. Between 3 and 6 biological replicates per variety were collected when possible and metabolite composition analysed.

Sample collection and preparation
The 4th full expanded leaf of 3 months old plants was collected and used for analysis. Leaves collected were immediately frozen in dry ice, lyophilised for 2 days and ground to a fine powder with a TissueRuptor (Qiagen). Ten milligrams of freeze-dried powdered leaves were used for extraction of metabolites and a quality control (QC) sample was prepared by pooling 10 mg of each variety and biological replicate.

Metabolites' extraction
Addition of methanol:water:chloroform (1:1:2) (vol/vol/vol) to 10 mg of freeze-dried leaf material enabled separation of semi-polar metabolites in the epiphase and non-polar metabolites in the organic phase. Solvent extraction was carried out for 1 hr at room temperature with continuous shaking and the phase separation obtained after centrifugation of this mixture. Semi-polar extracts were filtered (0.45 μm nylon membranes), centrifuged (14 000 rpm, 5 min) and an aliquot of 95 μl spiked in with 5 μl of internal standard (genistein 0.2 mg/ml), were then subjected to LC-MS analysis. Details described in [19].

LC-MS metabolite profiling
Analysis was conducted with a UHPLC UltiMate 3000 (Dionex Softron) linked to a MAXIS UHR-Q-TOF mass spectrometer (Bruker Daltonics) and an electrospray ionisation (ESI) source operating in negative mode under the following conditions: dry gas at 8 L/min, capillary 3500 V, end plate at -500 V, vaporizing temperature was 195˚C and nebulizer was 1.3 Bar. Mass spectra were recorded in full scan mode from 50 to 1200 m/z range. Chromatographic separation of metabolites was carried out on an YMC-UltraHTPro C18 2 μm column (100 x 2 mm i.d.) using 10% acetonitrile in water (A) and acetonitrile (B) as mobile phases, both containing 0.1% formic acid. These solvents were used in linear gradient mode from 100% (A) to 65% (A) in 17 min and up to 0% (A) over 12 min. A 5 min washing and re-equilibration steps respectively were added to the gradient program. The flow rate was 0.2 ml/min and the injection volume used was 5 μl.

Metabolomics data processing and analysis
Chemical features were extracted using MetaMS script in R [20] as described in [19]. Schemes of analytical, data processing and data analysis workflows and strategies adopted for interrogating data are illustrated in S1 and S2 Figs. Breeding line MM2006/130 nominated as Naro-Cass 1 and Naro-Cass 2 in the same reference [8].
[NARO = National Agricultural Research Organisation in Uganda].

� �
Tested for whitefly phenotype in ref.9 but results not reported for this line other than it doesn't fall into the 10 top resistant nor into the 10 top susceptible. WF-M stands for whitefly-moderate phenotype. + B. tabaci abundance as in ref. [21]. https://doi.org/10.1371/journal.pone.0242245.t001

Data analysis strategy and statistical analysis
An untargeted extraction and analytical methodology was used to assess the natural variation present within the diversity panel of the 5CP cassava collection. Three multivariate tests were used to explore the untargeted data matrix: (i) unsupervised Principal Component Analysis (PCA) to check for quality of the data and to obtain a general overview of the trends within the data, (ii) Agglomerative Hierarchical Clustering Analysis (HCA) to ascertain similarities and distance between varieties, and (iii) supervised classification modelling using Orthogonal Partial Least Square-Discriminant Analysis (OPLS-DA) to evaluate robustness and accuracy of metabolite fingerprinting for explaining whitefly phenotype (resistance vs. susceptibility) and at the same time enabling identification of potential biochemical markers linked to the phenotype.
An overview of the metabolome analysis was assessed by principal component analysis (PCA) and the classification models of phenotypes evaluated using OPLS-DA. Varieties Mkumba (TZ3), Albert (TZ6) and Nase 18 (UG4) as whitefly resistant (WF-R), and Sagonja (MAL5), Kalawe (MAL6) and Eyope (MOZ2) as whitefly susceptible (WF-S) were used as a subset for model prediction and classification (OPLS-DA). Selection of lines was based on the phenotyping data (number of Bemisia tabaci adults) published in [21]. Both multivariate analysis PCA and OPLS-DA were performed using pareto scaling and data matrix input was preprocessed by normalising against internal standard and after batch correction (QC normalisation).
Significant features identified through statistical tests were annotated using authors' established libraries [5,19]. Hypothesis generated from multivariate analysis were further validated by univariate statistical tests.

Genotyping
DNA was extracted from 31 cassava leaf samples; using the CTAB-based protocol described by [22] with the following minor adjustment: DNA was obtained from powdered leaf tissues using Qiagen Tissue Lyser (Venlo, Netherlands). A standard amount of DNA (60 ng) per samples were processed using a protocol for 96 single nucleotide polymorphism (SNP) genotyping in cassava with the EP1TM system and 96.96 SNP type assays of Fluidigm1 version S.01. This technique allowed simultaneously to collect both end-point and real-time data from a unique chip cell with 97% confidence. This protocol has been used previously in several studies to analyse cassava diversity and varietal identification [23][24][25].
Alleles for each SNP were scored as present, absent, or missing (failed to amplify); the resultant binary matrix was integrated in a.vcf file. The evaluation of true genetic duplicates was done using the NGSEP platform [26]; where, samples with a percentage of homozygous and heterozygous differences equal or lower than 3% were nominated. Finally, the genetic distance among genotypes was calculated based on the matrices of allele frequencies, and the cluster analysis was made using the agglomerative hierarchical clustering (AHC) algorithms with the Ward method. The resulting dendrogram was un-rooted and depicted using pltree function, implemented in the Cluster package v.2.1 for R program v.3.3.4 (R Development Core Team, 2017).

The experimental approach used to metabotype the 5CP panel
Untargeted LC-MS analysis in negative mode, covering an m/z range from 50 to 1 200 generated a data-matrix of over 14 000 chemical features, after the application of MetaMS peak-picking algorithm. The raw data matrix was then filtered by excluding low abundant features not presenting an isotopic pattern. Typically, these low abundant features are close to the limit of detection and present high variability and poor reproducibility across replicates (RSD>30%). In order to reduce redundancy, chemical features corresponding to the isotopic species ( 13 C, 2 H, etc.), i.e., [M+1, 2, 3, 4] of parent ions [M] were also excluded. The final data matrix contained 1530 unique chemical features (S1 Fig), one being the internal standard. At this stage chemical features grouped under the same peak-cluster (PC) group were treated independently, and clustering of adducts, in-source fragments or multiple charged species was considered later for identification and characterisation purposes.
The final data matrix normalised against a single internal standard (SIS) and batch corrected contained 149 observations and 1529 variables (S1 File) and was used as input data for multivariate analysis and univariate statistics (S2 Fig). Within subjects biological variability was also calculated as coefficients of variation (%CV) and plotted as violin plots (S3 Fig). Generally, the biological variation of the varieties oscillated between 25 to 30%.

The assessment of diversity at the genetic and biochemical level within the 5CP collection
To assess the overall biochemical variation of the Eastern African cassava elite varieties, unsupervised PCA was performed on the metabolite composition of leaf extracts. Score plot of components 1 and 2 explained 23% variation of the collection which was comprised of 26 varieties (Fig 1). Three main clusters can be depicted from the PCA score plot attending geographical sub-region classification. Tanzanian varieties constituted the Central area of East Africa on the top-left sector whilst varieties collected from Northern part of the East African region, Uganda and Kenya, co-locate in the bottom-left sector of the ellipsoid. Varieties original from Mozambique and Malawi tend to cluster together and spread towards the bottom-right quadrant of the ellipsoid, especially those lines from Mozambique. Some outliers of this geographical clustering pattern are Kiroba (TZ7) and KBH2002/066 (TZ1) or Nase 1 (UG7), Kibandameno (KE6) and Oekhumelela (MOZ1). Clustering pattern associated with whitefly (resistant and susceptible) phenotypes was difficult to draw due to the insufficient data available for the complete collection. Nevertheless, Sagonja (MAL5) and Kalawe (MAL6) and Eyope (MOZ2) and KBH2002/066 (TZ1) regarded as the most susceptible by [21], cluster together in two subgroups respectively. Similarly, Tanzanian lines Mkumba (TZ3), Mkuranga 1 (KBH 2006/26, TZ5) and Albert (TZ6) described as less preferred by B. tabaci [21] form a tight cluster in the PCA score plot but separated from Nase 18 (UG4), which was also defined as whitefly resistant.
Genetic similarity was evident when analysis of SNPs was carried out to assess the natural genetic variation of the diversity panel and compared to similarities found at chemical composition level. The Ward AHC dendrogram was constructed using the genetic similarities matrix obtained by running 96 highly informative SNP locus among the complete set of the 5CP collection, which includes 31 East African cassava genotypes (Fig 2A). These African cassava varieties clustered in four different groups; group 1 (G1) comprised principally by samples from Malawi (MAL) and Mozambique (MOZ), group 2 (G2) by samples from Tanzania (TZ), while the samples from Kenya are represented in group 3 (G3) and Uganda in group 4 (G4). The analysis of true genetic duplicates showed that among 31 African cassava genotypes, two genotypes in G2 from Tanzania, Pwani and Mkumba (TZ2 and TZ3), and three genotypes in G4 from Uganda, Nase18, 72-TME14 and Nase 14 (UG4, UG5 and UG6) showed the same genetic makeup, respectively; with zero homozygous and heterozygous differences as it is also depicted by their chemical composition. Similarly, LM1/2008/363 (KE1) and Tajirika (KE3) in group-3 clustered together based on low DNA and chemical differences (Fig 2). On the contrary, cultivars with the largest genetic and chemical differences are Tajirika (KE3) in group-3 and Oekhumelela (MOZ1) in group-1 as shown in Fig 2. It is important to highlight that G4 is not a real group; but formed due to the high number of genetic duplicated samples from Uganda. Additionally, based on both nucleotide and metabolite composition, the genotypes clustered in the same group when both analyses were compared, were Eyope (MOZ2), Nziva (MOZ3), The number of significant chemical features differentiating each sub-group of metabolome derived dendrogram (Fig 2B) were calculated by multiple t-test comparison and indicated in the figure. The differences between the two main clusters, one comprised by G3 and G4 and the other including G1 and G2, is based on 185 significant features out of 1529 suggesting a % homology at the metabolite level of 87.9%. Similarly, the differences between groups occurring at the next branching point, G1 and G2, had a 93.8% homology, i.e. 96 significant features over a total of 1529. The percentage of similarity increases towards the base of the tree where the homology of chemical profiling between closest varieties is over 99% ( Table 2).

Biochemical features as differentiating phenotypic markers
In order to study whether the natural variation present in the collection can be utilised to define metabotypes representative of WF-R/WF-S phenotype and also to identify potential chemical markers linked to phenotype, a multivariate classification analysis was applied using the LC-MS untargeted data matrix as input.

PLOS ONE
Supervised discriminant analysis like partial least squares (PLS) are often used for these purposes. In the present study an OPLS-DA model was chosen to incorporate both the orthogonal variation which is inherent of the metabolomics mass spectrometry data and the chemical variation of the different classes of phenotypes.
Based on the whitefly abundance data available [21] OPLS-DA was applied to assess the ability of the metabolome to explain and predict the phenotype reported. The varieties presenting the most extreme phenotypes were used to build the model and extract potential biomarkers. Varieties Mkumba (TZ3), Albert (TZ6) and Nase 18 (UG4) were selected as WF-R class and Sagonja (MAL5) and Kalawe (MAL6) as WF-S. The variety Eyope (MOZ2) was initially selected and included in the WF-S but was later excluded due to the high detected variation across biological replicates (S4 Fig). Coincidentally, Sagonja (MAL5) and Kalawe (MAL6) happened to present the highest percentage of symptomatic plants when assessed for CBSD phenotype, and Mkumba (TZ3), Nase 18 (UG4) and Orera (MOZ5) displayed the lowest number of infected plants [21] thus designated as CBSD susceptible and resistant classes, respectively. The resulting model created from WF-R and WF-S classes presented strong goodness fit and prediction power (R2cum = 0.941 and Q2 = 0.762, respectively). A variability of 16.3% between classes and 11.8% within classes is explained by the model (Fig 3A) where sub-clusters can be observed within both the WF-R and WF-S class. Loadings plot in the form of S-plot (Fig 3B) was used to visualize and select the variables statistically relevant for discriminating both phenotype classes, i.e., putative biomarkers. A list of 69 potential biochemical markers was extracted based on the combined values of correlation and contribution (p(corr) [1]>|0.6|) and p [1]>|0.05|) of chemical features and ANOVA analysis was applied to additionally validate the statistical differences between the means of the different varieties (S2 File). Ten out the 69 putative markers were found to be not significant according to pair-wise comparison tests. Multiple t-test comparison to Sagonja (MAL5) as susceptible reference reveals that varieties Mkumba (TZ3) and Albert (TZ6) presented the highest number of significant differences (5 and 13 out of the 69, respectively), followed by Kalawe (MAL6) and Nase 18 (UG4) which shows 2 and 1 significant differences, respectively. This is also consistent with ANOVA's Dunnett test using Sagonja (MAL5) as control where 40 markers out of the 69 were significantly different when compared to Mkumba (TZ3), similar number of significant features were obtained from Sagonja (MAL5)-Kalawe (MAL6) and Sagonja (MAL5)-Albert (TZ6) comparisons, 27 and 26 respectively, and the lowest number of significant differences (8 out of 69) were found between the WF-R variety Nase 18 (UG4) and the WF-S reference Sagonja (MAL5) (S2 File).
Representation of normalised relative concentrations of putative biomarkers across varieties in a heatmap shows that the trend in data is organised in three clusters as confirmed by Kmeans clustering (Fig 4). Two main clusters 2 and 3 showing reciprocal accumulation of metabolites between the WF-S varieties Sagonja (MAL5) and Kalawe (MAL6) and the WF-R varieties Mkumba (TZ3), Albert (TZ6) and Nase 18 (UG4). In cluster number 1, the WF-R line Nase 18 (UG4) presented a number of metabolites with similar levels as WF-S variety Sagonja (MAL5). Some of the chemical features identified as putative chemical markers were characterized by tandem mass spectrometry and accurate mass as malate esters of p-coumaric acid and caffeic acid and neochlorogenic acid (S2 File). These compounds have been previously identified to be involved in the resistance mechanism of South American cassava ECU72 challenged with the South American whitefly Aleurotrachelus socialis Bondar [19].

Discussion
The natural variation at metabolite and genetic level of an East African cassava elite diversity panel is presented. Analysis of the chemical fingerprinting datasets indicated low natural variation (23%) compared to the variation found in other cassavas' collections including both African and South American cultivars [33] where >50% and >30% was found in leaves collected in the fields and in vitro plantlets respectively. The high level of homology within the East African collection is consistent at both chemical and genetic level, which has been previously reported by [34]. In effect the data shows the robustness of using metabolomics as a classification tool. Capturing the representative metabolome, enables comparisons at the level of chemical composition which is often more representative of traits and more comprehensive than a limited number of DNA based markers.
Historical breeding efforts of African cassava, documented since in the 1930s [1,8,11,13] suggests a high level of in-breeding probably resulting in scarce diversity in the current national stock collections and fields. Indeed, the most promising candidates within the 5CP collection, which represent a subset of national collections, as potential sources of both whitefly and disease resistance seem to be concentrated in the Ugandan and Tanzanian lines ( Table 1) which are the ones presenting the highest level of duplication and homology. Although the pedigree of these candidate lines is unknown, the origins of some of the Ugandan collection can be found in the literature (S3 File), whilst Tanzanian cultivars may originate from similar ancestors, i.e., the "Amani hybrids" previously deployed to farmers and most likely bred/clonally propagated naturally over the years. Chemical distribution of East African varieties either in the PCA or dendrogram followed a geographical origin pattern and classification associated with either whitefly or virus resistance/susceptible phenotypes couldn't be drawn. Instead a high level of similarity was found between phenotypically and even geographically distant varieties suggesting that probably the effect of adaptation to local agro-ecosystems on the plant's metabolome dominate and mask both the whitefly and virus phenotype observed in the fields.
The variability introduced by environmental adaptation factors has likely to have an impact in defining the preference of whitefly adults for certain varieties or the symptoms developed under virus infection and therefore hindering/misleading the identification of true-to-type phenotype classes. This hypothesis/concern also arises when the list of potential biomarkers generated from the classification model (ultimately based on observed phenotype) is investigated further. Pair-wise comparison of the WF-R variety Nase 18 (UG4) and the susceptible Sagonja (MAL5) revealed higher similarity between them than comparison between Sagonja (MAL5) and Kalawe (MAL6), despite both Malawian lines being phenotyped as the most susceptible. Varieties UG4 (Nase 18) and MAL5 (Sagonja) presented a significant number of phenotypic biochemical markers displaying a similar pattern of accumulation in both lines.
A number of chemical features differentiating WF-R and WF-S classes were characterised and identified as neochlorogenic acid, caffeic acid and p-coumaric acid malate esters among other unknown features. Malate esters of p-coumaric and caffeic acid were the most influential in the S-plot and present at lower levels in the resistant varieties UG4, TZ3 and TZ6 whilst neochlorogenic acid accumulated in the Tanzanian varieties resistant to whitefly and viral diseases CMD and CBSD (S5 Fig). The identified compounds have been previously found to be also associated with the South American whitefly resistance in cassava and a mechanistic mode of action proposed [19]. The changes in the metabolite composition mapped into plant's metabolome suggested a link between phenylpropanoids and monolignols biosynthesis and cell wall reinforcement as a resistance mechanism. The initial cassava breeding programmes witnessed the incorporation of South American germplasm, namely the Brazilian cultivars, Aipin Valenca and Macaxeira aipin (S3 File). Therefore it is potentially feasible that "ancestorial" regions of "exotic" DNA have been selected for, within these 5CP lines, that have led to the presence of the altered phenolic components (or regulators controlling their levels) that appear to contribute to vigour and resilient traits. Given the heterozygosity of the cassava, segregation of the 5CP lines from selfed accessions could lead to enhancement of the phenotype by increasing the dosage of the gene(s) of interest and/or regions influencing complex traits. However, it is important not to lose sight that the number of unknown chemical features that remain unresolved is still significant and further characterisation of these unknowns as well as implementation of further analytical metabolomics approaches are required. Annotation and characterisation of unknown molecular or chemical features is the bottleneck of untargeted metabolomics studies which limits the discovery of relevant biological markers to a reduced number of overrepresented chemical classes. Phenylpropanoids constitute one of these examples of a recurrent marker of multiple agronomic trait or biological activities with emphasis on chlorogenic acid and flavonoids [35][36][37][38][39][40]. The large-scale nature of untargeted metabolomics studies requires high throughput methods that likely favour the extraction and detection of, for example, phenylpropanoids and underestimate the presence of minor chemical families are likely to be a contributing factor to these outcomes. Although bioinformatic resources have implemented the process of interrogating metabolite and mass spectrum databases thus increasing the number of putative identifications, complementation with targeted approaches based on front-end fractionation would also aid in providing a more comprehensive coverage of plant's metabolome diversity.

Conclusions
This study shows that metabotyping approaches are effective methodologies for the assessment of natural variation in cassava. The corroborating genotyping emphasises the need for multi-level omic analysis to ensure robustness. Collectively, these data provide important information on parental materials that can be used in future breeding programmes directed towards cassava varieties with improved stress tolerances. The relative low level of natural variation identified in this study suggests that initial wide genetic crosses with diverse germplasm would be beneficial, prior to focussed trait specific introgression or pyramiding of poorly characterised materials. In addition, if diversity panels containing genetically unrelated germplasm can be created mGWAS is an approach that could be implemented.