Proteomic fingerprint identification of Neotropical hard tick species (Acari: Ixodidae) using a self-curated mass spectra reference library

Matrix-assisted laser desorption/ionization (MALDI) time-of-flight mass spectrometry is an analytical method that detects macromolecules that can be used as biomarkers for taxonomic identification in arthropods. The conventional MALDI approach uses fresh laboratory-reared arthropod specimens to build a reference mass spectra library with high-quality standards required to achieve reliable identification. However, this may not be possible to accomplish in some arthropod groups that are difficult to rear under laboratory conditions, or for which only alcohol preserved samples are available. Here, we generated MALDI mass spectra of highly abundant proteins from the legs of 18 Neotropical species of adult field-collected hard ticks, several of which had not been analyzed by mass spectrometry before. We then used their mass spectra as fingerprints to identify each tick species by applying machine learning and pattern recognition algorithms that combined unsupervised and supervised clustering approaches. Both principal component analysis (PCA) and linear discriminant analysis (LDA) classification algorithms were able to identify spectra from different tick species, with LDA achieving the best performance when applied to field-collected specimens that did have an existing entry in a reference library of arthropod protein spectra. These findings contribute to the growing literature that ascertains mass spectrometry as a rapid and effective method for taxonomic identification of disease vectors, which is the first step to predict and manage arthropod-borne pathogens. Author Summary Hard ticks (Ixodidae) are external parasites that feed on the blood of almost every species of terrestrial vertebrate on earth, including humans. Due to a complete dependency on blood, both sexes and even immature stages, are capable of transmitting disease agents to their hosts, causing distress and sometimes death. Despite the public health significance of ixodid ticks, accurate species identification remains problematic. Vector species identification is core to developing effective vector control schemes. Herein, we provide the first report of MALDI identification of several species of field-collected Neotropical tick specimens preserved in ethanol for up to four years. Our methodology shows that identification does not depend on a commercial reference library of lab-reared samples, but with the help of machine learning it can rely on a self-curated reference library. In addition, our approach offers greater accuracy and lower cost per sample than conventional and modern identification approaches such as morphology and molecular barcoding.

2 Abstract (219 words) 27 Matrix-assisted laser desorption/ionization (MALDI) time-of-flight mass spectrometry is 28 an analytical method that detects macromolecules that can be used as biomarkers for 29 taxonomic identification in arthropods. The conventional MALDI approach uses fresh 30 laboratory-reared arthropod specimens to build a reference mass spectra library with high-31 quality standards required to achieve reliable identification. However, this may not be possible 32 to accomplish in some arthropod groups that are difficult to rear under laboratory conditions, or 33 for which only alcohol preserved samples are available. Here, we generated MALDI mass 34 spectra of highly abundant proteins from the legs of 18 Neotropical species of adult field-35 collected hard ticks, several of which had not been analyzed by mass spectrometry before. We 36 then used their mass spectra as fingerprints to identify each tick species by applying machine 37 learning and pattern recognition algorithms that combined unsupervised and supervised 38 clustering approaches. Both principal component analysis (PCA) and linear discriminant 39 analysis (LDA) classification algorithms were able to identify spectra from different tick species, 40 with LDA achieving the best performance when applied to field-collected specimens that did 41 have an existing entry in a reference library of arthropod protein spectra. These findings 42 contribute to the growing literature that ascertains mass spectrometry as a rapid and effective 43 method for taxonomic identification of disease vectors, which is the first step to predict and 44 manage arthropod-borne pathogens. 45 46 Author Summary (153 words) 48 Hard ticks (Ixodidae) are external parasites that feed on the blood of almost every 49 species of terrestrial vertebrate on earth, including humans. Due to a complete dependency on 50 blood, both sexes and even immature stages, are capable of transmitting disease agents to 51 their hosts, causing distress and sometimes death. Despite the public health significance of 52 ixodid ticks, accurate species identification remains problematic. Vector species identification 53 is core to developing effective vector control schemes. Herein, we provide the first report of 54 MALDI identification of several species of field-collected Neotropical tick specimens preserved 55 in ethanol for up to four years. Our methodology shows that identification does not depend on 56 a commercial reference library of lab-reared samples, but with the help of machine learning it 57 6 mass fingerprinting for the identification of field-collected specimens that do not exist in a 111 reference spectra library (or for those from which reference spectra cannot be generated under 112 Here, we used MALDI as a scheme to identify Neotropical specimens of adult hard ticks 118 derived from ethanol-preserved field collections. Specifically, we used machine learning and 119 pattern recognition algorithms to classify protein spectra from the legs of field-collected 120 specimens in order to identify a group of unknown samples with a self-curated reference 121 library. MALDI is a promising tool for cataloging and quickly identifying large arthropod groups 122 such as ticks [11]. Our results should contribute to the growing body of literature trying to 123 address questions about feasibility, reliability and universality of the methodology for different 124 environments and species that have not been evaluated before. Properly identifying disease 125 vectors such as Ixodidae in highly diverse Neotropical countries, such as Panama, is a critical 126 first step to predict and manage tick-borne zoonotic pathogens such as Rickettsia and 127 arboviruses (e.g., arthropod-borne viruses). 128 129 130

Sample preparation 132
Ticks stored in ethanol for up to 5 years, and previously identified based on 133 morphological characters, were taken from long-term storage in a -20 °C freezer (S1 Table) either the left or the right anterior leg from each tick using a scalpel. The leg was then put in 142 tube with 300 µL ultrapure water followed by the addition of 900 µL of 100% ethanol. These 143 tubes were vortexed for 15 s and centrifuged at 13,000 RPM for 2 min. After centrifugation, the 144 supernatant was poured off from the sample tube, which was left to dry for 15 min. 145 Subsequently, the legs were resuspended in 60 µL 70% formic acid and 60 µL 100% 146 acetonitrile and homogenized in the microtube using a manual pestle. The samples were 147 placed in a Branson 1510 ultra-sonicator (Bransonic, Danbury, CT, USA) for 60 minutes in ice 148 water, and then vortexed for 15 s and centrifuged again at 13,000 RPM for 2 min. 1 µL of 149 supernatant was pipetted onto a polished steel MALDI plate and covered with 1 µL of HCCA 150 matrix. After letting the plate dry, it was inserted into the MALDI mass spectrometer to record 151 the protein spectra from tick legs. 152 8

MALDI mass spectrometry parameters 155
We used an UltrafleXtreme III spectrometer (Bruker Daltonics, Bremen, Germany) to 156 generate the protein mass spectra of each specimen. The equipment has a MALDI source, a 157 time-of-flight (TOF) mass analyzer, and a 2KHhz Smartbeam™-II neodymium-doped yttrium 158 aluminum garnet (Nd:YAG) solid-state laser (λ=355 nm) that we used in positive polarization 159 mode. All spectra were automatically acquired in the range of 2,000 to 20,000 m/z in linear 160 mode for the detection of the most abundant protein ions. Each spectrum represented the 161 accumulation of 5,100 shots with 300 shots taken at a time, and the acquisition was done in 162 random-walk mode with a laser power in the range of 50% to 100% (global laser attenuation at 163 30%). The software FlexAnalysisTM (Bruker) was used to analyze the mass spectra initially 164 and to evaluate number of ion peaks and their intensity. Visual comparisons of the mass 165 spectra from different tick species gave initial indications of dominant ion peaks that would 166 suggest possible classification into discrete groups. Mass spectra that did not include at least 167 one ion peak with an intensity of 1000 a.u. or more, were considered low quality and filtered 168 out. All samples were placed and measured on three individual target wells with three technical 169 replicates of the mass spectra collected per well. Machine Learning that are commonly used for dimensionality reduction and classification. 179 Dimensionality reduction can help decrease computational costs for classification, as well as 180 avoid overfitting by minimizing the error in parameter estimation. 181 PCA is an "unsupervised" algorithm that generates vectors that correspond to the 182 direction of maximal variance in the sample space. On the other hand, LDA is a "supervised" 183 algorithm that considers class information to provide a basis that best discriminates the 184 classes (i.e., tick species) [37]. For both PCA and LDA analyses, we calculated the Euclidean 185 distance between the vector describing the test sample and the average vector describing 186 each class to identify a test sample. The class with the minimum distance with respect to the 187 test sample was assigned as the identified species for that test sample. The LDA was applied 188 over the data set expressed in terms of the coefficients (i.e., principal components) obtained by 189 the PCA. Thus, PCA reduced the dimensionality of the data, and the LDA provided the 190 supervised classification. 191 The performance of the clustering algorithms was tested using Monte Carlo simulations 192 over 1000 iterations per species to optimize training and cross-validation prediction success 193 rates (Table 2). For each iteration, the data elements in each class were split randomly in 194 approximately, but not less than, 20% of the elements for testing and the rest of the elements 195 for training, for each species. For this analysis, we used the first 150 principal components 196 from the PCA stage that explained 99.9% of the total variance, which after being projected for 197 the LDA algorithm, also generated a 150-components data set. The number of components 198 was chosen after a performance analysis, again using a Monte Carlo approach, that provided 199 the best identification rates. Global and class positive identification rates were calculated to 200 establish the classification capacity of the algorithm ( Table 2). The positive identification rate 201 corresponds to the percent ratio between positive identifications performed by the algorithm 202 and the real positive cases in the data. 203 For visualization purposes in the plots, species that were morphologically identified 204 within the Rhipicephalus and Ixodes genera were separately compared against Dermacentor 205 and Haemaphysallis for which there was only one species in each. All species that were 206 morphologically identified within the Amblyomma genus were separately compared between 207 themselves or against the Ixodes genera. 208 209

210
Optical micrographs from 18 species of Neotropical hard ticks showed very clear 211 differences among species in terms of adult morphological features (Fig 1, S1 Fig), which was 212 well aligned with the expected unique mass spectra generated from each sample and taxon 213 (Fig 2). The global automatic acquisition rate was 77% for all species (Table 1), confirming 214 that, overall, the mass spectra of field-collected and ethanol preserved specimens allowed 215 automatic acquisition of spectra. In fact, automatic acquisition of spectra results in faster and 216 more objective data acquisition than performing spectra collection manually. The percentage of 217 automatic spectra acquisition with the MALDI ranged from 50 % for A. mixtum (cajennense), I. 218 boliviensis and R. sanguineus to 100% for several of the species, including A. calcaratum, A. 219 geayi, A. sabanerae, I. affinis, and R. microplus ( Table 1). The time stored in ethanol or the 220 location of sample origin did not seem to explain the variable percentages of automatic spectra 221 collection (S1 Table). Spectra from freshly collected specimens stored dry at -20 °C, used to In addition, the specimens within each species showed consistently similar protein 225 profiles, regardless of their taxonomic genera, sex, collection date and/or sampling location 226 (Fig 2, Table S1). Mean protein spectra for tick species differed visually among taxa and the 227 differences appeared to be related to their degree of phylogenetic relatedness (Fig 2). For 228 example, species within the genera Ixodes, Rhipicephalus, and Amblyomma were more similar 229 among themselves in terms of the ions peak number and mass over charge (m/z) position in 230 their mass spectra than species from different genera. Nonetheless, some closely related 231 species within the Amblyomma genus such as A. mixtum (cajennense), A. varium, and A. 232 tapirellum also showed fairly distinct protein spectra (Fig 2), which motivated the application of 233 clustering algorithms for their classification.

237
The images for the full assemblage of 18 species can be found in S1

252
Distinct mass spectra profiles between morphologically identified ixodid species could 253 be classified by an unsupervised PCA algorithm to identify specimens. The quantitative 254 performance of the PCA algorithm was assessed per species (Table 2), and visually confirmed 255 with the graphic clustering presented in 3D plots (Fig 3). The PCA global positive identification 256 rate was 91.2%, with 14 out of 18 species having higher than 90 % positive identification rate. 257 The PCA graphs showed that most species separated in well-defined clusters, and the 258 distance among clusters seemed to be related to the degree of phylogenetic relatedness as 259 evidenced by the clear separation from the specimens of Dermacentor and Rhipicephalus with 260 those from Haemaphysallis and Ixodes (Fig 3A, B), or just between the specimens of 261 Amblyomma (Fig 3C). When comparing species within the genus Amblyomma against those 262 from Ixodes, again the spectra from specimens of each species clustered together with limited 263 overlap between groups and those from different genera were clearly separated (Fig 3D).  In addition, the LDA clustering analysis showed a global positive identification rate of 276 94.2% (Fig 4; Table 2), with 14 out of 18 species having higher than 97.8 % positive 277 identification rate. The range of positive identification rates went from 100% (best score 278 possible) for A. mixtum (cajennense), A. nodosum, A. oblongoguttatum, A. ovale, A. varium, A.  279 naponense and R. sanguineus to 45.6% for D. nitens. The 3D representation plots of the LDA 280 clustering displayed that the separation between species was more pronounced than with PCA 281 when comparing species from different genera, confirming the improved quantitative results of 282 the performance of the LDA algorithm (Table 2).

295
Our results show that MALDI mass spectra of highly abundant proteins in arthropod 296 legs served as fingerprints to identify samples of 18 species of Neotropical hard ticks using 297 machine learning and pattern recognition algorithms to create a self-curated reference library. 298 We compared smoothed and baseline-corrected spectra generated from unknown field-299 collected tick samples against the mean spectra from a subset of the same field samples that 300 had already been identified through traditional means. To systematize this process, we used 301 PCA and LDA algorithms to classify mass spectra without prior establishment of a high-quality 302 reference library, which typically requires laboratory-reared specimens that may not be 303 possible to obtain for all species. Global positive identification rates of up to 94.2% were 304 achieved with this methodology, offering a rapid, reliable and objective approach to identify 305 hard tick species, which will likely improve as more specimens are evaluated and included in 306 our database. 307 These outcomes agree with our previous work [26] in which we used a similar approach 308 to classify field-collected samples of 11 morphologically-identified species of Anopheles 309 mosquitoes. In that study, Neotropical Anopheles samples were stored dry in silica gel at -20 310 °C, which seemed to avoid sample degradation and maintain spectral quality. This contrasts 311 with the present study, where most of our specimens were stored in ethanol at -20 °C for 312 several years. Thus, our findings confirm that our novel analytical approach using MALDI and 313 PCA/LDA clustering algorithms is robust for species classification regardless of the arthropod 314 assemblage, sample storing conditions, and the lack of a high-quality reference library. Our 315 results herein also show that both classification algorithms, PCA and LDA, were capable of 316 clustering and recognizing spectra from up to 18 different tick species, including roughly 50 % 317 of Ixodid taxa (e.g., both ecologically dominant and rare taxa) reported for Panama [26,41]. 318 LDA outcomes were more discriminant and robust than PCA overall, but PCA also classified 319 species from different genera with over 91 % accuracy and consistency. LDA was able to 320 cluster each of the 18 species of ticks with validation and cross-validation scores above 94 %, 321 both between and within genera. As expected, the clustering algorithm was most accurate for 322 distinctly related phylogenetic species (i.e., Ixodes, Rhipicephalus and Haemaphysalis 323 genera), with higher than 97 % success rate in most of these cases, than for closely related 324 species (i.e., Amblyomma genus). 325 Although the number of samples analyzed for some ixodid species was relatively low, 326 The long-term goal of our analytical approach with MALDI is to offer an open-source, 348 web-based platform where users can upload the protein mass spectra of their known and 349 unknown specimens to increase the number of species covered and to improve the power of 350 our clustering algorithms. This crowd-sourced approach could be more cost effective, given 351 that it is not necessary to generate a reference library of well-curated samples. Instead, field 352 samples can be taxonomically assigned as they arrive to the laboratory using a correctly 353 matched protein fingerprint, while unidentified samples can be identified with traditional 354 methods and added as new entries into the growing self-curated reference database. 355 In conclusion, the present study used MALDI mass spectrometry as a tool to rapidly 356 identify Neotropical specimens of adult hard ticks that had been preserved in ethanol for 357 several years. Our algorithms were capable of identifying specimens from the 18 tick species 358 evaluated, based on their protein spectra "fingerprint" with up to 94% cross-validation 359 capability. This is the first report of the protein mass spectra from the leg for most of these 360 Neotropical tick species. Large arthropod groups such as ticks are difficult to identify with 361 currently available strategies from commercial vendors, forcing the user to lower the "quality"