A molecular barcode to inform the geographical origin and transmission dynamics of Plasmodium vivax malaria

Although Plasmodium vivax parasites are the predominant cause of malaria outside of sub-Saharan Africa, they not always prioritised by elimination programmes. P. vivax is resilient and poses challenges through its ability to re-emerge from dormancy in the human liver. With observed growing drug-resistance and the increasing reports of life-threatening infections, new tools to inform elimination efforts are needed. In order to halt transmission, we need to better understand the dynamics of transmission, the movement of parasites, and the reservoirs of infection in order to design targeted interventions. The use of molecular genetics and epidemiology for tracking and studying malaria parasite populations has been applied successfully in P. falciparum species and here we sought to develop a molecular genetic tool for P. vivax. By assembling the largest set of P. vivax whole genome sequences (n = 433) spanning 17 countries, and applying a machine learning approach, we created a 71 SNP barcode with high predictive ability to identify geographic origin (91.4%). Further, due to the inclusion of markers for within population variability, the barcode may also distinguish local transmission networks. By using P. vivax data from a low-transmission setting in Malaysia, we demonstrate the potential ability to infer outbreak events. By characterising the barcoding SNP genotypes in P. vivax DNA sourced from UK travellers (n = 132) to ten malaria endemic countries predominantly not used in the barcode construction, we correctly predicted the geographic region of infection origin. Overall, the 71 SNP barcode outperforms previously published genotyping methods and when rolled-out within new portable platforms, is likely to be an invaluable tool for informing targeted interventions towards elimination of this resilient human malaria.


Introduction
Plasmodium vivax is the predominant cause of malaria outside of sub-Saharan Africa [1,2] and there are increasing reports of drug-resistance and severe complications that pose a threat to children and pregnant women [3][4][5][6][7]. Elimination efforts have led to reductions in the prevalence of the deadlier P. falciparum malaria, but areas of co-endemicity have seen a corresponding rise in the proportion of P. vivax infections, which appear more resilient to control strategies [8]. In order to halt transmission of vivax, malaria programmes should focus on identifying the main reservoirs of infection and target control measures towards these. P. vivax has been observed in regions where malaria transmission had once been interrupted [9], and in such a context, continuing surveillance and the use of genetic tools to identify transmission networks within malaria outbreaks is essential. The utility of genetic tools has been demonstrated in previous studies in low-transmission settings, such as in Malaysian Borneo [10] or places where there are imported cases, such as in Greece [11]. Further, understanding the transmission dynamics of the parasite populations through assessment of genetic diversity has the potential to play a key role in guiding the elimination efforts, including by revealing transmission events and identifying potential foci of infection. In recent years, genomic studies have dissected the molecular dynamics of P. vivax populations in regions with stable transmission [12][13][14][15][16]. Other studies have used microsatellites to study trends in the population dynamics [17][18][19][20]. The availability of whole genome sequencing data can inform the design of "genetic barcodes" which require low numbers of SNP polymorphisms [21] but can be used to infer transmission networks and the geographic source of infections. Molecular barcodes, when combined with affordable delivery systems, can facilitate P. vivax epidemiological and surveillance investigations.
Microsatellite markers have been used to reveal a spectrum of population structures in P. falciparum [22], as well as infer transmission dynamics and complexity of infections [23]. Compared to microsatellites, SNP markers are more suitable for comparisons of both strongly and weakly diverged populations, and in revealing ancestral patterns of genetic structuring [24]. By leveraging off whole genome sequencing for population characterisation [14,15,25,26], a number of SNP-based barcodes have been derived for P. falciparum. A barcode based on 24 SNPs in the nuclear genome ("24-SNP barcode") has been used to identify and track isolates from an endemic population in Senegal [27]. This barcode harboured capacity to identify clones from non-clones [21] using data from a limited set of long-term adapted laboratory lines [27], and has been employed on field isolates to infer local temporal changes in genetic diversity [28]. This approach has demonstrated the potential effectiveness of such tools in combination with epidemiological methods to elucidate transmission intensity in malaria endemic regions [28]. Nevertheless, the use of isolates with a very limited geographical spread to generate the barcode can underestimate the genomic variability present in other Plasmodium populations and lead to low precision when estimating relatedness [21], thereby impeding the transportability of the 24-SNP barcode across global malaria regions. It has been shown that the predictive power of the 24-SNP barcode for geographical determination is poor, especially when compared to one formed of 23 SNPs from the mitochondria and apicoplast organellar genomes, which predicted the continental origin of samples with 92% accuracy [29]. Another barcode formed of 105 highly frequent nuclear genome SNPs was developed to infer transmission intensity using a geographically broader panel of isolates, thereby potentially providing greater utility across malaria-endemic countries [30]. Simulations on genomewide data from P. falciparum has recommended the use of at least 200 barcoding SNPs for identity by decent (IBD) analysis in haploid eukaryotes [21]. Overall, the studies in P. falciparum have demonstrated that SNP barcodes can potentially provide insights into the intensity of transmission, identify the geographical origin of the field isolates, and inform the dynamics of the diversity in a parasite population. These include outbreak identification, an event which has been shown to be more likely in low-transmission settings [31].
SNP barcodes for P. vivax have been proposed, including one based on 42-SNP nuclear polymorphisms and another on mitochondrial genome markers, but both developed to ascertain the source of infection [32,33]. One limitation of these barcodes is that they were based on relatively small datasets. As more data becomes available and geographical coverage increases, genotyping tools with greater predictive power and wider global reach can be developed [14][15][16]34]. Technological advancements in genomics can be leveraged, including the high throughout sequencing of candidate genomic regions and portable genotyping [35]. Ultimately, the identification and integration of informative loci for P. vivax and other plasmodium parasites for inferring transmission and infection source has the potential to revolutionise global malaria surveillance. Here, using whole genome sequencing data for 433 P. vivax isolates across 17 countries, we applied machine learning and SNP tagging approaches from human genome-wide association studies (GWAS) to create a 71 SNP barcode with high predictive ability for geographic origin (91.4% accuracy) and the capability to infer transmission. We demonstrate that the barcode outperforms alternative approaches, including microsatellite genotyping, for global geographical profiling and inferring outbreak events within a low-transmission setting in Malaysia. Further, we validate the barcode by analysing the 71 SNPs in P. vivax DNA sourced from 132 recent UK travellers to East Africa, Asia and South America. Our work demonstrates that the 71 SNP barcode has the potential to be an invaluable tool to help elimination efforts of this resilient neglected Plasmodium species.
A principal component analysis (PCA) revealed that the isolates clustered broadly by regional groups: Southeast Asia (Thailand, Myanmar, Cambodia, Vietnam and Laos), South Asian/East Asian/Southeast Asia (China), Americas (Peru, Colombia, Mexico and Brazil) and Oceanian/Southeast Asia (Papua New Guinea, Indonesia and Malaysia) and others located around the Arabian Sea (Fig 1). The genomic distance between intra-country isolates was on average smaller (mean: 26,505 SNPs, range: 0-37,801 SNPs) compared to between inter-border isolates (mean: 33,160 SNPs, range: 6,034-39,676 SNPs), in line with previous studies [14,15]. The notable exception to the pattern were isolates sourced from neighbouring Thailand and Myanmar, supporting evidence they belong to a similar parasite population [14], which matches with the Thai samples being collected in the western region [15]. Despite this exceptional situation, our analysis suggests the potential of identifying markers that could Isolates are coloured according to country of origin. Clustering by region can be observed, with Southeast Asian isolates appearing to group at the bottom right of the plot, Oceania at the top right, and South American isolates on the centre left. A relative degree of clustering by country can be observed, especially for isolates from Oceania and to a lesser extent Southeast Asia. The percentage of variation explained for each PC is shown in the axis labels. Additional region-specific plots are shown for clarity.
https://doi.org/10.1371/journal.pgen.1008576.g001 determine geographical origin to a country level in most settings. Intuitively, the highly skewed distribution of the MAF towards rare variants suggests that more common SNPs are those driving the population structure observed. To assess this hypothesis, we split the dataset into three equally sized divisions based on SNP MAF (tertiles: < 0.002, 0.002-0.007, > 0.007; S1 Fig, right), and for each we constructed neighbour-joining trees and correlated the pair-wise genome distance with the estimate based on all 720k SNPs (Fig 2). These comparisons revealed the strong explanatory power harboured by those SNPs with the highest MAF (Pearson's r 2 correlation of 0.94 with the full 720k SNP set) compared to the subsets with lower MAF (Pearson's r 2 correlations of 0.25 and 0.27). A MAF cut-off (> 0.3) was determined (S1 Fig, right), leading to a subset of 16,110 SNPs used as the starting point for barcode building for country classification.

Selection of highly informative barcoding SNPs using a tagging and machine learning approach
In order to further reduce number of markers used for barcoding construction, we applied the software TAGster [37] to identify tagging SNPs that summarise blocks of high linkage disequilibrium (LD), as estimated by the r 2 metric across windows of size 500 kbp. The genetic ; right] r 2 = 0.94); (Bottom) A Bland-Altman analysis comparing the differences in genetic distance between using whole genome SNPs ("gold standard") and each of the SNP subsets. This reveals that subsets of SNPs with low MAF tend to overestimate the distance (panels left and centre, with mean differences -7.99 and -1.81 respectively; SD = standard deviation). Whilst, in the high MAF subset (right) the genetic distance was underestimated (mean of differences: 0.113).
https://doi.org/10.1371/journal.pgen.1008576.g002 variability can be captured by tagging SNPs due to the strong LD structure found in P. vivax populations [34], and we identified 1,173 SNPs, which summarised variation in 4,896 neighbouring SNPs, almost 40% of the total dataset. The 1,173 highly informative SNPs were then used as the pool for a further selection using a random forest modelling approach. Prior to implementation, missing alleles (4.2%) were imputed with high accuracy (< 1% error, see Materials and methods), and a neighbour joining tree was constructed using the resulting data, thereby confirming that no bias was introduced to the clustering patterns (S2 Fig, top). The correlation between genome distances based on the 1,173 SNPs and whole genome sequencing (720k SNPs) was high (Pearson's r 2 = 0.96) (S2 Fig, middle). Similarly, a Bland-Altman analysis revealed the genetic distances between isolates obtained using the subset of 1,173 tagging SNPs were similar to those based on whole genome SNPs (S2 Fig, bottom), especially those with the highest MAF (see Fig 2, bottom right). We sought to show that the 1,173 SNPs could not only predict differences between geographical regions, but also estimate genetic distances within intra-border isolates and identify subclades in low transmission settings [31]. In particular, when analysing those isolates within a genomic distance equivalent of <20,000 SNPs (0.74), we found a high correlation (Pearson's r 2 = 0.98) between distances based on the 1,173 tagging and 720k genome-wide SNPs. This analysis revealed that the 1,173 high frequency tagging SNPs can not only detect strong inter-border differentiation but potentially identify highly related isolates within the same country based on genomic distance.
Application of the random forest classification approach to the classification of country involved constructing 500 trees, partitioning the dataset randomly into 80% training (n = 346) and 20% for validation (n = 87), and using 34 variables at each nodal split. These default settings have been used previously in P. falciparum genome-wide analysis with success [38]. Classification error rates became stable when 100 trees were averaged ( S3 Fig, B). The final model inside the training set performed with an overall out-of-bag error rate of 17.1%, where the main classification errors were found across Southeast Asian populations (Thailand, Vietnam, Cambodia, and Myanmar). These populations were identified as being highly related using the SNP-based genomic distance (Fig 2, right), supporting previous observations [14]. The random forest model was then used to identify the 60 SNPs with the highest predictive importance across the trees. The 60 SNP cut-off was established using a point of inflection analysis of cumulative predictive importance, where the addition of further SNPs does not significantly improve predictive power ( S3 Fig, B). A further 11 SNPs were chosen to summarise high between-country genetic differentiation based on the fixation index (F ST > 0.7), leading to a final barcoding set of 71 SNPs (Table 1). Within this SNP set, there are differences in allele frequency across the two main regions (Southeast Asia and South America), but no markers were fixed across the populations. The 71 markers are in low linkage disequilibrium (S4 Fig) (LD r 2 : mean 0.15, inter-quartile range 0.02-0.24), but some blocks of correlation are observed, which can be explained by an uneven geographic distribution of the isolates.
In order to assess the potential of the barcode for geographic classification, a PCA was generated using only the 71 SNPs (Fig 3, top), and similar clustering patterns were observed to those using the genome-wide (720k) SNP set (Fig 1). When comparing the genomic distances obtained using the 71 barcoding to 720k genome-wide SNPs in intra-border pair-wise comparisons, we observed a high Pearson's r 2 correlation (0.898), providing evidence for the potential of this barcode to not only identify geographical origin but also provide insights into the relatedness of intra-border isolates (Fig 3, middle). Similarly, a Bland-Altman analysis revealed little overall bias in the estimation of genetic distance between isolates using whole genome and the 71 barcoding SNPs (Fig 3, bottom; average difference of 0.09). Using the 71 barcoding SNPs, there is a trend towards an underestimation of the genetic distance for closely related isolates, and an overestimation for distantly related ones. The potential use of the Table 1. Selection of 71 barcoding SNPs for Plasmodium vivax. The barcode has the ability to predict geographic origin and perform transmission inference. Using SNPs without complete fixation, clustering is observed across populations, and there is an increased number of haplotypes that can be traced and therefore assist the potential identification of transmission events. Countries/regions are included if they have at least two samples.
Chr barcode to infer intra-border relatedness and provide insights into transmission dynamics, is supported by the observation that 98.2% (425/433) of the haplotypes obtained were unique in the dataset. Furthermore, the 71 SNPs were used to predict the geographical source of the 20% of the isolates (n = 87) not used to develop the random forest model. The model using only the 71 SNPs yielded an overall accuracy of 91.4% (8.6% out-of-bag error) in predicting the country source of these 87 isolates and a further 16 Brazilian newly sequenced isolates. The 71 SNPs outperformed a published 42-SNP barcode [33], which under the same random forest model conditions (80%/20% training/prediction and 500 trees) obtained a 77.5% accuracy. The low accuracy of the 42-SNP barcode can also be observed in the ambiguous clustering found in the PCA (S5 Fig, top) and neighbour-joining tree (S5 Fig, bottom). Furthermore, the correlation between genetic distances based on the SNP barcode and genome-wide (720k) SNPs was lower (Pearson's r 2 = 0.59).

In-silico testing of the barcode in a near-elimination setting in Malaysia
A field-ready SNP barcode with the potential for being deployed in low-transmission settings has to be proven efficacious in settings where the identification of foci of infection and imported cases are of key relevance. We used a dataset comprising 60 P. vivax isolates, sourced from Sabah Malaysia, where the population dynamics have been extensively studied using microsatellite markers and whole genome sequencing [31]. In silico characterization using the 71 SNP barcode in a PCA analysis identified two main populations, denoted previously as K1 and K2 (Fig 4A). K2 comprised of 26 almost genetically identical isolates in a known transmission outbreak, previously supported using a set of nine microsatellite markers [31]. The estimated genetic distances between isolates based on the 71 SNPs were strongly correlated with those based on the genome-wide SNPs (Pearson's r 2 = 0.88). The outbreak isolates shared the same haplotype across the barcoding polymorphisms in most cases, with only 3 isolates presenting a one SNP difference. There was one notable exception (ERR1475456) which presented a larger genetic distance (Fig 4A). This isolate shared the same microsatellite haplotype as the outbreak K2, but the genomic distance from the samples in the rest of the outbreak samples was greater, and therefore likely to be of independent origin (Fig 4B). A neighbour-joining tree constructed using the 71 SNPs revealed clustering by administrative division in Sabah Malaysia, indicating its potential for tracking of cases from different health authorities (Fig 4C).

Prospective testing of the barcode using traveller genotype data
The genotypes for the 71 barcoding SNPs were characterised in 132 P. vivax DNA sourced from returning travellers to the UK from ten endemic countries (n = 132; Afghanistan (n = 26), Bangladesh (n = 1), Eritrea (n = 11), Ethiopia (n = 6), Guyana (n = 3), India (n = 38), Pakistan (n = 35), the Philippines (n = 1), Sudan (n = 7) and Uganda (n = 4)). The genotypes were used to construct a combined PCA plot for the 132 prospective and 433 barcode-development isolates, which revealed clear clustering by geographic region (Fig 5). The isolates colocalised in the PCA with the previously determined wider-regional populations, including East Africa, South Asia, Southern Southeast Asia and South America. A PCA based only on prospectively collected data (n = 132) (S6 Fig) revealed a clear regional pattern, but no strong clustering at a country level. The country source was predicted using the 71 SNP barcoding genotypes for each prospective isolate. An 80.1% country-level prediction was obtained for those isolates with countries represented in the training data (Ethiopia and India). For validation isolates that were sourced from countries in Central and South Asia (Afghanistan and Pakistan) and East Africa (Sudan, Uganda and Eritrea) that were not included in the SNP barcode, the predictions reflected the nearest countries included in the development process.

Discussion
Plasmodium vivax accounts for a significant proportion of the global malaria burden, with the greatest incidence outside of sub-Saharan Africa [2]. The resilience of the parasite is evidenced by its re-appearance in regions where malaria transmission had previously been halted [9]. Microsatellite genotyping has been used to study P. vivax genomic and population dynamics, although it underestimates the total variability in natural populations [39]. Whole genome sequencing is the gold standard approach but is currently logistic-and cost-inefficient for large genomic epidemiological studies, especially in under-resourced endemic areas. We have a developed in-silico a 71 SNP barcode, informed by whole genome sequencing data from 433 isolates across 17 different countries worldwide. To date, barcodes in Plasmodium species have each been designed for specific roles. Some barcodes have focused on the determination of the geographical origin of the samples using nuclear or organellar genomic markers [29,32], others on measuring transmission intensity using the frequency of unique haplotypes [28,33] or the complexity of the infections [30]. Further, P. vivax barcodes have been based on small datasets featuring fewer geographic populations [34]. As more datasets become available genotyping tools such as our barcode will have global applicability. In fact, application of a 42-SNP barcode [33] to our dataset led to a lower accuracy (77.5%) when predicting origin of the isolates at a country level. The 42-SNP barcode also performed sub-optimally when estimating SNPbased relatedness of isolates, and therefore may not be suited to the inference of local transmission networks. Our barcode was constructed by recognising that common SNPs are more robust markers and harbour greater explanatory power for both geographical and transmission inference, as demonstrated previously in human genetic studies [40]. By triaging SNPs using established LD tagging methods, it was possible to apply random forest methodology to select 60 markers, augmented by 11 inter-continental SNPs, to accurately predict country of source (91.4%). Whilst there are alternatives to the random forest algorithm, this approach has become an established data analysis tool in bioinformatics, with reported high performance in settings where the number of variables is much larger than the number of observations [38]. The methodology has an ability to explore complex interactions between correlated SNPs, and return useful measures of their predictive importance [41,42]. The set of "important" SNP predictors of country determined by the random forest approach was robust to initial model parameterisation, and the final model and set of SNPs were validated on a 20% subset of the original samples augmented by prospectively collected ones from Brazil. The 71 SNP barcode and intermediate SNP sets informing its construction were used to reconstruct principal component analysis plots. These plots were consistent with the overall population structure based on 720k SNP markers, and therefore confirm there was no major loss of geographical specificity.
Our 71 SNP barcode is the first that shows such strong levels of accuracy in geographic prediction at a country level in P. vivax for a diverse dataset, making it a valuable tool for the detection of imported cases of malaria. As whole genome data becomes available, especially from sites with currently poor coverage such as central America, Africa and South Asia, the machine learning approach can be used to update the SNPs in the barcode. The barcode was also designed to be able to provide information about potential transmission trends and therefore be useful in field settings, where large genomic differences are less likely, with the exception of imported cases. The proportion of unique haplotypes identified across the dataset was high (98.2%; 425/433 haplotypes observed), which allows greater scope for informing on intraborder haplotype diversity, including low diversity such as in an outbreak setting. A simulation study using P. falciparum genome sequencing and P. vivax microsatellite data has estimated that 200 biallelic markers should be used for IBD analysis [21]. In our study, using whole genome sequencing data on P. vivax we demonstrate the potential of a 71 SNP barcode to estimate genetic distance (SNP differences) as a measure of relatedness and to predict geographic origin to the country level. Further work could be aimed at assessing and expanding this set of markers and explore its potential use for IBD analysis specifically focusing on intra-border transmission. This potential utility for transmission characterisation was demonstrated by the use of intra-border highly related isolates, where we confirmed that the SNP distances based on the 71-SNP barcode and genome-wide 720k markers were highly correlated, and represent an improvement on comparative reported values for microsatellites (Pearson's r 2 = 0.70) [39]. Further, the use of the 71 SNP barcode was validated in silico using data from 60 P. vivax sourced from a low endemic and near-elimination setting in Malaysia, where it was possible to partition the population into highly structured subclades and confirm the presence of an outbreak cluster, previously identified using both microsatellite genotyping and sequencing. Also, by using the 71-SNP barcode we identified a misclassified "outbreak" isolate with an identical microsatellite haplotype, and evidence of regional clustering by the different administrative divisions, further supporting the utility of the tool and its superior performance to microsatellite genotyping. One of the challenges when studying P. vivax infections is the difficulty to differentiate recrudescent infections from reinfections or relapses [43][44][45][46]. Although, we have not explored the ability of our barcode to differentiate between such cases, based on the results from the Malaysian setting we anticipate it would at least be able to identify cases of relapse caused by meiotic siblings [44]. This is because of the high degree of similarity expected in such cases already picked up in clonal expansions, although specific work would be needed to address this challenge.
The barcode was also tested on 132 newly assessed isolates sourced from travelers to 10 different countries across the main P. vivax endemic regions, revealing that the 71 SNPs have a strong regional discriminatory power, and achieved an 80.1% overall accuracy in classifying the isolates at a country level for the countries represented in the barcode development. The PCA performed using the 71 SNPs showed that isolates originating from geographically distant regions (e.g. South Asia) clustered, and demonstrated the ability of the barcode to identify regional signatures even for previously unstudied locations. Nevertheless, the next step in order to improve our barcode would be the incorporation of the whole genome sequencing data from these isolates and future collections. Further, it is possible to genotype the barcoding SNPs using standard TaqMan genotyping assays or PCR followed by high resolution melting analysis, as previously described for other malaria SNP barcodes [27,33]. It is also possible to apply amplicon sequencing using a portable sequencer, such as the MinION nanopore platform linked to a laptop computer. This approach was applied recently to genotype P. falciparum parasites [47].
In summary, we have presented a new in-silico molecular barcode for P. vivax that can provide information on both geographical origin and identify highly related isolates within country borders to help infer transmission events and identify foci of infection. The 71 SNP barcode out-competes previous genotyping methods and is a powerful and potentially affordable solution that could enable the execution of large genomic epidemiological studies, with high throughput assessment of large numbers of parasites. By leveraging off growing and large datasets of whole genome sequencing data, and the power of machine learning algorithms, it will be possible to update the barcode, augment it with drug resistance markers, and implement it rapidly in field-based settings using portable technologies. Ultimately, insights into genetic diversity will assist the much-needed understanding of the dynamics of P. vivax populations and inform disease control decision making.

Genomic data generation
Illumina sequenced data from previously published studies [14][15][16] was downloaded from ENA repository to form a total dataset of 834 isolates. Some of these data are from the Malaria-GEN P. vivax Genome Variation project. Each isolate data was mapped against the PvP01_v1 reference (obtained from http://genedb.org) using bwa-mem [47]. SNPs (n = 1,522,046) were called from the resulting alignments using the samtools software suite [48], as previously described [34]. Isolates and SNPs were excluded if they presented with > 20% missing or heterozygous genotype calls, and additional SNPs were removed if located within hypervariable gene regions (e.g. vir genes). The final dataset consisted of 720,340 high-quality SNPs and 433 isolates.

Population structure and tag SNP selection
The 720,340 high-quality biallelic SNPs and its subsets were used to infer distance matrices by calculating Manhattan distances ("identity-by-state") between samples. These genetic distances where divided by the sum of the allele frequencies of the SNPs included in their calculation, in order to make the units comparable. These matrices were used to generate principal component analysis (PCA) plots and neighbour-joining trees (R ape package [49]). The Pearson's r 2 metric, calculated using the R base function cor, was used to estimate the correlation between distance matrices. A Bland-Altman analysis comparing genetic distances based on different sets of SNPs was performed using the R BlandAltmanLeh library. The software TAGster [37] was used to identify SNPs which summarise blocks of high linkage disequilibrium, as estimated using the genetic r 2 metric. We specified an r 2 threshold of at least 0.7 for inclusion in a block and a window size of 500 kbp, leading to 16,110 SNPs with minor allele frequency > 0.3 being included for analysis. The resulting 1,173 tag SNPs identified were then further characterized for downstream analysis. The Fixation index (F ST ) was calculated using in house R scripts and a threshold of F ST > 0.7 was used to determine population-level informative barcoding SNPs [25,34]. A flow diagram summarising the selection process of the SNPs is presented in S7 Fig.

Barcode SNP selection using a random forest approach
The selected 1,173 highly informative SNPs obtained using a combination of minor allele frequency and SNP tagging approaches were extracted from the original dataset. Allele imputation was performed on the missing data points (4.2%) in the dataset using the R mis-sForest package [50]. This yielded an estimated out-of-bag error in the imputation of 16.2% which left a total of 0.7% potentially erroneous calls. After imputation, a random selection of 80% of the dataset was assigned as a training set and the remaining 20% was assigned as a test dataset. Source country was tested as the predicted variable. Subsequently, five hundred trees were calculated using the RandomForest [51] package in R in order to determine the SNPs in the dataset with highest importance for classification of samples into countries. We then selected the 60 SNPs with highest importance and used the R LDcorSV package [52] to calculate the correlation between the markers. PCA plots and neighbour-joining trees were constructed for this subset of SNPs. The final subset of 71 SNPs was then used to retrain the random forest model on the training set, and this model was applied for the prediction of the country of origin in the validation dataset.
Validation of the barcode using UK traveller isolate genotype data DNA was extracted from blood samples sourced from 132 returning travellers to the UK from endemic areas between 2017 and 2018. The 132 samples were sourced from travellers to Afghanistan (n = 26), Bangladesh (n = 1), Eritrea (n = 11), Ethiopia (n = 6), Guyana (n = 3), India (n = 38), Pakistan (n = 35), the Philippines (n = 1), Sudan (n = 7) and Uganda (n = 4). The blood samples are stored in the Public Health England (PHE) Malaria Reference Laboratory (MRL) at the LSHTM. The DNA underwent Illumina MiSeq 150bp paired-end sequencing at the LSHTM. The resulting data was aligned to the P. vivax PvP01_v1 reference genome using bwa-mem (see [34,53] for the bioinformatics pipeline), thereby allowing the calling of the genotypes at the 71 positions (S2 Table). The validation isolates were not used in the construction of the random forest model. The PHE and LSHTM ethics boards provided approval for the sequencing of the P. vivax DNA.
Supporting information S1 Table. The