VivaxGEN: An open access platform for comparative analysis of short tandem repeat genotyping data in Plasmodium vivax populations

Background The control and elimination of Plasmodium vivax will require a better understanding of its transmission dynamics, through the application of genotyping and population genetics analyses. This paper describes VivaxGEN (http://vivaxgen.menzies.edu.au), a web-based platform that has been developed to support P. vivax short tandem repeat data sharing and comparative analyses. Results The VivaxGEN platform provides a repository for raw data generated by capillary electrophoresis (FSA files), with fragment analysis and standardized allele calling tools. The query system of the platform enables users to filter, select and differentiate samples and alleles based on their specified criteria. Key population genetic analyses are supported including measures of population differentiation (FST), expected heterozygosity (HE), linkage disequilibrium (IAS), neighbor-joining analysis and Principal Coordinate Analysis. Datasets can also be formatted and exported for application in commonly used population genetic software including GENEPOP, Arlequin and STRUCTURE. To date, data from 10 countries, including 5 publicly available data sets have been shared with VivaxGEN. Conclusions VivaxGEN is well placed to facilitate regional overviews of P. vivax transmission dynamics in different endemic settings and capable to be adapted for similar genetic studies of P. falciparum and other organisms.


Results
The VivaxGEN platform provides a repository for raw data generated by capillary electrophoresis (FSA files), with fragment analysis and standardized allele calling tools.

Introduction
In the Asia-Pacific region, Plasmodium vivax is responsible for between 20 and 280 million malaria cases per year, inflicting a significant burden of morbidity and mortality. Over the last decade, the prevalence of P. falciparum has declined in many endemic countries as a result of intensified malaria control interventions, but outside Africa this has been associated with a rise in the proportion of P. vivax cases, reflecting the limited efficacy of interventions against this species [1]. This trend emphasizes the need for innovative new strategies to reduce P. vivax transmission. A critical weakness of conventional malaria surveillance is the lack of information on the genetic dynamics of the parasite population-an important reflection of underlying transmission potential. Previous studies have demonstrated the utility of genotyping parasite population samples at highly polymorphic short tandem repeat (STR) markers such as microsatellites to inform on P. vivax diversity, population structure and underlying transmission patterns [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. These simple molecular approaches complement the more traditional measures of transmission intensity as well as providing a surrogate marker for transmission intensity, informing on outbreak dynamics, reservoirs of infection, and the spread of infection spread within and across borders [20,21].
However, individual projects have limited potential to address regional questions. The challenges of imported and border malaria associated with highly mobile human populations emphasizes the need for a framework to support integrated, multinational comparative analyses. Effective comparison between studies and sites has been confounded by heterogeneity of methodologies such as the number and location of markers used, size standards, allele calling/binning, and specifications for calling minor alleles reflecting minor clones in polyclonal infections [22]. To address some of these challenges, the Vivax Working Group (VxWG) of the Asia Pacific Malaria Elimination Network (APMEN) has worked with research partners in 15 Asia Pacific countries to develop a consensus panel of STR markers (MS1, MS5, MS8, MS10, MS12, MS16, MS20, pv3.27 and msp1F3) and genotyping methods [23]. The web-based VivaxGEN platform was developed to facilitate standardized allele calling, data analysis and sharing across P. vivax studies using consensus STR marker sets such as the APMEN panel. The VivaxGEN platform provides a repository for FSA files (the primary data files containing the raw fragment analysis data generated during capillary electrophoresis runs). To date, no such repository exists for P. vivax STR data. The capacity to derive allelic data directly from the FSA files ensures high accuracy and standardization in allele-calling between different sample batches produced at different time points and/or on different machines from possibly different studies. This feature also supports flexibility in defining allele-calling thresholds, enabling user-defined settings that may be applied to one or more sample batches. The VivaxGEN platform also provides tools for standard population genetic analyses that can be applied to multiple sample batches to evaluate local and regional trends in the prevalence of polyclonal infections, population diversity, structure and differentiation both spatially and temporally. Data export tools are available to allow users to conduct more bespoke analyses not provided within the platform framework.

Ethics statement
All genotyping data described in the manuscript has been published [4,9,12,14,34]. As described in the original publications, all samples were collected with written informed consent from the patient, parent or legal guardian (individuals

System architecture and implementation
The VivaxGEN platform was developed as a multi-tier web application system, utilizing Post-greSQL as its backend Relational Database Management System (RDBMS) and leveraging on several common external tools for genotype data analysis. PostgreSQL was chosen as the RDBMS as it provided ACID operations and complex SQL query optimization in an opensource package. The backend is programmed in Python, while the web interface uses Java-Script and jQuery library for interactivity. YAML was chosen as the format for platform configuration and data exchange/interoperability. Sample and assay data uploading process can be performed using either batch processing with tab or comma-delimited files in conjunction with a zip file containing raw FSA files, or interactively using sample and assay editing interface. Detailed instructions on data upload, and an accompanying tutorial dataset can be found in Tutorial 1 (Uploading your metadata and FSA files) provided on the VivaxGEN website and in S1 File.

Integrated fragment analysis tools
VivaxGEN provides a framework to store and process raw FSA files with standardized allele calling tools. This framework reduces the heterogeneity that may be introduced from different fragment analysis methods. A Python based library called FATOOLS, which can also be used as a stand-alone command line utility, was developed to provide the raw FSA processing capabilities in VivaxGEN. This library utilizes numpy (www.numpy.org) and scipy (https://www. scipy.org) scientific libraries for its numerical processing. The library provides methods for base normalization of traces, peak scanning and classification, standard size determination, peak calling and allele annotation, as well as FSA assay quality controls. A detailed guide on the FSA fragment analysis process in VivaxGEN can be found in the Guide on Fragment Analysis manual provided on the website and in S2 File. Briefly, base normalization is undertaken using a top-hat morphological transform algorithm implemented in scipy. A simple peak finding algorithm and a CWT-based peak scanning algorithm implemented in scipy are also included in the library [24]. A combination of greedy algorithm and dynamic programming is employed for standard size alignment and size determination. Results of each step of the FSA and fragment analysis processing are stored in the system for aiding manual inspection and assay verification. The source code for FATOOLS is available for stand-alone usage and further development (http://github.com/trmznt/fatools). To aid the manual inspection of traces, a trace viewer is included in the web interface, as shown in Fig 1. Detailed instructions on the manual data editing tools can be found in Tutorial 2 (Inspecting FSA files and data cleaning) provided on the VivaxGEN website and in S1 File. The trace viewer is coded in JavaScript and enables users to identify and edit incorrectly annotated alleles.

Tools for allele and sample filtering
The form-based web interface also provides a number of allele and sample filtering options. Details on the allele and sample filtering tools can be found in Tutorial 3 (Data analysis) provided on the VivaxGEN website and in S1 File. Alleles can be filtered according to marker name (Marker), marker failure rate in the given sample set (Marker quality threshold), absolute minimum relative fluorescence unit (RFU) (Allele absolute threshold) and relative RFU of minor peaks compared to the highest intensity peak (Allele relative threshold). Suspected stutter peaks can also be filtered according to a user-defined stutter range in base pairs (Stutter range) and ratio (Stutter ratio) based on the RFU relative to the highest intensity peak in the given range. Samples can also be filtered according to genotyping success rate across the given marker set (Sample quality threshold), to exclude polyclonal infections or multi-locus genotypes that are presented more than once in the given sample set (Sample filtering), or by passive versus active case detection (Detection differentiation).

Sample query system
Sample querying and grouping can be performed using a query syntax modeled on the NCBI Entrez system with some modification. Detailed instructions on how to perform data analysis using custom queries is provided in Tutorial 4 (Data analysis with custom query) provided on the VivaxGEN website and in S1 File. Boolean operations can be applied to classify sample groups based on spatial (by country level or by 1 st , 2 nd , 3 rd or 4 th administrative division level) or temporal (by year or quartile of sample collection) definitions. The query from the formbased web interface is converted into a YAML-based query internally, which can then be run in the database. An interface that accepts YAML-based query is also provided, enabling the user to apply bespoke sample grouping operations not supported by the form-based web interface. Instructions on how to perform data analysis in VivaxGEN using YAML queries is provided in Tutorial 5 (Data analysis with YAML query) provided on the website and in S1 File.

Population genetic tools
A suite of population genetic measures and associated statistical tests that are commonly used in STR-based P. vivax studies to gauge underlying patterns of transmission intensity, stability and boundaries, including rates of polyclonality, population diversity, genetic relatedness, population structure and out-crossing/inbreeding rates, can be applied to the genotyping data from one or more sample batches. Population genetic measures currently supported within and (iv) Principal Coordinate Analysis (PCoA) with plots illustrating the population structure and genetic relatedness between infections based on a genetic distance matrix. External software employed by the platform include (i) LIAN for measuring linkage disequilibrium (LD) using the index of association (I A S ) [25] as a gauge of out-crossing/inbreeding rates, (ii) Arlequin for measures of genetic differentiation between populations using the fixation index (F ST ) [26], (iii) the APE (Analysis of Phylogenetics and Evolution) package in R for building neighbor-joining trees for assessment of genetic relatedness between infections [27], (iv) the Facto-MineR package in R for generating Multiple Correspondence Analysis (MCA) plots to assess population structure and genetic relatedness based on the nominal categorical data [28], and (v) the DEMEtics package in R for calculating the genetic differentiation index D [29,30]. A standardized measure of genetic differentiation, F' ST , adjusted for marker diversity to support greater comparability between studies using different marker sets is calculated internally in VivaxGEN using the output from Arlequin and following the method described by Hedrick [31]. Further details on the population genetic tools can be found in the Guide on Data Analysis manual provided on the VivaxGEN website and in S2 File.

File format conversion module
The VivaxGEN platform has tools for exporting genotype data in several formats supported by other commonly used population genetics softwares including LIAN [25], Arlequin [26], Genepop [32] and STRUCTURE [33]. Tab-delimited formats suitable for R's data frame or Python's pandas data frame are also provided.

Data access policy
VivaxGEN users may choose to keep their data private, accessible to all or only specified researchers or they may allow their data to be open access. The repository currently holds data obtained from published studies on P. vivax samples from China [12], Ethiopia [4], Indonesia [14], South Korea [9] and Bhutan [34]. Private accounts have been generated for users with data sets on P. vivax samples from Iran, Malaysia, Myanmar, and Vanuatu.

Availability
The platform can be accessed at http://vivaxgen.menzies.edu.au. The source code for the platform, licensed under GNU GPL version 3, can be obtained from https://github.com/trmznt/ plasmogen.

Results and discussion
The VivaxGEN platform was developed as a framework to support standardized allele calling and greater ease of data sharing for comparative analyses between different STR-based studies in P. vivax. Relative to Single Nucleotide Polymorphisms (SNPs), where a maximum of 4 alleles arising from the 4 different nucleotides are possible at a given position, STRs may exhibit dozens of alleles, measured as different repeat lengths. Although STRs offer high discriminatory potential between independent infections, comparison of STR alleles (fragment size variants) between different sample batches produced at different time points and/or in different laboratories is considerably more challenging than comparison of the discrete allele forms generated from the analysis of SNPs. Despite the application of a size standard, replicates of the same sample may exhibit slight variation (usually less than 1 bp difference) in fragment size. In order to address this variation, alleles can be assigned to bins encompassing a range of fragment sizes usually reflecting the size of the repeat unit. However, whilst one researcher might assign fragment sizes of 254.4 bp and 255.7 bp to two different allele bins such as "254" and "256" respectively, another researcher might assign both alleles to bin "255", and yet another might assign these fragment sizes to allele bin "256", creating artificial differentiation between datasets. As illustrated in Fig 2, the Vivax-GEN platform provides a common interface for fragment size allele calling using the raw FSA files and applying a standardized binning system, which facilitates comparability between different datasets. By virtue of this feature, using the VivaxGEN platform, it was possible to identify a distinct, population-specific allele profile at the MS20 locus in South Korea versus Bhutan, Ethiopia and Indonesia (Fig 3). The distinct MS20 allele profile observed in South Korea is postulated to reflect a single major reservoir of P. vivax infections, most likely from North Korea [9]. Future data entries to VivaxGEN on MS20 genotypes from other vivax-endemic regions are likely to provide further important insights on this phenomenon and other transmission patterns.
One of the greatest challenges in genotyping Plasmodium samples (and other microorganisms) is the identification and characterization of polyclonal infections [22]. Owing to artefacts such as background noise, stutter peaks, and overlapping peaks (also known as pull-up peaks and Ethiopia (B), highlighting the potential for the same alleles to be assigned to different bins in Indonesia ("255") versus Ethiopia ("256"). The standardized binning within the VivaxGEN platform ensured that the~255 bp alleles in Indonesia and the~256 bp alleles in Ethiopia were assigned to the same allele bin defined as "256".
https://doi.org/10.1371/journal.pntd.0005465.g002 or bleed) in multiplex reactions where amplicons are labelled with different fluoresceins. Some of these artefacts may not be automatically detected and excluded from the peak binning during the fragment scanning process. To address this challenge, the VivaxGEN platform provides utilities enabling visual inspection of individual electropherogram traces and editing of allele annotations. The platform also enables user-defined relative minimum RFU thresholds for calling minor alleles: an approach that is commonly applied in STR-based Plasmodium studies to reduce the prevalence of artefact peaks, and enhance comparability in the sensitivity to detect minor peaks in samples of differing quality such as DNA derived from dried blood spots versus blood tubes [35]. Different studies may however apply different thresholds. A benefit of the integrated database and analytical framework in VivaxGEN is that population genetic measures such as the average MOI or proportion of polyclonal infections can be compared between different sample batches at the same user-defined threshold-and indeed multiple different thresholds can be explored.
Capitalizing on the feature to incorporate samples from multiple studies (batches) within an analytical procedure, we used the platform to compare multi-locus genotypes (MLGs) between different published datasets stored in the database. As illustrated in Fig 4A, Multiple Correspondence Analysis (MCA) demonstrated clear distinction of the MLGs at the 9 APMEN standard markers between Ethiopia, Indonesia and South Korea, whilst the Bhutanese isolates displayed a broad range of MLGs with overlap in both Ethiopia and Indonesia. It is widely acknowledged that different STR markers have different strengths in their ability to detect polyclonal infections and/or to define population structure [36]. Amongst the APMEN panel, 5 markers (MS1, MS5, MS10, MS12 and MS20) have been defined as "stable", with optimal utility for analysis of population differentiation [36]. Therefore, the effect of repeating the analysis using the 5 stable markers was assessed (Fig 4B). A similar pattern was observed to the full marker panel, adding assurance that the clustering patterns had not been affected by the high diversity markers.
The integrated data repository, allele calling and data analysis tools in VivaxGEN promote exploratory and semi-interactive analysis in a common web interface. Compared to other popular softwares for processing microsatellite data, VivaxGEN is unique in providing both the capability to process and store raw electropherogram data (FSA files) and to perform statistical and population genetic analysis commonly applied in studies of Plasmodium (Table 1). A data export utility enables population genetic analysis outputs for a given parameter set to be downloaded from VivaxGEN to facilitate data reporting. These features greatly simplify data processing and exploration, and should enable malaria researchers who are new to the field of population genetics to conduct robust data analysis with greater autonomy. The integrated VivaxGEN: Plasmodium vivax genotyping analysis platform data repository should also foster collaborations between different research institutions and allow analyses on regional trends as well as population differences between countries. The outcomes will inform national malaria control and elimination programs on malaria transmission dynamics, may help distinguish local from imported parasite populations and facilitate malaria surveillance.

Conclusions
The VivaxGEN platform is well placed to facilitate regional overviews of P. vivax population genetic patterns in different endemic settings, informing on the underlying transmission dynamics of this highly adaptive parasite. The system is amenable to being adapted for STRbased analyses in P. falciparum and other microorganisms or other forms of genetic data such as SNP-based genotypes. The open access source code is provided to facilitate developments for such applications.

Acknowledgments
We would like to thank the patients who contributed samples and the health workers who facilitated sample collections.