DivStat: A User-Friendly Tool for Single Nucleotide Polymorphism Analysis of Genomic Diversity

Recent developments have led to an enormous increase of publicly available large genomic data, including complete genomes. The 1000 Genomes Project was a major contributor, releasing the results of sequencing a large number of individual genomes, and allowing for a myriad of large scale studies on human genetic variation. However, the tools currently available are insufficient when the goal concerns some analyses of data sets encompassing more than hundreds of base pairs and when considering haplotype sequences of single nucleotide polymorphisms (SNPs). Here, we present a new and potent tool to deal with large data sets allowing the computation of a variety of summary statistics of population genetic data, increasing the speed of data analysis.

A user-friendly interface was developed in order to facilitate the use by the research community.
The graphical interface allows the upload of a VCF file or a text file with the genetic data in the fasta format. Moreover, a command line version was developed, allowing the upload of a folder with more than a VCF or text file.

Graphical User Interface version -GUI:
When the user opens the DivStat software GUI version, the following window appears on the screen: First (1 st point), the user should upload a file containing the polymorphism data, which could be a VCF or a text file containing the SNPs and the corresponding position number in the complete genome.

Uploading a text file
If the user uploads a text file containing the SNPs and the corresponding position number in the complete genome, it should be similar to the following examples:  Note that, the position numbers should be written at the first line of the document, the following ones corresponding to the SNP sequences.
On the 2 nd and 3 rd points, the user should define a set of parameters, namely, the start and end positions of the haplotype sequences, the window size, defined by number of base pairs or segregating sites, and the window increment. Defining, for example, a window size of n base pairs and considering p as its start position, the program computes the chosen statistics within the window [p..p+n-1], working just with the SNP positions that fall within this interval. If the window increment is v, it means that the next computations are computed after sliding the window of v base pairs, i.e., in the window [p+v..p+v +n-1].
On the 4 th point, the user should indicate whether the data has or not missing data. In the affirmative case, the user should indicate the symbol used.
Finally (5 th point), the user should select or deselect the statistics to be performed. Note that, the haplotype frequency computation does not take into consideration the window parameters defined. data, which is identified by symbol "-", and the required statistics are not computed per population. In both cases, the window is defined by number of base pairs.
The haplotype frequencies will be saved on independent files, while all window statistics will be saved on the same file. Furthermore, the output comprises a file per population and a global file (comprising the statistics computed for all sequences considered as a global group). Examples of the output can be seen on the following images:  The shown example files correspond to statistics computed globally. The statistics calculated for specific populations produce similar output files.

Uploading a VCF file
If the user uploads a VCF file, the following window appears on the screen: This tab is triggered when the user choose to upload a VCF file in the tab "Statistcs" to compute the statistics mentioned in the previous section. Nevertheless, it can be used whenever the user needs to convert a VCF file into a text file.
On the 2 nd point of this tab, the user should identify the ploidy of the data, in order to enable a good reading and conversion of the VCF file into the text file.
On the 3 rd point, the user should indicate whether the information about the population should be considered or not. In the affirmative case, the user should upload a text file with the information on the populations. More precisely, the file should contain the identification of the individual samples followed by the corresponding population indicated by a string of three characters, according to the following example: Otherwise, all sequences are considered as belonging to the same population.
On the last point, the user should indicate the total number of polymorphic positions that are stored in the inputted file.
The output of this operation will be a text file with the data in the fasta format , which is similar to those shown in the figures 2. and 3. of the previous section (without and with missing data, respectively).
After file conversion, the user can proceed to the computation of the statistics.

Command Line versioncmd:
If the user needs to analyze more than one file, the cmd version is a most suitable option. In this case, the user should create a folder containing all text or VCF files to be analyzed, for instance "folder", and then open the file "call.py" to indicate the parameters to perform the analysis of all files. The parameters to be defined are:  window_def: to state whether the window is defined by "number of base pairs" or "number of segregating sites";  statistics: dictionary with required statistics, which are marked as "YES" (those marked as "" are not computed);  input_way: path to the folder with the text or VCF files to be analyzed by the software;  general_name: general name of the text or VCF files on the inputted folder. All files on the folder should have the same prefix, which should be followed by consecutive numbers, for example, "example_1.txt", "example_2.txt", "example_3.txt", etc. In this case, the general name is "example_";  output_way: path to the folder where output files should be saved;  Pop: "YES" to compute the statistics by populations and "No" otherwise;  general_name_pop_file: general name of the text files on the inputted folder with the populations information. All files on the folder should have the same prefix, which should be followed by consecutive numbers, for example, "Population_1.txt", "Population _2.txt", "Population _3.txt", etc. In this case, the general name is "Population _". Note that, this parameter is just required in the case of defining the parameters type_file as ".vcf" and Pop as "YES";  MD: "YES" if the files have missing data and "NO" otherwise;  symbol: symbol used on the files for missing data (just needed when MD="YES");  num_runs: number of FASTA files on the inputted folder that should be analyzed.
The software runs one time for each file.
The python file "call.py" is similar to the following: Figure 10. Python file named "call.py" in which the user should define the parameters to perform the analysis. Here, the user has VCF files with 1707 polymorphic positions and diploid genomes. The user defined the start and end positions of the inputted haplotype sequences of all files as being 5221930 and 5225700, respectively; the window size (defined by number of base pairs) and the window increment being 2000 and 150, respectively; the statistics required are S, Haplotype Number, Haplotype diversity, π and Tajima's Dwindow statistics; the files to be analyzed are in the folder named "folder" and each file has a name with prefix "example_"; the output should be in the same folder "folder"; the statistics should be computed per populations, being the information stored in files where the name has the prefix "Population_", and the files have missing data indicated by "-". Note that, the number of files on the inputted folder "folder" is 10, thus, DivStat should run 10 times with the defined parameters, one for each file in the folder.