fcGENE: A Versatile Tool for Processing and Transforming SNP Datasets

Background Modern analysis of high-dimensional SNP data requires a number of biometrical and statistical methods such as pre-processing, analysis of population structure, association analysis and genotype imputation. Software used for these purposes often rely on specific and incompatible input and output data formats. Therefore extensive data management including multiple format conversions is necessary during analyses. Methods In order to support fast and efficient management and bio-statistical quality control of high-dimensional SNP data, we developed the publically available software fcGENE using C++ object-oriented programming language. This software simplifies and automates the use of different existing analysis packages, especially during the workflow of genotype imputations and corresponding analyses. Results fcGENE transforms SNP data and imputation results into different formats required for a large variety of analysis packages such as PLINK, SNPTEST, HAPLOVIEW, EIGENSOFT, GenABEL and tools used for genotype imputation such as MaCH, IMPUTE, BEAGLE and others. Data Management tasks like merging, splitting, extracting SNP and pedigree information can be performed. fcGENE also supports a number of bio-statistical quality control processes and quality based filtering processes at SNP- and sample-wise level. The tool also generates templates of commands required to run specific software packages, especially those required for genotype imputation. We demonstrate the functionality of fcGENE by example workflows of SNP data analyses and provide a comprehensive manual of commands, options and applications. Conclusions We have developed a user-friendly open-source software fcGENE, which comprehensively supports SNP data management, quality control and analysis workflows. Download statistics and corresponding feedbacks indicate that software is highly recognised and extensively applied by the scientific community.


Introduction
Modern developments in micro-array techniques enable large scale genome-wide association (GWA) studies comprising thousands or millions of SNPs in thousands of individuals. Statistical methods for analysing GWA data were further developed in the last decade to handle several issues of GWA analysis such as principal component analysis (PCA), genotype imputation, haplotype-based analyses and different types of association models. A variety of software packages and environments have been developed to allow corresponding computations even for highdimensional data. However, these software packages usually require their own specific input and output formats of data. As such, there are not only computational and statistical challenges in GWA analysis, but also a burden of fast management of different data formats and their transformations. Some apparent issues of necessary data transformations required in GWA analyses are described in the following.
PLINK [1], which is now the most popular and computationally efficient software for a variety of GWA analyses, requires a ''ped'' -file format containing a genotype matrix (one row per individual and one column per SNP) and a ''map''-file that contains information on SNPs (one row per SNP). Software EIGENSOFT [2] is designed to perform PCA for example to analyse ethnical structures in genetic data. Even though the file format used by EIGENSOFT is very similar to PLINK-formatted ''ped''-and ''map''-files, some minor adaptations of the data format are required. One example is that EIGENSOFT uses ''299'' to code missing phenotype information while PLINK uses ''29'' as default for missings. Haploview, which is also an open source program, performs linkage disequilibrium (LD) analysis and estimates haplotype population frequencies [3]. This software basically requires two files: (a) a pedigree genotype data, and (b) a SNP annotation file with columns of marker names and base pair positions. PLINK-formatted pedigree data (''ped'') are accepted by Haploview, however it still requires an extra file with marker names and position information. Thus applying previously mentioned GWA software is possible only after a data conversion process, especially when genotype data are given in formats other than PLINK.
Genotype imputation has now become a standard process, especially in the case of meta-analyses combining data genotyped at different micro-array platforms [4]. Here, tasks related to data management are of particular concern. The data managing tasks comprise merging differently formatted multiple genotype data with different SNP content and individuals, converting the merged data into a format required by the selected imputation tool, and finally converting the imputation results back into different formats for down-stream analysis. A variety of tools such as MACH [5], IMPUTE [6][7][8], BEAGLE [9], BIMBAM [10,11], PHASE [12] and fastPHASE [13] are available for genotype imputation. All these packages have different input and output formats. Moreover, there are in general many output formats of imputation results available such as most likely genotypes or genotype probability distribution for each SNP and individual. Some imputation software such as MaCH require a matrix row to describe an individual's genotypes while others (such as IMPUTE) require one row per SNP. In addition, the software may use one, two or three cells (e.g. ''A/B'', ''A B'', ''0 1 0'') of a genotype matrix to describe a particular genotype. To allow imputation within an acceptable time frame, new strategies suggest chunking of chromosomes into small segments with a certain overlap which can be imputed by serial computations. Applying such strategies requires additional data management namely division of the main data set into small overlapping chunks and merging the overlapping imputation results.
Different strategies are proposed to test for association between imputed genotypes and different traits of interest. The most popular strategies are based on the following three types of imputed genotype data [14]: (a) best guess of genotypes (maximum a posteriori genotype), (b) expected minor allelic dosage, and (c) posterior-probabilities of the three possible genotypes. PLINK can deal with all three genotype representations but has little support for analyses addressing expected minor-allele-doses and individual genotype probabilities. SNPTEST [6][7][8] offers a Bayesian test for the analysis of single SNP association in GWA studies. The main input file of SNPTEST is similar to the output of IMPUTE, but another burden here is to create the required sample covariate file. GenABEL [15] is a package of the statistical software environment R [16]. It is another popular program for easy and fast analysis of genetic data. It uses a two bit format to efficiently store genotype data. Moreover, this package features R-functions to integrate IMPUTE-imputed and MaCH-imputed results but has no interface to convert BEAGLE-imputed and BIMBAM-imputed results.
Summarizing these issues, in state-of-the-art GWAS analyses we have to deal with differently formatted sets/subsets of highdimensional genotype data. So far there is little support in generating or converting these different sets/subsets of genotype data. The usual way in dealing with this challenge is the application of self-written Linux based shell, Perl or R scripts. However such types of private solutions are cumbersome and prone to errors. Additionally, computer programming skill is necessary to write efficient format-converting scripts. Therefore, there is an obvious need for a user friendly genotype format converting and data management tool, which can handle the twoway format-conversions of genotype data in one framework. In this paper, we present and demonstrate the newly developed opensource tool fcGENE (format converting tool for genotype SNP data). This tool includes a variety of functions for the different genotype format conversions sketched previously and other useful options to support the process of SNP data management.

Basic Concepts
We developed fcGENE using C++ object-oriented programming language. Therefore it allows us to handle high-dimensional data sets quickly. Our aim was to construct fcGENE as a complementary tool to PLINK by developing options for transforming SNP data into the formats required by different tools for GWA analysis. Therefore interface and commands of fcGENE are inspired by PLINK commands. For example just like in PLINK, PLINK-formatted files ''example.ped'' and ''example.map'' can be read in fcGENE with the Linux based command option: ''./fcgene --ped example.ped --map example.map'' or ''./ fcgene --file example''. The sequential order of the command options is unimportant throughout. Each option contains a command identifier. Commands are separated by ''--''. fcGENE runs under Unix, Linux and Microsoft Windows operating systems; the latter by either implementing it in MINGW [17] or in the statistical software R [16]. For example, PLINK-formatted binary files can be read in R as follows: system (''./fcgene --bim example.bim --fam example.fam --bed example.bed'').

Software validation
We have extensively checked and validated the correctness of the commands used in fcGENE. All functions related to format conversions were validated by performing multiple closed loops of format conversions, i.e. we converted genotype data from format ''A'' to format ''B'', format ''B'' to format ''C'', and finally, format ''C'' back to format ''A''. After such a closed loop, we used Linux/ Unix commands namely ''diff'' and ''sdiff'' to check if the first file containing original data in format ''A'' is identical to the result of the multiple conversions. Calculations of quality parameters (call rates, minor allele frequency (MAF), p-values of Hardy-Weinberg disequilibrium) were checked by comparing our results with those of PLINK. Many functions of fcGENE were also checked by repeating the same tasks via self-written Linux based shell-scripts or R scripts.

Limitations
Functionality of fcGENE is based on the file formats required by current versions of GWA analysis tools. If future versions of any of these GWA tools bring some changes in their file formats, an update of fcGENE may be necessary to handle the new data formats properly. Current version of fcGENE is not designed for purposes of statistical computations and data analyses except for a few summary statistics and issues of quality control necessary for data management.

Results
Functional overview fcGENE converts genotype data into different formats by loading inputs or outputs of a large variety of software packages such as PLINK, MaCH, IMPUTE, BEAGLE, BIMBAM, PHASE, fastPHASE and SNPTEST. It can also read and write compressed files (ending with ''.gz''). If the original data is given in PLINK-format, we can use fcGENE to convert the data into the formats of either any imputation tool or that of other tools useful for genetic data analysis, such as SNPTEST, EIGENSOFT, HAPLOVIEW, GenABEL or VCF-tools. After completion of genotype imputation, fcGENE can be used to transform the imputation outputs back into the formats required for further analysis. A flowchart of possible data conversions is shown in Figure 1. In Figure 1, arrows pointing towards the fcGENE box imply that the data files of the programs from where the arrows start can be read and uploaded by fcGENE. Similarly, arrows pointing away from fcGENE address programs whose files can be generated. fcGENE can convert not only the inputs and outputs of imputation software, but also the imputation reference panels. This type of format conversion is required for example if we want to compare study genotypes and imputation reference panels.
Aside from the format converting tasks, fcGENE offers a number of additional options related to data management processes such as filtering SNPs and individuals according to pre-defined cut-offs of quality measures, generating templates of commands to run genotype imputation software, preparing phenotype files for SNPTEST software and GenABEL, and, creating scripts for EIGENSOFT and GenABEL.

Syntax and command options
fcGENE uses unique command identifiers to recognize command options and the names of files to be loaded. For example command identifiers ''--ped'' and ''--dat'' recognize genotype data given in MERLIN (MaCH) format. Commands of fcGENE are inspired by PLINK's command-line syntax. This makes fcGENE commands intuitive for PLINK users. The names of command options also hint to their functions. For example, if genotype SNP data is given in PLINK format, we can use the following commands to prepare inputs of EIGENSOFT and HAPLOVIEW respectively.
./fcgene --ped plink.ped --map plink.map --oformat eigensoft -out plink_eigensoft ./fcgene --ped plink.ped --map plink.map --oformat haploview --out plink_haploview Here, the command ''--oformat'' performs translation of data into the specified formats, e.g. those of EIGENSOFT and HAPLO-VIEW respectively. Details are explained in the next section. Names of output files generated by fcGENE can be specified by using option: ''--out''. More examples of format conversions can be found in the supplementary document. Command options required to upload data in fcGENE are listed in Table S1.

Main functions
Two most important functions of fcGENE are format conversion of raw genotype data and transformation of imputed data into the formats required for different GWA tools. fcGENE can convert sets of genotype data of different formats into the formats of any of the specified imputation tools. Similarly, the outputs of different imputation software can be converted back either into PLINK format or into the formats of other software. Table 1 summarizes corresponding important command options implemented for fcGENE.
Imputation outputs are either most likely genotypes of SNPs for each individual or the genotype probability distribution for each SNP and individual. In addition, the outputs are either provided in rows per individual (e.g. MaCH-and MINIMAC-outputs) or one, two or three columns per individual (e.g. IMPUTE-and BEAGLE-outputs). fcGENE can convert all kinds of imputation outputs. After loading an original or imputed dataset into fcGENE, it can be converted into the format of any of the specified programs using the ''--oformat'' command. Further details about the command ''--oformat'' can be found in Table S2. Table S3 describes commands which are optional to use. These optional commands are used to update phenotype and SNP information, to calculate quality measures of SNPs and individuals, to apply corresponding filters and to split or merge genotype data. For example, MaCH-outputs do not contain SNP annotations such as base pair position. However, HAPLOVIEW requires base pair position to identify the genetic distance between two SNPs which can be loaded to fcGENE using command ''-snpinfo''. Similarly, IMPUTE-outputs do not contain any kind of pedigree information. Therefore it is necessary to update the original phenotype information with ''--pedinfo'' option, before the impute outputs are converted into other data formats such as those required for PLINK. Files required to update SNP and pedigree information are automatically generated by fcGENE when data transformation is initiated. Alternatively, one can use ''--write-snpinfo'' and ''--write-pedinfo'' commands to generate SNP and pedigree information respectively. Detailed information on different command options can be found in the supplementary file and fcGENE's documentation, which is distributed together with the source code through its open source home page [18].
To analyse genotype -phenotype associations using allele dosages, one can convert the imputation output for example into PLINK's dosage-file-format using command option ''--oformat plink-dosage''. Moreover, imputed data can be converted into other tools like HAPLOVIEW, EIGENSOFT and SNPTEST as well. Another output option for imputed data is to create a file containing dosages of minor alleles calculated on the basis of genotype probability distributions. This is achieved by the command ''--oformat recodeA-dose''. The most recent version of BEAGLE (BEAGLE4) requires VCF-format [19]. The current version of fcGENE (1.0.7) can export VCF-formatted files using the option ''--oformat vcf'' (see Table S2). A read-option for VCFformatted data will be added in a future version of the software.

Auxiliary functions
For the convenient application of fcGENE, we implemented a number of additional options and features, namely options related to execution of multiple commands at a time, data management like merging, splitting, exclusion of SNPs and individuals, quality control of both raw and imputed genotypes, generation of templates for software commands and updating phenotype and SNP information. Some of these functions are described below. Execution of multiple tasks fcGENE can execute multiple tasks, i.e. fcGENE can process two or more tasks by one command. Each new task, except the first, starts with identifier ''--new-start'' and ends with ''--newend''. This command can be used for example to merge two or more sets of genotype data. The following example command reads two different PLINK-formatted files and convert the first one into MaCH and second one into IMPUTE format.

Strand alignment
Strand alignment between genotype data set and reference data set is crucial for GWA analysis and imputation. Generally, reference panels such as HapMap are given as '+' strand but data might be genotyped with respect to negative strand. If two samples at a SNP are genotyped at different strands, it can be easily recognized except for C/G or A/T SNPs. PLINK has the option to detect opposite strand alignments between cases and controls (''--flip-scan''). fcGENE supports the comparison of strand information between genotyped SNP data and reference panels using this PLINK's ''--flip-scan'' feature in the following way: (1) Use fcGENE to merge study genotypes and the corresponding reference panel, (2) use fcGENE to convert the merged data into PLINK format and assign a dummy case (genotyped data) and control (reference) status using option ''--force'', (3) use PLINK to detect the strand mismatches of ambiguous markers applying command option ''--flip-scan'' or ''--flip-scan-verbose'' [1] on the merged data, (4) create a list of SNPs whose strand needs to be flipped, (5) use PLINK to flip the strand and (6) use fcGENE finally to convert the corrected genotypes into the format required by the desired imputation tool.

Pre-imputation quality control
It is common to implement a series of quality control (QC) steps at SNP-wise and sample-wise level before and after genotype imputation so that different confounding factors, which might affect imputation quality, can be ruled out. Typical SNP-wise QC measures comprise p-value of HWE-test, MAF and genotyping call rate [20]. Similarly, sample-wise genotyping call rate is also often applied as filter criterion. In the command line of fcGENE, one can prescribe the thresholds of these SNP-wise and samplewise quality measures so that SNPs or individuals violating these criteria are automatically discarded. While filtering SNPs and samples, the default process of fcGENE calculates all quality measures without excluding any SNPs and samples of the uploaded data, and then filters SNPs and samples according to the specified thresholds. SNP-wise and individual-wise thresholds  can be assigned by the command options ''--filter-snp'' and ''-filter-indiv'' respectively.
Generating templates of software commands fcGENE can create not only files required by specified software, but it also generates scripts required to run the software. For example if we use IMPUTE, we have to provide lower and upper limits of the base pair positions to be imputed. Moreover, if we analyze a whole chromosome with IMPUTE, we may have to split it into smaller chunks to parallelize computations. Since it is cumbersome to specify the upper and lower limits of the chunks manually, fcGENE generates both, a list of upper and lower limits of base pair positions for each chunk, a Perl script that can execute all commands of the chunking process, and, a list of commands used to impute the chunks as well.
To start the imputation process, we may require a set of imputation reference panels. Imputation commands only work properly if one provides filenames of the reference panel with correct file identifiers. To support this process, fcGENE generates appropriate templates which can be edited with respect to filenames and folders of the reference panel.
While converting files into EIGENSOFT format, fcGENE also generates parameter files and command templates necessary for running SMARTPCA and SMARTEIGENSTRAT. More precisely the extra files generated by fcGENE are: a parameter file for SMARTPCA, a parameter file for SMARTEIGENSTRAT, an R-script for generating PCA-Plots and for modifying outputs of SMARTPCA to be compatible with the format of SMARTEI-GENSTRAT, and finally, a Linux-script (alternative way of using R) to run SMARTPCA, SMARTEIGENSTRAT, EIGENPLOT and TWSTATS. More information on the parameter files and the different packages of EIGENSOFT can be found on its official website [21].
Post-imputation quality control fcGENE can filter poorly imputed SNPs on the basis of their imputation quality and allele frequency. Imputation quality metrics differ between imputation tools. For example, MaCH uses quality parameter ''Rsq'' to assess imputation quality, while IMPUTE calculates so called ''quality'' and ''info'' scores (see corresponding publications for more information regarding definition of these QC parameters). The choice of corresponding filter options is rather intuitive: For example, filtering SNPs with values of MaCH-imputation quality score (i.e. Rsq score) lower than 0.3, is achieved by ''--rsq 0.3''. Similarly, filtering SNPs with allele frequency ,1% is achieved by the option ''--maf-thresh 0.01''. An overview of possible options regarding post-imputation quality control can be found in Table S3. Detailed information on the filtering process are given in the manual of fcGENE [18].

Updates of SNP and sample information
To create a list of SNPs and individual ids from the uploaded genotype data, fcGENE provides command options ''--writesnplist'' and ''--write-pedlist'' respectively. Not all software need whole information related to SNPs and samples contained in genotype data. Therefore it may be necessary to drop or add SNP and sample information before one converts genotype data between formats. Moreover, it may be necessary to define coding and non-coding allele before reading and converting raw genotype data and probability distributions of imputed genotypes. fcGENE provides command options ''--snpinfo'' and ''--pedinfo'' respectively to add or update SNP information like rsid, allele information, base pair position etc, and pedigree information like pedigree ids, phenotype status and sex information.
Identifying individuals in pedigree data format requires family, parental and individual ids. However, some software like SHAPEIT or EIGENSOFT accept only one ID for each individual. In such a case, fcGENE can create hybrid ids by combining family-ids, individual-ids, paternal ids or maternal ids using the command option ''--iid''. The process for creating such hybrid sample ids is explained in Table S4 and in fcGENE's documentation file in more detail.
While converting data for SNPTEST, we may require covariate information. Option ''--covar'' reads files given in plink-formatted covariate file and updates necessary information prior to data transformation.

Adding group labels in preparation for PCA analysis
To plot the result of PCA, software SMARTPCA expects an assignment of each sample to its population/ethnic group. To add this group labelling, fcGENE accepts an extra file (containing sample ids in the first column and group/population-labelling of the samples in the second column) using command option ''--group-label''.
Coding genotypes as count of a given allele fcGENE can code genotypes as counts of a given allelic reference. This type of coding is convenient for a variety of different statistical analyses such as regression models [14]. PLINK supports generation of files containing counts of minor alleles. However it only supports data transformation from the raw calls of genotype data resulting in numbers 0, 1 and 2 of the minor allele dose. Complementary to this option, fcGENE facilitates transformation of genotype probability distributions into PLINK's recodeA-formatted files but using expected allele doses of a given reference allele. More precisely, the raw files transformed by fcGENE can contain not only 0, 1 and 2 as reference allele counts but also expected allele doses of the reference allele, i.e. numbers between 0 and 2. The default reference allele is the minor allele. However, users can force fcGENE to take either the first, or the second or the major allele as reference. GWA analysis with this type of coding is useful especially if the uncertainty of imputation results is high [14].
To facilitate analyses of imputed genotype data with the statistical package R, fcGENE can convert sets of genotype data from different formats into standard text files with genotype codes either as counts of a reference allele (0,1 or 2), or as the expected dose of the reference allele. Text files contain rows for samples and columns for SNPs with simple headers for SNP identifier (e.g. rs-IDs) and a first column to identify subjects (i.e. sample ids). In order to write SNPs as rows and individuals as columns, an additional command option ''--transpose'' can be used. Command options ''--oformat r'' and ''--oformat r-dose'' are used to write the allele counts and expected doses of the reference alleles. Again, the default reference allele is the minor allele. However one can alter the reference allele using commands like ''--force ref-allele = major''. Re-import of these types of allele counting data is also possible (command option ''--rgeno''). More information on these formats can be found in the supplementary file.

An example workflow
The following example workflow demonstrates how fcGENE can be applied in different stages of GWA analysis, and how it interacts with PLINK. We assume that original genotype data are in PLINK format and saved as ''example.ped'' and ''example.map'' files. If we plan to create a plot for PCA analysis after quality control, one can use the following command to convert the data into EIGENSOFT format.
./fcgene --ped example.ped --filter-snp hwe = 1e-6,crate = 0.95,maf = 0.10\ --map example.map --oformat eigensoft --out example_eigensoft This command filters SNPs with low quality and low minor allele frequencies, saves files and writes scripts necessary to run software EIGENSOFT at different stages. In addition, the command also creates R-scripts to plot PCA results on the basis of outputs of SMARTPCA.
In the next step, we aim to impute the PLINK-formatted data with IMPUTE using HapMap reference panel and to analyse the imputation output with SNPTEST after performing post-imputation quality control. We start this work by comparing the strand similarities between study genotypes and IMPUTE-formatted reference panels as explained in section ''strand alignment''. A typical command line for this process is given below.

Discussion
Different analysis tools having their own specific input-and output formats are in use in modern GWAS analysis. Motivated by recurrent format conversions during GWAS analysis processes, we developed the open source format converting tool fcGENE. This software automates the process of transformation of genotype data among the most common formats required by different GWA tools with emphasis on imputation software. We also provide a number of helpful features facilitating the data management process of comprehensive data analysis pipelines such as quality control on the basis of usual measures of genotype and sample quality, splitting and merging of genotype data sets, exclusion of SNPs and individuals, updating SNP annotation and sample information, and, generation of command templates necessary to execute specific tools. Rather than constructing another selfcontained software for statistical analyses, we developed fcGENE in order to make the use of existing GWA packages easier. Through this, we simplify and automate the process of imputationbased GWA studies. The tool has been developed on the basis of C++ which allows dealing with large datasets quickly. Syntaxes of fcGENE are similar to those of PLINK. Therefore PLINK users may find fcGENE easy and intuitive to apply. fcGENE has gained many regular user world-wide and a number of positive feedbacks encouraged us to further improve and develop the software.
The current version of the software is able to perform data format conversion between the analysis packages EIGENSOFT, HAPLOVIEW, PLINK, R (GENABEL package), SNPTEST and the imputation packages BEAGLE, BIMBAM, IMPUTE, MACH, PHASE, fastPHASE and PLINK. Conversions involving the phasing software SHAPEIT [22] can also be addressed since it accepts PLINK or IMPUTE formatted data as input and outputs IMPUTE formatted data. Moreover, fcGENE can convert SHAPEIT-formatted phased data (*.haps and *.sample) file into other formats.
There are only a few publically available tools for the purpose of data transformation: GTOOL [23] is one of such programs and is provided by the IMPUTE developers. This program solely supports transformations between PLINK-formatted ped/map files and IMPUTE-formatted gen/sample files. Similarly, MaCH software developers provided some templates of Perl scripts, which can deal with data transformation of MaCH-imputed data. However, as mentioned previously, without efficient computer programming knowledge, such template scripts are difficult to edit. GenGen is another Perl program [24] for genotype format conversions. However, this program supports only conversions of MaCH-imputed data into PLINK-formatted ped/map files and into the file formats required by SNPTEST. Aside from these tools, one may find some private R-scripts or Perl-scripts for selected software-specific format conversion at different websites. Hence, to our knowledge there is no software with comprehensiveness comparable to that of fcGENE.

Future plans for software extensions
The recent version of fcGENE allows export of genotype data into the variant call format (VCF). In the next step, an option to import this data format will be added. We plan to develop fcGENE as format converter of family data and to add a graphical user interface (GUI) for windows users. We are also committed to update fcGENE if necessary, especially in case of changes of input and output formats of the addressed software packages.

Supporting Information
Table S1 Commands to read SNP data of different formats. Table S1 summarizes command options necessary to upload genotype data of different formats into fcGENE. In the table, we used the name ''example'' as file name combined with different extensions specific for different data formats. (DOCX)