SNPpy - Database Management for SNP Data from Genome Wide Association Studies

doi:10.1371/journal.pone.0024982

Figure 1.

Workflow Chart.

This figure shows the data workflow. First the genotypic and phenotypic data are loaded into the database. The data is then exported from the database as standard format files, including a possible filtering and/or merging step. Finally, the output files are further analyzed using third party tools.

More »

Expand

Figure 2.

Database Schema.

Geno Single database schema for the Affymetrix platform. In this diagram, the rectangles correspond to database tables, and the rows in each rectangle correspond to database table columns. The four columns in a row correspond to, from left to right, database name (column 1), data type (column 2), primary key indicator (column 3), and foreign key indicator (column 4). The arrows correspond to foreign keys. Observe the number of arrows leaving a table is equal to the number of columns that are foreign keys in that table.

More »

Expand

Figure 3.

Database Layout.

Datasets for different platforms are stored in separate databases, here represented by cylinders. Every dataset is stored in a separate database schema (namespace within a database). The same dataset can be stored in multiple schemas, differing in what options have been selected when loading the dataset. To illustrate this, the figure shows the schemas in red and the datasets in black. Each of the datasets HapMap 6 and CEU HapMap 610 is stored in two schemas. For further details see the manual.

More »

Expand

Figure 4.

Dataset load timings.

Timings for loading simulated datasets for the Illumina platform into the database, for the Geno Single layout, and the Geno Shard layout with degree of parallelism and . For all these datasets, the number of SNPs is 620,901.

More »

Expand

Figure 5.

PED file write timings.

Timings for writing PED files from simulated datasets for the Illumina platform, for the Geno Single layout with degrees of parallelism and Geno Shard layout with degree of parallelism and . For all these datasets, the number of SNPs is 620,901. All timings correspond to warm cache.

More »

Expand

Figure 6.

PED file merged write timings.

Timing results for writing the PED file corresponding to the merger of the 2000 patient Illumina simulated dataset with the corresponding HapMap datasets compared to timings for writing the PED file for each of the 2,000 patient simulated dataset and the Hapmap dataset. All these timings are for the Geno Shard layout. For all these datasets, the number of SNPs is 620,901. All timings correspond to warm cache.

More »

Expand

Table 1.

Software timing comparisons.

More »

Expand