GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations

Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI's utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.


Running the testing suite
GEMINI comes with a full test suite to make sure that everything has installed correctly on your system. We strongly encourage you to run these tests.

Functional annotation tools
GEMINI depends upon external tools to predict the functional consequence of variants in a VCF file. We currently support annotations produced by both SnpEff and VEP. Recommended instructions for annotating existing VCF files with these tools are available here. In addition, we have attempted to standardize the terms used to describe the functional consequence of a given variant, as each annotation tool uses different vocabulary.
The variant consequence columns in the variant table are populated either by snpEff or VEP as defined by the user using the -t option while running pop load (To populate these columns the input VCF file should have been annotated either by snpEff or VEP):

Quick start
gemini is designed to allow researchers to explore genetic variation contained in a VCF file. The basic workflow for working with gemini is outlined below.

Importing VCF files into gemini.
Assuming you have a valid VCF file produced by standard variation discovery programs (e.g., GATK, FreeBayes, etc.), one loads the VCF into the gemini framework with the load submodule: $ gemini load -v my.vcf my.db In this step, gemini reads and loads the my.vcf file into a SQLite database named my.db, whose structure is described here. While loading the database, gemini computes many additional population genetics statistics that support downstream analyses. It also stores the genotypes for each sample at each variant in an efficient data structure that minimizes the database size.
Loading is by far the slowest aspect of gemini. Using multiple CPUs can greatly speed up this process.

Querying the gemini database.
If you are familiar with SQL, gemini allows you to directly query the database in search of interesting variants via the -q option. For example, here is a query to identify all novel, loss-of-function variants in your database: $ gemini query -q "select * from variants where is_lof = 1 and in_dbsnp = 0" my.db Or, we can ask for all variants that substantially deviate from Hardy-Weinberg equilibrium: $ gemini query -q "select * from variants where hwe < 0.01" my.db Bio). This script is useful for those who do not have all the modules in their system required by VEP, specifically DBI and DBI::mysql. Use this link for alternate options of the installer script.

Stepwise installation and usage of VEP
Users (e.g mac users) who have a problem installing through this script should go for a manual installation of the latest Ensembl API (68) and bioperl-1.2.3 and follow all other installation instructions here.
The appropriate pre-build caches should be downloaded to the .vep directory under home from this link.
To use the cache, the gzip and zcat utilities are required. VEP uses zcat to decompress cached files. For systems where zcat may not be installed or may not work, the following option needs to be added along with the --cache option: --compress "gunzip -c" You may run the script as: $ perl variant_effect_predictor.pl [OPTIONS] We recommend running VEP with the following options as currently we support VEP fields specified as below: Note: Choose the annotator as per your requirement! Some gene/transcript annotations are available with only one tool (e.g. Polyphen/Sift with VEP and amino_acid length/biotype with SnpEff). As such these values would be set to None, if an alternate annotator is used during the load step.
Instructions for installing and running these tools can be found in the following section: Annotation with snpEff or VEP

The basics
Before we can use GEMINI to explore genetic variation, we must first load our VCF file into the GEMINI database framework. We expect you to have first annotated the functional consequence of each variant in your VCF using either VEP or snpEff (Note that v3.0+ of snpEff is required to track the amino acid length of each impacted transcript). Logically,the loading step is done with the gemini load command. Below are two examples based on a VCF file that we creatively name my.vcf. The first example assumes that the VCF has been pre-annotated with VEP and the second assumes snpEff. As each variant is loaded into the GEMINI database framework, it is being compared against several annotation files that come installed with the software. We have developed an annotation framework that leverages tabix, bedtools, and pybedtools to make things easy and fairly performant. The idea is that, by augmenting VCF files with many 10 Chapter 2. Table of contents gemini Documentation, Release 0.3.0b informative annotations, and converting the information into a sqlite database framework, GEMINI provides a flexible database-driven API for data exploration, visualization, population genomics and medical genomics. We feel that this ability to integrate variation with the growing wealth of genome annotations is the most compelling aspect of GEMINI. Combining this with the ability to explore data with SQL using a database design that can scale to 1000s of individuals (genotypes too!) makes for a nice, standardized data exploration system.

Using multiple CPUS for loading
Now, the loading step is very computationally intensive and thus can be very slow with just a single core. However, if you have more CPUs in your arsenal, you specify more cores. This provides a roughly linear increase in speed as a function of the number of cores. On our local machine, we are able to load a VCF file derived from the exomes of 60 samples in about 10 minutes. With a single core, it takes a few hours.
Note: Using multiple cores requires that you have both the bgzip tool from tabix and the grabix tool installed in your PATH.

Using LSF, SGE and Torque clusters
Thanks to some great work from Brad Chapman and Rory Kirchner, one can also load VCF files into GEMINI in parallel using many cores on LSF, SGE or Torque clusters. One must simply specify the type of job scheduler your cluster uses and the queue name to which your jobs should be submitted.
For example, let's assume you use LSF and a queue named preempt_everyone. Here is all you need to do: If you use SGE, it would look like: If you use Torque, it would look like: (you guessed it):

Describing samples with a PED file
GEMINI also accepts PED files in order to establish the familial relationships and phenotypic information of the samples in the VCF file.

Load GERP base pair conservation scores
By default, GERP scores at base pair resolution are not computed owing to the roughly 2X increasing in loading time. However, one can optionally ask GEMINI to compute these scores by using the --load-gerp-bp option.

Loading VCFs without genotypes.
To do.

Querying the GEMINI database
The real power in the GEMINI framework lies in the fact that all of your genetic variants have been stored in a convenient database in the context of a wealth of genome annotations that facilitate variant interpretation. The expressive power of SQL allows one to pose intricate questions of one's variation data.
Note: If you are unfamiliar with SQL, sqlzoo has a decent online tutorial describing the basics. Really all you need to learn is the SELECT statement, and the examples below will give you a flavor of how to compose base SQL queries against the GEMINI framework.

Basic queries
GEMINI has a specific tool for querying a gemini database that has been load''ed using the ''gemini load command. That's right, the tool is called gemini query. Below are a few basic queries that give you a sense of how to interact with the gemini database using the query tool.

Extract the nucleotide diversity for each variant:
$ gemini query -q "select chrom, start, end, pi from variants" my.db 4. Combine GEMINI with bedtools to compute nucleotide diversity estimates across 100kb windows: $ gemini query -q "select chrom, start, end, pi from variants \ order by chrom, start, end" my.db | \ bedtools map -a hg19.windows.bed -b --c 4 -o mean

12
Chapter 2. Table of contents gemini Documentation, Release 0.3.0b

Selecting sample genotypes
The above examples illustrate ad hoc queries that do not request or filter upon the genotypes of individual samples.
Since GEMINI stores the genotype information for each variant in compressed arrays that are stored as BLOBs in the database, standard SQL queries cannot directly access individual genotypes. However, we have enhanced the SQL syntax to support such queries with C "struct-like" access. For example, to retrieve the alleles for a given sample's (in this case, sample 1094PC0009), one would add gts.1094PC0009 to the select statement.
Here is an example of selecting the genotype alleles for four different samples ( You can also add a header so that you can keep track of who's who: Let's now get the genotype and the depth of aligned sequence observed for a sample so that we can assess the confidence in the genotype:  G/G  7  chr1  30866  30869  CCT  C  FAM138A CCT/CCT 8  chr1  30894  30895  T  C  FAM138A T/C  8  chr1 30922

Filtering on genotypes
Now, we often want to focus only on variants where a given sample has a specific genotype (e.g., looking for homozygous variants in family trios). Unfortunately, we cannot directly do this in the SQL query, but the gemini query tool has an option called -gt-filter that allows one to specify filters to apply to the returned rows. The rules followed in the -gt-filter option follow Python syntax.
Tip: As you will see from the examples below, appropriate use of the -gt-filter option will allow you to compose queries that return variants meeting inheritance patterns that are relevant to the disease model of interest in your study.
As an example, let's only return rows where sample 1094PC0012 is heterozygous. In order to do this, we apply a filter to the gt_types columns for this individual: variant_samples is a list of all of the samples with a variant, HET_samples is the subset of those heterozygous for the variant and HOM_ALT_samples is the subset homozygous for the variant.

comp_hets: Identifying potential compound heterozygotes
Many recessive disorders are caused by compound heterozygotes. Unlike canonical recessive sites where the same recessive allele is inherited from both parents at the _same_ site in the gene, compound heterozygotes occur when the individual's phenotype is caused by two heterozygous recessive alleles at _different_ sites in a particular gene.
So basically, we are looking for two (typically loss-of-function (LoF)) heterozygous variants impacting the same gene at different loci. The complicating factor is that this is _recessive_ and as such, we must also require that the consequential alleles at each heterozygous site were inherited on different chromosomes (one from each parent). As such, in order to use this tool, we require that all variants are phased. Once this has been done, the comp_hets tool will provide a report of candidate compound heterozygotes for each sample/gene. Assuming you have defined the familial relationships between samples when loading your VCF into GEMINI, one can leverage a built-in tool for identifying de novo (a.k.a spontaneous) mutations that arise in offspring.

default behavior
By default, the de novo tool will report, for each family in the database, a list of mutations that are not found in the parents yet are observed as heterozygotes in the offspring.

-d
Unfortunately, inherited variants can often appear to be de novo mutations simply because insufficient sequence coverage was available for one of the parents to detect that the parent(s) is also a heterozygote (and thus the variant was actually inherited, not spontaneous). One simple way to filter such artifacts is to enforce a minimum sequence depth for each sample. For example, if we require that at least 50 sequence alignments were present for mom, dad and child, two of the above variants will be eliminated as candidates:

interactions: Find genes among variants that are interacting partners.
Integrating the knowledge of the known protein-protein interactions would be useful in explaining variation data. Meaning to say that a damaging variant in an interacting partner of a potential protein may be equally interesting as the protein itself. We have used the HPRD binary interaction data to build a p-p network graph which can be explored by Gemini. Meaning to say return all LoF gene TGM2 (in sample M128215) interacting partners to a 3rd order of interaction.

--var
An extended variant information (chrom, start, end etc.) for the interacting gene may be achieved with the -var option for both the interactions and the lof_interactions

annotate: adding your own custom annotations
It is inevitable that researchers will want to enhance the gemini framework with their own, custom annotations. gemini provides a sub-command called annotate for exactly this purpose. As long as you provide a tabix'ed annotation file in either BED or VCF format, the annotate tool will, for each variant in the variants table, screen for overlaps in your annotation file and update a new column in the variants table that you may specify on the command line. This is best illustrated by example.
Let's assume you have already created a gemini database of a VCF file using the load module.

-t boolean Did a variant overlap a region or not?
Now, you can use this TABIX'ed file to annotate which variants overlap your crucial regions. In the example below, the results will be stored in a new column called "crucial". The -t boolean option says that you just want to track whether (1) or not (0) the variant overlapped one or more of your regions.
$ gemini annotate -f crucial.bed.gz -c crucial -t boolean my.db Since a new columns has been created in the database, we can now directly query the new column. In the example results below, the first and third variants overlapped a crucial region while the second did not.

-t count How many regions did a variant overlap?
Instead of a simple yes or no, we can use the -t count option to count how many crucial regions a variant overlapped. It turns out that the 3rd variant actually overlapped two crucial regions. When we use -t list, the resulting column can store a comma-separated list of the region names (column 4). You can choose whatever column you want to store in the database, but in this example, we will use the 4th column (the name). We specify which column to store in the list with the -e option.

region: Extracting variants from specific regions or genes
One often is concerned with variants found solely in a particular gene or genomic region. gemini allows one to extract variants that fall within specific genomic coordinates as follows: --reg $ gemini region --reg chr1:100-200 my.db

--gene
Or, one can extract variants based on a specific gene name.
gemini includes a convenient tool for computing variation metrics across genomic windows (both fixed and sliding).
Here are a few examples to whet your appetite. If you're still hungry, contact us.
Compute the average nucleotide diversity for all variants found in non-overlapping, 50Kb windows.
$ gemini windower -w 50000 -s 0 -t nucl_div -o mean my.db Compute the average nucleotide diversity for all variants found in 50Kb windows that overlap by 10kb.

--tstv-coding
Compute the transition/transversion ratios for the snps in the coding regions.

--tstv-noncoding
Compute the transition/transversion ratios for the snps in the non-coding regions.
Compute the type and count of the snps.

--summarize
If none of these tools are exactly what you want, you can summarize the variants per sample of an arbitrary query using the -summarize flag. For example, if you wanted to know, for each sample, how many variants are on chromosome 1 that are also in dbSNP: $ gemini stats --summarize "select * from variants where in_dbsnp=1 and chrom='chr1'" my.db sample total num_het num_hom_alt M10475 1 1 0 M128215 1 1 0 M10478 2 2 0 M10500 2 1 1

db_info: List the gemini database tables and columns
Because of the sheer number of annotations that are stored in gemini, there are admittedly too many columns to remember by rote. If you can recall the name of particular column, just use the db_info tool. It will report all of the tables and all of the columns / types in each

The GEMINI browser interface
Currently, the majority of GEMINI's functionality is available via a command-line interface. However, we are developing a browser-based interface for easier exploration of GEMINI databases created with the gemini load command.
Ironically, as of now, one must launch said browser from the command line as follows (where my.db should be replaced with the name of the GEMINI database you would like to explore).

Genotype information
gts BLOB A compressed binary vector of sample genotypes (e.g., "A/A", "A|G", "G/G") gt_types BLOB A compressed binary vector of numeric genotype "types" (e.g., 0, 1, 2) gt_phases BLOB A compressed binary vector of sample genotype phases (e.g., False, True, False) gt_depths BLOB A compressed binary vector of the depth of aligned sequence observed for each sample