All of gene expression (AOE): An integrated index for public gene expression databases

Gene expression data have been archived as microarray and RNA-seq datasets in two public databases, Gene Expression Omnibus (GEO) and ArrayExpress (AE). In 2018, the DNA DataBank of Japan started a similar repository called the Genomic Expression Archive (GEA). These databases are useful resources for the functional interpretation of genes, but have been separately maintained and may lack RNA-seq data, while the original sequence data are available in the Sequence Read Archive (SRA). We constructed an index for those gene expression data repositories, called All Of gene Expression (AOE), to integrate publicly available gene expression data. The web interface of AOE can graphically query data in addition to the application programming interface. By collecting gene expression data from RNA-seq in the SRA, AOE also includes data not included in GEO and AE. AOE is accessible as a search tool from the GEA website and is freely available at https://aoe.dbcls.jp/.


Introduction
After the invention of the microarray, it became possible to measure the abundance of all transcripts at the genomic scale, which is now called the transcriptome. Since then, gene expression data from those experiments have been archived in public repositories after the development of the Minimum Information About a Microarray Experiment (MIAME) standard [1]. They are the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) [2] and the EBI ArrayExpress (AE; https://www.ebi.ac.uk/arrayexpress/) [3] in MIAME compliant manner.
Unlike the International Nucleotide Sequence Database [4], these two databases for gene expression have not been exchanging data with each other. AE once had imported data from GEO, but stopped importing data in 2017 (https://www.ebi.ac.uk/arrayexpress/help/GEO_data.html). Archived GEO data is still available from AE, but new data archived in GEO no longer available from AE. Therefore, users need to search both databases to get comprehensive public gene expression data of interest, because these databases have been separately maintained. Furthermore, the DNA DataBank of Japan (DDBJ) recently started a similar repository called the Genomic Expression Archive (GEA; https://www.ddbj.nig.ac.jp/gea/) [5]. Integration of these public gene expression databases is required.
In addition, these databases may lack sequencing transcriptome data (RNA-seq) where the original sequence data are accessible in the nucleotide sequence repository of high-throughput sequencing platforms, the Sequence Read Archive (SRA) [6]. This is because data deposition to GEO and AE is not mandatory when the original sequencing data are deposited to the SRA.
We therefore developed an index of public gene expression databases, called All Of gene Expression (AOE). The aim of AOE is to integrate gene expression data and make them all searchable together. We have maintained AOE for five years, and it has been useful for functional genomics research. Here, we report a detailed description and utility of AOE. AOE is freely accessible from https://aoe.dbcls.jp/.

Status of gene expression databases
Gene expression data in NCBI Gene Expression Omnibus (GEO) used to be continuously imported into EBI ArrayExpress (AE), and thus we were theoretically able to get all data deposited to GEO from AE. Therefore, All Of gene Expression (AOE) was originally indexed for AE only.
Unfortunately, AE discontinued GEO data import in 2017. At that point, we investigated data-series entries in these two databases by matching GEO series IDs. IDs starting with GSE in GEO and those starting with E-GEOD in AE; for example, GSE52334 in GEO corresponds to E-GEOD-52334 in AE. It showed that there were over thirty thousand entries missing in AE (Fig 1).
Furthermore, even GEO did not publicly represent whole transcriptome data, as over ten thousand entries in AE were missing in GEO. Thus, we decided to include those missing entries in AOE. In other words, we started indexing GEO data and other public transcriptome data, including the DDBJ Genomic Expression Archive (GEA), to allow all public gene expression data to be searched.

An index of gene expression data series from metadata
AOE was originally developed to give a graphical web interface to search EBI AE, which is one of the public gene expression databases described above. We call this dataset from AE only 'AOE level 1' (Fig 2). Data at this level basically contain only IDs for AE, and the entries imported from GEO contain IDs for both BioProject and GEO.
After the import of GEO data to AE was discontinued, AOE began importing GEO data by directly utilizing DBCLS SRA application programming interface (API) [7]. By subtracting the GEO data already existing in AE, new entries were included in AOE. We call the merged dataset that includes GEO data 'AOE level 2' (Fig 2). Data at this level contain IDs for BioProject and GEO, not for AE.
There were still gene expression data missing that were not included in AE and GEO, but were registered as transcriptome sequencing data in SRA. The final merged dataset is called 'AOE level 3', and represents a real public gene expression dataset (Fig 2). Data at this level contain BioProject IDs only. As AOE was designed to index public gene expression data, 'experimental series'-wise data have been indexed for the search. Individual hybridization data for microarray and run data for RNA-seq are directly linked to the original databases. All codes to parse public databases and construct a web service are accessible from the DBCLS AOE GitHub repository (https:// github.com/dbcls/AOE/). They are free and open source software, and can be installed anywhere.

Graphical web interface
Gathering all three levels of data described above, AOE enables visualization and exploration of gene expression data. AOE provides an interactive web interface (https:// aoe.dbcls.jp/) to retrieve data of interest (Fig 3). Users can see overall statistics of stored data in AOE ( Fig 3A). The histogram for ranking by quantification methods can be dynamically drawn by clicking technology name. Fig 3B shows the numbers of data in AOE only for sequencing assays (RNA-seq).
Users can easily limit data by organism and quantification method of gene expression. For example, users can search by keyword 'hypoxia' (Fig 3A). AOE currently reports 524 items with three histograms (by year, by organism, and by quantification method; Fig 3C). After looking at the histograms, the user can limit the data to 'Homo sapiens' by dragging the bar in the histogram by organism. Then, AOE redraws the histograms with the selected data ( Fig 3D). Additionally, the user can limit the data to 'Illumina' by dragging the bar in the histogram by quantification method (Fig 3E). The selected data (58 records currently) can be retrieved by clicking the 'Retrieve' button ( Fig 3F). Users can browse retrieved data and jump to original data by clicking IDs in the table (ArrayExpress, BioProject and GEO; Fig 3G). Optionally users also can download the list of IDs from 'Download ID list' button.
A shortcut to retrieve a list of specific organism is to click species icon with nomenclature and 'retrieve' button ( Fig 3A). Top 30 species in AOE are listed and can be accessed with this way.

Application programming interface
Users can also query AOE via API. AOE provides a simple Representational State Transfer (REST) API that enables users to perform searches with their client programs in an automated manner. The search results in a JSON formatted output can be retrieve through the following URI:

Discussion
We have developed and maintained an index of public gene expression databases, called All Of gene Expression (AOE). AOE was originally begun as an index for the ArrayExpress (AE) database maintained at EBI (we call this 'AOE level 1'), because AE had exported gene expression data from Gene Expression Omnibus (GEO), which is the largest gene expression database maintained at NCBI. That meant that AE contained all gene expression data, including those deposited to GEO. AE stopped importing data from GEO in 2017. While GEO data archived in AE is still available from AE, new data archived in GEO no longer available from AE. Thus, we started indexing GEO data directly by making use of API of DBCLS SRA (AOE level 2). In 2018, the DNA DataBank of Japan (DDBJ) started the Genomic Expression Archive (GEA), which is a repository for gene expression quantification data. Integration of these public gene expression databases is needed to increase the reusability of gene expression data. Newly submitted data contain BioProject IDs, and this feature makes it possible to integrate multiple levels of indices and resolve complicated relationships among IDs, while old AE entries do not have BioProject ID.
The existence of a great deal of data at AOE level 3 shows that not all sequencing gene expression data are stored in GEO. This indicates that GEO is insufficient as a complete public gene expression database. Much of the data at AOE level 3 are heterogenous and metadata for those can lack several descriptions, which are curated and cleanly described in GEO and AE.
A similar approach has also been undertaken by EBI, called the Omics Discovery Index (OmicsDI; https://www.omicsdi.org/), which provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics) [9]. OmicsDI aims to integrate various types of omics data and is not focused on gene expression data.
AOE is focused on gene expression data. It is also designed to be a search interface for DDBJ GEA, and a link to AOE can be found on the official GEA website. When AOE is used as a search interface to DDBJ GEA, it is expected that AOE will be continuously used in the DDBJ website.
The web interface for AOE is simple and user friendly, and so AOE can be used by biologists who are not familiar with database searching. AOE can also be used by professionals to construct reference expression datasets for specific organisms. We have also developed Reference Expression dataset (RefEx) for humans and mice [10]. We are planning to implement RefEx for other organisms, making use of these reference expression datasets retrieved by AOE.
For the future development, we are also planning to use not only metadata, but also quantified expression data that will allow users to search data based on the similarity of gene expression profiles. And, we are going to use the result of quality control by FASTQ program to screen the data for RNA-seq data.

Methods
Acquisition of public gene expression data AOE consists of two major types of data source. One is EBI ArrayExpress (AE), and the other is data in NCBI, including the Gene Expression Omnibus (GEO).
For the AE data type, several files are required to make an AOE index. These files are in a simple spreadsheet-based, MIAME-supportive format, called MicroArray Gene Expression Tabular   (MAGE-TAB

Organizing metadata from different sources
For the AE type of data, ADF, IDF and SDRF files are required to make an index for AOE.
Data from the DDBJ Genomic Expression Archive (GEA) also consist of the AE type of data, and are available from its FTP site (ftp://ftp.ddbj.nig.ac.jp/ddbj_database/gea/). We made use of the AE type of data to construct an initial AOE index set (called AOE level 1).
GEO data in the Sequence Read Archive (SRA), BioProject and BioSample are used to make an index for AOE. These data have been stored in the DBCLS SRA as JSON-LD, and the application programming interface (API) for metadata for those has also been maintained in DBCLS. AOE used this API to retrieve data needed to make the index (AOE level 2).
Finally, we collected RNA-seq data in SRA, making use of the DBCLS SRA API. Most of this fraction of data are in AOE level 2, but many entries can be found in this filter (AOE level 3).
Data parsers to make a tab-delimited text file for visualization are implemented in Perl5 and UNIX shell commands. All shell and Perl5 scripts for those are accessible from GitHub (https:// github.com/dbcls/AOE/).

Visualization of datasets
For visualizing datasets, we employed specially coded Python3 scripts, and we also used D3.js, a JavaScript library for manipulating documents based on data (https://d3js.org/). This enables data selection by mouse operation. For example, the user can select data by release date by dragging the histogram generated in the keyword search.