High density genotype storage for plant breeding in the Chado schema of Breedbase

Modern breeding programs routinely use genome-wide information for selecting individuals to advance. The large volumes of genotypic information required present a challenge for data storage and query efficiency. Major use cases require genotyping data to be linked with trait phenotyping data. In contrast to phenotyping data that are often stored in relational database schemas, next-generation genotyping data are traditionally stored in non-relational storage systems due to their extremely large scope. This study presents a novel data model implemented in Breedbase (https://breedbase.org/) for uniting relational phenotyping data and non-relational genotyping data within the open-source PostgreSQL database engine. Breedbase is an open-source, web-database designed to manage all of a breeder’s informatics needs: management of field experiments, phenotypic and genotypic data collection and storage, and statistical analyses. The genotyping data is stored in a PostgreSQL data-type known as binary JavaScript Object Notation (JSONb), where the JSON structures closely follow the Variant Call Format (VCF) data model. The Breedbase genotyping data model can handle different ploidy levels, structural variants, and any genotype encoded in VCF. JSONb is both compressed and indexed, resulting in a space and time efficient system. Furthermore, file caching maximizes data retrieval performance. Integration of all breeding data within the Chado database schema retains referential integrity that may be lost when genotyping and phenotyping data are stored in separate systems. Benchmarking demonstrates that the system is fast enough for computation of a genomic relationship matrix (GRM) and genome wide association study (GWAS) for datasets involving 1,325 diploid Zea mays, 314 triploid Musa acuminata, and 924 diploid Manihot esculenta samples genotyped with 955,690, 142,119, and 287,952 genotype-by-sequencing (GBS) markers, respectively.


Introduction
Routine genotyping is now possible with the advent of low-cost, high-throughput genotyping platforms, giving rise to enormous amounts of data but presenting challenges for data management and queriability [1]. Plant breeding programs routinely genotype for 1) quality control Materials and methods

I. Chado schema modifications
The Natural Diversity (ND) module [17] in the Chado schema provides the foundation for the database schema used in the following work. The ND module allows for storage and querying over many projects, and the evaluation of many stocks, genotyping experiments, and phenotyping experiments. Only three modifications to the ND schema are required to accommodate the new non-relational genotyping data storage model: 1) the value field in the genotypeprop table and 2) the value field in the nd_protocolprop table are both converted to the JSON (JSONb in PostgreSQL) column type instead of text. Fig 1 shows the core of the schema with the modifications highlighted in red. Additionally, 3) a Generalized Inverted Index (GIN) is applied to the JSONb column in the genotypeprop table, allowing for faster queries of keys within the JSON structures. Genotype data in this schema are linked to a 'genotyping protocol'; a 'genotyping protocol' is given a name, description, and controlled vocabulary type as an entry in the nd_protocol table. The 'genotyping protocol' aggregates all meta-information provided in a VCF file header, including variant calling software, genotypes calling format (i.e. allele depth, depth of coverage, quality), the reference genome used for alignment, and marker-related information (i.e. applied filters and information fields). It stores the aforementioned information and all meta-data lines from the VCF file header in a JSON formatted value entry of the nd_protocolprop table linked to nd_protocol. The 'genotyping protocol' is central to grouping and delineating marker information in the database. The JSON data structure storage in the value field of the nd_protocolprop table is described in more detail in the 'Genotype Storage JSON Structures' section below. A 'genotyping project' is defined as a group of 'genotyping protocols', hierarchically; this organizational structure is useful for large breeding programs where multiple genotyping events are occurring. The 'genotyping project' is stored as an entry in the project table.
The samples genotyped are stored as entries in the stock table; in Breedbase, the stock entries that can be genotyped have controlled vocabulary terms of either 'tissue_sample', 'plot', 'plant', or 'accession'. The genotype for a sample under a specific 'genotyping protocol' is stored as an entry in the genotype table with a link to the genotypeprop table; the JSON formatted value field in the genotypeprop table represents the sample's complete 'callset' under a specific 'genotyping protocol'. A 'callset' represents all genotypes for all markers in the 'genotyping protocol' for a single sample as defined in BrAPI [8]. More information on the JSON data structure stored in the value field of the genotypeprop table is given in the 'Genotype Storage JSON Structures' section below. Phenotypes for samples are linked using the phenotype table, where the variable being measured is a controlled vocabulary type linked using the cvalue_id field and the phenotypic result is stored as a string in the value field. The nd_experiment table connects the project, nd_protocol, stock, phenotype, and genotype tables under a controlled vocabulary type, enabling querying across all entities involved.

II. Genotype storage JSON structures
The JSON value fields stored in the genotypeprop and nd_protocolprop tables take their forms from the VCF data model. Intuitively, the first nine columns of information in the VCF are aggregated into a single JSON object and stored as an entity in the nd_protocolprop table, while a new JSON object is aggregated for each of the sample columns (columns 10 to the maximum number of columns in the VCF) and stored as entries in the genotypeprop table. Tables 1 and 2 describe the structure of the value field JSON objects in the nd_protocolprop and genotypeprop tables, respectively, for a hypothetical 'genotyping protocol' involving two markers named 'S2_20032' and 'S2_20033'. [ "##fileformat = VCFv4.0", "##Tassel = <ID = GenotypeTable, Version = 5>", "##FORMAT = <ID = GT, Number = 1, Type = String, Description = 'Genotype'>" ] 'vcf_map_details' marker_names Array of Strings An array containing all marker names used in the genotyping protocol.
The second entry in the nd_protocolprop table is stored using the controlled vocabulary term 'vcf_map_details_markers'. In this entry, the value field JSON object is a key value mapping of marker names to marker information objects (MIOs); each MIO has the following keys: 'chrom', 'pos', 'name', 'ref', 'alt', 'qual', 'filter', 'info', and 'format', signifying values for, chromosome number, base pair position, marker name or unique identifier, reference allele, comma separated alternate alleles, marker quality score, filter status, additional information, and genotype score format, respectively. This terminology is taken directly from the VCF specification and the meanings and data formats are identical. The third entry in the nd_protocolprop table is stored under the controlled vocabulary term 'vcf_map_details_markers_array'. This entry contains an array of MIOs. Note that the previous entry was an object of MIOs and represents redundant information; however, these two data representations allow for flexible and performant construction of JSON SQL queries in PostgreSQL.
The value field JSON object in the genotypeprop table is composed of dynamic top-level keys; the top-level keys are all the marker names tested in the 'genotyping protocol' and are the same marker names stored in the value field JSON object of the nd_protocolprop table described in the previous paragraph. As described in Table 2, the subsequent object under these top-level keys contains the genotype score information, with keys coming dynamically from the 'format' field of the uploaded genotype file; uploading a VCF file allows for storage of all VCF defined genotype attributes e.g. 'GT', 'AD', 'DP', 'GQ', 'PL', and 'GL'. Alternatively, uploading a tab-delimited allele matrix file, allows for storage of only 'GT' in the genotypeprop JSON object. In all cases for every genotype, the uploaded genotype data files must minimally contain the marker name, the reference and alternate alleles, and the genotype encoded in VCF 'GT' form. Additionally in all cases of genotype upload, Breedbase generates and stores two new keys named 'NT' and 'DS'. The 'NT' key contains the nucleotides for the given polymorphism (e.g. 'A', 'T', 'C', or 'G'), separated by a comma; the order of the nucleotides is the same as in the 'GT' key and this is true for both unphased data (the GT key contains the "/"  Table 1, $marker_name would be "S2_20032" or "S2_20033".)

Object
The top-level keys are the marker names. Each value is an object containing all genotype score information in accordance with the VCF data model, and the "NT" key. The "NT" key is generated by Breedbase and is important for allelic interpretation of the genotype. The "DS" key represents the dosage genotype (value of '0', '1', '2', etc., or 'NA'); it is generated by Breedbase if it is not provided during upload.
An important difference between a VCF file and the data structure in the database is that VCF files are 'marker first' (markers define the rows and the samples define the columns) whereas this database implementation stores the data 'sample first' (the database rows define samples and the 'columns' in the JSONb data structure define the markers) [21]. The data in the database is therefore essentially the transpose of the VCF file and accordingly has to be transposed when loaded and when downloaded into certain formats, including VCF. Transposition is both memory and time intensive, with a complexity of O(n 2 ). While it would be easy to store both 'marker first' and 'sample first' matrix representations in the database, we have opted not to do this because it leads to complex operations when data have to be added or removed.

III. Caching of results
Application performance is critical for breeding programs to effectively incorporate genotypic information in decision making. Provided that genotyping protocols do not change after being uploaded into Breedbase, file caching is used to maximize genotypic query performance in the Breedbase system and to minimize system memory requirements. When a user issues a genotype query to the Breedbase system the following actions are executed: (1) the query parameters are encoded into an MD5 hash string (2) the file cache determines whether this string represents a query that has not been served by the system before (3) if the query has been served before then the results will simply be returned to the user from the file cache (4) if the query has not been served before, then the results will be fetched from the PostgreSQL database, written to the file cache, and then returned to the user. The file cache minimizes system memory requirements by iteratively writing results from the PostgreSQL database to the cached file line-by-line; similarly, results can be efficiently read line-by-line.
The file cache is implemented using the Perl Cache::File module. To meet the requirements of different analyses applications, three formats can be retrieved from the file cache system currently: VCF, dosage matrix, and internal JSON. As is discussed in the "Packaged Queries" section below, simple entry-points for retrieving results in any of the three formats from the file cache are provided. The VCF and dosage matrix formats are largely for user-facing actions for serving genotype results directly as files, whereas the internal JSON format is largely for operations and tools within Breedbase. In the dosage matrix format, the first column lists all the genotyped markers, each subsequent column is for a genotyped sample, and the genotypes are dosage values (e.g. '0', '1', '2', etc., or 'NA'). The internal JSON format follows the JSON representations previously described; however, it also returns experimental metadata concerning the genotyped samples.

I. Example SQL queries
Using PostgreSQL's JSON and JSONb query functions, precise queries spanning phenotypes and genotypes can be constructed; however, directly constructing SQL queries is not recommended because it does not take advantage of the file caching system within Breedbase. Furthermore, constructing SQL queries directly may not retrieve all the possible metadata that is available within the Breedbase database. The queries here are for demonstration purposes only; in practice, the examples demonstrated in the "Packaged Queries" section below should be used as templates in production settings.
SQL Example I. To construct a query with the following criteria: (1) stocks genotyped for a marker named either 'S10_0880' or 'S11_0112', and (2)

II. Packaged queries
Perl Moose objects named CXGN::Genotype::Search, CXGN::Genotype::GRM, and CXGN:: Genotype::GWAS are available in Breedbase to facilitate query and analyses construction, and to provide an interface to the file cache system.
For convenience and performance reasons, the CXGN::Genotype::Search object provides three entry-points for retrieving results from the file cache, formatted as either VCF, dosage matrix, or internal JSON. There are two additional entry-points for retrieving genotypes in VCF and dosage matrix formats computed from genotyped parents; the progeny's genotypes are calculated for each marker as an average of the parental dosage genotypes, simulating the inbreeding coefficient of each marker genotype of the hybrid as one-half of each of the two parents [22]. An example instantiation signature is given below with entry-points for retrieving each of the cache file formats. These entry-points have the advantage of being memory efficient by allowing reading of results line-by-line from the result file. The returned file handles can also be returned directly to the user for download, as is used for the Breedbase Search Wizard demonstrated in the "Web Interface Queries'' section below. The GRM is computed using the R rrBLUP package and imputes missing genotypes using an 'Expectation Maximization' (EM) algorithm [23]. The genotypes are filtered using user input for minor allele frequency (MAF) and percent missing data for markers and samples prior to calculating the GRM. Three formats are available for download: a tab-separated matrix format (.tsv), a three-column format (.tsv), and a heatmap figure (.pdf). The three-column format is particularly useful for fitting mixed models in ASReml [24] once the data is exported from Breedbase. The GRM can be computed for accessions whose parents are genotyped, as described previously, using the 'get_grm_for_parental_accessions' boolean attribute. My $result_filehandle_grm = $geno->download_grm(@required_config); C. CXGN::Genotype::GWAS. A genome-wide association study (GWAS) can be computed using the CXGN::Genotype::GWAS Perl Moose object by minimally specifying a list of accessions, a list of phenotypic traits, and a genotyping protocol. The required configuration parameters are 'bcs_schema', 'people_schema', 'cache_root', 'grm_temp_file', 'gwas_temp_file', and 'pheno_temp_file' for the Bio::Chado::Schema database schema connection, the CXGN::Metadata::Schema database schema connection, the directory of the cache file system, and temporary files to process the GWAS result, respectively; all other fields are query parameters and parameters for performing the GWAS. The R rrBLUP package is used to perform imputation using the 'EM' algorithm [23]; rrBLUP performs the GWAS using a mixed linear model including fixed effects for the experimental design (i.e., location and year of field experiment and replicate of tested accession) of the phenotypic measurements and using a kinship matrix calculated from the genotypic data. Genotypes are filtered by MAF and missing marker and sample data prior to calculating the kinship matrix and the GWAS. The GWAS can be computed for accessions whose parents are genotyped, as described previously, using the 'get_grm_for_parental_accessions' boolean attribute. If the provided trait list represents a series of repeated measurements, the boolean 'traits_are_repeated_measurements' attribute can be used. When traits are not to be treated as repeated measurements, results are returned

III. Web interface queries
Breedbase provides a web-interface compatible with all modern internet browsers on any device. A suite of web-pages are available for management of germplasm resources, pedigrees, seed inventories, field trials, experimental locations, phenotypic records, crossing blocks, genotyping storage, and other plant breeding program aspects. Once the information is entered into Breedbase, the primary means of searching and retrieving information is through the Search Wizard (Fig 2).
The Search Wizard (https://breedbase.org/breeders/search) enables construction of queries spanning accessions, field trials, genotyping protocols, locations, years, and phenotypic traits, and also provides an interface for downloading phenotypic and genotypic results as data files in several formats. Genotypic data can be filtered by chromosome, start position, and end position and can be downloaded in VCF or dosage matrix formats. More precise filtering is possible by selecting a marker set; a marker set is a user defined list of markers or a range of physical positions. Once accessions are selected the GRM can be downloaded, and if phenotypic traits are selected then a GWAS can also be downloaded. The Search Wizard internally uses the entry-points previously described in the "Packaged Queries" section. Phenotypic records can be filtered by minimum and maximum values prior to downloading as CSV or Excel files.

IV. Limitations
The maximum number of markers that can be stored is limited by the PostgreSQL maximum data limit of 1GB per a single field [25]. If each marker requires 60 bytes of storage space, the maximum number of markers that can be described in a single genotyping protocol is approximately 17 million. For a 1 gigabase genome, this would represent almost 2 single nucleotide polymorphisms (SNPs) for every 100 base pairs or a polymorphism rate of about 2%. In the case of large genomes that are very polymorphic, these storage limitations may be reached, and additional implementation approaches may be required, such as storing different chromosomes individually in separate genotypeprop entries. Furthermore, storage of polyploid genotypes increases the number of bytes per marker, reducing the maximum number of markers which can be stored in the current implementation.

Fig 2. The search wizard is the primary means of querying Breedbase and provides a means for downloading phenotypic and genotypic records in several formats.
The search consists of four query categories (1) to (4) to filter across every kind of data object in the database. In this example, traits were first selected (1) and 'grain moisture', 'grain yield', and 'plant height' were chosen. Then, accessions were selected (2) and from the 1,404 accessions which met the selected trait criteria 8 accessions were chosen. Then, trials were selected (3) and of the 5 field trials which met the selected trait and accessions criteria, 4 trials were chosen. Then, locations were selected (4) and the two locations which met the previous criteria were chosen. A genotyping protocol can be selected as a filter in (1) to (4); however, a default genotyping protocol is used when one is not explicitly selected. Clicking on "Related Genotype Data" brings a dialog to filter genotype data for the selected accessions by chromosome, start position, and end position prior to downloading in VCF or Dosage Matrix formats (5). Additionally, a marker set can be selected to filter downloaded genotypes further. Genotypes can be computed from parents in the pedigrees of the selected accessions if the parents were genotyped by clicking the "Compute from Parents" checkbox for (5), (6), or (7). The genomic relationship matrix (GRM) can be downloaded (6) for the selected accessions after filtering for minor allele frequency (MAF) and missing data. Three formats for downloading the GRM are available: a tab separated matrix format (.tsv), a three-column format (.tsv), and a heatmap figure (.pdf). A GWAS can be computed by selecting accessions and traits in (1) to (4) and results can be downloaded (7) as Manhattan and QQ plot figures (.pdf) or as a tabular file of the p-values (. tsv). Clicking "Related Trial Phenotypes" brings a dialog to filter phenotypes by minimum and maximum values prior to downloading phenotypic data in CSV or Excel formats (8). https://doi.org/10.1371/journal.pone.0240059.g002

V. Performance benchmark
To test the Breedbase JSON genotype storage system, three datasets were loaded into a test Breedbase instance running on a HP Z820 workstation with 256 GB RAM and 2x Intel Xeon E5-2660v2 CPUs. The first dataset loaded is a VCF containing 314 Musa acuminata samples genotyped with 142,119 genotype-by-sequencing (GBS) markers; this data is available in the Breedbase instance https://musabase.org/breeders_toolbox/protocol/1 and contains triploid genotypes [26,27]. The second dataset loaded is a VCF containing 924 Manihot esculenta samples genotyped with 287,952 GBS markers and is available in the Breedbase instance https:// cassavabase.org/breeders_toolbox/protocol/6 [28]. The third dataset loaded is a VCF containing 1,325 Zea mays samples genotyped with 955,690 GBS markers and is available in the Breedbase instance https://imagebreed.org/breeders_toolbox/protocol/5 [29].
Additionally, four phenotypic datasets were loaded into the Breedbase instance; the phenotypic records include accessions evaluated in field trials for which genotypic records in the aforementioned VCF datasets exist. The first dataset is from https://musabase.org of 75 phenotypic traits evaluated across 3 field trial experiments of Musa acuminata accessions. The second dataset is from https://cassavabase.org of 18 phenotypic traits evaluated across 3 field trial experiments of Manihot esculenta accessions. The third dataset is from https://imagebreed.org of 14 phenotypic traits evaluated across 3 field trial experiments of Zea mays accessions. To test computing genotypes from genotyped parents, pedigrees between hybrid Zea mays progeny accessions and parent accessions are uploaded into Breedbase; a fourth dataset was uploaded of 14 phenotypic traits evaluated across 3 field trial experiments for the hybrid Zea mays accessions.
The Perl test script, the three genotypic data VCF files, the four phenotypic data CSV files, and an SQL dump of the data loaded into the test Breedbase instance are included with this publication in the "Supplemental Information". The phenotypic data files include the field experiment metadata and pedigree information. Note that there exist significant typographical errors between the accession names listed in the genotype VCF files and the accessions listed in the phenotypic information files, both for the tested accessions and for the pedigree accessions; however, Breedbase consolidates these names through a curation interface during upload of new accession names. Typographical errors such as 'Tx303' vs 'TX-303' are flagged by a text similarity score and the interface allows for correctly storing the relationships between identifiers.
Loading genotype data. Retrieving genotype and phenotype data performance. For each of the three species in the test data, a random set of 25 accessions were chosen 10 different times with replacement. Those accessions were then (1) queried for 500 random markers and genotypic data was returned in VCF and dosage matrix formats, (2) the GRM was computed using all genotypes in the genotyping protocol after filtering for 1% MAF, 60% missing marker genotypes, and 80% missing sample genotypes, (3) GWAS was performed for two phenotypic traits using all genotypes in the genotyping protocol after filtering for 1% MAF, 60% missing marker genotypes, and 80% missing sample genotypes, and (4) the accessions were queried for all phenotypic traits evaluated. An additional scenario was tested for Zea mays in which the genotypes were computed from genotyped parents in the pedigree. Table 3 lists the mean query time in seconds for these four scenarios. The same queries were then performed a second time in order to test the file cache performance; Table 4 lists the mean query time in seconds for the four scenarios retrieving data from the file cache. The genotype downloads and computation of the GRM and GWAS were performed using the CXGN::Genotype::Search, CXGN::Genotype::GRM, and CXGN::Genotype::GWAS modules described in the "Packaged Queries" section.

VI. Scalability and continued development
Cassavabase (https://cassavabase.org) is the Breedbase instance currently with the largest genotypic data, nearly 100,000 samples with dense GBS genotypes from genotyping protocols of up to 287,952 markers, as well as thousands of samples with low density genotypes from genotyping protocols of around 20 markers. The system is used routinely to perform genomic selection analysis by affiliated breeding programs using the built-in solGS tool [6]. Further scalability tests and development will be necessary before this solution can accommodate very large breeding programs; however, Cassavabase and the performance benchmark described above show that for small to medium programs the implementation presented here is an appropriate solution. Development will continue on the presented database system to support the many plant breeding communities relying on Breedbase. Software development of Breedbase is streamlined through the Git version control available on Github through https://github.com/ solgenomics/sgn. In this way, issues arising in the software can be posted and new releases to the software can be managed through a review process. Documentation is bundled directly with the software at https://solgenomics.github.io/sgn. For easy deployment, Breedbase is released in a Docker https://github.com/solgenomics/breedbase_dockerfile. The Docker deployment allows launching a Breedbase web-server and database instance with minimal configuration, and provides detailed documentation.