ETDB-Caltech: A blockchain-based distributed public database for electron tomography

Three-dimensional electron microscopy techniques like electron tomography provide valuable insights into cellular structures, and present significant challenges for data storage and dissemination. Here we explored a novel method to publicly release more than 11,000 such datasets, more than 30 TB in total, collected by our group. Our method, based on a peer-to-peer file sharing network built around a blockchain ledger, offers a distributed solution to data storage. In addition, we offer a user-friendly browser-based interface, https://etdb.caltech.edu, for anyone interested to explore and download our data. We discuss the relative advantages and disadvantages of this system and provide tools for other groups to mine our data and/or use the same approach to share their own imaging datasets.


Introduction
Three-dimensional electron microscopy (3D EM) techniques produce large and informationrich datasets about biological samples. In electron tomography (ET), samples are imaged as they are tilted incrementally-typically 1-2 degrees between images. The resulting tilt-series of 2D projection images can then be computationally combined into a 3D reconstruction, or tomogram, of the sample with nanometer-scale resolution. ET has both biological [1] and materials science applications [2]. ET is frequently performed on frozen samples (cryo-ET) such as intact, small cells. Cryo-ET has revealed many details about cell ultrastructures that are inaccessible by other techniques, either because they cannot be purified intact or because they are not preserved by traditional EM sample preparations [3]. Another 3D EM technique, single particle analysis, also yields 3D information about cellular complexes [4].
Biological applications of 3D EM techniques are rapidly increasing, with an explosive rise in the number of datasets published [5] and excitement about the field (e.g. [6][7][8]). In addition, technological advances such as increased automation for higher-throughput data collection and movie acquisition with direct detectors are increasing the information content of datasets [9,10], which makes management of these datasets a mounting challenge [11]. At the same time, public accessibility is of critical importance [12]. 3D  an automatic processing pipeline that uploads and processes datasets as they are acquired by the microscope [28]. The majority of our ET datasets come from cryo-preserved cells. They represent more than 100 unique species of bacteria, archaea, and eukaryotes and have led to dozens of publications about diverse aspects of cell ultrastructure. The nature of whole-cell imaging, though, means that these datasets are far from exhausted. While we collected them for a specific study, they contain information about many other aspects of cell biology that may be useful to other researchers. While we have been sharing our data by publishing papers and depositing representative tomograms in the EMDB, we have also received many requests-from software developers, biologists, and EMPIAR-to share more of our data. We filled these individual requests, but wanted to explore a broader solution to enable our lab and others to share large amounts of data of unverified quality in a persistent and decentralized fashion. The approach we describe here uses a distributed peer-to-peer file network tracked by an ownerless ledger (blockchain) system. We describe how we used this method to release more than 11,000 electron tomography datasets (excluding those that are still part of ongoing studies), representing 85 species and encompassing more than 30 TB. We discuss the advantages and drawbacks of our approach, and how it can be adopted by other groups that wish to share their own datasets.

Approach
In recent years, decentralized cryptographic ledgers, or blockchains, have been explored as a method to securely record data (typically cryptocurrency transactions, for which they were first conceived [29]). Rather than relying on a trusted central authority, blockchains employ a security model that builds consensus from a decentralized network of users and miners, none of whom necessarily need to trust one another. Users execute and broadcast transactions to the network. Miners collect these transactions from the network and gather them to build a block. In the original proof-of-work blockchain model [29], miners compete to produce a valid block. For each block, miners use an algorithm to calculate a number, which functions as a cryptographic fingerprint of the block. This is known as hash. If the information inside the block changes, the hash also changes. To make a valid block, miners must find a hash smaller than a number set by the blockchain protocol, the target. In order to search for hashes with a lower value than the target, miners are allowed to change the content of one of the fields of the block header, the nonce. Miners then keep trying different nonces, re-calculating the hash of the block and checking if it is smaller than the target. The smaller the target, the harder it is to find a nonce that produces a valid hash. The first miner to find a nonce that makes the block's hash smaller than the target has the right to publish the block to the network and collect a reward, which is a pre-established amount of coins. By analyzing the contents of the published block, other miners and users can independently verify whether or not the published block is valid. The new blocks are linked to previous blocks by referencing the hash of the last published block, thus forming a chain of blocks, or a blockchain. Originally developed to solve the problem of double-spending, blockchain technology has since been adapted to other uses. For instance, the Republic of Georgia uses the bitcoin blockchain to record land transfer titles, one of several countries using the cryptographic ledger to improve the security of property rights [30]. In the United States, blockchains have been proposed as a way for patients to control access to their digital medical records [31,32]. Blockchains are used by Nasdaq in the U.S. and stock exchanges in other countries to record private securities transactions [33].
In 2013, an anonymous developer announced a fork from a cryptocurrency called Litecoin to create a new cryptocurrency, FlorinCoin (FLO), whose ledger features a descriptive transaction comment line similar to that found on a traditional check. The text entered in this transaction comment is stored in the FLO blockchain along with the details of the transaction. Each comment can contain up to 528 characters [34]. In 2014, a company called Alexandria proposed to use this feature as a public record of information and developed an open source protocol termed the Open Index Protocol (OIP) [35]. They first used this protocol to record public social media status in the FLO blockchain and later, using a peer-to-peer distributed file-sharing network, they expanded the specifications of the protocol to register the metadata of videos and music in the FLO blockchain while storing the files in the peer-to-peer file-sharing network BitTorrent, allowing artists to prove ownership of these digital assets. From September 2017 to May 2018 FlorinCoin passed through a series of upgrades. It was renamed FLO, its code was updated to version 0.15 of Bitcoin (still retaining the sCrypt algorithm for proof-of-work), and the comment field was expanded to 1,040 characters. The current OIP specification (0.42) is optimized for the new FLO comment field size, encompasses a variety of data types, and uses a peer-to-peer file system called the InterPlanetary File System (IPFS) [36] to store files. In cases where the metadata of a record exceeds 1,040 characters, the OIP specification automatically splits the record into multiple transactions. There is no limit for how much metadata a single record can hold. File metadata is thereby cryptographically secured, and completely searchable, allowing anyone to discover and download the files from the IPFS.
We were curious to see if this blockchain-based data distribution model would be effective to openly and securely share our scientific imaging data. In the scheme, each dataset would be distributed to IPFS and its metadata recorded in the FLO blockchain. Any interested party, typically through a user-friendly front-end in their web browser, could query the blockchain for datasets of interest and retrieve them from IPFS. We called the resulting distributed database the public Electron Tomography Database-Caltech (ETDB-Caltech), and its information flow is schematized in Fig 1. We worked with Alexandria to develop a digital record type tailored to the metadata of our datasets that could be encoded easily in the FLO transaction comment. The result, Research-Tomogram, contains fields corresponding to the information we store about each dataset in our internal database. This information includes details about the user who collected the data, descriptions of the sample and its preparation, and data acquisition and processing parameters. Where appropriate, this information follows standard conventions for the 3D EM field [37]. We wrote a simple GoLang script to automatically read this information from the record in the internal lab database and translate it into an OIP Research-Tomogram record. If other groups want to adopt this approach, they can use a subset of these fields and/or add their own as necessary to match their local recordkeeping. Table 1 lists the currently available fields in the Research-Tomogram record.
As in other peer-to-peer networks, files can be chunked and hosted from multiple nodes in the network. Users who download a file and participate in IPFS can choose to host it in this fashion for other users. This feature makes the distribution model scalable; if many users are downloading a file, multiple seeds speed up those downloads, avoiding a bottleneck from a single server. This scheme is particularly attractive for distributing the contents of larger files because they are divided and transmitted through the network in smaller chunks. The integrity of each chunk is verified as they arrive and they are reassembled into the final file. In case of service interruption, the resumed download fetches missing or corrupted chunks, which avoids having to redownload the entire file. In addition, if the file is hosted by multiple nodes of the network, individual chunks can be served simultaneously, increasing download speed. By using a peer-to-peer network, not only can users and labs contribute to the distribution of files, but institutions with the mission of safekeeping information, such as libraries and repositories, can also download and distribute the datasets by participating in the IPFS network. In our case, we expect relatively light file traffic, so at the current time, files are downloaded solely from our server, as in a traditional distribution model. Therefore, at a minimum, if no other users or institutions participate in the ETDB, we can replicate the typical scheme of centralized repositories, although files remain distributed in servers in participating laboratories.
In the rare event that a dataset is published in error, OIP offers the option of deactivating or editing a published record. Because the blockchain is persistent, editing a record will not modify the original metadata. Instead, edits to a record can be performed through another type of OIP record called an Edit. In this record type, a JSON formatted patch to the metadata containing the difference between the new and old version is recorded in the blockchain. This scheme keeps records up-to-date while also registering changes over time. Likewise, deactivating a record will not erase the metadata published in the blockchain, but the record will no longer be available to anyone using the OIP API to search the blockchain. In that case, if a user were interested in an unavailable tomogram, they would have to search the raw data in the blockchain, and hope that the files were still in the IPFS network.
There are two ways that users can download our datasets. The first is through a direct query of the blockchain and IPFS. We built a command-line application that facilitates this approach; see Materials & Methods for details. To increase public accessibility, we added a second route: a browser-based front-end. This graphical interface, which can be found at https://etdb. caltech.edu, provides an intuitive, interactive experience for anyone to browse ETDB-Caltech datasets, view images and videos they contain, and download part or all of each dataset. A sample dataset display page is shown in Fig 2. The ETDB-Caltech front-end offered us a chance to highlight scientific challenges for target user groups-cell biologists and software developers. We hope cell biologists will find novel features in the imaged cells, and identify those that remain mysterious. Electron tomograms contain a wealth of information, not all of which is currently interpretable; recently, for instance, we published a paper describing some of the cellular features we have observed in our electron tomograms but could not identify [38]. We hope software developers will use the released datasets to improve image-processing algorithms. In particular, we hope the availability of these datasets contributes to the development of software that can: (1) more reliably find and track the fiducial markers used for alignment in tomographic reconstruction; (2) automatically and accurately segment the boundaries of cells; and (3) automatically segment large macromolecular complexes in cells. In addition to their usefulness to experts in the field, the datasets in ETDB-Caltech may be of interest to students and the general public. To welcome these users, we designed the front-end of ETDB-Caltech to be accessible and educational, with information about the data and technology, as well as a Featured Tomograms page highlighting various features of bacterial and archaeal cells that are visible in electron tomograms (Fig 3).

Outlook
Here we tested a new approach to publicly share a large amount of ET data. If our goal was simply to continue honoring requests from the community to make our datasets public, it would have been cheaper and easier to simply host the data from a local MySQL database, as we do for our internal group users. However, we also wanted to make a broader resource that could encompass data from many ET labs into a flexible repository that does not rely on a central authority. If ETDB is ultimately successful in enabling large-scale community data sharing, we believe it will complement (but never replace) the mission of curated repositories like Compared to more centralized models of data storage, this dissemination model offers several attractive points. The first is flexibility. Multiple file types can be combined in a single OIP record, allowing, for example, light micrographs from correlative light and electron microscopy experiments and annotated segmentations to be included in EM datasets; this has been cited as a key feature lacking in some current repositories [12,39]. Other file types from different imaging modalities can be accommodated with similar ease. The OIP specification of the Research-Tomogram record type requires few mandatory fields (Table 1). These fields can be adapted to the metadata collected by other groups, who may be using different internal databases (e.g. [40,41]). Here we implemented a first version of the Research-Tomogram record type. We expect that other labs will adapt this record type for their needs. However, this flexibility comes at the cost of compatibility among records from different labs. We anticipate that the developers of third party applications that display datasets from multiple labs will be the most impacted by this issue. For this reason, we recommend that, in the future, participating laboratories engage in a discussion to minimize incompatibility between datasets published by different labs. This discussion could occur in forums such as the 3DEM mailing list, a highlyutilized communication channel for three-dimensional EM researchers to discuss issues related to their field, or on the OIP Wiki [35]. Another downside of this flexibility is that, compared to repositories of validated datasets like EMDB/EMPIAR [26], ETDB entries may be missing information like pixel size or contain errors in metadata. This caveat should be kept in mind when using the data in further studies; information critical to interpretation should be verified with the depositor.
Another appealing feature of distributed file sharing is the distribution of storage and cost. 3D EM datasets are large, as reflected by EMPIAR, which has grown to accommodate >80 TB of stored data in 5 years [42]. These datasets are associated with only 168 studies [43]. The popularity of 3D EM methods, particularly cryo-ET [8], is growing rapidly: the number of entries in the EMDB has more than doubled over the last three years [5,44]. There are currently more than 6,500 entries in the EMDB [44]; if each of these was associated with a similarly-sized dataset in EMPIAR, more than 3 PB of centralized storage space would be required. In a distributed distribution model, each contributing lab is responsible for storing their own data, which they presumably already do. In our case, we could have implemented the system using our existing server, which hosts our internal database, at no added cost. For extra security, we chose to keep the server with the internal database behind a local firewall and mirror the relevant datasets on an additional server outside the firewall hosting ETDB. This second server, which is larger than necessary to accommodate additional applications and future growth, cost US$7,000. In addition to the local server, files should be available from other nodes of the IPFS. This ensures data persistence in the event of, for instance, a local disk failure. Of course, how well this feature works depends on whether the system is widely adopted. In addition to users hosting IPFS nodes, institutions can also easily archive ETDB data through the IPFS. The more nodes are hosting a file in the IPFS, the higher the bandwidth for users to download it; this scalability is a major feature of peer-to-peer networks. Currently, however, the IPFS is still experimental and, like many new technologies, unstable. For that reason, we serve the files in our front-end directly from the IPFS node running on our local server, not through the full IPFS peer-to-peer network. However, IPFS is in rapid development and we expect soon to update the front-end to fetch and serve the files from the IPFS. Our command line application for bulk download, ETDB-downloads, already retrieves the files from the IPFS network.
The maintenance of the ownerless ledger used to store the ETDB metadata, the FLO blockchain, depends on a distributed network of miners and users. This feature facilitates adoption as anyone can publish tomograms to the ETDB without having to seek permission from a central authority. However, as in other cryptocurrencies, miners and users have an incentive to participate in the FLO network depending on a combination of factors including the costs of hardware and electricity, and the value of FLO in the cryptocurrency market. Although FLO has been in circulation for over 5 years, a relatively long time by cryptocurrency standards, its eventual success is difficult to predict. Almost 150 million FLO coins are currently in circulation. The maximum supply of 160 million FLO coins is projected to be reached in 2079. At the time of writing, 1 FLO was equivalent to US$0.07, and historically FLO has mirrored the price fluctuations of Bitcoin. If FLO becomes an inviable option, it may be necessary to switch to a different ledger system in the future (Ethereum, Namecoin, and Bitcoin Cash are all capable of storing text). In this scenario, the OIP would have to be adapted to be compatible with the new blockchain and the records re-posted. Note, however, that metadata already published remains accessible as long as at least one copy of the FLO blockchain exists; we host one ourselves.
The main goal of the ETDB is to provide a public space for laboratories to publish large volumes of raw, unpublished electron tomography datasets. The data distribution model we employed has several features that are imperative to achieve this goal. The choice to use a public blockchain to store metadata means that: 1) since most of the data we envision being deposited in the ETDB is unpublished, original ownership of the data can be traced via the cryptographic signatures and timestamps encoded in the process of posting information to the blockchain, 2) different laboratories with different needs can publish their datasets in bulk due to the flexibility of the protocol, and 3) the metadata of datasets in the ETDB is always accessible via the FLO blockchain, which has had 100% uptime since its first transaction over 5 years ago. The choice to use a peer-to-peer filesystem (IPFS) to host the relevant files allows the ETDB to: 1) distribute the costs and burden of server management among participating labs, 2) provide easy adoption for institutions with the mission of safekeeping information to store and aid in distributing datasets, and 3) offer the capacity to improve the download speeds of high-demand datasets. Most importantly, because both FLO and IPFS are distributed systems, the ETDB is ownerless and does not require any central individual or institution to manage the content, the database or the file sharing system.
For us, the project took a few months to complete and the cost for the cryptocurrency transactions we used to publish 11,293 datasets was US$17.89 (see Materials and Methods). Most of the development effort was invested in the user interface as well as the scripts to automatically upload datasets to the IPFS and the metadata to the FLO blockchain using OIP. If other groups wish to adopt the same approach to make their data public, they would only need to slightly modify these scripts (available on GitHub, see Materials & Methods) to match their internal database descriptors. Our front-end code is similarly available on GitHub so that other groups can easily adapt it to taste and use it to display: (1) their own data, (2) all ETDB datasets in the IPFS, or (3) a custom subset (e.g. data from a single species or technique). In addition, individuals interested in web applications for visualization and manipulation of tomograms can use the ETDB as a distributed database of content without needing to host any tomograms themselves. Outlets (e.g. science educators) can stream tomogram videos directly from the IPFS network.
Ultimately, we believe the relationship between the ETDB and curated central repositories like the EMDB is complementary. We will continue to support the invaluable mission of the EMDB and EMPIAR in safeguarding scientific data by submitting representative curated datasets we use in our publications. We hope that the ETDB can in turn help facilitate broader releases of large batches of electron tomography data for community use. If successful, the ETDB could even be integrated into centralized repositories by their hosting an IPFS node, enhancing accessibility of the data. The flexible features of this blockchain-based, distributed scheme of data sharing may also make it useful for other types of scientific data.

ETDB-Caltech distribution
The ETDB-Caltech database is fed by a MySQL database (version 14.14 distribution 5.7.21) hosted on an Ubuntu Server (Artful Aardvark kernel version 4.3.0-37). The MySQL database contains the metadata of entries from the Caltech Tomography Database [28] that have been designated for publication. Associated files are stored in a RAID6 ext4 file system. Each night, the internal server hosting the internal Caltech Tomography Database executes a script to find datasets newly edited or marked for publication and copy them to the external ETDB-Caltech server, updating the MySQL database.
The ETDB-Caltech server runs a full node of the FLO blockchain, a node of the IPFS and a MySQL database. Upon changes in the MySQL database, a custom-built GoLang script (goetdb, available on Github: https://github.com/theJensenLab/go-etdb) makes the new files publicly accessible through the InterPlanetary File System (IPFS, version 0.4.15-dev) [36]. The IPFS daemon calculates a unique identifier to the dataset directory called a hash which is cryptographically dependent on the contents of the directory and makes the directory available to other nodes of the IPFS. This hash is combined with the metadata of each dataset and formatted according to Open Index Protocol (OIP, version 0.42) specification to create a JSON record (see Table 1). Each record generated this way is signed with a cryptographic key unique to the Jensen lab (the private key associated with public address FTSTq8xx8yWUKJA5E3b gXLzZqqG9V6dvnr) and published to the FLO blockchain by a daemon (OIPd) on the server, attaching the record to the "floData" field of one or more transactions. The cost to publish the full set of 11,293 tomograms (at then-current rates of exchange) was US$17.89.
If other laboratories would like to publish datasets on ETDB, they can do so by following the steps described above. The only difference is that they will need to adapt the go-etdb script to (1) read the metadata fields from their internal database and (2) publish them with an address generated by the full node of the FLO blockchain on their server.
To search for ETDB-Caltech data, any user can use the cryptographic key given above to query the blockchain and retrieve matching ETDB records. This procedure is facilitated by an OIP daemon that scans and indexes the FLO Blockchain and exposes an Application Programming Interface (API) for public use. The API is accessible by a package (oip-js) deposited on the node package manager (npm). We also developed a command-line application for Unixrelated environments (ETDB-downloads, manual available on Github: https://github.com/ theJensenLab/etdb-downloads/blob/master/userManual.md) designed to allow users to download all or a subset of ETDB-Caltech datasets. Unlike the ETDB-Caltech website (see below), this application launches a temporary IPFS node and fetches the files from the IPFS network.
In the future, if users would like to search ETDB records from other laboratories, they can do so by the same method, simply adding the public FLO addresses of the institutions that provided the data. Users can also access the data by simply searching for all OIP records of type Research-Tomogram in the FLO blockchain using the OIP API. Of course, since the publishing step is permissionless, it is not guaranteed that all the records returned will be electron tomography datasets. There is no censorship or control of the publishing step, only verification that the data retrieved is from a trusted source.

ETDB-Caltech interface
The front-end was built using node.js (version 9.1), react (16.2.0), webpack (4.1.1), and Twitter Bootstrap. It uses the oip-js package (https://github.com/oipwg/oip-js) to connect to an OIPdaemon Representational State Transfer (REST) API, which scans the FLO blockchain for valid OIP records and indexes them into an internal database. Currently, oip-js queries OIPdaemon for a list of records with type "Research" and subtype "Tomogram" published by our lab (the private key associated with public address: FTSTq8xx8yWUKJA5E3bgXLzZqqG9V6 dvnr). In the future, queries could also search for the cryptographic keys of different groups. Alternatively, records could be retrieved by a full-node search of the FLO blockchain (available on GitHub: https://github.com/floblockchain/flo) with OIPdaemon. Files are served for download from this interface directly from the IPFS node on the ETDB-Caltech server.
The interface was designed to be easily navigable by scientists and non-scientists, and is optimized for viewing on all common web-enabled devices. We expect that in the future, some users and other labs may wish to customize this web interface. They can either copy and modify our template (available on GitHub: https://github.com/theJensenLab/etdb-react) or develop their own. While the Caltech ETDB interface displays only entries from our lab, other users may wish to build front-ends to display data from all labs sharing data using Open Index Protocol or to display only a subset of interest, for instance only those datasets corresponding to a particular species. In that case, instead of serving the files directly from the ETDB-Caltech IPFS node, those websites would use the peer-to-peer feature of the IPFS to search for the files in multiple nodes.