Fast and Efficient XML Data Access for Next-Generation Mass Spectrometry

Motivation In mass spectrometry-based proteomics, XML formats such as mzML and mzXML provide an open and standardized way to store and exchange the raw data (spectra and chromatograms) of mass spectrometric experiments. These file formats are being used by a multitude of open-source and cross-platform tools which allow the proteomics community to access algorithms in a vendor-independent fashion and perform transparent and reproducible data analysis. Recent improvements in mass spectrometry instrumentation have increased the data size produced in a single LC-MS/MS measurement and put substantial strain on open-source tools, particularly those that are not equipped to deal with XML data files that reach dozens of gigabytes in size. Results Here we present a fast and versatile parsing library for mass spectrometric XML formats available in C++ and Python, based on the mature OpenMS software framework. Our library implements an API for obtaining spectra and chromatograms under memory constraints using random access or sequential access functions, allowing users to process datasets that are much larger than system memory. For fast access to the raw data structures, small XML files can also be completely loaded into memory. In addition, we have improved the parsing speed of the core mzML module by over 4-fold (compared to OpenMS 1.11), making our library suitable for a wide variety of algorithms that need fast access to dozens of gigabytes of raw mass spectrometric data. Availability Our C++ and Python implementations are available for the Linux, Mac, and Windows operating systems. All proposed modifications to the OpenMS code have been merged into the OpenMS mainline codebase and are available to the community at https://github.com/OpenMS/OpenMS.

Listing 3 TIC calculation using parallel indexed access. The following code describes our C++ implementation in OpenMS which calculates the TIC using the "random access" algorithm using an indexed mzML file and makes use of parallelization. 1 String in = "input.mzML";  Listing 7 TIC calculation using Python in-memory access. The following code describes our implementation in OpenMS which calculates the TIC using the Python "in memory" algorithm. Listing 10 TIC calculation using the Python interface The following Python code calculates the TIC using the "event-based" streaming algorithm.
1 class TICCalculator: In Figure 4 we compare the performance of the different APIs provided in OpenMS through pyOpenMS [1] in terms of performance. As expected, processing speed of Python is slightly slower than C++, however the new pyOpenMS execution times are also substantially improved over the 1.11 OpenMS kernel. Only the cached implementation in Python offered a substantial speed gain over C++, but the improvement was not as large as observed for a pure C++ implementation.

B. Considerations regarding random access reads in large files
Some algorithms need random access reads into large raw data files that cannot be easily bundled or ordered by retention time. In these cases, random access to data is necessary which precludes the algorithm from using the in memory implementation (due to system memory restrictions) and the event-driven implementation (since random access is necessary). In these cases, using the indexed data access API -which relies on the mzML idx standard -is the most straight-forward way to implement such an algorithm. The mzML idx standard stores binary offsets to the individual data tags inside the mzML file which allows a file seek to jump to the desired location and read the next XML tag (either a <spectrum> or <chromatogram> tag).
However, using the mzML idx standard has at least two main disadvantages. (i) The file needs to be in de-compressed form while reading since the indices relate to the decompressed locations and stream-based compression algorithms such as gzip do not allow random access. (ii) During each read the raw data has to be converted from a base64 string into a floating point number representation in memory which is generally the most time consuming step while reading. If many random access operations need to be performed, these two disadvantages necessitate initial de-compression of the file and then only allow relatively slow access to each spectrum. Therefore, we implemented the "cached" file format that allows fast caching of the raw data while retaining the meta data structure of mzML. The file format consists of two linked files, a cachedMzML which only contains the raw mass spectrometric data and an associated mzML file which does not contain any raw data (only the meta-data is retained in the XML data structure). By allowing for clear separation of raw data and meta-data, reading the meta-data into memory and performing search operations (for example collecting all spectra within a certain retention time window, collecting all spectra with their precursor masses in a certain range etc.) is extremely fast since the data structures are very small (generally a few MB) and no raw data needs to be loaded into memory for this operation. Once a suitable set of spectra (or a single spectrum) is found, its associated raw data can be loaded from the disk from the cachedMzML file for further processing. Loading the raw data of specific spectra from disk can be extremely fast as indicated by Figure 1 and 2 in the main text, which indicate that loading the cached raw structures can be more than 10 times faster than any other access mechanism. As we describe in the main text, we were able to process the raw data of all spectra of a 60 GB mzML file and compute the TIC on this data in less than 20 seconds using the cached access algorithms. All tests were performed on the same RHEL system also used in the main text. The "dev" versions indicate that we used OpenMS with the enhancements described in the main text.
Note that the "OpenMS" software relates to the C++ implementation (shown as comparison) while pyOpenMS relates to the Python implementation. In order to assess the performance of our implementation, we compared it to the XML parsing implementation available in the ProteoWizard software, another major open-source data access library [2,3]. We used the ProteoWizard library revision 7261 to build a custom program that calculates the TIC and compared it to the performance measured using the OpenMS implementations. The results of the measurement are shown in the main text, in Figure 1 and in Table I.
Our results indicate that the single threaded execution of ProteoWizard and OpenMS are on par in terms of processing speed (except the "cached" implementation which is an order of magnitude faster). However when using multiple threads, OpenMS is 30 % ("In Memory"), 60 % ("event-driven") or even a factor of 4 ("indexed") and 50 ("cached") faster.
We also compared our implementation to pymzML [4] which only provides a featurecomplete mzML reader. Interestingly, when run on the same machine as the other comparisons, we found pymzML to outperform the Python and C++ OpenMS 1.11 implemen-  for the cached implementation. The C++ value in this graph is equivalent to the single threaded value for the "In Memory" algorithm in Figure 1 of the main text.