GlycCompSoft: Software for Automated Comparison of Low Molecular Weight Heparins Using Top-Down LC/MS Data

Low molecular weight heparins are complex polycomponent drugs that have recently become amenable to top-down analysis using liquid chromatography-mass spectrometry. Even using open source deconvolution software, DeconTools, and automatic structural assignment software, GlycReSoft, the comparison of two or more low molecular weight heparins is extremely time-consuming, taking about a week for an expert analyst and provides no guarantee of accuracy. Efficient data processing tools are required to improve analysis. This study uses the programming language of Microsoft Excel™ Visual Basic for Applications to extend its standard functionality for macro functions and specific mathematical modules for mass spectrometric data processing. The program developed enables the comparison of top-down analytical glycomics data on two or more low molecular weight heparins. The current study describes a new program, GlycCompSoft, which has a low error rate with good time efficiency in the automatic processing of large data sets. The experimental results based on three lots of Lovenox®, Clexane® and three generic enoxaparin samples show that the run time of GlycCompSoft decreases from 11 to 2 seconds when the data processed decreases from 18000 to 1500 rows.


Introduction
Heparin is a complex, polydisperse and structurally heterogeneous mixture of linear, anionic polysaccharides that is widely used as a clinical anticoagulant [1]. Discovered 100 years ago in 1916, heparin's use predated the establishment of the U.S. Food and Drug Administration (FDA) [1]. Low molecular weight (LMW) heparins, introduced in the 1990s for their improved pharmacodynamics and bioavailability, are derived from heparin by controlled depolymerization. LMW heparins, including innovator drugs and more recently generic versions, have undergone a great deal of regulatory scrutiny because of their polycomponent nature and their high level of structural complexity [2]. Studies of new approaches to determine the structure of LMW heparins have included bottom-up [3], top-down [4] and combined liquid chromatography (LC)-mass spectrometry (MS) analysis [5][6]. Despite these major advances, the rapid and accurate glycomic analysis of glycosaminoglycans by LC-MS has been plagued by the absence of suitable bioinformatics software.
In contrast to glycomic analysis, proteomic analysis using LC-MS is highly developed [7][8]. The availability of bioinformatics software has made routine proteomic analysis available to non-experts and has allowed expert laboratories to take on more difficult challenges, such as prost-translational modifications [9]. Recently, there has been an increased interest in developing similar bioinformatics software for glycomic analysis [10][11]. In the area of glycosaminoglycan analysis, GlycReSoft 1.0 software, developed at Boston University by Zaia and coworkers [12], relies on raw mass spectral data after auto-processed charge deconvolution using DeconTools software [13]. GlycReSoft has been applied to the top-down analysis of LMW heparins [4]. While this approach has been quite successful, the data coming from Gly-cReSoft for the comparison of the top-down of three lots of LMW heparin takes about one week of a skilled analyst's time to manually process. Here we report the GlycCompSoft algorithm that allows the comparison of three sets of top down analysis from three different batches of LMW heparin in a few minutes.

Materials and Methods
Data preparation and pre-processing Lovenox 1 and Clexane 1 , the innovator versions of enoxaparin marketed in the U.S. and Europe were purchased from Sanofi-Aventis (Bridgewater, NJ). Generic versions of Lovenox 1 were provided by three different manufactures (three current lots of each). Online hydrophilic interaction chromatography (HILIC) Fourier transform mass spectrometry (FTMS) was performed as previous described [4]. Briefly, enoxaparin injections were diluted into 1 μg/μL and directly injected into a HILIC column (2.0 mm × 150 mm, 200 Å, Phenomenex, Torrance, CA) by an Agilent 1200 autosampler. The LC column was directly connected online to the standard ESI source of LTQ-Orbitrap XL FT-MS (Thermo Fisher Scientific, San-Jose, CA). The enoxaparin intact chain compositions were analyzed under the negative mode. Following raw data acquisition, charge deconvolution was auto-processed using DeconTools [14][15] software. Enoxaparin structural assignment was performed by automatic processing with Gly-cReSoft 1.0 software, developed at Boston University (http://code.google.com/p/glycresoft/ downloads/list) [4,12]. The output on enoxaparin composition from GlycReSoft was then processed using GlycCompSoft to provide automated relative quantification of intact chains present in enoxaparin.
"ΔHexA" was set as "1" and the molecular of "ΔHexA" was changed into C 6 H 4 O 4 from C 6 H 6 O 5 to add one more H 2 O loss. Other parameters were kept as the same. After matching the LC-MS raw data output by DeconTools [13] with the hypothesis, GlycReSoft gives out 15 features [12]. GlycCompSoft then compares and screens the full scale matching results output by GlycReSoft based on Compound Key, Total Volume, Scores and other features automatically.

GlycCompSoft algorithm
The workflow from LC-MS to relative quantification is shown in Fig 1. GlycCompSoft compares and screens the full scale matching results output by GlycReSoft. Matching results of three batches/replications output by GlycReSoft can be input into Microsoft Excel in text format (.txt) and can be automatically compared, screened and computed by GlycCompSoft. First, a new workbook is created and data from three text files are input into three sheets of the workbook. Each sheet is renamed using the text file name and a new column is also added to each sheet with the value of the text file name, so that each row of the output can be labeled and distinguished based on where it comes from in the comparison and screening process. Second, all data in each sheet with the empty value in the Compound Key feature [12] is deleted to allow true signals arising from glycosaminoglycan chains to be distinguished from noise and to make GlycCompSoft more efficient all the rows in the three sheets are copied to a new sheet named with "Total Sheet". Third, data in the Total Sheet is sorted, compared, screened, computed and merged. The overall flowchart of the GlycCompSoft presented in Fig  2. A full copy of the program and the user guide are presented in S1 File.
In the third step, the value of Compound Key feature in each row is first extracted. The entire or partial list of Compound Key values, enclosed with square brackets in each cell, is extracted according to the lazy matching method in regex [16][17][18], which instructs the engine to match as few input characters as possible and then proceed to the next token in the regular expression pattern, improving the efficiency of value matching. In order to compare and check each row of data in the "Total Sheet" accurately and efficiently, all the rows are sorted according to the extraction value of Compound Key feature. If at least three rows of data have the same extraction values of the Compound Key column, and if these three or more than three rows of data come from three different sheets (text files) and can be distinguished by the newly added column with the value of the sheet (text files) name, then these rows of data are retained, otherwise they are deleted as noise. The flowchart of the deletion module is shown in Fig 3. For the retained data, when two or more have same extracted Compound Key value rows will be merged as the row with the largest value in the Score feature will be retained and the sum values of Total Volume will be calculated and adopted, the algorithm of the merging module is shown in Table 1.
The intact chain composition of enoxaparin after LC-MS has been successfully analyzed using GlycReSoft [4]. In this analysis, a total HILIC-FTMS time of 70 min resulted in more than 15,000 components that were first summarized by DeconTools software to obtain acceptable resolution. After matching with the enoxaparin hypothesis, generated by GlycReSoft 1.0, on average about 450 components were obtained for each sample. Since the intact chain composition is complicated and the NH 3 adduction varies, false positives matching results were present in almost 50% of the components. GlycReSoft provides a set of features and summarizes the features into a single score [12], which can determine false positives. However, for LMW heparin data, the score system usually also results in false negatives after manual confirmation (see S2 Table, where the true positive result [8,8,1,12,0] has a score similar to the false  positive result [8,8,1,14,1]). The best way to maximize the number of likely components is retain all the components that are found in three replicate determinations or on three lots of the same drug based on the "All-presence-principle". The reliability of this "all-presence-principle" approach can be confirmed by manually checking the most abundant oligosaccharides. In the absence of an automatic approach, one can only manually keep components present in the three replicates, a time-consuming process with a high error rate. The workflow for determining the relative quantification of enoxaparin intact chain composition from LC-MS is shown in Fig 1. The workflow and illustration of GlycCompSoft, which is designed based on this process is shown in Table 2.

Experimental design
A large number of experiments were on the Windows 8 Professional Operating System (Intel i7 2.1GHz, 8G RAM) to verify the accuracy and efficiency of the GlycCompSoft. Matching results for three lots of Lovenox 1 , Clexane 1 and generic enoxaparins were used in the verification of this software. The proportion of the data rows without empty Compound Key feature values represents a relatively small portion of the entire data set derived from the data files. However, the proportion of the true data finally filtered is much lower. The detailed information is shown in Fig 4, and the red, blue and green bars represent the amounts of raw data, the data after the rows with empty Compound Key are deleted, and the true data, respectively. Thus, the challenge is to have a low error rate with good time efficiency in comparing, screening and computing these massive data sets by a manual method even though the sorting procedure can be done using Excel [20]. While ensuring a low error rate, GlycCompSoft can automatically perform this process within several hundred seconds. It is noteworthy that the runtime can be further reduced to ten seconds by deleting the empty value rows before copying to the Total Sheet (Table 3). Thus, it is important to delete the rows with empty values in Compound Key feature before copying data to the Total Sheet. The runtime of the GlycComp-Soft decreases with the decreasing numbers of data rows requiring processing and at 18000 rows, the average runtime is under 11 seconds (Fig 5 and Table 3). Detailed runtimes for the five different samples at a different number of rows are shown in Fig 5.

Table 2. The overall workflow and illustration of GlycCompSoft.
Step name Illustration

False discovery rate
GlycCompSoft is aimed at comparing complicated oligosaccharide mixtures. The false rate of the software calculation is technically 0%, while the false rate of oligosaccharide recognition is instead associated with the GlycReSoft software [12]. GlycReSoft has a many parameters that can be optimized, such as the minimum matching abundance and error. Moreover, GlycRe-Soft provides a score system to avoid false discovery as much as possible. However, for very complicated oligosaccharide mixtures, such as enoxaparin, the scoring system is not 100% reliable. Based on our manual interpretation experience (data not shown), the score threshold suggested by GlycReSoft, 0.16 in the software application on HS samples, is too high for the low abundant oligosaccharides in enoxaparin. Thus, in data analysis we first reduced the score threshold into 0.1, then used the GlycCompSoft to compare the oligosaccharide matching result output by GlycReSoft using our "all-presence-principle" before we used GlycReSoft score threshold filtration. The false discovery rate (FDR) of the total process is basically the FDR after our "all-presence-principle" associated with the GlycReSoft score system. Calculating an accuracy for the FDR based on enoxaparin oligosaccharide composition analysis is challenging because of the compositional complications of chain length, substitution, and diffuse abundance ions. Therefore, the FDR was instead tested using the same LC-MS system and software processing by investigating four homogeneous chemoenzymatically synthesized heparin oligosaccharides After filtering with GlycComp-Soft, the number of results was reduced to 5 in each replicate. Then we performed a manual confirmation after which we found that only 3 of these were truly positive. Based on oligosaccharide (A), the false positive rate after GlycCompSoft processing was 40%, which is greatly decreased from an average false positive rate of 85% using only GlycReSoft matching. Similarly, by using GlycCompSoft the false positive rates for oligosaccharides (B, C and D), decreased from 89% to 62.5%, 87% to 50%, and 89% to 40%, respectively. Detailed GlycComp-Soft comparison results are presented in S1, S2, S3 and S4 Tables. Since all expected oligosaccharides as well as unexpected contaminants were observed, no false negatives occurred during the software automatic processing.

GlycCompSoft applications on other complicated carbohydrate mixtures
The capability of extracting useful information from a large LC-MS data set from LMW heparins has been demonstrated in the current study. GlycCompSoft is a powerful tool when used together with DeconTools and GlycReSoft on heterogeneous glycan products, including both intact chain compositions (as demonstrated here) and potentially on compositions of enzymatic of chemical cleavage components. Data can be processed in a short time as long as good LC-MS resolution is obtained. Different saccharide compositions (i.e., polysaccharides containing galactosamine, galacturonic acid, fucose, etc.,), different glycosidic bonds (1!2, 3, or 4) or the presence of side chains or branches do not present difficulties for GlycReSoft, as matching only relies on accurate molecular weight data. Based on the "All-presence-principle", the extraction results in GlycCompSoft output that is highly accurate on high abundant components. While, automated data processing software, such as DeconTools, GlycReSoft, and GlycCompSoft, greatly reduce the data processing times, sometimes curation by a skilled analyst is required. For example, in cases where low abundance components are present only in some of the batches manual examination of the data may be required.

Conclusions
GlycCompSoft is an algorithm that automates the comparison of complex data sets generated in the top-down analysis of LMW heparins. An "all-presence-principle", makes GlycCompSoft a highly accurate method for the analysis of components present in high and moderate abundance. Automated data processing software, such as DeconTools, GlycReSoft, and GlycComp-Soft, improve the speed and reliability of interpreting these complex data sets. Manual curation by a skilled analyst is still required in cases where low abundance components are selectively present only in some of batches of LMW heparins.
GlycCompSoft utilizes macro functions and specific mathematical modules in programming for massive data sets comparison, screening and computing. Work is ongoing that will include additional algorithms and error calculation methods to optimize the performance of Software for Automated Comparison of Low Molecular Weight Heparins Using Top-Down LC/MS Data this software. Ultimately machine learning methods should be applied that can accept training data from inputs and generate more intelligent output results based on different samples and relationships among their features. Future studies are also planned to examine the automated processing of LC-MS/MS data when reliable data sets become available.