Identification of potential antimicrobials against Salmonella typhimurium and Listeria monocytogenes using Quantitative Structure-Activity Relation modeling

The shelf-life of fresh carcasses and produce depends on the chemical and physical properties of antimicrobials currently used for treatment. For many years the gold standard of these antimicrobials has been Cetylpyridinium Chloride (CPC) a quaternary ammonium compound (QAC). CPC is very effective at removing bacterial pathogens from the surface of chicken but has not been approved for other products due to a toxic residue left behind after treatment. Currently there is also a rising trend in QAC resistant bacteria. In order to find new compounds that can combat both antimicrobial resistance and the toxic residue we have developed two Quantitative Structure-Activity Relationship (QSAR) models for Salmonella typhimurium and Listeria monocytogenes. These models have been shown to be accurate and reliable through multiple internal and external validation techniques. In processing these models we have also identified important descriptors and structures that may be key in producing a viable compound. With these models, development and testing of new compounds should be greatly simplified.


Introduction
Foodborne illness presents a considerable risk to both public and personal health in the United States. Every year, approximately one in six United States citizens will contract some form of foodborne illness [1]. The severity and duration of the disease depends on the causative agent. According to foodsafety.gov, the majority of foodborne disease reported in the United States is caused by pathogenic bacteria. Among the many pathogenic bacterial species spread by food products, two species particularly noted for their burden on public health are S. enteritidis serovar typhimurium (commonly known as S. typhimurium) and L. monocytogenes.
S. typhimurium causes the gastroenteritic disease salmonellosis, which is characterized by a period of one to four days of abdominal pain, fever, and diarrhea. S. typhimurium is also capable of entering the bloodstream through the intestines and causing bacteremia. Salmonellosis org), and the European Bioinformatics Institute Database (www.ebi.ac.uk). The molecules used in these sets included QACs, as well as tertiary nitrogen-based compounds with similar antimicrobial actions to QACs. The selected compounds were effective bacteriostatic agents against S. typhimurium, L. monocytogenes, or both, and held a wide range of Minimum Inhibitory Concentration (MIC) values. The Simplified Molecular Input Line Entry System (SMILES) notation and MIC value for the compounds were both reported in separate Excel files-one document for S. typhimurium, and the other for L. monocytogenes. The collected data was sorted into either the training set (85% of compounds) or the test set (15% of compounds).

Calculating descriptors
The molecular descriptors for all data sets were calculated using the ochem.eu chemoinformatics database (https://ochem.eu). The "Calculate Descriptors" bar was accessed under the "Models" tab, and the Excel files were uploaded directly to the server. The uploaded molecules were then pre-processed; during this procedure, the salts associated with each compound structure were removed. After this stage, the molecular descriptors were selected (unless otherwise stated, the default settings for the descriptor types were not modulated). The descriptor types used were: ALogPS, GSFragment, QNPR, ISIDA fragments (fragment length was set as 2 to 10), and E-state (all boxes checked apart from "Extended indices-experimental"). These were selected from the entire set as they did not contain any 3D descriptors. This selection was made to avoid errors that occur with 3D descriptors calculation. Any compounds still experiencing errors in calculation were deleted from the descriptor sets, and the remaining sets were saved as.csv files. Resulting in 1356 descriptors for each compound.

Descriptor output modification and data normalization
The.csv files for S. typhimurium and L. monocytogenes were modified before QSAR use. Column headings with no empirical data (i.e. Comments) were deleted from the files and a new column containing the log of each compound's MIC (logMIC) was inserted into the documents. Finally, each compound was randomly assigned to either the training or test groups for both pathogens. 85% of the total molecules in the data sets were used as the training group and the remaining 15% were used in the test group. In all, the S. typhimurium file contained 26 compounds in the training group and 6 compounds in the prediction group; the L. monocytogenes file contained 37 compounds in the training group and 7 compounds in the prediction group. The compounds were assigned numbers in a new "split" column by their status; "1" indicated assignment to the training group, "2" indicated molecules from the testing group. These modified.csv files were then uploaded to the normalizeTheData (v.1.0) data normalization tool [34] (http://teqip.jdvu.ac.in/QSAR_Tools/#ADInHouse), which produced a new version of the.csv files with adjusted molecular descriptor values. Last, the.csv files were reformatted as.txt files for compatibility with the QSAR program.

QSAR modeling
After the file modifications, the.txt data files for S. typhimurium and L. monocytogenes were imported separately into the QSARINS program [35,36](qsar.it). All compounds in the test set were manually deleted from the uploaded file. Following this, the training compounds were run through the software's internal filters. The internal filters were used to remove descriptors that had under 80% consistency throughout the data set and those that were 95% correlated. The internal filters removed the majority of the descriptors for all QSAR models. The next step involved a setup procedure for the QSARINS equation in which the remaining molecular descriptors were selected as the variables and the log(MIC) was selected as the response. The in-program Genetic Algorithm was then applied to select the top models for each iteration of descriptors based on their Q 2 (average leave-one-out fit) values. The number of iterations (number of times the Genetic Algorithm is run) is equal to 1/5th the total number of compounds in the test set (as recommended by Eriksson et al. [37]); the number of combinations (of molecular descriptor variables) is equal to the number of molecular descriptors that remain after the preliminary internal filtering step. After the Genetic Algorithm is applied QSARINS displays the top 5 models from each iteration. The algorithm was set to run for 20 iterations with 500 generations processed for each iteration.

Internal selection
These models were then sorted according to their respective R 2 and R 2 -Q 2 values. The default model parameters for S. typhimurium and L. monocytogenes were the same for both; however, as QSARINS produced models with different value ranges for each, the parameters were slightly modulated to reduce the overall number of models that were considered. Our cutoff values were R 2 ! 0.75 and R 2 -Q 2 0.10. These numbers were elevated from the less conservative cutoffs presented by most other researchers [27,28,31,37]. Therefore, only the top models selected by internal statistics were used for further analysis.

External validation
The saved model file for the selected top models were retrieved from the "models" bar and loaded onto QSARINS. The number of rows was adjusted to reflect the number of compounds within the test set. The compounds and relevant molecular descriptor data were then loaded into the model. The model predicted the log(MIC) of the test compounds; these predicted log (MIC) values were contrasted against the experimental log(MIC) values by percent error analysis and the R 2 (linear trend fit) of the predicted compounds.

Predictions
Once the final top models were selected and combined as described above, each model was then used on the prediction set collected from the Bai lab [33]. A set of 835 compounds was collected from PubChem based off the top results from a similar identity search using CPC as the query. These compounds were then filtered based on the model applicability domain for each model, any model outside the applicability domain was removed from the final results. The consensus models for each bacterium were also applied to this data set. Any compound within the applicability domain for 75% of the combined models were kept, all other compounds were removed from the final set.

Consensus modeling
The consensus models were built using the average of all predicted values for the models that met the following criteria. First, the single worst model of each set was removed (77 for S. typhimurium and 84 for L. monocytogenes). Models were then included for the selected consensus if they met the following criteria, a cutoff of external validation R 2 ! 0.60 and percent error 10% for S. typhimurium and external validation R 2 ! 0.90 and percent error 35% for L. monocytogenes. This technique was also used to combine the predictions from the final model identified for each bacterial species.

Structural similarity comparison
In order to better understand the relationship between the top compounds identified by the QSAR models developed for this project and for our previous study of E. coli, we looked at similarity and substructures shared between the top 50 compounds from each predicted set using SIMCOMP2 [38][39][40]. In order to use this tool, all SMILES structures were converted into.mol files using molview [41]. The.mol files for each prediction set were then concatenated to produce a single.mol file with all 50 structures. Each set was then analyzed pairwise against the other two sets using SIMCOMP2 with the following parameters: global search, bond based docking matched by KEGG atom, post-processing for all SCCSs matched by atom, and a cutoff value of 0.60. This data was further filtered to cutoff score of 0.75 and limited to structures that matched with structures across all data sets [38][39][40].

S. typhimurium models
Detection of potentially new food safe antimicrobials first starts with looking at the best current technologies being applied. We performed a literature search to find all available data on QACs and QAC-like compounds that have been tested against S. typhimurium. Specifically we looked for studies pertaining to minimum inhibitory concentrations (MIC) of compounds against this bacterium. MIC is a measure of the effectiveness of an antimicrobial by the lowest concentration that inhibits growth, making the most effective compounds to have the lowest MIC. Although previous individual studies provide a more stable and accurate model building sets, no single study had a wide enough range of MIC values or a wide enough range of different molecules. Therefore, we decided to utilize a group of 32 compounds that were collected from 8 different sources [42][43][44][45][46][47][48]. The compounds we chose had at least a single nitrogen with 3 or more constituent groups, beyond that these compounds had variable numbers of carbon, benzene rings, oxygen, and other distinct structural differences. Their MIC values ranged from 3 μg/ml to 1071.1 μg/ml. Having a wide range of MIC values and a wide range of structural differences increases our confidence in our results. These structures were used to develop over 300 models using the QSARINS software. This software produces predictive models using a multiple linear regression (MLR) based approach paired with a genetic algorithm for descriptor selection. The algorithm iterates through models based off a number of descriptors based on the iteration number. It then substitutes descriptors out from each individual model based on the given mutation rate. The top 10 models, based on the average linear fit loss of one (Q 2 ) for each model, are kept for the next round of mutations. Each iteration cycles through 500 generations. After each iteration a list of the top 5 models of each iteration is kept for further analysis. 20 iterations were done for each model. The top models, as determined by internal selections (Table 1), were applied to external datasets to produce predicted MIC values. The best of these models were selected via filtering using their R 2 (linear trend fit) and R 2 -Q 2 values (difference between linear trend fit and average leave-one-out fit). Within these models the most frequent descriptors that were shared between all of these top compounds included a cyclic SP2 hybridized 5 carbon ring attached to a methyl group and a minimum 6 carbon chain ( Fig 1A). The incidence of use of these descriptors do not seem to be correlated to the magnitudes of the coefficients however. Our data shows that the descriptors that have the highest coefficient values are cyclic carbons attacked to a Nitrogen atom and a carbon chain containing oxygen attached to a Nitrogen. These descriptors only appear in~25% and~10%, respectively, of the top models. As such, a singular descriptor cannot be ascertained as being most important in predicting MIC values of potential QACs.
Using external validations and our own internal filters, a final list of 14 models was collected for S. typhimurium (Table 1). A number of these models were then used to develop three separate consensus models (Table 1). These consensus models were developed in hope of creating a more accurate model. After looking over the data, we determined that model 76 was the best possible model having the highest R 2 (by 0.04) and lowest percent error (by 0.21%) of the external validation set. Its internal statistics were not the best of the set, however they were comparable to the rest of the models. The regression of this model can be seen in Fig 2. Of all the potential models, 76 really stood out as the most potentially accurate and consistent model. It is interesting to note that the R 2 values of these models diminished across the validation sets. This could be due to the selection of test compounds being on the high end of the applicability domain, unlike the L. monocytogenes set. Having determined the optimal model, we began producing predictions using the QSARINS program.

S. typhimurium predictions
Predictions of the QSAR model that we selected were run on the dataset previously curated by Rath et al. based on a sub structure search of CPC [33]. This list contains no samples that have a previously predicted or experimentally validated MIC, unlike the test set used for external validations. Out of the 834 compounds, we identified the top 10 compounds that fit within the applicability domain of model 76 (Table 2). These compounds are similar in that they all contain one or more ring structures and some form of nitrogen. It's interesting to note that there is not a positively charged nitrogen in each molecule.

L. monocytogenes models
Much like S. typhimurium, a single study with enough antimicrobial information was not found for L. monocytogenes. The final list of compounds was taken from 12 sources and contained 49 unique compounds [49][50][51][52][53][54][55][56][57]. These compounds were similar in structural depth to the previous set (above) of compounds but ranged in MIC from 0.0005 μg/ml to 50 μg/ml. This resulted in an even larger list of potential models than the list for S. typhimurium. In the top 11 models there were four descriptors were involved in > 90% of the top models. These include a constituent group that starts with a single carbon, a 6 member carbon ring, a 6 carbon chain, and two carbons leading to a nitrogen (Fig 1B). The two descriptors that showed the greatest coefficients were the two descriptors containing Sulfur. These two descriptors were only involved in~10% of the top compounds. Through this data we cannot ascertain a singular descriptor that is most important for MIC prediction across the top models. Using the same internal filters and an external validation set the final list of 11 compounds was curated (Table 1). These models were used to create three additional consensus models to be considered for the final prediction. These models have a much higher margin of error than that of the S. typhimurium models, this could be due to a greater variance in the structures of the compounds used in the training and test sets. Model 90 was selected as the optimal model of this set. It was the top single model for external validation R 2 and the second lowest percent error by only 0.30% (Fig 3). Although the selected consensus model had reduced error and a very close R 2 , we chose a singular model for ease and efficiency in prediction.

L. monocytogenes predictions
We ran the predictions of the same 835 compounds through the selected model in order to determine the best possible compounds from this set. The top ten compounds that lay within the applicability domain of model 90 are shown in Table 3. These compounds have many structural similarities to those detected by the S. typhimurium based model.

Combining all predictions
Having three separate QSAR models (the two detailed here and one from a previous study for E. coli [33]) for each individual target species is effective in providing highly accurate and reliable predictions. Alone these models are unable to produce a direct consensus on a singular set of potential compounds, and cannot be directly combined without making incomplete assumptions of the effectiveness of known compounds. In order to gain a better understanding of the predicted compounds' effectiveness across a range of pathogenic bacteria. This information was gathered through two separate approaches. The first approach focuses on using the average predicted MIC across all models and the later utilizes similarity scoring to find general structural similarities of top compounds from each model.
One approach to combining the predictions of the models is to follow our consensus approach. By averaging the log(MIC) for each compound across all three predictions we are able to draw conclusions as to the general effectivity of the compounds. The top 5 compounds are reported in Table 4. These compounds have a structure similar to CPC with an aromatic head and a long hydrophobic tail.
When trying to define possible compounds that can effect multiple species of pathogenic bacteria, it is important to look at the structural similarities between the top compounds from each predictive model. In order to find the regions of similarity and the degree of similarity between each set, we turned to a pairwise comparison through SIMCOMP2, an online tool provided by KEGG. In order to expand the possibilities of similar compounds, we expanded our potential list to the top 50 compounds predicted for each species. Multiple similar structural components were identified within pairwise comparisons, however only one major structure was identified for the compounds with at least 0.75 similarity between all three predictions (Fig 4). A search of the most similar compounds returned 1 L. monocytogenes  Identification of antimicrobials against S. typhimurium and L. monocytogenes using QSAR modeling compound, 3 S. typhimurium compounds, and 9 E. coli compounds ( Table 5). The actual structures of these compounds can be found in S1 Table. These compounds provide a good launching point for experimental validation, or rational design of new compounds. Identification of antimicrobials against S. typhimurium and L. monocytogenes using QSAR modeling

Discussion
The development of an accurate and reliable computation tool for the development and identification of potential food safe antimicrobials is paramount for increasing the efficiency of this process, especially for those under the QAC and QAC-like umbrella. Using currently available QAC antimicrobial data we developed two models that focus of the effectiveness of these compounds on S. typhimurium and L. monocytogenes respectively. These models were produced using an optimized descriptor set and the built in genetic algorithm (GA) within the QSARINS software. The models we have identified in this study have shown both accuracy (percent error) and reliability (Q 2 ) against the test and training sets. We have confidence these are the optimal models for the datasets that are currently available.
We have identified the top used descriptors for each data set, potentially providing great insight into the important structural components of effective QACs. From the data collected, long chains of carbon (at least 5 carbons) followed by both cyclic rings and the inclusion of heteroatoms were shown to be the most frequent descriptors for top models (Fig 1D) Consequently, we can assume that the carbon structure of these compounds is very important to their antimicrobial effects. Due to the mechanism that QACs rely on, this relationship not only makes sense, but adds to the confidence that we have in the accuracy of our models.
These models and the model produced for a previous publication have uncovered a number of top rated compounds that have potential as food safe antimicrobials [33]. There were no compounds that were shared amongst the top ten compounds for each organism; however, there were many structural similarities observed (Tables 2 and 3). Each top ten set contained at least one straight chain compound with a hexane ring in the head region, this is similar to most commonly used QACs. These compounds may be a good immediate replacement for CPC in meat processing but may not provide much difference in terms of solubility, residue, and in overcoming potential antimicrobial resistance.
There are other compounds in these sets that do show potential for overcoming these issues. Within these sets there are a group of compounds that have a larger "head", usually composed of two fused rings with one to three nitrogens (not usually charged) and a large set of fused carbon rings as the "tail". This tail is slightly reminiscent of the structure of most cholesterols and other steroids. This structure gives us confidence that these compounds might be able to interact with bacterial membranes in a similar manner to their straight chained cousins. If this is the case, it is possible that these compounds could easily overcome QAC bacterial resistance and could possibly reduce the issues of residue. Unfortunately, these compounds would be much harder to dissolve in water as they have very large hydrophobic regions. Regardless, these compounds have potential in replacing the current QACs being used for food safe decontamination as they have a chance to overcome bacterial resistance.
Through the use of both a consensus between the different bacterial models and through structural similarity detected between the top 50 predicted compounds from each model we were able to determine the most important structural elements to help narrow the list of Identification of antimicrobials against S. typhimurium and L. monocytogenes using QSAR modeling potential compounds. All compounds have a structure that is nearly identical to the structure of CPC, this shows a high degree of confidence in our ability to detect top compounds using either method. Unfortunately, these compounds do not give us any structures that will be greatly effective in reversing and overcoming antimicrobial resistance. We attempted to further limit this list by looking at the toxicity of each compound, the more toxic compounds being less likely to be good food safe antimicrobials. None of the compounds in Table 4 or Table 5 had any available toxicity data and all predictions of toxicity from open-source prediction tools were too varied to draw any adequate conclusions. Future experimental validation for our computationally identified compounds would provide a final list of candidate food antimicrobials. While our current predictions are based on a wide ranging list, these models could be reapplied to a more targeted list of potential compounds. This would provide an extra filtering step to any study that wishes to experimentally test any potential QAC/QAC-like compounds. Furthermore, our top potential compounds could also be tested in a similar manner.
Our study focuses on a more accurate model for less different structures. It is of our belief that these structures will still be able to overcome some of these issues even immediate resistance (as their binding affinity to e-flux pumps may be reduced), although this approach would not be able to overcome long term resistance. In terms of lipophilicity, some of our predicted compounds contain more hetero-atoms which lead to increased polarity which would Identification of antimicrobials against S. typhimurium and L. monocytogenes using QSAR modeling affect their ability to be removed from fattier surfaces using water based washes/rinses. Environment impact of these compounds cannot be currently ascertained however. The discovery and development of new QAC/QAC-like compounds is vital in the preservation of food and in the management of pathogenic microbes on the surface of foods. Without more research into potential compounds, antibiotic resistance and other current problems will continue to be a detriment to the food industry and in turn the consumer. Any new compound that can overcome the issues of current food safe antimicrobials will become a gold standard for all other antimicrobials. Although our current list is not perfect for replacement of CPC, our model can be applied to other QAC and QAC-like compounds to increase the viability of future studies and reduce the cost of bulk sampling of these lists.
Supporting information S1