COMBSecretomics: A pragmatic methodological framework for higher-order drug combination analysis using secretomics.

Multi drug treatments are increasingly used in the clinic to combat complex and co-occurring diseases. However, most drug combination discovery efforts today are mainly focused on anticancer therapy and rarely examine the potential of using more than two drugs simultaneously. Moreover, there is currently no reported methodology for performing second- and higher-order drug combination analysis of secretomic patterns, meaning protein concentration profiles released by the cells. Here, we introduce COMBSecretomics (https://github.com/EffieChantzi/COMBSecretomics.git), the first pragmatic methodological framework designed to search exhaustively for second- and higher-order mixtures of candidate treatments that can modify, or even reverse malfunctioning secretomic patterns of human cells. This framework comes with two novel model-free combination analysis methods; a tailor-made generalization of the highest single agent principle and a data mining approach based on top-down hierarchical clustering. Quality control procedures to eliminate outliers and non-parametric statistics to quantify uncertainty in the results obtained are also included. COMBSecretomics is based on a standardized reproducible format and could be employed with any experimental platform that provides the required protein release data. Its practical use and functionality are demonstrated by means of a proof-of-principle pharmacological study related to cartilage degradation. COMBSecretomics is the first methodological framework reported to enable secretome-related second- and higher-order drug combination analysis. It could be used in drug discovery and development projects, clinical practice, as well as basic biological understanding of the largely unexplored changes in cell-cell communication that occurs due to disease and/or associated pharmacological treatment conditions.


Quality control explained
Quality control (QC) procedures include pre-processing of the collected raw measurement values per experimental plate/batch, in order to eliminate noise and exclude outliers that may trigger misinterpretations. The sequential QC steps employed currently by COMBSecretomics are described below in different subsections.

Imputation
Missing data (as obtained from the previous step) are imputed in the horizontal direction (i.e., the closest row in the raw N × d data matrix) using the euclidean distance as metric. For this task, the function knnimpute is used, as implemented in MATLAB R2019b, by employing default settings. In terms of our case study, no imputations were needed as shown in Fig.  S2.

Coefficient of variation
A last QC step is employed in order to ensure that there is limited technical variability between intra-plate replicate measurement values for the protein releases (i.e., columns in the N × d data matrix). More specifically, for a particular protein k, the coefficient of variation for all intra-plate replicate wells is calculated. Since there are up to 6 different cell states, each protein k has up to 6 different coefficients of variation, one for each state. The protein d i is kept for further analysis only if the median across all these coefficients of variation is below a user-defined threshold. For this particular case study, this threshold was set to 25% (Fig. S3).  Figure S3: Intra-plate coefficient of variation. Median coefficient of variation for the 9 proteins remained after all previous quality control steps. Given the cut-off threshold of 25%, S100A6 was also excluded from further analyses.

Figure S5: Normalized release differences between D,untreated (UT),unstimulated (US) and H,untreated (UT),unstimulated cells (US).
Small differences in protein releases are observed for all apart from two proteins. In particular, the release levels of ICAM1 and TNFSF12 seem to increase ≈ 13% and drop ≈ 14% respectively in D compared to H cells.

Figure S6: Normalized release differences between D,treated (T),unstimulated (US) and D,untreated (UT),unstimulated cells (US).
A particular row shows how much a particular treatment T x has affected the protein releases of D cells. For example, T 12 seems to have resulted in ≈ 12% less release of ZG16 and ≈ 30% more release of GROA, while the remaining 6 proteins seems to be relatively unaffected. Each row corresponds to a particular stimulation (S 1 , S 2 , S 3 ). Compared to the corresponding normalized protein release differences for the unstimulated cells (Fig. S5), stimulating with S 1 and S 3 is associated with noticeable changes for several proteins.

Example raw data file
In this section, instructions on how to provide a valid data file (i.e., file with raw protein release measurements) are provided. The raw data file must be in CSV file format using the barcode of the microtiter plate as the corresponding filename (< barcode.csv >). Its content should be structured as described below. The first column and row should contain annotations of the collected dataset, namely the cell state of experimental wells and measured protein releases. The rest of rows and columns correspond to the numeric part of the dataset, which is stored in the form of a N × d matrix. N and d denote the number of total wells and measured proteins per experimental well, respectively. In the example raw data file, N = 9 and d = 3. More precisely, each row corresponds to a particular experimental well and contains the raw release measurement values for proteins PEDF, CXCL11 and IL13. The cell state of a particular experimental well is described in column "Sample" of the corresponding row.
COMBsecretomics requires the following naming convention for the cell states included in column "Sample". Blank wells should be declared as "BLANK" or "blank". For wells containing cells, a three-field naming convention is required in the following form: The field < Cells > should contain an identifier related to the type of cells in a particular well. For example, here we have used D for disease releated but the user is free to choose this field. Healthy cells must be denoted H. The field < T reatment > must be either set to "U T " for untreated cells or "T X" for cells being treated with T X, where X must consist of the numeric identifier for a particular treatment. For instance, T 1 corresponds to the treatment consisting of compound 1, T 12 corresponds to the combination treatment consisting of compounds 1 and 2, while T 123 corresponds to the combination treatment consisting of compounds 1, 2 and 3. Therefore, the user has to annotate the treatments using numbers instead of names. Finally, the field Stimulation must be either set to "U S" for unstimulated cells or "SY " for stimulated cells, where Y declares the numeric identifier of the corresponding stimulation. For example, S1 corresponds to stimulation 1, S2 corresponds to stimulation 2 etc. Similarly, the user has to annotate the stimulations using numbers instead of descriptive names (see also https://github.com/EffieChantzi/COMBSecretomics.git).

User-defined inputs
In this section, recommendations on how to select the user-defined inputs to COMBSecretomics are provided (see also https://github.com/EffieChantzi/COMBSecretomics.git): (1) Raw data file is selected interactively by the user. Details on how to provide a valid raw data file with the collected protein release data and the corresponding annotations are given in the previous section "Example raw data file".
(2) Cut-off threshold (%) for the blank filtering, as part of the quality control (see section "Blank filtering", above). By setting a low value, one ensures that proteins with low levels of noise are kept for further analysis.
(3) Cut-off threshold (%) for the coefficient of variation for the measured protein releases, as part of the quality control (see section "Coefficient of variation", above). By setting a low value, one ensures that proteins with low levels of technical variability are kept for further analysis.
(4) Number of resampling based validation datasets to be created (see section "Resampling statistics", main article text). The higher this number, the more datasets are going to be created for validation, which is advisable especially if several intra-plate replicate measurements are included in the experimental design. Here, the trade-off between quality and quantity should be taken into account. It might be that few obtained results appear to be stable but these few are likely to be reproducible. Despite being experimentally expensive, multiple replicate measurements are highly recommended for this type of validation approaches.
(5) Option for employing exhaustive subset search when visualizing the combination analysis results from the hierarchical clustering (see section "Top-down hierarchical clustering", main article text). We highly recommend this option, especially for large exhaustive experimental set ups, as it provides a very helpful way of disentangling higher-from lower-and single-order treatment effects and summarizing the prototypical chemically induced protein release patterns.