Personalized pathology test for Cardio-vascular disease: Approximate Bayesian computation with discriminative summary statistics learning

Cardio/cerebrovascular diseases (CVD) have become one of the major health issue in our societies. But recent studies show that the present pathology tests to detect CVD are ineffectual as they do not consider different stages of platelet activation or the molecular dynamics involved in platelet interactions and are incapable to consider inter-individual variability. Here we propose a stochastic platelet deposition model and an inferential scheme to estimate the biologically meaningful model parameters using approximate Bayesian computation with a summary statistic that maximally discriminates between different types of patients. Inferred parameters from data collected on healthy volunteers and different patient types help us to identify specific biological parameters and hence biological reasoning behind the dysfunction for each type of patients. This work opens up an unprecedented opportunity of personalized pathology test for CVD detection and medical treatment.

Introduction Cardio/cerebrovascular diseases (CVD) were the first cause of mortality worldwide in 2015, causing 31% of deaths according to World Health Organization [1]. Pathology tests for detection of CVD rely on testing functionality of blood platelets, which play a key role in the occurrence of these cardio/cerebrovascular accidents in addition to complex process of blood coagulation, involving adhesion, aggregation on the vascular wall to stop a hemorrhage while avoiding the vessel occlusion. In a comprehensive bio-medical evaluation study [2], the correlation between the clinical biological measures using platelet function tests and the occurrence of a cardiovascular event was found to be null for half of the techniques and rather modest for others, indicating the evident need for a more efficient tool or method to monitor patient platelet functionalities. The inadequacy of these tests can be explained by the fact that no current test allows for the analysis of the different stages of platelet activation or the prediction of the in-vivo behavior of those platelets [3,4]. In addition, the current clinical tests do not take into account the dynamic aspect of the process of platelet aggregation and the role that red blood cells can have in this process. To address these issues, we extend here the stochastic model proposed in [5] that simulates numerically the deposition pattern of platelets observed in the Impact-R device [6], namely the sizes and number of aggregates as a function of time for a layer of whole blood subject to a controlled shear rate. This model is characterised by biophysically meaningful parameters the adhesion rate p Ad , the aggregation rates p Ag (of depositing on an already deposited platelet) and p T (of depositing on an existing cluster of platelets), the deposition rate of albumin p F , the attenuation factor a T , and the flux of activated (AP) and non-activated platelets (NAP) obtained from their characteristics velocities (v AP z and v NAP z ). The value of these parameters can be inferred by matching the simulation output with the corresponding in-vitro deposition pattern.
Our main claim here is that the values of some of these model parameters (eg. adhesion and aggregation rates) are precisely the information needed to assess various possible pathological conditions and to quantify their severity with reference to CVD.
To support this claim, we develop a methodology to identify medically interpretable parameters differentiating between patients and healthy volunteers. To infer the estimates of the biologically interpretable parameters of the stochastic platelet deposition model from the deposition patterns observed in the Impact-R device of platelet collected for a patient, we use approximate Bayesian computation (ABC) [7,8] and report the estimated mode of the approximate posterior learned by ABC as the estimates, which can also be interpreted as approximate maximum likelihood estimates (MLE) when non-informative uniform priors are considered on the parameters. We note that ABC inferential algorithms depend on the choice of the summary statistics extracted from the datasets [7], which can be inferred via metric learning [9,10] -a methodology that can provide the summary statistics able to maximally discriminate different patient groups. Leveraging on this crucial link between ABC and metric learning [11], we are able to identify medically meaningful parameters, which can distinguish between different types of patients and, at the same time, to estimate those parameters for each patient with the aim of developing a test for the pathology. We further notice that the proposed approach can be applied on each patient, in a systematic way. This reduces the bias of a human operator.
Finally, to verify our proposed methodology, we perform a four stage experiment: 1) Collect blood or platelet from 32 patients (16 patients needing dialysis and 16 patients with Chronic Obstructive Pulmonary Disease-COPD) and 16 healthy volunteers; 2) Study the deposition patterns observed in the Impact-R of platelet collected for each patient; 3) Learn the summary statistics from this dataset which is able to maximally distinguish between the 3 types of patients; 4) Estimate the model parameters for each of the patients, using ABC with the use of the learned discriminatory summary statistics from Stage 3. Studying the inferred parameters from each of the patients, we were able to identify medically meaningful parameters which are able to distinguish between patients of different types.
This study relies only on a small data set and is meant as a proof of concept. A forthcoming clinical study will provide a much larger data set on which we plan to demonstrate the potential of our approach convincingly.

Dataset
The collected dataset (characterizing the deposition pattern in the Impact-R) can be divided into three groups: healthy volunteers (Group 1), patients needing dialysis (Group 2) and patients affected by Chronic Obstructive Pulmonary Disease (COPD) (Group 3). We collected blood samples from 16 volunteers or patients from each of the three groups. Half of the 16 patients having dialysis were also affected by diabetes and all the volunteers and patients were chosen from a broad age group.

Summary statistics learning and inference of parameters
Using Large Margin Nearest Neighbor Metric Learning (LMNN) [10], we first learn a 2-dimensional projection of the collected dataset, which maximally discriminates between different types of patients. We notice that the two features defining this two-dimensional space do not have a meaningful biological representation as they are just weighted non-linear combinations of the observed time-series of the platelet deposition pattern. We use these two features as summary statistics for ABC and thus the fact that they have no biological interpretation is not relevant since they are only used to facilitate the estimation process of biologically meaningful model parameters that are, in turn, used to define the clinical tests. Next we use the Euclidean distance on this projection space for ABC to infer the parameters of the stochastic platelets depositions model for each of the volunteers (or patients), using the corresponding deposition and aggregation pattern of the platelets in their blood displayed in Impact-R machine. This provides us with a posterior distribution of the parameters given the data from each individual volunteer (patient). After learning the posterior distribution, to provide an estimate of the parameters for each of the patients, we calculate the maximum a posteriori (MAP) estimate of the parameters. Here we note that the Bayesian point estimates are minimizers of posterior loss (eg. posterior mean minimizing squared error loss or MAP minimizing 0-1 loss) [12]. These ABC inferred parameters provide valuable biological interpretation. In Fig 1, we illustrate the inferred posterior distribution for a patient with COPD and the corresponding maximum a posteriori (MAP) estimate of the parameters.

Uncertainty in MAP estimation
We illustrate the uncertainty of the MAP estimates in each patient group through the boxplots of MAP estimates for each patient in each of the three groups in Fig 2 for  One of our main assumption to build the test for pathology is that the parameter values for each of the patients in a specific group are centered around a true value of the parameters with , v AP a small variance. Under this assumption, using consistency theorem for MAP estimates [13] we can argue that the MAP estimates for each patient in that group would converge to the true value if we have increasingly many samples (eg. repeated measurements using Impact-R machine) for each of the patients. Given we only have one sample from each of the patient, we can argue that the distribution of the MAP estimates of the patients in a group would be centered around the true value. To justify the standard deviation observed in the MAP estimates of the patients in a group, we simulated 10 dataset using a true parameter value and computed the standard deviation of the MAP estimates for each of the simulated dataset (ŝ p Ag : 23:4;ŝ a T : 0:95;ŝ v NAP z : 2:71e À 05). We notice that the variance in the MAP estimates of the real data in a group [ Fig 2] is similar, in its order of magnitude, to the one observed in our simulation study.
Finally, we notice that the mean/median of the MAP estimates in each group of patients are significantly different among the three groups for the parameters p Ag ; a T ; v NAP z , possibly indicating a discriminative behaviour among the three groups, which helped us to identify the most discriminative parameters and devise a test for pathology.

Kruskal-Wallis H-test
Our main claim is based on the assumption that the median of the MAP estimates of the three patient groups are different among the groups for some of the parameters. We notice that the distribution of the MAP estimates of patients from different groups can still overlap due to the uncertainty in MAP estimation. To test whether the patient specific estimated parameters of the model can distinguish patients from healthy volunteers and furthermore discriminate between different types of the patients, we use the Kruskal-Wallis H-test [15] which tests the null hypothesis that the median of the MAP estimates of the different types of patients are equal versus a bilateral alternative. The computed test statistics and P-values between all the three groups (healthy volunteers, patients needing dialysis and patients with COPD) are reported in the first column of Table 1. Considering a cutoff for significance of 0.05, the null hypothesis gets rejected for the parameters p Ag and v NAP z indicating that the values of these parameters significantly differ between the groups.

PLOS COMPUTATIONAL BIOLOGY
We notice that the rejection of the null hypothesis does not indicate which of these groups differ. Hence, we perform post-hoc Kruskal-Wallis H-test between healthy volunteers and patients needing dialysis, healthy volunteers and patients with COPD and between patients needing dialysis and patients with COPD [columns 2, 3 and 4 of Table 1 respectively]. This indicates that p Ag can clearly differentiate patients with COPD from both healthy volunteers and patients having dialysis. Further a T and v NAP z being able to discriminate patients needing dialysis correspondingly from healthy volunteers and patients with COPD. We list these parameters which are capable to distinguish between the corresponding groups in Table 2.

Discriminating parameters and pathology test
According to our analysis, the biologically meaningful parameters which are able to discriminate between different patient types up to some accuracy are p Ag , a T and v NAP z , as shown in Table 2 and Fig 2. Further, these parameters can be divided into two distinct types of pathologies related to biochemical and conformational changes correspondingly in platelets and red blood cells (RBCs).
Pathological changes in platelets. The first group of parameters (p Ag , a T ) represent explicit intrinsic changes in platelets associated with the presence of pathology, which causes a change in their patterns of adhesion and aggregation. The common thread between dialysis patients and COPD patients is the existence of chronic systemic inflammation implicated in the development of cardiovascular disease. In response to inflammation, it is well known that platelets in COPD and dialysis patients are activated in the bloodstream, altering their hemostatic properties [16][17][18][19] and therefore the process of adhesion and aggregation.
Biochemical and conformational changes in RBCs. v NAP z reflects another aspect of the presence of pathology, where changes in the velocity of platelets are caused via their interaction with red blood cells (RBCs), which may have undergone biochemical and conformational changes altering blood rheology [20] under pathological situations. These changes in RBCs are known as spherization and have been observed in sepsis, dialysis patients, COPD patients and Table 2. Biologically meaningful discriminating parameters. The parameters which are significantly different between the corresponding two groups of patients (volunteers).

Discriminating parameters
Healthy vs Dialysis a T

Table 1. Statistics (P-values, P-values corrected for multiple testing) of Kruskal-Wallis test using maximum a posteriori (MAP) estimates of the parameters.
A higher value of the statistics and P-values smaller than 0.05 (bold) indicates that the medians of the estimated parameters of the corresponding groups are statistically different with a significance cutoff of 0.05. Healthy indicates healthy volunteers, Dialysis indicates patients under dialysis, and COPD stands for patients with COPD. The correction of P-values for multiple testing has been done using the Benjamini-Hochberg procedure [14]. PLOS COMPUTATIONAL BIOLOGY other pathologies causing chronic or acute systemic inflammation [21]. Recently, [22] reported that the RBC spherization induces an increase in platelet adhesion and aggregation processes and an increase in platelet transport to the wall. Hence, the changes observed in platelets velocities under pathologies may indicate medical conditions which causes biochemical and conformational changes to RBCs. Pathology test. Based on the most discriminative parameters (a T and p Ag ) identified by our analysis, we devise a test for pathology to identify diseased patients (correspondingly for patients with COPD and patients having dialysis). An individual is identified as healthy or belonging to COPD group (similarly to dialysis patients) if the MAP estimate of the parameter a T (correspondingly p Ag ) is closer to the median of the MAP estimates of the healthy volunteers or closer to the corresponding value estimated on patients in the COPD group (respectively dialysis patients). The sensitivity and specificity [23] of our two tests are reported in the Table 3. This shows that we can identify patients having COPD with higher degree of accuracy given that our analysis only depends on a relatively small number of patients/volunteers.

Ethics statement
Volunteers were recruited at the nephrology and pneumology units of the CHU-Charleroi, ISPPC Hôpital Vésale in Belgium. Written informed consent was obtained from each patient and healthy donor included in the study. The protocol of the study was in conformity with the ethical guidelines of the Helsinki Declaration of 1975 (revised in 2000) and was approved by the institution's ethics committee (No: OM008; P17/49_27/09). Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Based on detailed in vitro experiments using the Impact-R device mimicking platelet adhesion-aggregation in blood vessels, first we provide a model which is an in silico counterpart for an in-depth description and understanding of the phenomenon and the underlying mechanisms.

Impact-R experiment
Impact-R [6], a well-known platelet function analyzer, is a cylindrical device whose bottom wall is a fixed disc (deposition substrate), while the upper wall is a rotating disc (shaped as a cone with a small angle). The height of the device is 0.82 mm and due to the motion of the upper wall a pure shear flow is created. A controlled shear rate _ g is produced in a given observation window of 1 × 1 mm 2 , where we track the formation of clusters resulting from the deposition and aggregation of platelets. Blood was drawn from both healthy and diseased donors with different hematocrit (volume fraction of red blood cells (RBC)). Before starting the tests, a sample is recovered and analysed, to determine the concentration of activated (AP) and nonactivated (NAP) platelets. Serum albumin, the most abundant protein in human blood plasma, antagonises with the platelets, preventing them from adhering to the substrate. The quantities Table 3. Sensitivity and specificity of the proposed test using the most discriminative parameters to identify diseased patients compared to healthy volunteers.

Stochastic model of platelet deposition
The deposition process in Impact-R was successfully described with a mathematical model for the first time in [24,25] accounting for the following observations: (i) AP adhere to the deposition surface, forming a seed for a new cluster, (ii) NAP and AP can deposit at the periphery or on top of an existing cluster and (iii) Albumin (Al) deposits on the surface, thus reducing locally the adhesion and aggregation rates of platelets. A sketch of the situation shown in Fig 4. In [25], the platelets reach the bottom layer due to a RBC-enhanced shear-induced 1D diffusion. In the present work, we propose a fully stochastic model of platelet deposition, by substituting the 1D diffusion systems with a 3D random walk, while keeping the deposition dynamics the same. The reason of this new approach is to avoid the introduction of a boundary layer (denoted by Δz in [25]) when coupling platelet transport with platelet deposition. With this particle based approach, a platelet becomes a candidate for deposition whenever it hits the deposition surface. In [25] the diffusion coefficient D and the thickness of the boundary layer Δz where determined independently of the other model parameters, precisely by assuming a random motion of the platelets. Now, we keep the same level of description everywhere and, instead of D and Δz, two new parameters are considered, namely the characteristic velocities of activated and non-activated platelets. They will be inferred from the data, together with the adhesion and aggregation rates, and other quantities defined below. The random walk of the platelets can be described by their jump (Δz(t), Δx(t), Δy(t)) at each iteration t: DxðtÞ ¼ v xy js xy j cosð2prÞ dt; DyðtÞ ¼ v xy js xy j sinð2prÞ dt; ð3Þ

PLOS COMPUTATIONAL BIOLOGY
where v z,xy is a speed unit, r is a random variable uniformly distributed in [0, 1], s z,xy 2 (−1, 1) is a random variable distributed as a standard normal distribution, λ 2 {−1, 1} with probability 1/2 for each outcome, and dt is the time step of the simulation. Superimposing the stochastic motion with the velocity field of the pure shear flow, the positions of the platelets are updated as It should be noted that this stochastic model is physically equivalent to the shear-induced diffusion model used in [25]. From the random motion of particles, diffusion constants emerge both along the flow direction and perpendicular to it. The equivalence of random walk

PLOS COMPUTATIONAL BIOLOGY
and diffusion is a well known result, see for instance [26]. It has been checked that the platelet deposition pattern obtained with this new model is the same, up to statistical fluctuations, as with the model used in [25], when the physical properties are the same. The present model is computationally more costly than that of [25], but provides more flexibility for parameter inference, in particular to infer, from in vitro data, the transport properties of platelets towards the deposition surface. Additionally the present transport model no longer assumes the same transport properties for activated and non-activated platelets. Actually, this feature allows us to better explain the Impact-R data as activated platelets turn out to move faster, probably due to their increased effective hydrodynamic size. As a consequence, the final results of both models cannot be compared.
Our interest here is to model the transport of platelets in the direction perpendicular to the blood flow so as to obtain the flux of platelets that will reach the so-called Cell Free Layer (the layer without red blood cells near a wall), at the bottom of the Impact-R device and be candidate for deposition. The stochastic transport along the blood flow direction is not expected to have an impact on the deposition, due to the fast mixing of platelet in the horizontal plan. This assumption made in [25] is confirmed by the present model which includes explicitly this horizontal transport. Note finally that here we can exclude any drift in the vertical direction due to the up-down symmetry in the Impact-R setting. This absence of a drift, sometimes proposed to describe platelet transport in a tube, has been confirmed by full resolved blood flow simulations, in which deformable red blood cells and platelets are in suspension in a plasma subject to a shear flow [27].
Owing to the different dynamics and physics governing the activated and non-activated platelets, they have in principle different speed units (v AP z;xy ; v NAP z;xy ). Regarding the motion of albumin, its abundance allows us to neglect the small density gradients due to its deposition, and thus albumin can deposit at any time at a maximum value of deposition rate.
The AP and NAP that cross the lower boundary of the computational domain are removed from the bulk if the deposition is successful, or get trapped at the Cell Free Layer (CFL) for a future deposition attempt (they never get re-injected into the bulk). Further, periodic conditions are applied at the x, y directions, and bounce back boundary condition for the platelets that cross the upper boundary. Assuming a good horizontal mixing in the xy-plane due to the rotating flow, and given its low impact on the deposition process, the stochastic part of the motion in the x, y directions can be fixed to the same order of magnitude as the z direction velocity. Therefore, we consider v AP and s xy = s z . Next we describe the deposition rates. Let us denote by N i,j (t) the number of candidate particles for deposition above the cell at position i, j of the discretised substrate. The deposition of the platelets and albumin on the substrate follow the stochastic rules described in [25], i.e., based on the N i,j (t) and on the occupancy of the i, j-th cell at time t. Albumin that reaches the substrate at time t deposits with a probability P(t) which depends on the local density ρ al (t) of already deposited albumin. We assume that P is proportional to the remaining free space in the cell, where p F is a parameter to be determined and ρ max is given by the constraint that at most 100,000 albumin particles can fit in a deposition cell of area ΔS = 5 (μm) 2 , corresponding to the size of a deposited platelet (obtained as the smallest variation of cluster area observed with the microscope). An activated platelet that hits a platelet-free cell deposits with a probability Q, where Q decreases as the local concentration ρ al of albumin increases. We assumed that where p Ad and a T are parameters to be determined. This expression can be justified by the fact that a platelet needs more free space than an albumin to attach to the substrate, due to their size difference. In other words, the probability of having enough space for a platelet, decreases roughly exponentially with the density of albumin in the substrate (more details in [5]). In our model, AP and NAP can deposit next to already deposited platelets. From the above discussion, the aggregation probability R is assumed to be with p Ag another unknown parameter. We also introduce p T the rate at which platelets deposit on top of an existing cluster. Fig 4 presents coarsely the competing adhesion-aggregation process between albumin and platelets. More details on the stochastic deposition rules can be found in [25].
For the purpose of the present study, the platelet deposition model M is parametrized in terms of the seven quantities introduced above, namely the adhesion rate p Ad , the aggregation rates p Ag and p T , the deposition rate of albumin p F , the attenuation factor a T , and the velocities of AP and NAP v AP z and v NAP z . Collectively, we define If the initial number of AP and NAP at time t = 0 (N platelet ð0Þ and N actÀ platelet ð0Þ), as well as the concentration of albumin are known from the experiment, we can forward simulate the deposition of platelets over time using model M for the given values of these parameters θ = θ � :

Estimation of model parameters
As the likelihood function induced by the platelets deposition model is analytically intractable due to the need of computing a very high-dimensional integral, we can not compute the maximum likelihood estimate of the model parameters or perform traditional Bayesian inference. In setting where the likelihood function is not available, approximate Bayesian computation (ABC) [7] offers a way to sample from an approximate posterior distribution of the parameter p(θ|x 0 ) � π(θ)p(x 0 |θ) given the observed data x 0 , where π(θ) and p(x 0 |θ) are correspondingly the prior distribution on the parameter θ and the likelihood function. Further we note that the mode of this approximate posterior distribution (i.e. the maximum-a-posteriori -MAP-estimate) is also the approximate maximum likelihood estimate if we assume the prior distribution on the parameter to be uniform. Following this, given the observed data, we compute the MAP estimates using the ABC approximate posteriors of the parameters of the stochastic platelet deposition model.

Approximate Bayesian computation (ABC)
The fundamental ABC rejection sampling scheme iterates the following steps: 1. Draw θ from the prior π(θ).
2. Simulate a synthetic dataset x sim from the simulator-based model MðθÞ.

See Fig 5 for a visualization of the above algorithm.
Here, the metric on the dataspace d(x sim , x 0 ) measures the closeness between x sim and x 0 . The accepted (θ, x sim ) pairs are thus jointly sampled from a distribution proportional to π(θ) p d,ϵ (x 0 |θ), where p d,ϵ (x 0 |θ) is an approximation to the likelihood function p(x 0 |θ): where K ϵ ðdðx sim ; x 0 ÞÞ is in this case a probability density function proportional to 1ðdðx sim ; x 0 Þ < ϵÞ and 1ð�Þ is used as an indicator function. Besides this choice for K ϵ ðdðx sim ; x 0 ÞÞ, that has been exploited in several papers (for instance [28][29][30][31]), ABC algorithms relying on other choices exist, for instance with the kernel K being exp(−d(x sim , x 0 )/ϵ) as in simulated-annealing ABC (SABC) [32]. More advanced algorithms than the simple rejection scheme detailed above are possible, for instance ones based on Sequential Monte Carlo [30,31], in which various parameter-data pairs are considered at a time and are evolved over several generations, while ϵ is decreased towards 0 at each generation to improve the approximation of the likelihood function, so that we are able to approximately sample from the true posterior distribution. For inference of parameters of the platelets deposition model, here we choose the SABC algorithm, based on its suitability to high performance computing systems [33]. For practical implementation, SABC was run for 20 iterations generating 510 samples from the posterior distribution of the model parameters given data from each patient, keeping all other parameters fixed to the default values proposed in the Python package 'ABCpy' [34]. Next, we explain how the distance between datasets used for ABC was chosen via a discriminative summary statistics learning approach and finally how the MAP estimates (MAP) of the parameters was computed.

Discriminative summary statistics learning (DSSL)
Traditionally, distance between x sim and x 0 is defined by summing over Euclidean distances between all possible pairs composed by one simulated and one observed datapoint in the PLOS COMPUTATIONAL BIOLOGY corresponding datasets. Recently, distances for ABC have also been defined through classification accuracy [35], Kullback-Liebler divergence [36], maximum mean discrepancy [37] or by Wasserstein distance [38], under the assumption that the datapoints in each datasets are identical and independently distributed and they are available in a large number in both x sim and x 0 . We notice that in our setting we only have one datapoint in the observed dataset corresponding to the platelet deposition pattern of a patient and also due to the very expensive nature of our simulator model (simulation of one datapoint takes around 15 minutes) we can only have few datapoints in the simulated dataset. Hence, here we concentrate on the definition of distances through Euclidean distance on summary statistics extracted from the dataset when we only have one data-point in both x sim and x 0 . When the data x is high-dimensional, a common practice in ABC literature is to define d as Euclidean distance between a lower-dimensional summary statistics s : x sim 7 ! s(x sim ). Reducing the data to suitably chosen summary statistics may also yield more robust inference with respect to noise in the data. Moreover, if the statistics is sufficient, then the above modification provides us with a consistent posterior approximation [39], meaning that we are still guaranteed to converge to the true posterior distribution in the limit ϵ ! 0. As sufficient summary statistics are not known for the majority of the complex models, the choice of summary statistics remains a problem [40] and they have been previously constructed using neural networks trained on a 'pilot' simulated dataset from the simulator model as in neural network based semi automatic summary statistics learning (SASL) [41,42] or summary statistics learning minimizing triplet loss (TLSL) [11]. Detailed description of SASL and TLSL can be found in S1 Appendix.
These procedures are computationally expensive due to the need of the simulation of the 'pilot' dataset, further the learned summary statistics using these methods are not able to discriminate between datasets from different patient types. As the main goal of the present research is to learn parameter values which are able to differentiate between different patient types, here we propose a methodology to learn such summary statistics and name it as discriminative summary statistics learning (DSSL). Learning summary statistics which is most discriminative between datasets with different labels falls under a well-developed field of research in metric-learning [9]. We use Large Margin Nearest Neighbor Metric Learning (LMNN) [10], one of the metric-learning approaches, which learns a Mahalanobis distance between data from any two patients x 1 and x 2 able to discriminate between datasets with different labels, d M ðx 1 ; x 2 Þ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where M is a d × d positive semi-definite matrix. We note that learning of the Mahalanobis distance here corresponds to learning a summary statistics of the data. It is sufficient to recall that for each positive semidefinite matrix M there exists a square matrix L such that M = L T L. Therefore, we can write Eq 11 in the following way: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where ||�|| 2 is the Euclidean distance, from which it is clear that the above corresponds to learning the transformation s : x 7 ! s(x) = Lx which is able to discriminate between different patient groups.
To learn the transformation, LMNN solves the following optimization problem: where x i is a data from a patient, x j is one of that patient's k-nearest neighbors belonging to the same group of patients, and x l are all the other data from patients of different type within the neighborhood, η ij , y ij 2 {0, 1} are indicators, η ij = 1 if x j is one of the k-nearest neighbors (conditioned on being of the same type of patient as x i ) of x i , y ij = 0 indicates x i , x j are different types of patients, [�] + = max(0, �) is the Hinge loss. Intuitively, LMNN tries to learn a metric able to keeps k-nearest neighbors from the same patient group close together, while keeping patients from the other groups well separated. Further we note LMNN does not make any assumptions about the distribution of the data. To learn the projection by using LMNN, we consider the original data, its second and third order polynomial expansion and the cross products between them. Further, we manually tuned the tuning parameters for the LMNN algorithm provided in Python package 'metric-learn' [43] to maximize the rand index [44] between the true patient clusters in the dataset and the clustering achieved using agglomerative hierarchical clustering with the Euclidean distance on the learned discriminative summary statistics. Euclidean distance between the learned summary statistics from the data of different patient types were able to cluster the dataset with 100% accuracy. In Fig 6 we illustrate the learned discriminative summary statistics space in which the 3 groups of patients/volunteers are accurately clustered.
To validate our DSSL approach, we compare DSSL with SASL and TLSL by using posterior predictive checks for a experimental simulated dataset from the stochastic platelet deposition model. Further experimental details can be found in S1 Appendix. The main goal here is to analyze the degree to which the experimental data deviate from the data generated from the inferred posterior distribution of the parameters. Hence we want to generate data from the model using parameters drawn from the posterior distributions learned using the three different summary statistics learned via SASL, TLSL and DSSL. To do so, we first draw 500 parameter samples from the corresponding inferred approximate posterior distribution and simulate 500 data sets, each using a different parameter sample. This simulated dataset is considered as the predicted dataset from our inferred posterior distributions. In Fig 7, we plot the experimental data (solid line), 95% predictive credibility interval (shaded area) and the median prediction (dashed line) for SASL (red), DSSL (blue) and TLSL (green). The experimental data falls inside the 95% predictive credibility interval (PCI) for all the three summary learning approach, where DSSL producing the tightest PCI. For N aggÀ clust and S aggÀ clust the median prediction is closer to the true experimental data, whereas SASL and TLSL performs better for N platelet .
Further, to measure the deviation of the predicted dataset from the experimental data used for inference we use a Monte Carlo estimate of the energy score, which is a strictly proper scoring rule used to measure predictive performance of probabilistic predictions [45], where x i is the i-th data simulated using a posterior sample, x 0 is the experimental data used for inference, β 2 (0, 2). In Table 4, we report the energy score (fixing β = 1) computed correspondingly for the three inferential schemes using summary statistics learned via SASL, TLSL and DSSL for the three observed quantities N aggÀ clust , S aggÀ clust and N platelet . The values in Table 4 are in agreement with Fig 7, allowing us to conclude that the proposed DSSL approach works better in predicting N aggÀ clust and S aggÀ clust , whereas other two approaches perform better for the prediction of N platelet . This illustrates that the discriminative summary statistics learned by DSSL did not compromise in the overall predictive performance of the model, whereas it also encapsulates discriminative information between the three groups of patients/volunteers.

Maximum a posteriori estimate (MAP)
Given an observed dataset x 0 , we want to estimate the corresponding θ. SABC inference scheme provides us with Z samples ðθ i Þ Z i¼1 from the ABC approximated posterior distribution p(θ|x 0 ) given the data for each patient. Given these samples we construct a smooth approximation of the posterior distribution (given data for a specific patient) of the parameters using Gaussian kernel density estimator with a bandwith equal to 0.45 and compute the mode of the smoothed posterior distribution using Nelder-Mead algorithm [46] as the estimate of parameters for each specific patient. This estimate will be considered as the MAP of the parameters. The Gaussian kernel density and Nelder-Mead algorithm were used as implemented in Python package 'scipy' [47].
In Fig 8 we illustrate the discriminative projection of the parameter space θ learnt using LMNN, by considering the MAP estimates, its second, third and fourth order polynomial expansion and the cross products between them. The tuning parameters for the LMNN algorithm were tuned as before to maximize the rand index [44] between the true patient clusters in the dataset and the clustering achieved using agglomerative hierarchical clustering with the Euclidean distance on the learned discriminative projection of parameter values from the MAP estimates. Euclidean distance between this learned parameter projection from the MAP estimates for different patient types were able to cluster the patients with 100% accuracy. This illustrates that we didn't lose any discriminative information inherent in the patient (volunteer) dataset by learning the MAP estimates, justifying our main claim that the values of the model parameters are precisely the information needed to assess various possible pathological conditions.
Here we further note that we could have used the discriminative summary statistics space or the discriminative projection of the parameter space correspondingly in Figs 6 and 8, which are not biologically meaningful, to construct the test to determine pathology in the patients with 100% accuracy. Instead we choose here to construct the test for pathology based on the biologically meaningful parameters of our stochastic deposition model, doing so our methodology somewhat looses accuracy but gains significant interpretability.

Conclusion
A numerical model of platelets deposition introduced in [25] was illustrated to successfully predict the platelets deposition and aggregation patterns in Impact-R device. [8] proposed an inferential scheme based on ABC using a distance based on expert knowledge to quantify the uncertainty of the model parameters of the model presented in [25] given data from a healthy patient. Due to the modeling of platelets reaching the bottom layers by a 1D diffusion, it was not possible to accurately calibrate the diffusion coefficient and the thickness of the boundary layers using the inferential scheme described in [8] and hence these quantities were manually tuned. To solve this problem, here we introduced a fully stochastic model which replaces the

PLOS COMPUTATIONAL BIOLOGY
above parameters by the characteristic velocities of activated and non-activated platelets. We were able to estimate these velocities in addition to the other model parameters using the inferential scheme we propose in the present manuscript. Hence the proposed methodology completely removes the need to manually tune any parameters of the numerical model of platelets deposition. Secondly, we adapt our inferential scheme to estimate parameters of the model which not only achieves good prediction performance but also can accommodate the discriminative information available in the dataset of different types of patients and volunteers. This is done by learning a summary statistic which is maximally discriminative between the three groups (healthy volunteers, patients with COPD and patients needing dialysis) considered and finally defining an Euclidean distance on this summary statistics space to use as distance between observed and simulated data in approximate Bayesian computation. We show that the three groups cluster accurately in this learned summary statistics space. This discriminative summary statistics was also able to get comparable predictive performance to the existing summary learning approaches when used in ABC to learn approximate posterior distribution of the model parameters given the data.
Finally, we evaluate the maximum a posteriori of the model parameters for each of the 48 patients by computing the mode of the joint approximate posterior distribution inferred by ABC. We notice estimated values of some of the parameters were able to distinguish between PLOS COMPUTATIONAL BIOLOGY different types of patients and healthy volunteers. We would like to note here that the original learned discriminative summary statistics could be used to differentiate between different types of patients and volunteers, but those summary statistics are not interpretable in a biological sense. Hence, our main contribution lies in this ability to estimate biologically meaningful parameters which can also discriminate between different types of patients. This may serve as an illustration to make some of the machine learning models used in biology interpretable.
Supporting information S1 Appendix. Additional mathematical details. Details on Semi Automatic Summary Statistics Learning and Summary Statistics Learning by minimizing triplet loss. (PDF)