The Power of Heterogeneity: Parameter Relationships from Distributions

Complex scientific data is becoming the norm, many disciplines are growing immensely data-rich, and higher-dimensional measurements are performed to resolve complex relationships between parameters. Inherently multi-dimensional measurements can directly provide information on both the distributions of individual parameters and the relationships between them, such as in nuclear magnetic resonance and optical spectroscopy. However, when data originates from different measurements and comes in different forms, resolving parameter relationships is a matter of data analysis rather than experiment. We present a method for resolving relationships between parameters that are distributed individually and also correlated. In two case studies, we model the relationships between diameter and luminescence properties of quantum dots and the relationship between molecular weight and diffusion coefficient for polymers. Although it is expected that resolving complicated correlated relationships require inherently multi-dimensional measurements, our method constitutes a useful contribution to the modelling of quantitative relationships between correlated parameters and measurements. We emphasise the general applicability of the method in fields where heterogeneity and complex distributions of parameters are obstacles to scientific insight.


Introduction
Increasing the dimensions of an experiment allows the measurement of dependence between parameters. For example, introducing multi-dimensional nuclear magnetic resonance (NMR) has opened new possibilities for studying heterogeneous structures and complex phenomena by correlating different parameters describing transport properties and identifying different populations based on diffusion and relaxation properties [1,2]. In optical (electronic) spectroscopy in the infrared, visible, and ultraviolet ranges, obtaining 2D spectra that describe correlations of excitation and detection wavelengths recently enabled the mapping of energy transfer pathways in the light-harvesting mechanism of certain bacterial species [3][4][5]. The key notion for both techniques is that a joint probability distribution (or 'spectrum' or 'map') of different parameters and their relationships can be estimated, instead of just the marginal (or lowerdimensional) distributions for individual parameters.
Multi-dimensional intra-modality measurements-within the confines of a single technique-inherently provide probabilistic relationships between parameters. More often than not, though, different parameters of systems are studied using different techniques, i.e. a multidimensional inter-modality measurement; revealing parameter relationships then becomes a matter of data analysis post-experiment, rather than a matter of experiment per se. Unfortunately, individual parameters in almost all real systems are distributed, and heterogeneity is a major obstacle to data interpretation. One common approach to this problem is to rely on purification to minimise sample heterogeneity, estimating relationships using standard methods such as regression by least-squares fitting. Altering samples is typically not viable in the industrial processing of heterogeneous raw materials, and this can be a substantial hurdle even in controlled laboratory settings.
This dilemma has prompted us to move to a setting where experimental input and output originate from different measurement techniques and comprise distributions rather than single values. Combining information from different sources, often called data integration or data fusion [6], is broadly addressed in computer science, machine learning, and mathematics. Modelling probabilistic relationships when data is obtained as distributions of individual parameters is almost unexplored; the closest work has discussed 'distribution to distribution regression' in a different context [7].
Here we develop a widely applicable general-purpose statistical method for estimating probabilistic relationships from distributions of data for individual parameters. Critically, these distributions can be represented by different types of data, i.e. statistical samples (sets of values), decay curves, spectra, function curves, and histograms, and therefore involve solving inverse problems.

Results and Discussion
Consider the probabilistic relationship between an input parameter x and an output parameter y by modelling the joint distribution describing the dependence structure of x and y, based on measurements from K samples. The data from the kth sample represents marginal distributions p x;k (x) and p y;k (y) (directly as a statistical sample, or indirectly through an inverse problem) from the joint distribution for the kth sample, p x, y;k (x, y). Generally, nothing can be said about the dependence structure of x and y from a single measurement alone: finding a joint distribution from marginal distributions is by itself an inverse problem that requires assumptions or a priori knowledge. However, assuming that the conditional distribution of y given x is common to all samples, the joint distribution of x and y for the kth sample can be written p x, y;k (x, y) = p y|x (x, y)p x;k (x). Hence, A 'link' between the measurements for different samples is provided through the conditional distribution p y|x (x, y), interpreted physically as the distribution of y caused by unobserved parameters of the system other than x. That it is independent of sample index k corresponds to assuming that sorting into K samples is conducted solely on the basis of the value of x. The conditional distribution p y|x (x, y) is estimated indirectly by fitting the distributions p y;k (y) to the measurements, yielding an estimate of the probabilistic relationship between x and y.
Thus, a space of distributions of x is mapped onto a space of distributions of y, in the process of modelling the relationship. The distribution models may be parametric as well as nonparametric.
First, we study probabilistic relationships between material parameters for colloidal semiconductor quantum dots (QDs), which are luminescent nanoparticles with applications in imaging [8][9][10], catalysis [11], photovoltaics [12][13][14], and many more fields. QDs comprise a prototypical complex material with properties dependent on the interaction between multiple parameters. In particular, the emission spectral profile and excited-state decay dynamics and their relation to size are vital for understanding the electro-optic and catalytic properties of these materials [15]. Additionally, QD size polydispersity is often inherent to synthesis and further purification to reduce the polydispersity is often impractical. A set of four batches of standard ZnS-coated CdSe QDs are synthesised and diameter distributions estimated from transmission electron microscopy (TEM) image data. Steady-state emission spectra of the QDs are acquired and interpreted as distributions of wavelengths p λ;k (λ) convolved with a Lorentzian line-broadening function, γ/(π((λ − λ 0 ) 2 +γ 2 )). The model for the kth spectrum is therefore of the form The distribution p λ;k (λ) is the marginal distribution of λ from the joint distribution This factoring corresponds to assuming that the QDs are sorted only according to size, and that other variation for a given diameter d is the same regardless of the sample. The distribution p λ|d (d, λ) constitutes a summary of the influence of all system parameters other than d. If it is degenerate, p λ|d (d, λ) = δ(λ − λ 0 ), λ is completely determined by d, and thus there is no other relevant parameter in the relationship. Fig 1 shows the spectra, the model fits, and the estimated joint distributions of diameters and emission wavelengths. We now acquire fluorescence lifetime distributions of the QDs and interpret them as superpositions of exponential lifetime distributions, governed by a 'characteristic' (average) lifetime distribution for the kth sample, p τ;k (τ), such that The distribution p τ;k (τ) is the marginal distribution of τ coming from the joint distribution Otherwise, the analysis is identical to the previous one for wavelengths. Fig 2 shows the fluorescence lifetime distributions, the model fits, and the estimated joint distributions of diameters and characteristic lifetimes. The results show that the larger particles have narrower lifetime distributions. This may be due to the presence of fewer defects related to a lower surface-to-volume ratio. Second, we study the probabilistic relationship between molecular weight M and diffusion coefficient D for polymers. A well-known empirical scaling law for linear polymer chains states that where ν * 0.5 − 0.8 for uncharged polymers in the dilute regime, depending on solvent quality and temperature [16]. The parameter ν, analogous to the Flory exponent [17], is a measure of how polymers spread in space [18,19]. Knowledge of both parameters K and ν allows determination of molecular weight distributions from diffusion coefficient distribution measurements [20]. Polymers comprise a prototypical example of the case where samples are often purified to minimise heterogeneity, typically estimating the scaling law parameters K and ν by taking logarithms and using linear least-squares on a set of many samples [21]. Considering a polydisperse sample with marginal distributions p(M) and p(D), the scaling law relationship can be seen as a degenerate joint distribution with non-zero density only on the line described by Eq 6 because no other parameter influences the relationship, hence it is a special case of the more general example above. We use the fact that if p(M) is a lognormal distribution with parameter (μ M , σ M ), p(D) is also a lognormal distribution with parameter (μ D , σ D ). On a polydisperse polystyrene sample, we measure the molecular weight distribution by gel permeation chromatography (GPC), obtained as a distribution over log 10 M. We measure the diffusion coefficient distribution of the polystyrene, diluted in deuterated chloroform (CDCl 3 ), by nuclear magnetic resonance (NMR), obtaining a signal attenuation described by the Stejskal-Tanner equation [22] IðbÞ ¼ I 0 for a lognormal distribution p(D) with parameters (μ D , σ D ). From these parameters, the scaling law parameters K and ν can be obtained by As a validation, we use the reported values of M w from the manufacturer for eight well-defined monodisperse polystyrene standards and the corresponding values of hDi, measured by NMR for dilute amounts of the samples in CDCl 3 , and perform the conventional linear least-squares fitting to estimate K and ν.  with the conventional method based on a 95% nonparametric confidence interval of the latter. That K and ν can be determined from a single sample has been suggested before [23], but is here performed for the first time.
We emphasise that instead of needing to perform a painstakingly large number of measurements on monodisperse samples (and possibly needing to perform purification steps to obtain those samples), we can obtain a very similar result by different and simpler means, based on performing a single measurement on a polydisperse sample in this case when the functional form of the relationship is known.

Conclusions
We demonstrate a general-purpose mathematical method for estimating probabilistic relationships from measured distributions of individual parameters. The distributions can be represented directly by statistical samples, or indirectly through an inverse problem by decay curves, spectra, function curves, and histograms. The result is an estimate of a joint probability distribution of parameters that describes relationships and correlations. In the applications to quantum dots, measurements from multiple heterogeneous samples are utilized, allowing for joint probability distributions (Figs 1 and 2) for which the effects of diameter are observed through the dependence of both the mean and the variance of the conditional distribution on diameter. In the application to polymers, using measurements on a single polydisperse polymer sample, the parameters K and ν of the scaling law relationship in Eq 6 are estimated, and this gives rise to a joint probability distribution (Fig 3) (with conditional distributions being in fact delta functions).
In a world ever more data-rich, with a plethora of measurement techniques and sensors used in the minerals, pharmaceutical, food, and chemical industries, as well as in fundamental science, we urgently need to improve the ways in which we analyse the complex data sets that are becoming the norm. This trend will only continue, in particular because natural raw materials need to be used to their full potential to meet future sustainability and cost demandsmaterials that are naturally heterogeneous, complex, and challenging to master. We expect our general-purpose mathematical method to be applicable in many fields where heterogeneity and complex distributions of parameters are obstacles to scientific insight. Single values of parameters can be replaced by distributions, taking advantage of the heterogeneity and letting it work for us, not against us. Naturally, multi-dimensional experiments deliver a wealth of information which is literally impossible to access with lower-dimensional experiments. Though more complicated parameter relationships, involving e.g. non-monotonicity, discontinuity, or multimodal conditional distributions, cannot be resolved with our method, we expect the method to be useful for the understanding of many heterogeneous systems.

Preparation of quantum dots
Four samples of ZnS-coated CdSe core-shell quantum dots (QDs) with different nominal diameters in the range d % 2 − 5 nm are synthesized using a previously published method [24] and dissolved in toluene.

Transmission electron microscopy measurements
Diameters of QDs are estimated using transmission electron microscopy (TEM). The analysis is performed on a JEOL 2100F microscope operated at 200 kV. All experiments are carried out at room temperature. The spatial resolutions (physical pixel sizes) are Δx = 0.05 − 0.17 nm for the image data. Particle contours (n = 95, 127, 88, and 106) for the four samples are identified manually from raw TEM images.

Luminescence measurements
Steady-state emission spectra of the QDs are acquired in the range 450-650 nm with resolution Δλ = 1 nm using an Edinburgh Photonics FLS 980 time-resolved photoluminescence spectrometer. At each respective emission intensity maxima, fluorescence lifetime measurements by time-correlated single photon counting (TCSPC) are performed using the same instrument. Excitation lifetimes are recorded in a 500 ns range using 8192 channels.

Nuclear magnetic resonance measurements
Polydisperse polystyrene (PS) (190,000 M w , Sigma-Aldrich) is mixed to 0.02% w/w in CDCl 3 (99.8 atom% deuterium, Sigma-Aldrich). Monodisperse PS standards (2,430 M w , 3,680 M w , 13,700 M w , 18,700 M w , 29,300 M w , 44,000 M w , 212,400 M w , and 382,100 M w , Sigma-Aldrich) are each mixed to 0.1% w/w, also in CDCl 3 . These concentrations are chosen in order to be within the dilute polymer regime (18) and to avoid microscopic averaging effects [25]. NMR tubes are filled with the PS solutions and flame-sealed to avoid convection due to solvent evaporation. Pulsed Gradient Stimulated Echo (PGSTE) measurements of 1 H self-diffusion [26] are performed at ambient conditions using a Bruker Avance III HD NMR spectrometer with an Ascend 600 MHz superconducting magnet, a micro5 probe, and a diff30 gradient coil. The PGSTE parameters for all measurements are repetition time TR = 10 s, gradient pulse duration δ = 1.576 ms, observation time Δ = 50 ms, and τ = 2.6 ms. The PGSTE measurements on the 0.02% w/w 190,000 M w polydisperse PS sample, performed in triplicate, use 192 averages and 32 linearly spaced gradient steps with maximum gradient value 10 T/m. The PGSTE measurements on the PS standard samples use 16 averages and 16 linearly spaced gradient steps with maximum gradient values chosen to attenuate 30% of the PS signals. The spectrally resolved PS signal attenuations are analysed using the Stejskal-Tanner equation, Eq 7, with attenuation factor b ¼ ðggdÞ 2 for a sinusoidal gradient pulse shape.

Gel permeation chromatography measurements
Gel permeation chromatography (GPC) molecular weight distribution measurements of the 190,000 M w polydisperse PS are performed on an Agilent PL-GPC 220 system flowing trichlorobenzene at 1 ml/min through three PLgel 10 μm mixed-B columns at 150°C. Measurements are performed in triplicate on 1.5 mg/ml samples using a 200 μl injection volume and conventional calibration with Agilent EasiVial PS-H standards. The data is obtained as 400-500 points equidistant in log 10 M.

Data analysis for quantum dots
All particle contours identified from the TEM images are near-circular, and an equivalent diameter is computed by where A is area and N is number of pixels within the contour. We make parametric model assumptions and fit lognormal distributions, to the diameters using maximum likelihood [27], resulting in parameter estimates Steady-state emission spectra are interpreted as probability distribution of wavelengths convolved with a Lorentzian curve (Cauchy distribution with parameter γ, corresponding to a FWHM of 2γ.), accounting for line broadening. All spectra are fit simultaneously using nonlinear least-squares, thus simultaneously estimating p λ|d (d, λ) and p λ;k (λ) for fixed p d;k (d) by minimizing the global sum of squares summed over all spectra and all wavelengths. For fluorescence lifetime measurements by TCSPC, time zero is determined by finding the channel with the highest photon count. The lifetime distributions are interpreted as superpositions of exponential distributions, governed by 'characteristic' (average) lifetime distributions p τ;k (τ). Data is acquired in discrete time, as a histogram, and fitting of the discrete-time counterpart of Eq 14 with an additional baseline (dark photon count) probability is performed using maximum likelihood and 'discretized' channel probabilities, corresponding to the histogram binning, q k, l for bin l. For further details on fitting refer to Röding (2014) [28]. Because the photon counts differ between lifetime measurements (between N = 2.6 × 10 5 − 2.1 × 10 6 ), weighted maximum likelihood is employed. All lifetime distributions are fit simultaneously using weighted maximum likelihood, thus simultaneously estimating p τ|d (d, τ) and p τ;k (τ) for fixed p d;k (d) by maximizing the weighted global loglikelihood function summed over all lifetime distributions and all bins, and where n k, l is the number of photons in bin l of distribution k, and N k is the total number of photons in distribution k. Integrals in (d, λ) and (d, τ) space are computed using monte carlo integration with n mc = 10 4 quasi-random samples [29]. As a model for p λ|d (d, λ) we choose a lognormal distribution with diameterdependent mean m λ (d) and standard deviation s λ (d). We let both m λ (d) and s λ (d) be powerlaw type functions, Data analysis is performed using Matlab R2015a (Mathworks, Natick, MA, US).

Data analysis for polymers
Molecular weight distributions measured by GPC are fit using a normal distribution in log 10 M space with parameters (μ M /log10, σ M /log10), using nonlinear least-squares and minimizing SS ¼ By a well-known scaling property of the lognormal distribution family, the scaling law in Eq (6) implies that D is also lognormal distributed with parameters This derivation is fairly straightforward by use of the standard change-of-variables technique for probability distributions. Hence, a lognormal distribution model for p(D) is used in the Stejskal-Tanner equation in Eq (7), using numerical integration with n grid = 10 3 grid points to compute the values of the signal attenuation. The model is fit to the experimental NMR signal attenuation using nonlinear least-squares [20]. Bootstrapping [30] is used for obtaining confidence bounds for the conventionally estimated scaling law relationship. The eight measurement points are resampled with replacement n boot = 10 5 times, and 2.5% and 97.5% percentiles are computed pointwise along the M axis to yield a 95% nonparametric confidence interval.
Supporting Information S1 Code and Data Sets. Matlab programs and data sets used in this study. (ZIP)