A metric learning method for estimating myelin content based on T2-weighted MRI from a de- and re-myelination model of multiple sclerosis

Myelin plays a critical role in the pathogenesis of neurological disorders but is difficult to characterize in vivo using standard analysis methods. Our goal was to develop a novel analytical framework for estimating myelin content using T2-weighted magnetic resonance imaging (MRI) based on a de- and re-myelination model of multiple sclerosis. We examined 18 mice with lysolecithin induced demyelination and spontaneous remyelination in the ventral white matter of thoracic spinal cord. Cohorts of 6 mice underwent 9.4T MRI at days 7 (peak demyelination), 14 (ongoing recovery), and 28 (near complete recovery), as well as histological analysis of myelin and the associated cellularity at corresponding timepoints. Our MRI framework took an unsupervised learning approach, including tissue segmentation using a Gaussian Markov random field (GMRF), and myelin and cellularity feature estimation based on the Mahalanobis distance. For comparison, we also investigated 2 regression-based supervised learning approaches, one using our GMRF results, and another using a freely available generalized additive model (GAM). Results showed that GMRF segmentation was 73.2% accurate, and our unsupervised learning method achieved a correlation coefficient of 0.67 (top quartile: 0.78) with histological myelin, similar to 0.70 (top quartile: 0.78) obtained using supervised analyses. Further, the area under the receiver operator characteristic curve of our unsupervised myelin feature (0.883, 95% CI: 0.874–0.891) was significantly better than any of the supervised models in detecting white matter myelin as compared to histology. Collectively, metric learning using standard MRI may prove to be a new alternative method for estimating myelin content, which ultimately can improve our disease monitoring ability in a clinical setting.


Introduction
Myelin plays an important role in maintaining neurological functions. It can enhance signal conduction speed directly by 20-100 times, and provides critical neuroglial support for the underlying axons [1]. The integrity of myelin is implicated in normal development, ageing, and numerous pathological conditions of the nervous system such as multiple sclerosis (MS) [2]. Thus the availability of a non-invasive method to reliably characterize myelin content is critical.
There have been significant effort in deriving myelin assessment methods previously using magnetic resonance imaging (MRI) techniques. Example measures include: myelin water fraction [3,4], quantitative T1 or T2 [3][4][5], radial diffusivity [4,6], and magnetization transfer [3][4][5]7]. An alternative approach is assessing the distribution pattern (texture) of MRI pixels. Image texture refers to the unique inter-pixel relationships associated with a specific tissue. In MRI, such relationships highlight the intrinsic biophysical property of the underlying tissue as reflected by its signal intensity and relaxation [8]. Therefore, macroscopic analysis of MRI texture provides an integral estimation of the biochemical properties of tissue microstructure as done in histology. Combining with pertinent characterization techniques, texture analysis may enable robust evaluation of myelin content using myelinsensitive MRI. Indeed, texture analysis of T2-weighted MRI has shown considerable potential to detect changes in myelin integrity using various methods: first and second order statistics [9], directional statistics [10,11], or spectral analysis [12], although their sensitivity and specificity require further verification.
Analysis of MRI texture also includes methods that combine parameter estimation with tissue classification, such as Markov random fields (MRFs) [13,14]. In a MRF, image texture is modelled as a local interaction between neighbouring pixels. The outcome of a MRF classification model can directly serve as input for subsequent regression, or for latent variable analysis based on Mahalanobis distances [15], if the MRF is Gaussian (GMRF). Mahalanobis distance is a measure of the distance between a point and a distribution, characterized by the number of standard deviations that a point is away from the mean of a distribution. It takes account the covariance between variables in multi-dimensional measures and is scale invariant. Coregistering the Mahalanobis metric space to a physically meaningful metric space, such as histological myelin density, may allow direct, unsupervised prediction of tissue microstructural properties. Notably, advances in machine learning technologies over the past 20 years [16] have made supervised training of statistical models highly attractive, particularly given the availability of robust non-linear image co-registration algorithms [17,18]. Supervised learning is also typically easier to conduct and may perform better than unsupervised learning [19], but the ground truth is not always available [20].
In this study, we proposed an unsupervised learning approach for predicting histological myelin, and its counter-staining tissue, cellularity, based on T2-weighted MRI acquired from a mouse model of de-and re-myelination of MS [21,22]. Feature estimation used the Mahalanobis distance, following GMRF segmentation. Subsequently, we compared the derived MRI features to histology, and to results from two supervised regression models trained using histological myelin and cellularity standards; one model based on our GMRF segmentation results. To ensure accuracy, we also compared segmentation between our GMRF and an established MRF [14].

Materials and methods
This study was approved by the Health Sciences Animal Care Committee at the University of Calgary (ID: AC13-0246). During animal model creation, imaging, and tissue sampling, anesthetic agents including ketamine and xylazine were used to ensure optimal comfort of the animals and quality of the data.

Animal model
The source data used in this experiment have been previously reported in another study [12]. In brief, we examined 18 female C57BL/6 mice aged 8-10 weeks (Charles River, Quebec, Canada). Demyelinated lesions were induced by an expert (KSR) to the ventral white matter of thoracic spinal cord (T3/4) using a chemical toxin, lysolecithin. This model featured peak demyelination at 7 days post lysolecithin injection, active remyelination at 14 days, and nearly complete recovery by 28 days, as established in various studies [3,21,22].

MR imaging
Cohorts of 6 mice underwent MRI at each time-point (7, 14 and 28 days), totaling 18 animals. Image acquisition used a 35-mm-diameter volume coil at our 9.4T animal scanner (Bruker Biospin, Germany), following head and limb fixation with dedicated tapes and intraperitoneal injection of anesthesia (100 mg/kg ketamine/xylazine) immediately before MRI to keep the animals still. Throughout the experiment, the animals were under respiratory gating, and temperature and heartrate monitoring. The MRI protocol included a RARE T2 sequence with the following parameters: TR/TE = 2500/15 ms, echo train length = 8, slice thickness = 0.75 mm for 7 slices, field of view = 1.5x1.5 cm 2 , matrix = 256x256, and pixel size = 0.06x0.06 mm. As lesion induction was stereotactically controlled at a single injection site in each animal, we focused only on the MRI slice at the centre showing the largest lesion area.

Histology procedure
Post imaging, the animals were sacrificed immediately through transcardial perfusion with 4% paraformaldehyde. The thoracic spinal cords were then sampled and prepared for histology. Specifically, 20 μm-thick transverse slices were cut using a cryostat centered at the lesion, totaling 10-80 sections per animal to ensure complete coverage of the thickness of an individual MRI slice at the location (750 μm-thickness). The sections were stained with eriochrome cyanine for myelin, and neutral red counter stain for nuclei/cellularity [23], and digitalized using a confocal microscope (Olympus BX51, Japan). The section with the largest lesion area in each animal was used for MRI correspondence. Overall, 4 of the 18 animals were excluded from statistical analysis: 3 were sliced longitudinally (one per time-point), and one (1) did not show a MRI-matching lesion (mouse 4).

Image pre-processing for MRI and histology
Image processing was primarily performed using R, version 3.4.3 [24], via the EBImage package [25]. For histological images, we first manually removed background artifacts and loose tissue strands using Adobe Photoshop CC (2017.1.0). Then, we separated the myelin and cell nuclei staining as blue and red channels using R, and performed image normalization in 2 steps: pixel-wise, which divided each pixel value in a channel by the total pixel intensity across colors, and channel-wise, to the range [0, 1]. Standardization of MRI included masking of the spinal cord using ImageJ 1.5 [26] and intensity normalization to [0, 1]. The next step was image co-registration. This included primarily affine transformations [18] to align the longitudinal images in both MRI and histology, and to align histology and MRI cross-sectionally per subject. For the latter, we first embedded individual histological images in a white background such that each had a 4096x4096 dimension, and then down sampled them to 64x64 to match the scale of MRI using the Gaussian pyramid approach [27]. Subsequently, we segmented the MRI and histology images and affine co-registered the segmentation results from histology to MRI [18]. To further improve their alignment, we conducted a second, non-linear co-registration [17]. This co-registration included a pre-processing step that regressed MRI signal intensity against affine co-registered histology density using a random forest approach [28]. The purpose of the random forest was to linearize the relationship between MRI and histology to enhance the quality of co-registration (Fig 1).

Texture analysis
For both supervised and unsupervised approaches, we incorporated local textural information, derived from the pairwise relationships between an individual pixel and its neighbours, namely, Markov texture. Our neighbourhood structure, N , was symmetric about the central pixel and was sorted by radial distance then by angle with respect to the x-axis (Fig 2). In this study, we chose 20 symmetric neighbours, which gave rise to 10 unique parameters. This choice had shown to be highly effective previously in discriminating natural textures [29].

Unsupervised segmentation
The segmentation process seeks to convert a set of random variables, fx * i g, such as pixel intensity values in an image, into a hidden label membership field Z il , defined by: Based on histopathology, we a priori identified 4 major segmentation groups: background (BG), white matter (WM), grey matter (GM), and cell body (CB). In particular, we combined the substantia gelatinosa (SG) component with induced lesions to form the CB group, given their similarity in staining properties. For MRI, we included a 5 th label, partial volume (PV), to account for the potential partial volume effect between WM and GM [30]. In this study, we used a Gaussian Mixture Model (GMM) to segment histology images, and our customized GMRF to segment MRI, with the latter further compared with a different MRF-based opensource algorithm (FAST, Oxford, UK) as described below (Section 2.6.2).
2.6.1 Segmentation with GMMs. GMMs segment data into sub-populations with Gaussian statistics. We used GMMs for two purposes. First, segment the color histology images into 4 tissue types (BG, WM, GM, and CB), to generate the ground truth for validating MRI segmentations; calculate tissue means; and perform receiver operator characteristic (ROC) curve analysis. Second, initialize the GMRF segmentation by preliminary segmenting the MRI into 5 labels using T2 signal intensity alone. All GMM estimations were performed using the Mclust algorithm with default prior, which automatically selected the best model using the Bayesian information criterion [31].

Segmentation with an open-source MRF.
The FAST algorithm implemented in the open-source FSL library is a common method for tissue segmentation in neuroimaging [14], which uses a MRF akin to our GMRF model. For comparison, we used the R implementation [32] of FAST. Using the best recommended parameter values, the FAST segmented MR images into 3 options: BG/CB/GM/PV/WM, BG/CB/GM/WM, and BG/GM/WM. Based on our preliminary results, the FAST segmentation was not as accurate as our GMRF, so metric learning analysis used the latter only; our final results also confirmed the lack of competency of FAST in the current study (Table 1).

2.7
Unsupervised segmentation and metric learning with a GMRF 2.7.1 GMRF theory. Given a vector of image intensities: x * , and a possibly non-constant mean, m * , any GMRF can be written as: with inverse covariance Q [33]. Note that Q ij = 0 if x i and x j are not neighbours and are therefore conditionally independent [33]. We hypothesized that tissues were classifiable based on their differences in m * and Q. This also implied mathematically that these variables are hiddenlabel-dependent. We used the following parameterization: where (i, j) denotes image indices before vectorization.

Hidden label field estimation/GMRF segmentation.
Parameter estimation with the GMRF included 2 steps: 1) estimating the hidden (tissue) label field (GMRF segmentation), and then 2) estimating the parameters for each associated tissue label. These steps iterated until convergence. The latter was done using least-squares as loss functions [29], with the background label fixed for optimal segmentation at: μ = 0, β 00 = 10 −30 , and β ij = 0.
We used results from the GMMs to initialize our GMRF segmentation for T2-weighted MRI as described above. Iterative refinement of the label field took a classical approach known as Gibbs sampling achieved pixel-wise, using the simulated annealing technique [34,35]. Specifically, using an arithmetic coding structure [36], we iteratively updated the label field of all non-neighbouring pixels according to the proposal distribution (S1 File): where {−i} is the set of pixels excluding the ith one. This distribution allowed to calculate the probability of a pixel belonging to any associated label and then randomly sample a label for the pixel based on its probability. According to Eq (3), the first term on the right-hand-side of Eq (4) is: To validate the fitting results of our GMRF model, we also performed simulation via Gibbs sampling (S2 File) using Eq (5). Next, the probability of each label pixel was set a priori to: where g * ¼ ½0:010; 0:045; 0:030; 0:061; 0:854� T ¼ ½CB; GM; PV; WM; BG� T . These values were estimated from the median of the day 28 MRI scans that were essentially lesion free. Fixing g * allowed our algorithm to have a priori knowledge, which discouraged biologically unlikely outcomes to have high likelihood values (e.g CB group).
Finally, the temperature at the nth iteration, tðnÞ, was set using the heuristic schedule, which had the ability to converge to a physically realistic and stable solution [13,35]: where the number of iterations was determined by the condition that the final temperature be t f ¼ 0:25. When the temperature was 0 we picked the maximum of Eq (7), making it equivalent to iterated conditional modes [34], where the most likely label was picked at each iteration. MR images were initially segmented into 5 labels. When comparing to a segmentation with fewer labels (e.g. histology), we removed obsoleted labels (e.g. PV) and re-normalized Eq (4).

Myelin and cellularity feature estimation.
Our metric learning process enabled feature estimation. It involved constructing a metric space for MRI based on the output of GMRF segmentation, and then co-registering the MRI metric space to the corresponding histological metric space. The co-registration used tissue means as landmarks, which were estimated using our down-sampled GMM segmentation of histology (literature standards could be used here instead). For the MRI metric, we computed the Mahalanobis distance (d) between each pixel and the mean of each label as: d WM , d GM , and d CB . Each of these Mahalanobis distances was scaled by a constant optimized by Nelder-Mead, which gave the minimal root-mean-squared distance between tissue means of the MRI and histology metric spaces. The scaled distances were then used for numerical (Nelder-Mead) multilateration, which enabled co-registering the MRI-space position (WM, GM, CB) to the histology-space position (myelin, cellularity), and the histology-aligned MRI space positions for myelin and cellularity became our correspondent MRI features, respectively. As this process required at least three reference points or tissue groups, all of our segmentations contained three or more labels. The Mahalanobis distance of the ith pixel from the lth tissue mean was calculated using the mean, μ l , variance, β l,00 , neighbourhood parameters, β l,j and neighbour means μ j as: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi

Supervised learning
Using histology results as the ground truth, we considered two non-linear regression models: segmentation regression, based on outputs from our GMRF segmentation, and a blackbox generalized additive model (GAM). Regarding the latter, we also included a linear regression as the simplest type of GAM (linear GAM), for quantitative comparison. Furthermore, we considered two loss functions for each model: root-mean-squared error (RMSE) and meanabsolute-error (MAE). Where not explicitly stated, we refer to MAE. Finally, we explored the impact of focal lesions by explicitly including a manual lesion mask, where for optimal reliability, we focused only on the four day 7 mice because they had the largest lesions.

Segmentation regression.
We extended our GMRF segmentation algorithm to include a response variable, the staining density of myelin/cellularity, and then used the GMRF outcomes and MRI signal intensity as predictors of the response. For the RMSE loss, we used a linear random effects model [37], while for the MAE, we performed median regression independently for each segmented label [38]. Our final model included the manual lesion masks as a new label using the modified GMRF segmentation achieved by overriding existing labels.

Generalized Additive Models (GAMs).
GAMs can capture non-linearities by allowing arbitrary relationships between predictor variables. Each observation y i takes the form [16]: where the unknown function, f, must be expanded along a user-specified basis. The linear model (linear GAM) represents the identity case f(x) = x. Our other GAM models used a thinplate spline with the default number of knots [39,40]. To understand the impact of predictors on our blue and red histology standards, we performed step-wise regression using: 1) T2 signal intensity alone (Markov-GAM), 2) T2 signal intensity plus texture (Markov GAM), and then 3) adding the manually identified lesion masks as a binary dummy variable, l, to #2 (Markov + GAM). The Markov texture here also used a 20-neighbourhood setting per pixel to ensure comparability with our GMRF-based approach. The full GAM model was:

Time cohort analysis
To understand how MRI estimates performed over time, we conducted a group-level analysis at each time-point for all three learning approaches and compared that with the corresponding histology. This was done without inclusion of lesion masks. The estimated outcomes for each learning method were combined per time-point after initial longitudinal co-registration of the source images. For both MRI and histology, the median maps per time-point were calculated.

Model validation and statistical analysis
All models were validated using the 14 mice with corresponding histology. Unsupervised approaches were assessed directly using Pearson correlation. Supervised approaches were leave-one-out cross-validated, with error estimated by the standard deviation. In addition, pairwise comparisons used the Mann-Whitney test, and segmentation accuracies used the binomial test. To further test the ability of our unsupervised MRI myelin feature versus regression, ROC curve analysis was performed [41], using the co-registered histology myelin segmented in the WM as ground truth. Comparison of the area under the ROC (AUC) curves used the Delong test. All statistics were computed using R [24]. A p < 0.05 was considered significant. All reported confidence intervals were 95%. Members in the background label in either histology or MRI were excluded in calculating summary statistics.

Segmentation outcomes
3.1.1 Qualitative results. Overall, we analyzed the MRI of all 18 animals, and MRI versus histology of 14 animals. Visually the GMM segmentation for histology was consistent with the expected anatomy (Fig 3). In MRI, our GMRF segmentation appeared better than the FAST segmentation in all animals. The GMRF correctly identified more tissue, and had smoother and more continuous labels for the main tissues, such as WM and GM, than FAST (Fig 4). The segmentation performance was the best for MR images that had relatively good visual contrast.   . Tissue-wise, the FAST appeared to be more specific to CB than the GMRF, showing 2.8 times as many true positive CB labels. However, the FAST also had 4.7 times as many false positive CB due to mislabeling of GM (Table 1). Overall, the total number of CB pixels was small (382/6307 = 6.1%) as compared to the WM (3092/6307 = 49.0) and GM (2833/6307 = 44.9%).
Out of the 12 texture parameters in the GMRF model, 4 were significantly different from zero (Fig 5): the mean (μ), conditional variance (β 00 ) and the first 2 of the 10 neighbouring parameters (β 01 and β 01 ). Moreover, using the parameters generated by the GMRF segmentation (S1 Table), our simulated images showed a high similarity to the initial MRI in all

Visual inspection.
The estimated myelin content of GM and WM was consistent with the histological findings from all methods, except for mice with poor MRI contrast: 13 and 14, or with large lesion cores: 2, 11 and 16 (Fig 6). Compared to histology segmentation, there appeared to be a systematic under-fit in MRI models, especially using RMSE, as indicated by the residual correlations (Table 2), and residual images (S2 Fig). In addition, the segmentation regression and the Markov GAM outcomes appeared smoother and with smaller lesion core discrepancies than the unsupervised GMRF features, with the segmentation regression predictions having less intra-mouse variability than the GAMs (e.g. mouse 14). Between loss functions, all MAE regressions showed better contrast than RMSE regressions in supervised learning (S3 and S4 Figs).

Quantitative results.
The correlation between the GMRF myelin feature and histology myelin was 0.67 (top quartile: 0.78). It was 0.70 (top quartile: 0.78) for segmentation regression and 0.70 (top quartile: 0.78) for the Markov GAM (Table 2; Fig 6). The AUC analysis showed that the GMRF myelin feature had a stronger diagnostic ability than any of the regression models (p<10 −15 ; Table 3). In addition, segmentation regression was more diagnostic than the GAMs in general (p = 10 −2 ), where the Markov GAM was more diagnostic than Markov-GAM based on T2 intensity alone (p<10 −11 ). With inclusion of manual lesion masks into the regression models (S2 Table), the GMRF myelin feature was still more diagnostic than the best segmentation regression+ (p<0.0010), but not significantly greater than the Markov

Estimation outcomes for the cellularity feature
The patterns and trends in the estimation of tissue cellularity were similar to those of myelin quantitatively. Overall, the correlation was 0.60 (top quartile: 0.70) between the predicted feature and histology cellularity using our GMRF, 0.69 (top quartile: 0.77) using segmentation regression, and 0.69 (top quartile: 0.77) using Markov GAM. Based on the R2 values (MAE models), the supervised models performed better with inclusion of the texture parameters than without (p = 0.0023 for Markov GAM). The Markov GAM was also better than the linear GAM (p = 0.0067), and tended to be better with inclusion of lesion masks than without (p = 0.25). Visually, the regression methods showed an under estimation in all lesions, and Markov GAM appeared to be the least accurate in mice 10 and 14 predictions (Fig 7 and Table 4).

Myelin and cellularity outcomes by time cohort
Preliminary group-level analyses between time-points from all associated animals showed that there was a continual recovery of myelin estimates, and loss of cellularity estimates over time, consistent with the evolving pattern in histology. The MRI outcomes appeared to recover faster than histology outcomes in both myelin and cellularity, particularly in lesion cores, and this observation seemed to be present in all three of the MRI-based learning methods (Fig 8).   These observations were consistent with the quantitative outcomes as seen in the affected ventral white matter. The median myelin feature increased by 31 ± 12% and 78 ± 14% at days 14 and 28 from day 7, and cellularity decreased by 28 ± 5% and 46 ± 3%. The degree of changes in segmentation regression and Markov GAM were similar over time for both the tissue types. On the other hand, myelin content increased by 18 ± 8% and 47 ± 11%, and cellularity decreased by 15 ± 4% and 43 ± 3% in histology at days 14 and 28 from day 7, less than MRI changes. Correlation analyses showed a similar pattern, where the correlation coefficient between MRI and histology was much higher at days 7 and 14 than at day 28, but all values remained >0.8 (Tables 5 and 6).

Discussion
In this study, we developed a new unsupervised metric learning method for assessing myelin content using T2-weighted MRI and compared the outcomes against histology and 2 supervised approaches. In general, T2-weighted MRI is pathologically non-specific in MS [42]. However, based on a well-established de-and re-myelination model of MS, we discovered that Markov texture in T2 MRI was highly associated with the myelin integrity in mouse spinal cord. Further, integrating Markov texture into either unsupervised or supervised machine learning models helped extract valuable myelin content information from standard MRI.
Our prediction models were anatomically robust for most tissue groups segmented except for a few lesion cores. In particular, while lacking smoothness, our unsupervised GMRF features correlated strongly with histology, similar to the supervised models. In addition, the GMRF features were less susceptible to inter-subject variability than the GAM-based models, and more robust in identifying myelin rich structures than all supervised models as shown by the AUC. The two regression-based supervised learning approaches performed similarly, and both appeared to be better with use of: Markov texture, the MAE loss function, and lesion masks, than without. The non-significant difference between results with or without inclusion of lesion masks in the regression models may be due to the small number of animals (only 4), and the relatively small number of lesion pixels, deserving further verification. Notably, our GMRF model contributed to both unsupervised and supervised learning in the present study, where it performed metric learning directly in the former, and provided robust results enabling segmentation regression in the latter. Therefore, the GMRF can be a useful technique for both. Compared to previously reported myelin predicting metrics using advanced MRI, our results were equally accurate. Prior R2 values were 0.49-0.61 (quantitative T1), 0.32-0.59 (quantitative T2), 0.67 (MWF), and 0.2-0.71 (MTR) in MS brain, and 0.56 (MWF) in rat sciatic nerve [43]. Our R2 values were similar to those with T1 and T2, and likely better with inclusion of manual lesion masks in supervised learning. In fact, our best fit (mouse 5, MAE segmentation regression) achieved a much higher R2 than all of the above reports, even without the use of lesion masks. In addition, based on the same experiment, a prior study has investigated the utility of high angular resolution diffusion imaging (HARDI), which is hypothesized to be more specific to myelin and axonal changes than the traditional diffusion tensor imaging metrics (e.g. radial and axial diffusivity). In that study, a HARDI measure termed orientation dispersion index showed a considerable increase with demyelination and recovery with remyelination in mouse spinal cord, with continuing increase of myelin content as reflected by the neurite density index [12]. Our cellularity predictions performed similarly to but somewhat worse than that of myelin in the present study, and appeared to be more specific when including lesion masks, as indicated by the high cellularity versus myelin contrast in lesions and the GM. It is worth noting that at least 5 of the 6 studies done previously used ROI-based analyses in histology-MRI correlation, which may have overestimated the R2 values relative to the pixel-wise approach done in our present study.
Tissue segmentation was a critical intermediate step in this study. Our GMRF segmentation was 15% more accurate than the initiating GMM, indicating the importance of texture information. In contrast, while also using a MRF, FAST segmentation performed 6% worse than the GMM. Technically, the FAST differs from our GMRF in three key respects: 1) FAST is optimized using iterative conditional modes, 2) it does not allow a priori tissue proportions, and 3) it uses a Markov hidden label field [14] whereas we used the Markov property directly on source MRI signal intensity. The increased performance of GMRF was likely due to our use of the optimizer, which applied increased iterations, and the temperature parameter that prevented convergence to a local minimum. The use of a priori tissue proportions might also help explain the difference, although our ad hoc tests suggested that the main contribution of including tissue proportions was preventing empty labels. Notably, the FAST showed a higher sensitivity to CB than the GMRF, although at the cost of a much greater false positive rate from GM to CB simultaneously. One potential explanation is that the two CB populations (SG and lesion pixels) were primarily similar in staining identity rather than texture. The GMRF is much stronger texture-dependence, making it more sensitive to the differences between SG and lesions than the FAST, and hence a lower detectability to the CB as a group. Our overall outcomes also indicate that proper lesion detection in MRI is critical for improving model accuracy. While we applied manual lesion masks, there is extensive literature on the topic of lesion identification using unsupervised approaches [20], which however is out of the scope of the present study.
The main challenge of our GMRF model was in classifying lesions with a relatively large core that contained unusually dark signal intensity on T2 MRI. Most of these lesion pixels were classified as GM rather than CB, which likely also propagated errors to the estimation of GMRF features and segmentation regression. This was most obvious in a few day 7 lesions, which were mistakenly characterized as having increased myelin and decreased cellularity before mask use. The exact cause of low T2 signal intensity within the lesions remained unclear. As our lesions were induced by the injection of a chemical toxin, the decreased signal could be due to the following factors: mechanical damage, the impact of degraded myelin (debris), and an iron susceptibility effect associated with the myelin-forming cells and inflammatory cells [44]. Given the purpose of the current study, degraded myelin was not evaluated (eriochrome cyanine detects only healthy myelin). Previous work [45] reported the same problem when segmenting lesions using a similar model. Nonetheless, such intra-lesion signal loss on T2-weighted MRI is not common in the MRI of humans with MS, so the potential limitations related to the lesion core are not expected to have a major effect on future testing of our method in a clinical setting. Moreover, with application of lesion masks in supervised learning, our results demonstrate the feasibility to achieve reasonable estimates of myelin and cellularity features for these lesions.
Using co-registered longitudinal data, we also explored the changes in myelin and cellularity at a cohort level. By 'averaging' (median) outcomes across individuals per time-point, this analysis allowed to minimize random errors while preserving anatomy, including the lesions. We observed a significant increase in Pearson correlation between our models and histology in both myelin and cellularity. Over time, there was a reduction of cellularity in lesions, and a recovery of WM in lesion periphery (decreasing lesion size) in all model outcomes. Moreover, the MRI-based myelin and cellularity estimates seemed to recover faster than histology, as confirmed both visually and quantitatively: there was a near-complete resolution of lesions by day 28 in MRI measures in contrast to histology, and a sudden drop in Pearson correlation between them at day 28 versus days 7 and 14. These observations may suggest that the MRI features reflect more physical changes than lipid content (blue) and cell nuclei (red) as highlighted in histology [23]. T2-weighted MRI is sensitive primarily to water content, but that may also depend on the presence of macromolecules such as lipids, and proton-proton interactions [46]. We suspect that due to the recovery differences between MRI and histology metrics, the MRI models could no longer distinguish tiny lesion areas from the WM when the amount of macromolecules dropped below a certain threshold such as that occurred at day 28.
We note a few limitations in this study. First, the number of animals per time-point was small, limiting broad conclusions. Second, the inherent challenges in histological analysis may have affected the results. The presence of tears and bubbles in stained images is not uncommon, requiring pre-processing to optimize segmentation accuracy. Moreover, it is difficult to wash the blue staining out of nuclei without also washing the stain off of myelin [23], causing the nuclei to be stained with both red and blue (purple). While almost unavoidable in this type of studies, it may have created ambiguity in the ground truth. Variations in staining density is another common issue, for which we performed image normalization to compensate. Third, accurate MRI and histology correspondence has been a long-term challenge, potentially adding uncertainties to the analyses. One issue comes from the resolution differences between the modalities. In this study, we applied several image preparation steps to improve alignment, including 2 iterations of image co-registration. Another issue relates to the ex vivo nature of histology versus in vivo MRI. However, immediately before MRI, we applied high-dose anesthesia to keep the animals still, making optimal acquisition of images possible. In addition, all animals had their tissue samples prepared immediately after imaging for histology, which allowed best preservation of tissue integrity. Further, histological validation of MRI measures has been one of the standard approaches in both animal and human studies, and our current observations are highly consistent with results from prior studies using the same model [3,12]. Fourth, our image contrast in MRI is relatively modest. Our in-house experiments show that images with better quality would result in greater accuracy in both segmentation and feature estimation with the GMRF, consistent with the rule-of-thumb stating that good contrast facilitates good segmentation [14]. Therefore, our current results seemed to set a low threshold for the requirement of the quality of MRI scans, and our findings are rather conservative. Finally, we did not have a true longitudinal cohort, limiting systemic time series analyses. However, cohort-level analysis following strict longitudinal co-registration served as an alternative approach, particularly for this established animal model, which is highly predictable in both disease course and lesion location [21,47]. In the future, we seek to extend our GMRF model to a full Gaussian random field or beyond [36,48], replace it with a multi-scale approach [49], or modify it with outlier detection [20,50] to improve performance. Notably, our approach was purposefully designed as stepwise and modular, to enable further validation and extension. In addition, we plan to test alternative metrics, such as an exact geodesic distance [51] to compare with the Mahalanobis distance, and test additional imaging sequences [45] and anatomical regions (e.g. the brain) to further validate our method.

Conclusions
Using standard T2-weighted MRI and a robust model for de-and re-myelination, we implemented a novel GMRF algorithm for both unsupervised and supervised learning of myelin and cellularity. Our MRI myelin feature performed similarly well to the literature that used advanced MRI, and to supervised learning approaches. Likewise, our GMRF segmentation outperformed a common FSL FAST model by >20% in accuracy. Collective results suggest that image analysis using adequate mathematical models has the ability to extract valuable myelin information from standard MRI. With further confirmation, the GMRF method can be similarly applied to interrogate myelin characteristics concealed in clinical MRI of humans with MS or similar diseases. This will be achievable as long as there is a dedicated tissue group that has high cellularity in the images, such as focal lesions, which in fact are the hallmark pathology of many demyelinating disorders. Integrating such novel approaches in clinical practice can advance the use of standard MRI, provide new information about myelin injury and repair, and add new quantitative measures in disease diagnosis and management. Zhang.