Multidimensional analysis and detection of informative features in human brain white matter

The white matter contains long-range connections between different brain regions and the organization of these connections holds important implications for brain function in health and disease. Tractometry uses diffusion-weighted magnetic resonance imaging (dMRI) to quantify tissue properties along the trajectories of these connections. Statistical inference from tractometry usually either averages these quantities along the length of each fiber bundle or computes regression models separately for each point along every one of the bundles. These approaches are limited in their sensitivity, in the former case, or in their statistical power, in the latter. We developed a method based on the sparse group lasso (SGL) that takes into account tissue properties along all of the bundles and selects informative features by enforcing both global and bundle-level sparsity. We demonstrate the performance of the method in two settings: i) in a classification setting, patients with amyotrophic lateral sclerosis (ALS) are accurately distinguished from matched controls. Furthermore, SGL identifies the corticospinal tract as important for this classification, correctly finding the parts of the white matter known to be affected by the disease. ii) In a regression setting, SGL accurately predicts “brain age.” In this case, the weights are distributed throughout the white matter indicating that many different regions of the white matter change over the lifespan. Thus, SGL leverages the multivariate relationships between diffusion properties in multiple bundles to make accurate phenotypic predictions while simultaneously discovering the most relevant features of the white matter.


Introduction
Non-invasive methods for measuring human brain structure and function have revolutionized our understanding of brain function. These measurements have demonstrated that interactions between networks of brain regions give rise to coordinated information processing and to the complex adaptive behavior that characterizes human cognition. Diffusion-weighted Magnetic Resonance Imaging (dMRI) provides a unique view into the physical properties of the connections that comprise these networks, by sensitizing the measurement to the directional diffusion of water in each voxel [1,2]. Methods for computational tract-tracing from diffusion MRI, or tractography, combine the estimates of fiber orientations in each voxel to form streamlines that traverse the volume of the white matter [3,4]. A variety of methods can be used to delineate the trajectory of major neural pathways among these streamlines [5]. Tractometry uses the results of tractography and models of tissue biophysics based on the patterns of diffusion in each measurement voxel to assess the physical properties of the white matter along specific pathways [6,7]. In some previous tractometry-based studies, tissue properties along the length of each tract were summarized by taking the mean along each bundle, but there is a large body of evidence showing that there is systematic variability in the values of diffusion metrics along the trajectory of each bundle. This justifies retaining the individual samples along the length of each bundle [5,6,8,9], hereafter referred to as tract profiling. While this retains important information about each individual's white matter, it also presents statistical challenges due to the dimensionality of the data. In past work, comparisons between groups or across individuals were done independently at each node of each bundle, for each diffusion metric. This approach is exhaustive, but statistical power is compromised by a multiple comparison problem [8,[10][11][12]. An alternative that circumvents the multiple comparison problem is to select just a few tracts to compare in each individual, or even segments of these tracts based on a priori hypotheses. This approach is appropriate when the biological basis of the process of interest is relatively well understood (for a recent example, see [13]). Sometimes, these approaches are combined: a bundle is selected based on a priori knowledge, and all the data in the bundle of interest are used together to fit a model that can predict differences between individuals [14].
The present work aims to balance predictive accuracy with descriptive power [15,16] by capitalizing on all of the available data across all bundles, while also retaining and elucidating spatial information about the locations that are most informative for discriminative performance. Unlike many previous analyses of white matter (e.g., using TBSS [17] and connectometry [18]), which focus on classical inference about the differences between groups or individuals in terms of their white matter properties, the focus of the present work is on the combination of features that facilitate accurate prediction of the individual differences in a particular phenotype. For example, an accurate prediction of which individuals in a group have a particular disease, or a prediction of their age with small error. The distinct goals of inference and prediction are not necessarily incongruent, but can sometimes be in tension [19]. This distinction is also similar to the distinction between "encoding" and "decoding" used in the functional MRI literature [20]. In the present work, we predict the phenotypical variance in a group of subjects, or classify group membership, based on a linear combination of the features estimated with tract profiling.
Using this approach, we first need to deal with the large and asymmetric dimensionality of the data: tract profile data usually has many more features (i.e., number of measurements per individual) than samples (number of subjects), which makes inferences from the data about phenotypical differences between individuals ill-posed. This regime is the target of several statistical learning techniques, and is often solved by various forms of regularization.
The Lasso algorithm minimizes the sum of the absolute values of contributions of each feature [21]. This tends to shrink to zero the contributions of many of the features, providing results that are both accurate and interpretable. When additional structure is available in the organization of the data, regularization algorithms can take advantage of this structure. For example, if the features lend themselves to a natural division into different groups, the Group Lasso (GL) can be used to select groups of features, rather than individual features [22]. The Sparse Group Lasso (SGL) elaborates on this idea by providing control both of group sparsity, as well as overall sparsity of the solutions [23]. Because the features measured with tractomery lend themselves to grouping based on the tracts from which each measurement is taken, GL and SGL provide useful tools for linear model fitting in problems of this form. Here we develop an implementation of SGL that is well suited to the analysis of tract profile data. In addition, we demonstrate the power and flexibility of this approach by applying it to both classification and continuous prediction problems. Our data flow is represented in Fig 1 and explained in further detail in the Methods section.
One more approach to multivariate analysis of neuroimaging data is to perform inference or prediction in a transformed feature space rather then the original diffusion metric feature space. For example, in Lasso PCR, diffusion features can be projected onto a principal components (PC) basis for use in a sparsity constrained model [24,25]. While these approaches achieve high predictive performance and may even yield biologically interpretable principal components [12], they neglect anatomical grouping information. We view these approaches as complementary to the SGL-based approach: PCR based approaches seek to model brain-phenotype relationships using the most parsimonious representation of variance in the diffusion measures. In contrast, our SGL-based approach seeks to establish whether prior knowledge of anatomical grouping improves modeling of brain-phenotype relationships. Motivated by this distinction, we also introduce a union of the two approaches called PCR-SGL, in which each group of features is independently transformed into its PC basis, thereby retaining the anatomical grouping information of the original feature space. We demonstrate circumstances in which this approach both helps and hinders predictive performance.

Results
We developed a method for analyzing dMRI tract profile data that uses the Sparse Group Lasso (SGL) to select features that are sparse both at the group (bundle) level, as well as overall. We demonstrate the use of this method on four different datasets in both a classification setting and a regression setting.

SGL accurately detects ALS from tractometry data
Using data from a previous study of patients with amyotrophic lateral sclerosis (ALS) [26], we tested the performance of SGL in a classification setting. The previous study predicted ALS status with a mean accuracy of 80% using a random forest algorithm based on a priori selection of features only within the CST bundle-of-interest. SGL delivers improved predictive performance, with a cross-validated accuracy of 83% and an area under the receiver operating characteristic curve (ROC AUC) of 0.88, without the need for a priori feature engineering. We also predicted ALS diagnosis using the PCR-SGL, wherein each group of features is independently transformed to its PC basis and an SGL model is fit to the transformed features. This model achieves 88% accuracy and an ROC AUC of 0.9 (Fig 2a). In addition to this classification performance, both SGL and PCR-SGL also identify the white matter tracts most important for ALS classification. Fig 2b shows that PCR-SGL identified as potential disease biomarkers the diffusion measures in the CST from the cerebral peduncle to the corona radiata, agreeing with the previous study from which these data were extracted [26]. The relative importance of white matter features is captured in the β coefficients from Eq (3). Fig 2c depicts these coefficients along the right CST, plotted over the FA values for the control and ALS subject groups (see S1 and S2 Figs for tract profiles of all 18 tracts in these groups). We find that SGL and PCR-SGL select FA metrics in the corticospinal tract and particularly in the right corticospinal tract as most important to ALS classification, confirming previous findings [27-36] and identifying the portions of the brain that were selected a priori in the previous study from which we obtained the data [26].
To assess the added value of incorporating anatomical knowledge into our model, we compared our classification results to those obtained from four other models: a pure lasso model, an elastic net model, a bundle-mean lasso (i.e. a lasso model trained only on the mean metric values from each bundle), and a principal components regression lasso (Lasso PCR) model [24,25]. These models achieved accuracies of 76%, 76%, 79.5%, and 71.5% respectively and ROC AUC values of 0.76, 0.75, 0.82, and 0.71, respectively using the same cross-validation strategy used to assess the SGL model. This difference in performance justifies the additional complexity of the SGL over simpler sparsity-inducing regression strategies. The relatively poor performance of Lasso PCR suggests that features with small variance are nonetheless relevant for predicting ALS diagnosis. Or equivalently, that group differences between ALS diagnoses are not the dominant source of variance in the diffusion metrics.
The β coefficients exhibit high bundle level sparsity; only some bundles are important, which can be be confirmed by observing the value of α, the regularization hyperparameter that controls the mixture of the Group Lasso and lasso penalties, selected through nested cross-validation (see Eq (3)). If α is closer to zero, it indicates that the phenotype in question preferentially correlates with only a few groups of covariates. For the ALS dataset, the SGL model has α = 0.21 and the PCR-SGL model has α = 0.4, confirming that the white matter correlates of ALS reside mostly in one bundle, namely the CST.
Analyzing the ways in which the model mislabels individuals also provides insight. We found that mislabelled subjects are outliers relative to their group with respect to diffusion features of the CST (Fig 2d and 2e). The false positive classifications have reduced FA in one or more of the three sections of the CST where kbk is large in Fig 2c. The false negative subjects have FA profiles that oscillate between the two group means. Thus, when the SGL method predicts incorrectly this is done in a comprehensible manner.

SGL accurately predicts age from tractometry data
To test the performance of SGL in a continuous regression task, we focus here on the prediction of biological age in three datasets named WH, HBN, and Cam-CAN (see the Methods section for a description of each dataset). Prediction of "brain age" is a commonly undertaken task in neuroimaging machine learning, in part because these predictions, and deviations therefrom, may be diagnostic of overall brain health (for a review, see Cole et al. [37] 42] datasets used here contain data from 76, 1651, and 640 subjects, respectively, ranging from 6-50, 5-21, and 18-88 years of age, respectively. In each case, biological age was used as the predicted variable y. SGL was fit to the tract profile features FA and mean diffusivity (MD) in 18 major brain tracts, with diffusion metrics extracted from diffusion tensor imaging (DTI) for the WH dataset and diffusion kurtosis imaging (DKI) [43] for the HBN and Cam-CAN datasets see S3-S8 Figs for the full tract profile information in all tracts/datasets.
To evaluate the fit of the model, we used a nested cross-validation procedure. In this procedure, batches of subjects are held out. For each batch (or fold), the model is fully fit without this data. Then, once the parameters are fixed, the model is applied to predict the ages of held out subjects based on the linear coeffiecients. This scheme automatically finds the right level of sparsity and fits the coefficients to the ill-posed linear model, while guarding against overfitting. SGL accurately predicts the age of the subjects in this procedure, with a median absolute error of 2.67, 1.45, and 6.02 years for the WH, HBN, and Cam-CAN datasets, respectively and coefficients of determination R 2 = 0.52, 0.57, and 0.77, respectively (see Fig 3, top panels). The predictions for Cam-CAN are competitive with a recent state of the art prediction [44], which used streamline density to estimate the brain's structural connectivity and achieved R 2 = 0.63. The median absolute errors are also lower than the results of a recent study that predicted age in a large sample that included the Cam-CAN data, and was based on diffusion MRI features [45].
In contrast to the ALS classification case, the selected α values indicate high global sparsity over group sparsity, with α = 0.83, 0.67, and 0.68, for the WH, HBN, and Cam-CAN datasets, respectively. The model weights are distributed over many different tracts and dMRI tissue properties (see Fig 3d-3f and S3-S8 Figs). This demonstrates that SGL is not coerced to produce overly sparse results when a more accurate model requires a dense selection of features. Furthermore, inspecting the portions of bundles with larger coefficients in Fig 3g-3i reveals that SGL selects informative regions where diffusion properties are different between the age quintiles.
As with ALS classification, we also compared SGL performance with results obtained using the pure lasso, elastic net, bundle-mean lasso, and Lasso PCR. The pure lasso models achieved R 2 = 0.47, 0.54, and 0.70 for the WH, HBN, and Cam-CAN datasets respectively using the same cross-validation strategy used to assess the SGL. The elastic net models achieved R 2 = 0.46, 0.59, and 0.73 for the WH, HBN, and Cam-CAN datasets, respectively. The bundle-mean lasso models achieved R 2 = 0.33, 0.22, and 0.55, respectively. And the Lasso PCR models achieved R 2 = 0.54, 0.57, 0.74, respectively. The SGL models' performance improvement over lasso is more modest than in the ALS classification case, which accords with the selected values of the α hyperparameter. These SGL models were more lasso-like in their sparsity penalties so they are more lasso-like in their predictive performance.
Model performance across all four datasets and all six model types is summarized in Fig 4. In contrast to the ALS classification case, the PCR Lasso performs competitively in age regression, suggesting that aging is a significant source of global variance in the tract profiles. On the other hand, the PCR-SGL models destroy cross-bundle covariance information in their initial group-wise PC projection step (see Eq (4)), limiting performance for globally distributed phenomena. Conversely, the SGL models are able to adapt to this regime.

Discussion
We present here a novel method for analysis of dMRI tract profile data that relies on the Sparse Group Lasso [46] to make accurate predictions of phenotypic properties of individual subjects while, simultaneously, identifying the features of the white matter that are most important for this prediction. The method is data-driven and it is broadly applicable to a wide range of research questions: it performs well in predicting both continuous variables, such as biological age, as well as categorical variables, such as whether a person is a patient or a healthy control. Because aging affects the white matter globally, group structure-blind methods like elastic net and PCR Lasso perform well. Nonetheless, the SGL models show competitve predictive performance, adapting to a problem where group structure is not as informative. PCR-SGL performs poorly in this regime because its initial group-wise PC projection destroys between bundle covariance. The bundle-mean lasso performs poorly, demonstrating the value of along-tract profiling. https://doi.org/10.1371/journal.pcbi.1009136.g004

PLOS COMPUTATIONAL BIOLOGY
In both of these cases, SGL out-performs previous algorithms that have been developed for these tasks [26, 44,45]. The nested cross-validation approach used to fit the model and make both predictions and inferences from the model guards against overfitting and tunes the degree of sparseness required by the algorithm. This means that SGL can accurately describe phenomena that are locally confined to a particular anatomical location or diffusion property (e.g., FA in the CST) as well as phenomena that are widely distributed amongst brain regions and measured diffusion properties.
Specifically, we demonstrated that the algorithm correctly detects the fact that ALS, which is a disease of motor neurons, is localized to the cortico-spinal tract. This recapitulates the results of previous analysis of these same data, using a bundle-of-interest approach [26]. As with the original study, it is unclear whether our strong predictive performance is based on image properties that would be visible in patient scans or whether it is identifying a new subclinical disease biomarker. In contrast, for the analysis of biological age, the coefficients identified by the algorithm are very widely distributed across many parts of the white matter, mirroring previous results that show a large and continuous distribution of life-span changes in white matter properties [39]. We present age quintile bundle profiles and SGL β coefficients for all four datasets in S3-S8 Figs. For age regression, substantial differences between the datasets preclude generalization across all three: (i) different acquisition parameters, which can be challenging to harmonize [47], (ii) different diffusion models, with DTI for the WH dataset and DKI for the HBN and Cam-CAN datasets, (iii) different age ranges and distributions (which is evident in the figure legends for Fig 3), with HBN being a developmental dataset, while WH and Cam-CAN are lifespan maturation datasets, and (iv) different anatomical extents, with the WH streamlines truncated to remain with the bundle's bounding regions of interest (the default behavior in the legacy mAFQ) and the HBN and Cam-CAN streamlines allowed to retain their full extent from whole-brain tractography (the default behavior in pyAFQ). In a biological characteristic that has widespread effects in the brain, validity is difficult to assess [48], so it is both unsurprising and vexing that we can do so well at predicting within each dataset, and yet do poorly in interpreting informative features across instruments.
We also implemented a union of the PCR Lasso and SGL approaches by first projecting each group onto its PC basis. This PCR-SGL approach transformed the feature space into a more parsimonious representation of its variance while also preserving group structure. But it was unable to efficiently represent cross-bundle covariance, which is an advantage of PCR Lasso. As a result, PCR-SGL performed well in ALS classification, where the white matter interactions are highly localized, and poorly in age regression, where the white matter interactions are global.
One drawback of our approach is also evident in the age regression results. There are portions of the bundle profiles in Fig 3g-3i for which theb coefficients are zero but for which the bundle profiles clearly contain an age differentiating signal. While SGL correctly identifies certain features as important, it is not guaranteed to identify all of the important features. It will identify a parsimonious set of important features, dense enough to predict the phenotype and sparse enough (at either the group or global level) to satisfy the sparsity constraints. Another limitation of our approach is in its computational execution time, which, in our experience, is roughly five times slower than simpler models like the lasso. This slow-down is likely to be insignificant for "production" cycles in which researchers report their findings, but may be prohibitive for "development" cycles in which researchers repeatedly train models while adjusting their analysis pipeline. In practice, we circumvent this limitation by developing our analysis infrastructure with fast, simple models, and substituting slower, more performant models afterwards. The consistent scikit-learn-based application programming interface of AFQ-Insight provides a good basis for this pattern.
Taken together, our results demonstrate the promise of the group-regularized regression approach. Even at the scale of dozens of subjects, the results provided by SGL are both accurate, as well as interpretable [15]: tract profiling capitalizes on domain knowledge to engineer meaningful features; SGL scores these features based on their relative importance; enables a visualization of these feature importance scores in the anatomical coordinate frame of the bundles (e.g. , Figs 2b and 3d-3f) and provides a means to understand model errors (e.g. , Fig 2d  and 2e). Thus, this multivariate analysis approach achieves high cross-validated accuracy for precision medicine applications of dMRI data and identifies relevant features of brain anatomy that can further our neuroscientific understanding of clinical disorders.
Neuroscience has entered an era in which consortium efforts are amassing large datasets of high-quality dMRI measurements to address a variety of scientific questions [40, [49][50][51][52], but the volume and complexity of these data pose substantial challenges. Tract profiling followed by analysis with SGL provides a promising approach to distill meaningful insights from the wealth of data measured in these efforts.
SGL has many other potential applications in neuroscience, because of the hierarchical and grouped nature of many data types that are collected in multiple sample points within anatomically-defined areas. For example, this method may be useful to understand the relationship between fMRI recordings and behavior, where activity in each voxel may co-vary with voxels within the same anatomical region and form features and groups of features. Similarly, largescale multi-electrode recordings of neural activity in awake behaving animals are becoming increasingly feasible [53,54] and these recordings can form features (neurons) and groups (neurons within an anatomical region).
The results we present here also motivate extensions of the method using more sophisticated cost functions. For example, the fused sparse group lasso (FSGL) [55] extends SGL to enforce additional spatial structure: smoothness in the variation of diffusion metrics along the bundles. As brain measurements include additional structure (for example, bilateral symmetry), future work could also incorporate overlapping group membership for each entry in the tract profiles [56]. For example, a measurement could come from the corpus callosum, but also from the right hemipshere. This would also require extending the cost function used here to incorporate these constraints. Similarly, unsupervised dimensionality reduction of tractometry data (e.g., [12]) could also benefit from constraints based on grouping, as our implementation of PCR-SGL suggests.
The method is packaged as open-source software called AFQ-Insight that is openly available at https://github.com/richford/AFQ-Insight, and provides a clear API to allow for extensions of the method. The sofware integrates within a broader automated fiber quantification software ecosystem: AFQ [5] and pyAFQ [57], which extract tract profile data from raw and processed dMRI datasets, as well as AFQ-Browser, which visualizes tract profiles data and facilitates sharing of the results of dMRI studies [58]. To facilitate reproducibility and ease use of the software, the results presented in this paper are also provided in https://github.com/ richford/afq-insight-paper as a series of Jupyter notebooks [59].

Data
Four different datasets were used here: These data were acquired using a twice refocused spin echo sequence, in which there is sufficient time for eddy currents to subside between the application of the gradients and the image acquisition, so no eddy current correction was applied, but motion correction was applied before fitting the diffusion tensor model [60] in every voxel using a robust fit [61] on the b = 1000 s/mm 2 shell only. We will refer to this dataset as WH. • Diffusion data preprocessing Any images with a b-value less than 100 s/mm 2 were treated as a b = 0 image. MP-PCA denoising as implemented in MRtrix3's dwidenoise [68] was applied with a 5-voxel window. After MP-PCA, B1 field inhomogeneity was corrected using dwibiascorrect from MRtrix3 with the N4 algorithm [64]. After B1 bias correction, the mean intensity of the DWI series was adjusted so all the mean intensity of the b = 0 images matched across eachseparate DWI scanning sequence. FSL (version 6.0.3:b862cdd5)'s eddy was used for head motion correction and Eddy current correction [69]. Eddy was configured with a q-space smoothing factor of 10, a total of 5 iterations, and 1000 voxels used to estimate hyperparameters. A linear first level model and a linear second level model were used to characterize Eddy current-related spatial distortion. q-space coordinates were forcefully assigned to shells. Field offset was attempted to be separated from subject movement. Shells were aligned post-eddy. Eddy's outlier replacement was run [70]. Data were grouped by slice, only including values from slices determined to contain at least 250 intracerebral voxels. Groups deviating by more than 4 standard deviations from the prediction had their data replaced with imputed values. Data was collected with reversed phase-encode blips, resulting in pairs of images with distortions going in opposite directions. Here, b = 0 reference images with reversed phase encoding directions were used along with an equal number of b = 0 images extracted from the DWI scans. From these pairs the susceptibility-induced off-resonance field was estimated using a method similar to that described in [71]. The fieldmaps were ultimately incorporated into the Eddy current and head motion correction interpolation. Final interpolation was performed using the jac method. Several confounding time-series were calculated based on the preprocessed DWI: framewise displacement (FD) using the implementation in Nipype following the definitions by [72]. The DWI time-series were resampled to ACPC, generating a preprocessed DWI run in ACPC space.
Many internal operations of QSIPrep use Nilearn 0.6.2 [79], RRID:SCR_001362 and DIPY [80]. For more details of the pipeline, see the section corresponding to workflows in QSIPrep's documentation. We will refer to this dataset as HBN.
4. Diffusion MRI data from the Cambridge Centre for Ageing and Neuroscience (Cam-CAN) "CC700" dataset [41,42], containing data from 640 subjects with ages 18-88. These data were measured on a 3T Siemens TIM Trio system and written informed consent was obtained from each participant. Voxel resolution was 2 × 2 × 2mm 3 with 30 non-colinear directions measured for each of b = 1000 s/mm 2 and b = 2000 s/mm 2 . The diffusion weighted images were acquired with a twice refocused spin-echo sequence and the same preprocessing and reconstruction pipelines used for the HBN dataset was applied to this data. We will refer to this dataset as Cam-CAN.
Data from the ALS and WH studies was processed in a similar manner, using the Matlab Automated Fiber Quantification toolbox (mAFQ, version 1.1 for WH and version 1.2 for ALS) [5]: streamlines representing fascicles of white matter tracts were generated using a determinstic tractography algorithm that follows the prinicpal diffusion direction of the diffusion tensor in each voxel (STT) [81]. Eighteen major tracts, which are enumerated in S1 Fig, were identified using multiple criteria: streamlines are selected as candidates for inclusion in a bundle of streamlines that represents a tract if they pass through known inclusion ROIs and do not pass through exclusion ROIs [82]. In addition, a probabilistic atlas is used to exclude streamlines which are unlikely to be part of a tract [83]. Each streamline is resampled to 100 nodes and the robust mean at each location is calculated by estimating the 3D covariance of the location of each node and excluding streamlines that are more than 5 standard deviations from the mean location in any node. Finally, a bundle profile of tissue properties in each bundle was created by interpolating the value of MRI maps of these tissue properties to the location of the nodes of the resampled streamlines designated to each bundle. In each of 100 nodes, the values are summed across streamlines, weighting the contribution of each streamline by the inverse of the mahalanobis distance of the node from the average of that node across streamlines. This means that streamlines that are more representative of the tract contribute more to the bundle profile, relative to streamlines that are on the edge of the tract.
Data from the HBN and Cam-CAN studies were processed using the updated Python Automated Fiber Quantification toolbox (pyAFQ; [57]). In addition to demonstrating the our analysis pipeline is robust to changes in tractometry software, the use of the updated pyAFQ capitalized upon the following improvements over the legacy Matlab version: (i) the ability to ingest data provided in the BIDS format [84], and (ii) the calculation of diffusion kurtosis imaging (DKI [43]) metrics We will refer to the mAFQ and pyAFQ pipeline collectively as AFQ.
These processes create bundle profiles, in which diffusion measures are quantified and averaged along eighteen major fiber tracts, which are enumerated in S1 Fig. See S3 Fig of [57] for a depiction of these white matter bundles. Here, we use only the mean diffusivity (MD) and the fractional anisotropy (FA) of the diffusion tensor, but additional dMRI-based maps or maps based on other quantitative MRI measurements can also be used. The resulting feature space was the same for all four datasets, with the FA and MD metrics at each of 100 nodes in eighteen bundles comprising 3600 features per subject. These bundle profiles, along with the phenotypical data we wish to explain or predict, form the input to the SGL algorithm. In a domainagnostic machine learning context, the phenotypical data comprise the target variables while the bundle profiles form the feature or predictor variables (see Fig 1e).
Data harmonization for HBN. For the multisite HBN study, we use the ComBat harmonization method to robustly adjust for site effects in the tract profiles. Initially designed to correct for site effects in gene expression studies [85], ComBat employs a parametric empirical Bayes approach to adjust for batch effects and has since been applied to multi-site cortical thickness measurements [86], multi-site DTI studies [87], and functional MRI data in the Adolescent Brain Cognitive Development Study (ABCD) [88]. We rely on the neurocom-bat_sklearn library [89], to apply ComBat in our scikit-learn analysis pipeline and present bundle profile site differences and ComBat correction in S10-S13 Figs.

Sparse group lasso
Before fitting a model to the data, imputation and standardization are performed. Missing node values (e.g., in cases where AFQ designates a node as not-a-number) are imputed via linear interpolation. Care is taken not to interpolate across the boundaries between different bundles. Some diffusion metrics will have naturally larger variance than others and may therefore dominate the objective function and make the SGL estimator unable to learn from the lower variance metrics. For example, fractional anisotropy (FA) is bounded between zero and one and could be overwhelmed by an unscaled higher-variance metric like the mean diffusivity (MD). To prevent this we remove each feature's mean and scale it to unit variance (z-score) using the StandardScaler from scikit-learn [90]. The scaling parameters are learned only from the training data and then applied equally to the training and test data in order to prevent leakage of information between the testing and training sets [91].
After scaling and imputation, the tract profile data and target phenotypical data can be organized in a linear model: where y is the phenotype-categorical, such as a clinical diagnosis, or continuous numerical, such as the subject's age. The tract profile data is represented in the feature matrix X, with rows corresponding to different subjects, and columns corresponding to diffusion measures at different nodes within each bundle. The relationship between tractometric features and the phenotypic target is characterized by the coefficients in β. The error term, � is an unobserved random variable that captures the error in the model. We denote our prediction of the target phenotype asŷ and the coefficients that produce this prediction asb, so that without the error term, �. In general, the feature matrix X has S rows and B × N × M columns, where S is the number of subjects, B is the number of white matter bundles, N is the number of nodes in each bundle, and M is the number of diffusion metrics calculated at each node. Typically, B = 18, N = 100, and 2 � M � 8, resulting in �4, 000 − 14, 400 features. Conversely, many dMRI studies have between a few dozen and a few hundred subjects, yielding a feature matrix that is wide and short. Even in cases where more than a thousand subjects are measured (e.g., in the Human Connectome Project, where 1,200 subjects were measured [52]), the problem is ill-posed: the high dimensionality of this data requires regularization to avoid overfitting and generate interpretable results.
We propose that in addition to regularizing the coefficients inb, we can also capitalize on our knowledge of the group structure of the bundle profile features in X. The bundle-metric combinations form a natural grouping. For example, we expect that MD features within the left arcuate fasciculus will co-vary across individuals. Likewise for FA values within the right corticospinal tract (CST) and so on. This group structure is represented in Fig 1e, which depicts the linear modelŷ ¼ Xb. Thus, we seek a regularization approach that will fit a linear model with anatomically-grouped covariates, where we expect to observe both groupwise sparsity, where the number of groups (bundle/metric combinations) with at least one non-zero coefficients is small, as well as within-group sparsity, where the number of non-zero coefficients within each non-zero group is small. The sparse group lasso (SGL) is a penalized regression technique that satisfies these criteria [46]. It solves for a coefficient vectorb that satisfieŝ Here, G is the number of groups, X (ℓ) is the submatrix of X corresponding to group ℓ, β (ℓ) is the coefficient vector for group ℓ and p ℓ is the length of β (ℓ) . In the tractomtetry setting, G = B × M and 8ℓ: p ℓ = 100. The first term is the mean square error loss, L mse , as in the standard linear regression framework. The second and third terms encourage groupwise sparsity and overall sparsity, respectively. If α = 1 the SGL reduces to the traditional lasso [92]. Conversely, if α = 0 the SGL reduces to the group lasso [93]. Thus, the model hyperparameter α controls the combination of group-lasso and lasso. The hyperparameter λ controls the strength of the regularization. SGL with principal components. SGL may be combined with principal components regression (PCR-SGL) by performing dimensionality reduction separately for each group of covariates. Let be the compact singular value decomposition (SVD) of X (ℓ) , the n × p ℓ submatrix of X corresponding to group ℓ. Here, S is an r × r matrix, where r = min(n, p ℓ ), that contains the nonzero singular values of X (ℓ) . V T is an r × p semi-orthogonal matrix containing the principal axes of X (ℓ) . The product Z = US is an n × r matrix containing the principal component row vectors needed to reproduce X (ℓ) in the basis provided by V T . Since this decomposition is performed separately for each group of covariates, the grouping information is preserved in the transformation from X to Z. We may then build an SGL model relating y and Z,ŷ The PCR-SGL coefficientsŷ may be projected back onto the original feature space usinĝ b ¼ Vŷ.
Bagging meta-estimators. The previous section describes a single SGL model. To further improve model performance, we create ensemble models composed of m = 20 individual SGL models using bootstrap aggregation (bagging) [94]. Bagging relies on the underlying assumption that some of the error in a single SGL model's prediction stems from a mismatch in the distributions of training data used to fit the model and test data used to evaluate its performance. To overcome this, bagging invokes the same base estimator (e.g. SGL) many times with different training sets, which are created by sampling the original training samples with replacement. The bagging meta-estimator's prediction is then the average of its constituent estimators' predictions. Likewise, when we report the hyperparameter values α and λ, or regression coefficientsb, we are refering to these values averaged over 20 estimators in the bagging meta-estimator.
Incorporating target transformations. Often, the target variable y is not in the domain in which the linear model can be best fit to it. Eq (2) can be slightly modified as: where the transformation function f −1 characterizes the transform applied to the data before fitting the linear coefficients. This is similar to the use of a link function in a generalized linear model, but without the adoption of an exponential probability distribution [95]. For the WH, HBN, and Cam-CAN datasets, we use a logarithmic transform, implemented using scikit-learn's TransformedTargetRegressor meta-estimator [90] with numpy's log and exp as the transform and inverse transform functions, respectively [96]. Classification of categorical targets. When the phenotypical target variable is categorical, as in the case of explaining or predicting the presence of a clinical diagnosis, the SGL is readily adapted to logistic regression, where the probability of a target variable belonging to an arbitrary defined "true" class is the logistic function of the result of the linear model, or equivalently, the mean squared error loss function in Eq (3) is replaced with the crossentropy loss, which for binary classification is the negative log likelihood of the SGL classifier giving the "true" label: L mse ! L log ¼ À ðy logðpÞ þ ð1 À yÞ logð1 À pÞÞ: ð11Þ

SGL implementation, cross-validation and metaparameter optimization
Because the SGL is not specific to tractometry, we implemented its solution as a general-purpose Python package called groupyr [97]. Groupyr solves the cost function in Eq (3) using proximal gradient descent [98] by implementing a custom proximal operator and relying on the C-OPT optimization library [99], providing a fitted SGL model as a scikit-learn compatible estimator [100]. Groupyr also selects the hyperparameters α and λ that yield the best crossvalidated performance using either: (i) an exhaustive grid search of hyperparameter combinations, or (ii) sequential model based optimization using the scikit-optimize library [101]. To objectively evaluate model performance and guard against over-fitting, we used a nested cross-validation scheme, which is depicted for the binary classification case in Fig 5. The subjects (i.e. rows of the feature matrix X in Fig 1e and Eq (1)) are first shuffled and then decomposed into k 0 batches, hereafter referred to as folds. For the ALS and WH datasets, we used k 0 = 10 and for the HBN and Cam-CAN datasets, k 0 = 5. For each unique fold, we hold that fold out as the test 0 set and let the remaining data comprise the train 0 set, with the subscript indicating We evaluate model quality using a nested k-fold cross validation scheme. At level-0, the input data is decomposed into k 0 shuffled groups and optimal hyperparameters are found for the level-0 training set. To avoid overfitting, the optimal hyperparameters are themselves evaluated using a cross-validation scheme taking place at level-1 of the decomposition, where each level-0 training set is further decomposed into k 1 = 3 shuffled groups. In the classification case, the training and test splits are stratified by diagnosis. For the ALS and WH data, k 0 = 10, while for the HBN and Cam-CAN data, k 0 = 5.
https://doi.org/10.1371/journal.pcbi.1009136.g005 the depth of the nested decomposition. We further decompose each train 0 set into three folds, and again for each unique fold, we hold out that fold as the test 1 set and let the remaining data comprise the train 1 set. At level-1 of the decomposition, we fit an SGL model using fixed regularization meta-parameters α and λ, training the model using train 1 and evaluating the fit on test 1 . We find the optimal values for α and λ using sequential model-based optimization, implemented using the scikit-optimize BayesSearchCV class [101]. For continuous numerical y, BayesSearchCV searches for meta-parameter values that maximize the R 2 averaged over test sets. For binary categorical y BayesSearchCV seeks to maximize the classification accuracy.
Using hyperoptimization, we find optimal regularization parameters andb for each train 0 set and then use those to predict values for data in test 0 . Thus each subject in the dataset has a predicted phenotype derived from a model that never saw its particular subject's data. In the classification case, the shuffling into folds is stratified such that each fold has a population that preserves the percentage of each class found in the larger input data.
For each dataset, we also perform a randomization test by training similar models on copies of the data for which the rows of the target variable y have been shuffled while the feature matrix X remains the same. This effectively destroys any relationship between the diffusion data and the outcome. Indeed, all of our models perform no better than random guessing. One should expect this since any better performance might indicate data leakage between train and test sets [91] or other common machine learning pitfalls.

Software implementation
The full software implementation of the SGL approach presented here is available in a Python software package called AFQ-Insight, which is developed publicly in https://github.com/ richford/afq-insight. The version of the code used to produce the results herein is also available in https://doi.org/10.5281/zenodo.4316000. AFQ-Insight reads the target and feature data that has been processed by AFQ from comma separated value (CSV) files conforming to the AFQ-Browser data format [58] and represents them internally as DataFrame objects from the pandas Python library [102]. The software provides different options for imputing missing data in the feature matrix. Missing interior nodes are imputed using linear interpolation. For missing exterior nodes, the user may choose between linear extrapolation and constant forward-or back-fill. Imputation uses only values from adjacent nodes in the same white matter bundle in the same subject so there is no danger of data leakage from other subjects. It uses the scikit-learn [90] library to decompose input data into separate test and train datasets, to scale each feature to have zero mean and unit variance, and to evaluate each fit in the hyperparameter search using appropriate classification and regression metrics such as accuracy, area under the receiver operating curve (AUC ROC), and coefficient of determination (R 2 ). Internally, AFQ-Insight uses the groupyr software library [97] mentioned above to solve the SGL. FA is plotted on the left y-axis while theb coefficients are displayed on the twin axis on the right-hand-side. SGL selected the right corticospinal tract (CSTR) as important and regularized coefficients in the CSTL. Yet, there are also group FA differences in the CSTL. This highlighted a potential drawback of the SGL method, discussed in the main text in the context of age regression. Namely, SGL is not guaranteed to identify all important features. In this case, if the diagnostic signal in the CSTL is redundant to that in the CSTR, SGL will regularize the CSTL features, thereby reducing its sparsity penalty without any corresponding increase in loss. This parsimony cuts both ways; it is a feature of the method when one seeks an efficient predictive model, but is a disadvantage of the method when one wants an exhaustive explanation of feature importance. We use the phrase "parsimony pitfall" to refer to the case when SGL regularizes away redundant but obviously important features. Theb coefficients are distributed widely through the brain and SGL behaves more like the lasso than the group lasso. As before, one must be cautious about comparing bundle profiles andb coefficients between models. While the HBN and Cam-CAN datasets share the same diffusion model and refrain from clipping streamlines, the age distributions for the two are roughly disjoint, with the WH age distribution straddling the two.