Multidimensional analysis and detection of informative features in human brain white matter

doi:10.1371/journal.pcbi.1009136

Fig 1.

Tractometry data flow.

(a) Whole brain tractography generates streamlines approximating the trajectories of white matter connections. (b) Tractometry classifies these streamlines into anatomical bundles. In this case, we show the left corticospinal tract (CSTL) and the left arcuate fasciculus (ARCL) over a mid-saggital anatomical slice. Tract profiling further extracts bundle profiles, quantifications of various diffusion metrics along the length of the fiber bundle. Here, we show one subject’s fractional anisotropy (FA) profile for (c) the CSTL and (d) the ARCL. (e) the phenotypical target data and tract profile features can be organized into a linear model, . The feature matrix X is color-coded to reveal a natural group structure: the left (orange) group contains k features from the CSTL, the middle (green) group contains k features from the left cingulum cingulate (CGCL), and the right (blue) group contains k features from the ARCL. The coefficients in follow the same natural grouping. Panels (a) and (b) are adapted from https://figshare.com/articles/figure/example_tractography-segmentation/14485350, and reproduced under the CC-BY license (https://creativecommons.org/licenses/by/4.0/).

More »

Expand

Fig 2.

PCR-SGL accurately and interpretably predicts ALS diagnosis.

(a) Classification probabilities for ALS diagnosis, with controls on the left, patients on the right, predicted controls in blue, and predicted patients in orange. That is, orange dots on the left represent false positives, while blue dots on the right represent false negatives. We achieve 83% accuracy with an ROC AUC of 0.88. (b) PCR-SGL coefficients are presented on the core fibers of major fiber bundles. They exhibit high group sparsity and are concentrated in the FA of the corticospinal tract (CST). The brain is oriented with the right hemisphere in the foreground and anterior to the right of the page. The CSTL, CSTR, callosum forceps anterior (CFA), left arcuate (ARCL), and right arcuate (ARCR) bundles are indicated for orientation. (c) PCR-SGL identifies three portions of the CST as important, where (dashed line, right axis) has large values. These are centered around nodes 30, 65, and 90, corresponding to locations of substantial differences in FA between the ALS and control groups (shaded areas indicates standard error of the mean). (d) Bundle profiles for false positive classifications. Line colors correspond to the marker edge color in the top left plot. These individuals have reduced FA in the CST portions which SGL identified as important. Their misclassification is coherent with the feature importance and the group differences in FA. (e) Individual bundle profiles for false negative classifications. These individuals have bundle profiles which oscillate between the group means.

More »

Expand

Fig 3.

Predicting age with tractometry and SGL.

(top) The predicted age vs. true age of each individual from the test splits (i.e., when each subject’s data was held out in fitting the model) for the (a) WH, (b) HBN, and (c) Cam-CAN datasets; an accurate prediction falls close to the y = x line (dashed). The mean absolute error (MAE) and coefficient of determination R² are presented in the lower right of each scatter plot. (middle) Feature importance for predicting age from tract profile in the (d) WH, (e) HBN, and (f) Cam-CAN datasets. The orientation of the brain is that same as in Fig 2b, however because the coefficients exhibit high global sparsity (as opposed to group sparsity), we plot the mean of the absolute value of for each bundle on the core fiber. The global distrubution of the coefficients reflects the fact that aging is not confined to a single white matter bundle. (bottom) Age quintile bundle profiles for the (g) WH, (h) HBN, and (i) Cam-CAN datasets.

More »

Expand

Fig 4.

Model performance across all datasets.

Each panel shows model performance measured on the test set for each cross-validation split, with each black dot representing a split, box plots representing the quartiles, and white diamonds representing the mean performance. The y-scale varies in each subplot. (a) Accuracy of test set predictions for the ALS dataset. Because group differences in ALS diagnosis are mostly confined to a single bundle, the group structure-preserving methods, SGL and PCR-SGL, outperform the other models. The remaining frames show coefficient of determination, R² in test sets for the (b) WH, (c) HBN, and (d) Cam-CAN datasets. Because aging affects the white matter globally, group structure-blind methods like elastic net and PCR Lasso perform well. Nonetheless, the SGL models show competitve predictive performance, adapting to a problem where group structure is not as informative. PCR-SGL performs poorly in this regime because its initial group-wise PC projection destroys between bundle covariance. The bundle-mean lasso performs poorly, demonstrating the value of along-tract profiling.

More »

Expand

Fig 5.

Nested cross-validation.

We evaluate model quality using a nested k-fold cross validation scheme. At level-0, the input data is decomposed into k₀ shuffled groups and optimal hyperparameters are found for the level-0 training set. To avoid overfitting, the optimal hyperparameters are themselves evaluated using a cross-validation scheme taking place at level-1 of the decomposition, where each level-0 training set is further decomposed into k₁ = 3 shuffled groups. In the classification case, the training and test splits are stratified by diagnosis. For the ALS and WH data, k₀ = 10, while for the HBN and Cam-CAN data, k₀ = 5.

More »

Expand