Concussion classification via deep learning using whole-brain white matter fiber strains

Developing an accurate and reliable injury predictor is central to the biomechanical studies of traumatic brain injury. State-of-the-art efforts continue to rely on empirical, scalar metrics based on kinematics or model-estimated tissue responses explicitly pre-defined in a specific brain region of interest. They could suffer from loss of information. A single training dataset has also been used to evaluate performance but without cross-validation. In this study, we developed a deep learning approach for concussion classification using implicit features of the entire voxel-wise white matter fiber strains. Using reconstructed American National Football League (NFL) injury cases, leave-one-out cross-validation was employed to objectively compare injury prediction performances against two baseline machine learning classifiers (support vector machine (SVM) and random forest (RF)) and four scalar metrics via univariate logistic regression (Brain Injury Criterion (BrIC), cumulative strain damage measure of the whole brain (CSDM-WB) and the corpus callosum (CSDM-CC), and peak fiber strain in the CC). Feature-based machine learning classifiers including deep learning, SVM, and RF consistently outperformed all scalar injury metrics across all performance categories (e.g., leave-one-out accuracy of 0.828–0.862 vs. 0.690–0.776, and .632+ error of 0.148–0.176 vs. 0.207–0.292). Further, deep learning achieved the best cross-validation accuracy, sensitivity, AUC, and .632+ error. These findings demonstrate the superior performances of deep learning in concussion prediction and suggest its promise for future applications in biomechanical investigations of traumatic brain injury.


Introduction
Traumatic brain injury (TBI) resulting from blunt head impact is a leading cause of morbidity and mortality in the United States [1].The recent heightened public awareness of TBI, especially of sports-related concussion [2,3], has prompted the Institute of Medicine and National Research Council of the National Academies to recommend immediate attention to address the biomechanical determinants of injury risk and to identify effective concussion diagnostic metrics and biomarkers, among others [4].
Impact kinematics such as linear and rotational accelerations are convenient ways to characterize impact severity.Naturally, these simple kinematic variables and their more sophisticated variants have been used to assess the risk and severity of brain injury.As head rotation is thought to be the primary mechanism for mild TBI (mTBI) including sports-related concussion, most kinematics metrics include rotational acceleration or velocity, either solely (e.g., power rotational head injury criterion (PRHIC) [5], brain injury criterion (BrIC) [6], and rotational velocity change index (RVCI) [7]) or in combination with liner acceleration [8].Kinematic variables, alone, do not provide regional brain mechanical responses that are thought to initiate injury [9].Validated computational models of the human head are, in general, believed to serve as an important bridge between external impact and tissue mechanical responses.Model-estimated, response-based injury metrics are desirable, as they can be directly related to tissue injury tolerances.Commonly used tissue response metrics include peak maximum principal strain and cumulative strain damage measure (CSDM; [10]) for the whole or sub-regions [11] of the brain.More recently, white matter (WM) fiber stretch [12][13][14] is also being explored as a potential improvement.There is growing interest in utilizing model-simulated responses to benchmark the performance of other kinematic injury metrics [6,15].
Regardless of these injury prediction approaches (kinematic or response-based), they share some important common characteristics.First, they have utilized a single injury dataset for "training" and performance evaluation.Often, this was performed by fitting a logistic regression model to report the area (AUC) under the receiver operating curve (ROC) [5,6,8,14,16].However, without cross-validation using a separate "testing dataset", there could be uncertainty how the metrics perform when they are, presumably, deployed to predict injury on fresh, unmet impact cases in clinical applications [13,17].This is an important issue seemingly under-appreciated, given that AUC provides an average or aggregated performance of a procedure but does not directly govern how a clinical decision, in this case, injury vs. non-injury diagnosis, is made.
Second, an explicit, pre-defined kinematic or response metric is necessary for injury prediction.While candidate injury metrics are typically from known or hypothesized injury mechanisms (e.g., strain), they are derived empirically.For response-based injury metrics, they are also often pre-defined in a given, specific brain region of interest (ROI) such as the corpus callosum and brainstem.However, they do not consider other anatomical regions or functionally important neural pathways.Even when using the same reconstructed American National Football League (NFL) head impacts, studies have found inconsistent "optimal" injury predictors (e.g., maximum shear stress in the brainstem [16], strain in the gray matter and CSDM 0.1 (using a strain threshold of 0.1) in the WM [18], peak axonal strain within the brainstem [14], or tract-wise injury susceptibilities in the super longitudinal fasciculus [19]).These previous efforts are essentially "trial-and-error" in nature that attempt to pinpoint a specific variable for injury prediction, but they have failed to reach consensus on the most injury discriminative metric or ROI.
Injury prediction is a binary classification.Besides logistic regression relying on an explicit metric, recently there have been numerous algorithmic advances in classification, including machine learning and deep learning.They have achieved remarkable success in a wide array of science domains, including cancer detection (see [20] for a recent review).However, their application in TBI biomechanics is extremely limited or even non-existent at present.A recent study utilized support vector machine to predict concussion [21].However, it was limited to kinematic variables (vs.brain responses) and two injury cases, which did not allow for cross-validation.
In this study, we employed state-of-the-art deep learning for concussion classification based on implicit features of voxel-wise WM fiber strains of the entire brain (vs.an explicit injury metric in a given ROI).Repeated random subsampling was also employed to train and cross-validate the concussion classifier to objectively evaluate and compare injury prediction performances [19].These injury prediction strategies are important extensions to previous efforts, which may provide important fresh insight into how best to predict injury, including concussion, in the future.

The Worcester Head Injury Model (WHIM) and WM Fiber Strain
We used the Worcester Head Injury Model (WHIM; Fig. 1; formerly known as the Dartmouth Head Injury Model or DHIM [12,22]) to simulate the reconstructed NFL head impacts [23,24].Descriptions of the WHIM development, material property and boundary condition assignment, and quantitative assessment of the mesh geometrical accuracy and model validation performances have been published previously.Briefly, the WHIM was created based on high resolution T1-weighted MRI of an individual athlete.Diffusion tensor imaging (DTI) of the same individual provided averaged fiber orientations at each WM voxel location [12].
The 58 reconstructed head impacts include 25 concussion and 33 non-injury cases.Identical to previous studies [18,19,24,25], head impact linear and rotational accelerations were preprocessed before applying to the WHIM head center of gravity (CG) for brain response simulation.The skull and facial components were simplified as rigid-bodies as they did not influence brain responses.
Peak WM fiber strain, regardless of the time of occurrence during impact, was computed at each DTI WM voxel (N = 64272; [22]).For voxels not corresponding to WM, their values were padded with zeroes.This led to a full 3D image volume encoded with peak WM fiber strains (with surface rendering of the segmented WM shown in Fig. 1c), which served as classification features for deep network training and concussion prediction.The choice of fiber strain instead of more commonly used maximum principal strain was because of its potentially improved injury prediction performance [13,14,22].
Fig. 1 The WHIM head exterior (a) and intracranial components (b), along with peak fiber strain-encoded rendering of the segmented WM outer surface (c).The x-, y-, and z-axes of the model coordinate system correspond to the posterior-anterior, right-left, and inferior-superior direction, respectively.The strain image volume, which was used to generate the rendering within the co-registered head model for illustrative purposes, directly served as input signals for deep learning network training and concussion classification (see Fig. 2).

Deep Learning: Background
Deep learning has dramatically improved the state-of-the-art in numerous research domains (see a recent review in Nature Methods [20]).However, its application in TBI biomechanics is nonexistent at present.This technique allows models composed of multiple processing layers to learn representations of data with multiple levels of abstraction [20].A deep learning network uses a collection of logical units and their activation statuses to simulate brain function.It employs an efficient layer-wise supervised update method [26] or an unsupervised network training strategy [27].This makes it feasible to train a "deep" (e.g., more than 3 layers) neural network, which is ideal for learning large scale and high dimensional data.
For a deep learning network, the l-th layer transforms an input vector from its lower layer, , into an output vector, , through the following forward transformation: where matrix is a linear transform describing the unit-to-unit connection between two adjacent, l-th and (l-1)-th, layers, and is a bias offset vector.Their dimensions are configured to produce the desired dimensionality of the input and output, with the raw input data represented by (Fig. 2).The nonlinear normalization or activation function, , can be defined as either a Sigmoid or a TanH function [28], or Rectified Linear Units (ReLU) [29] in order to suppress the output values for discriminant enhancement [30] and for achieving non-linear approximation [31].Upon network training convergence, the optimized parameters, and , are used to produce predictions of the cross-validation dataset.More details on the mathematics behind and procedures of deep network training are provided in the Appendix.

Deep Learning: Network Design and Implementation
A systematic approach to designing an "optimal" deep learning network is still an active research topic [32].Often, trial-and-error is used to determine the appropriate number of layers and the numbers of connecting units in each layer, as a clear rule is currently lacking.In this study, we empirically developed a network structure composed of five fully connected layers (i.e., each unit in a layer was connected to all units in its adjacent layers; Fig. 2), similarly to that used before (e.g., [33]).The number of network layers was chosen to balance the trade-off between network structure nonlinearity and regularity.By adding or removing 1 or 2 layers, largely comparable prediction performances were obtained.This suggested relative insensitivity of the network layer to our injury classification problem in this study.
The numbers of connecting units in each layer or the network dimension also followed a popular pyramid structure [33] to sequentially halve the number of connecting units in subsequent layers (i.e., a structure of 2000-1000-500-250 units for layers 1 to 4; Fig. 2).Each layer performed feature condensation transform (Eqns. 1 and 2) independently.The final feature vector, , served as the input for injury classification.Table 1 summarizes the dimensions of the weights, , and offset vectors, , as well as the normalization functions, , used to define the deep network.In total, the network contained over 1.31×10 8 independent parameters.For the first three layers (i.e., layers 1 to 3 in Fig. 2), a hyperbolic tangent normalization function, TanH, was used to squash each unit value into [-1, 1].This was to preserve the zero mean and to facilitate non-biased weighting updates in training [28].A batch normalization technique was also used to avoid internal co-variate shift as a result of nonnormal distributions of the input and output values.This enhanced the network robustness [34].In contrast, the last layer prior to classification (layer 4 in Fig. 2) adopted a Sigmoid function to normalize output values to [0, 1], which was necessary to facilitate the Softmax classification [35].
Table 1.Summary of the dimensions of the weights and offset parameters, along with the normalization functions used to define the deep learning network.See Appendix for details regarding the normalization functions.
The network was trained via an iterative gradient descent optimization scheme through Caffe [36].The gradient descent step size or learning rate was set to 0.00001 for all network layers.The gradient descent momentum (i.e., the weight to multiply the gradient from the previous step in order to augment the gradient update in the current step) was set to 0.9.The regularization factor was 0.0005 in order to prevent the weights from growing too fast.The training dataset was divided into a batch size of 10 for training (randomly resampled cases were added when the remaining batch was fewer than 10).

Concussion Classification and Performance Evaluation
To objectively evaluate the concussion classification performance, we employed a repeated random subsampling technique to split the injury cases into independent and non-overlapping training and cross-validation datasets, as employed recently [19].

"Optimal" Training Iteration
An optimal training iteration achieves the best cross-validation accuracy at the minimum computational cost.However, this is not feasible to determine for fresh, unmet cases.Therefore, we adopted the following empirical approach based on subsets of known cross-validation samples.
For each training/validation configuration (Table 2), additional random trials (N=3 for each configuration; independent to those used in performance evaluation) were generated to observe the convergence behaviors of the training and validation error functions (Eqn.A4 in Appendix; Fig. 3).The training error function asymptotically decreased with the increase in iteration.The cross-validation error function initially decreased, as expected, but could start to increase after sufficient iterations, indicating possible overfitting.This was more evident when the crossvalidation accuracy began to decrease after plateau.These observations were consistent regardless of the training/testing configuration.
Therefore, an empirical threshold in training error function was established to ensure sufficient training, which depended on the specific training/validation configuration (Table 3; 0.45 for the illustrated trials in Fig. 3).In addition, an admissible range of training iterations, [10000, 15000], was also set to specify the minimum and maximum number of iterations.This empirical stopping criterion allowed sufficient training while minimized potential overfitting.

Performance Comparison against Other Injury Prediction Metrics
We selected the following four injury metrics to assess the deep learning concussion classification performance: Brain Injury Criteria (BrIC [6]; a kinematic metric found to correlate the best with strain-based metrics in diverse automotive impacts [15]), CSDM for the whole brain (CSDM-WB) and the corpus callosum (CSDM-CC) based on maximum principal strain, as well as peak WM fiber strain in the corpus callosum (Peak-CC; [14]).
Critical angular velocity values along the three major axes were necessary to define BrIC.They corresponded to a 50 th percentile probability of concussion based on the resulting maximum principal strain, and depended on the head FE model for impact simulation and the injury dataset to fit a logistic regression model.Adopting an earlier approach [14], the critical values for WHIM were determined to be 30.4rad/s, 35.6 rad/s, and 23.5 rad/s, along the three major axes, respectively, based on the reconstructed NFL injury dataset.For CSDM, an "optimal" strain threshold of 0.2 was used, which was identified to maximize the significance of injury risk-response relationship for the group of 50 deep WM regions using the same reconstructed NFL injury dataset [19].
For each injury prediction method, AUCs were produced for both the training and cross-validation datasets, separately, and for each random trial.For each training/validation configuration (Table 2), this led to 50 AUC values for the training datasets, and another batch of 50 values for the cross-validation datasets (to allow computing the mean and standard deviation).Computing the AUCs was straightforward for the explicit injury metrics subjected to logistic regression analysis.For the deep learning approach, the concussion probability score, (Fig. 2; Eqn.A1 in Appendix), was first extracted.With the known concussion/non-injury labels, an ROC curve was readily generated to report an AUC (perfcurve.m in Matlab).Similarly, this was repeated for each of the 50 random trials of training and cross-validation datasets, and for each configuration.Finally, the injury prediction methods were further compared in terms of cross-validation accuracy, sensitivity, and specificity, for all of the training/validation configurations (Table 2), in terms of mean and standard deviation.

Data Analysis
Simulating each head impact of 100 ms duration in Abaqus/Explicit (Version 2016; Dassault Systèmes, France) required ~50 min on a 12-CPU Linux cluster (Intel Xeon E5-2680v2, 2.80 GHz, 128 GB memory) with a temporal resolution of 1 ms.An additional 9 min was needed to obtain element-wise cumulative strains (single threaded).The classification framework was implemented on Windows (Xeon E5-2630 v3, 8 cores, 16 GB memory) with GPU acceleration (NVidia Quadro K620, 384 cores, 2 GB memory).Training each deep network typically required an hour, but subsequent injury prediction was real-time (<0.01 sec).
Concussion classification performances were compared in terms of AUC for both the training and cross-validation datasets.In addition, cross-validation accuracy, sensitivity, and specificity were also compared.Statistical significance was reached when p<0.05.All data analyses were conducted in MATLAB (R2016b; Mathworks, Natick, MA).

Strain-encoded whole-brain image volume
Fig. 4 illustrates and compares peak WM fiber-strain-encoded images on three orthogonal planes for a pair of striking and struck (non-injured and concussed, respectively) athletes involved in the same head collision.Deep learning utilized all of the strain-encoded WM image features for training and concussion classification, directly, without thresholding.Fig. 4 Cumulative WM fiber strains on representative orthogonal planes for a pair of striking (non-injury) and struck (concussed) athletes.

AUC and ROC
Fig. 5 summarizes the average AUCs for each prediction method and training/validation configuration, separately for the training and cross-validation datasets.The deep learning approach significantly, and consistently, outperformed all other injury metrics for the training datasets (minimum AUC of 0.95 vs. maximum value of 0.87; Fig. 5a).In addition, the AUC standard deviation (STD) was also typically the smallest.For the cross-validation datasets, we observed a sharp drop of AUCs for the deep learning approach (ranged 0.82-0.86).Albeit, they were still statistically comparable to BrIC and CSDM-WB (p>0.5), and significantly better than CSDM-CC and Peak-CC (p<0.01;Fig. 5b).Fig. 6 and Fig. 7 illustrates the ROCs corresponding to the highest and lowest AUCs for each configuration.Consistently with earlier findings, deep learning had the highest AUC for the best and worst cases when evaluating on the training datasets (Fig. 6).For the cross-validation datasets, this remained true when the training sample size was low (i.e., 19 or 29, corresponding to validation sample size of 39 or 29, respectively; Fig. 7).

Cross-Validation Performance Metrics
Fig. 8 compares the cross-validation accuracy, sensitivity, and specificity for each training/validation configuration.Deep learning consistently outperformed CSDM-WB, CSDM-CC, and Peak-CC, statistically (p<0.05) in crossvalidation accuracy.Its performance was also comparable to BrIC, albeit, not statistically better (p>0.05).However, deep learning consistently had the smallest STD among all approaches, suggesting a more stable classification performance.For sensitivity, deep learning consistently outperformed Peak-CC (p<0.05).It also had a higher sensitivity than BrIC when the training size was 19 and 39 (p<0.01), and higher than CSDM-WB when the training size was 19 (p<0.05).However, other comparisons were inconclusive.For specificity, BrIC was consistently better than other injury metrics (p<0.05).Deep learning ranked the second, and was comparable to BrIC when the training size was 49.
For cross-validation accuracy, Table 4 compares the best and worst cases for the five injury prediction methods.Deep learning and BrIC were virtually comparable, and they largely outperformed others.

Discussion
The biomechanical mechanisms of TBI, including mTBI and concussion, have been an active research focus for more than 70 years.Developing an accurate and reliable metric for injury prediction is one of the cornerstones in TBI biomechanics research.Unfortunately, despite decades of efforts, an "optimal" or best injury predictor has yet to be developed/identified.Instead of similarly attempting to pinpoint an explicit response measure pre-defined in a specific ROI, here we utilized voxel-wise WM fiber strains from the entire brain.The classical injury prediction problem was then formulated into a supervised classification via deep learning that automatically distilled the most discriminative features implicitly from the strain-encoded image volumes for concussion classification.
Thanks to its supervised training, the deep learning approach significantly outperformed all other injury metrics in terms of AUC based on the training datasets.The mean AUC (0.95 -0.99, Fig. 5a, with the highest of 1.0 for some trials; Fig. 6) was comparable to the "best" performers found in other studies (e.g., 0.9655 using axonal strain in the brainstem [14], up to 0.982 using the combination of linear/rotational acceleration [8]).However, the high levels of AUCs achieved in the training dataset did not necessarily translate into the same level of cross-validation performances (Fig. 5b).While not surprising, this suggests the need to cross-validate with samples separate from the training dataset to objectively evaluate and compare performances.
Without iterative training, other injury metrics subjected to logistic regression did not present a sharp drop in AUCs between the two datasets.However, although CSDM-WB had higher mean AUCs than BrIC in training and crossvalidation for most of the configurations (Fig. 5), the opposite was true for cross-validation accuracy and specificity (Fig. 8a).As AUC prescribes an average performance measure over all possible probability thresholds (0-1), it does not govern how a clinical decision (i.e., injury vs. non-injury) is made.In contrast, the cross-validation accuracy, sensitivity and specificity measures offers more insight into the performance in clinical diagnostic decision-making.This reinforces the need for cross-validation, beyond training or fitting alone, to evaluate the injury prediction performance more objectively.
Regardless, using performance metrics based on the cross-validation datasets, deep learning had a statistically comparable AUC relative to BrIC and CSDM-WB (p>0.5), while significantly higher than CSDM-CC and Peak-CC (p<0.01).In terms of cross-validation accuracy, deep learning had a comparable (and often higher) performance relative to BrIC either in average, best, or worst cases (Fig. 8 and Tables 4).It also consistently had the smallest STD, suggesting a more stable prediction performance.Further, deep learning was comparable to and often statistically better than CSDM-WB in cross-validation accuracy, while consistently outperformed CSDM-CC and Peak-CC.This was expected, given that deep learning utilized features of the entire WM, whereas empirical metrics pre-defined in the CC was limited to this specific region only.
Nevertheless, these findings suggested that deep learning had, perhaps, a relatively mild improvement in injury prediction performance.Yet, this came at a somewhat higher cost in technique sophistication.The simpler kinematicsbased BrIC even had a slightly better specificity.Justification of our seemingly more "cumbersome" approach mainly rests on its desirable advantages of utilizing estimated brain mechanical responses for injury prediction, as envisioned [37,38].This may enable a more graded understanding of "injury" to specific brain regions or functionally important neural tracks, as illustrated here (Fig. 4) and envisioned before [12,39,40].In contrast, kinematic injury metrics, including BrIC and RVCI, are limited to a binary injury prediction for the entire brain at present.
Further, deep learning utilized strain-encoded images of the entire brain directly without the need to pinpoint a specific ROI for concussion classification.Often, this was otherwise conducted empirically and without consensus (e.g., corpus callosum, brainstem, or gray matter, white matter [14,16,18]).In addition, no thresholding was necessary.
Potentially, this could be another notable advantage over others in which strain thresholding was typically conducted empirically as well (e.g., either 0.1 [14] or 0.25 [15]), and for the entire brain without accounting for possible interregional tolerance differences.As this technique is also applicable to other conventional neuroimages for injury prediction [41], conceivably, this may enable a multi-modal injury prediction scheme combining both biomechanical responses (e.g., strain-encoded image volume in Fig. 4, as well as strain rate that is known to be important to brain injury but was not included in the current study) and corresponding neuroimages such as DTI to improve injury prediction performance.This is beyond the capabilities of any other kinematic or strain-based injury metrics currently in use.
To summarize, our concussion classification performances using deep learning suggested that this approach is an attractive and highly competitive alternative to other injury metrics currently in use.Further explorations and larger well-documented injury datasets will reveal whether this technique has the potential to extend its application in TBI research (biomechanics and neuroimaging).

Explicit Injury Metrics
A previous study identified Peak-CC to considerably outperform BrIC in AUC using all of the reconstructed NFL impacts as a single training dataset (0.9488 vs. 0.8629 [14]).In contrast, here we reported the opposite (e.g., AUC of 0.7707 and 0.8293 for Peak-CC and BrIC in the training dataset, respectively, for the 49/9 configuration; Fig. 6).This suggested disparities between the two head models and their analysis approaches.Perhaps most notably, the two models differ in material properties (isotropic, homogeneous vs. anisotropic for the WM).In addition, they have different brain-skull boundary conditions (nodal sharing via CSF, as in SIMon, vs. frictional sliding), mesh resolution (average size of 3.2 mm vs. 5.8 mm), method to calculate fiber strain (projection of a strain tensor vs. assigning averaged fiber directions to FE elements), and even segmentation of the CC [12].
Nevertheless, improving a model's injury predictive power is a constant process.Together with more welldocumented real-world injury cases, further comparison of injury prediction performances across models is important to understand how best to improve.A high AUC in a training dataset does not necessarily indicate the same high level of AUC or other performance metrics in cross-validation, as found in this study.Therefore, it is important that future studies utilize cross-validation, rather than training or fitting, performances for objective evaluation and comparison.In this case, the random subsampling technique presented here may help.

Limitations
An important limitation of the WHIM lies in its use of an isotropic, homogeneous material properties of the brain [22], despite the well-known anisotropy of the WM.Incorporating WM material anisotropy appears relatively straightforward (e.g., via an existing Holzapfel-Gasser-Ogden model [42] in Abaqus).However, a fresh set of model validation is necessary, which is beyond the scope of our current study.Nevertheless, importantly, the deep learning framework presented here would remain applicable, when, presumably, more accurate estimates of WM fiber strains are available.
Extensive discussions also exist on errors in the impact kinematics [23], the resulting uncertainties in model results, and implications in injury prediction [14,16,18].Unfortunately, errors in head angular motion kinematics were only available in acceleration but not in angular velocity that is considered to be the main driver for brain strains [6,25].This may place model response estimates at an even greater uncertainty than previously thought, which may have partly contributed to the drop in deep learning cross-validation performances despite high AUCs achieved for the training datasets.
The relatively small injury dataset available for analysis was another major limitation in this study that precluded a more conclusive and statistically significant performance comparison.In addition, for each training dataset, all the remaining cases were used for cross-validation.This created unbalanced validation datasets (i.e., varying sizes to maintain a sum of 58 cases).Increasing the training dataset sample size did appear to slightly increase the crossvalidation accuracy for deep learning (Fig. 8a).However, a fairer performance comparison across different sizes of training/validation datasets would be to use an identical cross-validation dataset.This was possible (e.g., using 9 crossvalidation samples) but challenging, due to higher STDs with smaller cross-validation datasets (as observed in Figs. 5 and 8).Nevertheless, more well-documented injury cases will certainly facilitate better evaluation and comparison of injury prediction performances.This remains to be an important task for researchers at present and in the future.
Further, we have used a generic model with the corresponding neuroimages of one individual to study a population, as no images were available for the athletes.Inter-subject variation in neuroimaging does exist, which could lead to uncertainties in strain response on an individual basis [43].Nevertheless, a generic model is a critical stepping stone at present towards developing individualized models and to couple with their own neuroimages for more personalized investigation in the future.This is analogous to the typically 50 th percentile head models currently in use that do not yet directly correspond to detailed neuroimages [19].
Finally, for the deep learning approach itself, a clear guideline is lacking on how best to design the network structure, and we empirically adopted a similar structure proven to be successful before [33].However, a more systematic investigation may be necessary to fine-tune the network structure in the context of concussion classification, to improve the performance further.Another well-recognized "limitation" with the deep learning approach, perhaps, is that the technique behaves much like a "black box", without an obvious physical interpretation of the its internal decision mechanism.The features used to arrive at the injury-vs.-non-injurydecision were determined by the computer algorithm, rather than based on explicit features in a traditional regression method (i.e., "white box").Despite the reliance on "experience" in designing the network structure and the difficulty of "interpretability", the deep learning framework based on implicit strain features of the entire WM for concussion classification is an attractive and competitive alternative that has not been explored before.Therefore, findings from this study may offer important fresh insight into how best to predict brain injury, including concussion, in the future.

Conclusion
We have developed a deep learning approach to classify concussion based on voxel-wise WM fiber strains of the entire brain.Unlike other methods that rely on an explicit, pre-defined injury metric, this technique utilized strainencoded image features implicitly to automatically distill the most discriminative feature vector for concussion classification.Based on the reconstructed NFL injury cases and a repeated random subsampling technique for objective performance evaluation, we showed that this technique is an attractive and highly competitive alternative for brain injury prediction.

Fig. 2
Fig. 2 Structure of the deep learning network.The network contained five fully connected layers to progressively compress the fiber-strain-encoded image features, and ultimately, into a two-unit feature vector for concussion classification.

Fig. 3
Fig. 3 Illustration of error functions for the training and cross-validation datasets (Top), along with the corresponding cross-validation accuracy (Bottom), vs. training iteration for three randomly generated trials (configuration of 39/19, for training and cross-validation samples).Maximum cross-validation accuracies based on cross-validation datasets were achieved when the training error was below 0.45 and after 10000 iterations for the illustrated configuration.CV.: cross-validation

Fig. 5
Fig. 5 Comparison of the mean AUC values over training (a) and cross-validation (b) datasets.The AUCs of the deep learning approach were significantly, and consistently, larger than those of the others over the training dataset, thanks to its supervised training.They remained to be comparable to, and often higher than, others for the validation datasets (b).

Fig. 6 Fig. 7
Fig. 6 ROCs corresponding to the highest (blue) and lowest (red) AUCs for each injury prediction method and training/validation configuration over the training datasets.For all the cases shown, deep learning outperformed all others thanks to its supervised training.TP: true positive rate (sensitivity); FP: false positive rate (1-specificity)

Fig. 8
Fig. 8 Comparison of cross-validation accuracy (a), sensitivity (b), and specificity (c) of the five injury classification methods.For all of the measures, the STD increased with the decrease in validation dataset sample size, as expected.

Table 2 .
Both training and cross-validation datasets could randomly select concussed and non-injury cases.Given the relatively small sample size (N=58), we chose a range of training dataset sample sizes (of 19, 29, 39, and 49, respectively; Table2) and utilized all of the remaining cases for cross-validation.For each configuration (referred to as: 19/39, 29/29, 39/19, and 49/9, respectively), a total of 50 random trials were generated for evaluation.Summary of training/cross-validation dataset configurations based on a total of 58 injury cases.A total of 50 random trials were generated for each configuration.

Table 3 .
Summary of the empirical training error thresholds to determine the number of training iterations.

Table 4 .
The best and worst cross-validation accuracies for the five injury prediction methods.