Extreme Learning Machine-Based Classification of ADHD Using Brain Structural MRI Data

Background Effective and accurate diagnosis of attention-deficit/hyperactivity disorder (ADHD) is currently of significant interest. ADHD has been associated with multiple cortical features from structural MRI data. However, most existing learning algorithms for ADHD identification contain obvious defects, such as time-consuming training, parameters selection, etc. The aims of this study were as follows: (1) Propose an ADHD classification model using the extreme learning machine (ELM) algorithm for automatic, efficient and objective clinical ADHD diagnosis. (2) Assess the computational efficiency and the effect of sample size on both ELM and support vector machine (SVM) methods and analyze which brain segments are involved in ADHD. Methods High-resolution three-dimensional MR images were acquired from 55 ADHD subjects and 55 healthy controls. Multiple brain measures (cortical thickness, etc.) were calculated using a fully automated procedure in the FreeSurfer software package. In total, 340 cortical features were automatically extracted from 68 brain segments with 5 basic cortical features. F-score and SFS methods were adopted to select the optimal features for ADHD classification. Both ELM and SVM were evaluated for classification accuracy using leave-one-out cross-validation. Results We achieved ADHD prediction accuracies of 90.18% for ELM using eleven combined features, 84.73% for SVM-Linear and 86.55% for SVM-RBF. Our results show that ELM has better computational efficiency and is more robust as sample size changes than is SVM for ADHD classification. The most pronounced differences between ADHD and healthy subjects were observed in the frontal lobe, temporal lobe, occipital lobe and insular. Conclusion Our ELM-based algorithm for ADHD diagnosis performs considerably better than the traditional SVM algorithm. This result suggests that ELM may be used for the clinical diagnosis of ADHD and the investigation of different brain diseases.


Introduction
Attention-deficit/hyperactivity disorder (ADHD) is one of the most prevalent behavioral disorders in childhood and adolescence. Approximately 5% of school-age children and 2-4% of adults are diagnosed with ADHD or have ADHD-associated symptoms [1]. ADHD is typically characterized by inattention, hyperactivity, impulsivity and impaired executive function, and its diagnosis is normally made on the basis of these behavioral symptoms. However, there is currently no diagnostic laboratory test for ADHD. ADHD diagnosis may include psychological tests, such as the ADHD Rating Scale (ADHD-RS), Conners Parent Rating Scale and Brown Attention Deficit Disorder Scale (BADDS). The efficiency of the diagnostic process is generally low because testing requires a long, tedious clinical interview. In addition, traditional ADHD diagnosis methods commonly lead to misdiagnosis. For instance, approximately 20% of children are misdiagnosed because they are younger than their classmates [2,3]. Therefore, a rapid, accurate and objective diagnostic tool is needed to improve the understanding, prevention and treatment of ADHD.
To aid the development of a new ADHD diagnostic method, objective experimental differences between ADHD and control subjects (CS) should be defined. To date, most studies have explored differences in the connectivity of complex human brain networks between ADHD and normal children [4][5][6][7][8]. Most of these studies employ electroencephalographic (EEG) or magnetoencephalographic (MEG) detection technology to record electromagnetic brain activity. However, these recordings are subject to electromagnetic interference from the external environment, such as 50 Hz power-line interference, or signal reductions by the human skull [9][10][11][12]. Structural imaging tools, such as magnetic resonance imaging (MRI) and functional MRI, have been extensively utilized to study the anatomical aspects of human brain disorders and to identify the fundamental differences between ADHD and normal subjects [13][14][15][16]. Additionally, brain imaging technologies have also been applied to the ADHD diagnosis and classification. In the early days, researchers use single-photon emission computed tomography (SPECT) to compare the pattern of regional cerebral perfusion in groups of children with ADHD during a computerized performance test [17]. With the development of imaging techniques, a growing number of noninvasive imaging technologies begin to be applied in ADHD classification, especially two particularly prominent kinds of imaging methods: morphological information based on brain MRI data and brain connectivity based on functional MRI [18,19].
In the past several years, numerous anatomic imaging studies have accrued evidence for structural brain abnormalities in ADHD. Results for children with ADHD from recent findings showed a decrease in total cortical volume of over 7 and 8% and a decrease in surface area of over 7% bilaterally [20]. Anatomical abnormalities have also been observed in cortical thickness and folding, especially in posterior brain regions and anterior brain regions, including left/right superior temporal and parietal lobes, temporoparietal junction, and insula [21,22]. All these abnormalities in ADHD suggest that structural MRI data of human brain should be a kind of ideal classification feature for ADHD diagnosis.
Moreover, structural MRI has a high resolution and uses relatively stable imaging technology. Several studies using structural MRI have demonstrated anatomical differences between ADHD and normal children [23][24][25]. Anatomical MRI showed that the maturation of cortical thickness and the surface area developmental trajectory of the right prefrontal cortex is delayed in ADHD children relative to typically developing The upper and lower images refer to the pial vertices (outer gray surface) and white vertices (inner gray surface), respectively, that were extracted and reconstructed in stereotaxic space from (A). (C) Five basic cortical features, including thickness, surface area, folding index, curvature and volume, were measured from the divisional cortical surfaces, comprising a total of 340 brain features for each subject. (D) All the brain features were normalized to the range from 0 to 1. (E) The normalized data were rearranged in accordance with the F-score in descending order. (F) The SFS method was used to further select the features that enhance the classification accuracy. (G) The classification accuracy of both ELM and SVM learning algorithms was tested using the leave-one-out cross-validation method. doi:10.1371/journal.pone.0079476.g001 children [7]. Additionally, machine pattern recognition techniques based on structural MRI data have been extensively applied to diagnose many diseases. For example, brain tumor volume can be obtained from structural MRI data using computer-aided diagnosis [26,27]. Outstanding Alzheimer's disease (AD) classification accuracy has been achieved using whole-brain anatomical MRI with SVM, which can aid early AD diagnosis [28][29][30][31]. These successful examples of brain disease diagnosis prompted us to develop a method that combines brain morphological MRI with a learning machine method, which may be used to supplement existing cognitive batteries during diagnostic procedures.
To date, traditional machine learning techniques have been utilized to distinguish the MRI data of two groups of subjects who have multiple obvious defects. This involves time-consuming training sessions for the experimental dataset, classification inefficiency with changes in sample size and selection of one or more parameters for the classifier [29,30]. For example, when classifying mild cognitive impairment subtypes using a support vector machine, Haller and colleagues had to iteratively explore the parameter gamma from 0.01 to 0.09 [32]. In addition, the testing accuracy is not always satisfactory enough for practical classification applications [31].
In this study, we focused on developing an automatic, effective, rapid and accurate ADHD diagnosis method to overcome the deficiencies of traditional methods. We first proposed an ADHD classification model using the extreme learning machine (ELM) with F-score and SFS feature selection methods to provide objective clinical diagnosis. The simple and efficient ELM method was introduced to build a robust model for ADHD classification. It is based on 5 basic cortical properties: thickness, surface area, folding index, curvature and volume. Our findings demonstrate that the ELM learning model performs better and has an extraordinarily higher accuracy than the commonly used SVM learning algorithm in terms of computing efficiency and the dependence of experimental dataset size. We also found that the surface area (SA) and volume (V) data of the human brain provide the most salient information for discriminating between ADHD and CS.

Subjects
The data used in the present study were part of the dataset from the Peking University (Peking_1 and Peking_2) ADHD-200 Global Competition Test Dataset (http://fcon_1000.projects. nitrc.org/indi/adhd200/). The dataset contains a total of 152 subjects including 59 ADHD and 93 healthy controls. Fifty-five of 59 ADHD subjects with were selected for the current study according to the age range from 9 to 14 (mean age 11.8) and 4 overage subjects were excluded. Other fifty-five of 93 age matched healthy adolescents were selected to form the control group (mean age 11.5). Patients with a history of medication use were also included. The inclusion criteria were as follows: 1) righthandedness; 2) no lifetime history of head trauma with loss of consciousness; 3) no history of neurological disease, and no diagnosis of schizophrenia, affective disorder, pervasive development disorder, or substance abuse and 4) full-scale Wechsler Intelligence Scale for Chinese Children-Revised (WISCC-R) score of greater than 80.

MRI
MRI data were downloaded from the ADHD-200 Global Competition website (http://fcon_1000.projects.nitrc.org/indi/ adhd200/). A description of the Peking University ADHD-200 Global Competition data acquisition can be found in the scan parameters item of the website. Briefly, the MRI data were collected using a SIEMENS TRIO 3-Tesla scanner. The MRI protocol included acquiring a high-resolution T1-weighted MPRAGE volume (voxel size 1:3|1:0|1:3 mm 3 ) using a custom pulse sequence with the following parameters: 2530/ 3.39 ms (TR/TE) and 1.33 mm (slice thickness).

MRI Data Processing
The FreeSurfer 5.10 software package was utilized for cortical reconstruction and volumetric segmentation (FreeSurfer v5.10, http://surfer.nmr.mgh.harvard.edu/fswiki). For processing, the original MRI data were first subjected to a series of preprocessing steps, including motion correction, T1-weighted image averaging, registration of the volume to Talairach space and stripping the skull with a deformable template model ( Figure 1A). By encoding the shape of the corpus callosum and pons in the Talairach space and following the intensity gradients from the white matter to the cerebrospinal fluid, the white surface and the pial surface were generated for each hemisphere ( Figure 1B). Once these surfaces were known, a cortical surface-based atlas was mapped to a sphere aligning the cortical folding patterns, which provided accurate matching of the morphologically homologous cortical locations across subjects. The average shortest distance between white and pial surfaces denoted the cortical thickness at each vertex of the cortex. Surface area was calculated by computing the area of every triangle in a standardized spherical surface tessellation. The local curvature was computed using the registration surface based on the folding patterns. The folding index over the whole cortical surface was measured using the method developed by Schaer. In the present study, the FreeSurfer pipeline was used to automatically generate the five basic cortical features. Each basic feature was divided into 68 components based on brain segments, which comprise a total of 340 cortical features for each subject ( Figure 1C). The indexes of 340 cortical features are briefly presented in Table 1.

Feature Selection
After normalizing all the brain features data to the range from 0 to 1 ( Figure 1D), we utilized the F-score method ( Figure 1E) and the sequential forward selection (SFS) method ( Figure 1F) for feature optimization selection of the 340 cortical features to achieve a high classification accuracy. We then set the selected features as the experimental dataset for ADHD classification. The basic principles of these two feature selection methods are briefly described below.
4.1. F-Score. F-score (Fisher score) is a simple and efficient feature selection criterion obtained by measuring the discrimination between two sets of real numbers [33]. Given training vectors x i , i~1, . . . ,l, the F-score of the jth feature is defined as where n z and n { are the number of positive and negative instances, respectively, x i,j z ð Þ and x i,j { ð Þ are the jth feature of the positive and negative instances, respectively, and x x j , x x j z ð Þ and x x j { ð Þ are the averages of the whole, positive and negative datasets, respectively. A larger F-score indicates that the feature is more significant because the numerator refers to the variance between two classes and the denominator denotes the variance within each class.
4.2. SFS. Sequential forward selection (SFS) is a simple efficient feature selection approach [34]. A subset was defined by iteratively adding one feature at a time to an empty set to achieve the maximum intermediate criterion value. Then, the subset of d features was generated using the SFS method:

Classification
As shown in Figure 1G, both the SVM and ELM classifiers were used for the experimental dataset of 110 subjects to perform the leave-one-out cross-validation. Validation involves using features of a single subject from the whole experimental dataset for testing and using the remaining subjects to train the classifier. This processing is repeated for all the subjects. We then evaluated the ADHD classification efficiency of both learning algorithms by comparing their average testing accuracy and classification time. The descriptions of these two learning algorithms are shown below.
5.1. SVM learning algorithm. Support vector machines (SVM) are popular machine learning methods for classification and regression that are based on the learning theory originally developed by Vapnik and his colleagues in 1995 [35]. In SVM, an n-class problem is converted into n two-class problems. For each two-class problem, the original m-dimensional input vector x is mapped into the l-dimensional (l §m) dot product space (feature space) using a nonlinear vector function to enhance linear separability. In this high-dimensional feature space, the optimal separating hyperplane that has the maximal margin to the nearest training datum needs to be found. Once processing is completed, the testing data can also be mapped into the feature space, and then a class is assigned to the testing data.
In the present study, the LIBSVM software package was applied to implement the SVM algorithm, and simple efficient linear function and radial basis function (RBF) were respectively selected as the kernel functions. LIBSVM, an integrated software package that is extensively used for regression and classification in machine learning, was developed by Dr. Chih-Jen Lin and his colleagues (LIBSVM v3.12 available at http://www.csie.ntu.edu.tw/,cjlin/ libsvm/). 5.2. ELM learning algorithm. Extreme learning machine (ELM) is an extremely fast learning algorithm with good generalization performance that was developed by Huang and his research group [36]. Traditional single hidden-layer feedforward neural networks (SLFNs), such as the back propagation (BP) learning algorithm, have been extensively used for research in many fields. These methods may require a search for the specific input weights and hidden layer biases to minimize the cost function, which usually makes it difficult to keep the computing speed and classification accuracy within an acceptable range. According to Theorem 1 and Theorem 2 shown in the Appendix S1, the input weight w i and the hidden layer biases b i of SLFNs for ELM can be randomly assigned if the activation functions in the hidden layer are infinitely differentiable [37,38]. Therefore, training an SLFN is equivalent to finding a least squares solution b b of the linear system Hb~T: However, for most cases the number of hidden nodes is far less than the number of distinct training samplesÑ N%N À Á , which means H is not a square matrix, and there may not exist Hb~T. According to Theorem 3, the smallest norm least squares solution of the linear system where H { is the Moore-Penrose generalized inverse of matrix H.
With the completion of the model of the ELM algorithm, the testing data could be efficiently classified.

Selection of Classification Algorithm Parameters
Our extreme learning machine (ELM) training and classification computing program was compiled using MATLAB based on the relative research theories of Dr. Huang. In this study, we selected a simple sigmoidal kernel function g x ð Þ~1= 1z ð exp {x ð ÞÞ and set the number of hidden nodes to 20. The SVM classification simulations were carried out using the MATLAB interface to the C-coded LIBSVM package developed by Dr. Lin's team. In our experiments, two kernel parameters C and c for radial basis function (RBF) SVM and one kernel parameter C for linear SVM needed to be determined according to the LIBSVM user guidelines. Because the SVM algorithm performs particularly poor on the experimental dataset when the default parameters setting is selected, we used the grid-search method on C and c to obtain suitable parameters for the SVM algorithm before the training. A practical method of identifying good parameters involves attempting exponentially growing sequences of C and c. The pair of C,c ð Þ values with the best cross-validation accuracy is selected as the best setting. In the present study, the search scales of these two parameters were set to C~2 {5 ,2 {4 , . . . ,2 8 Â Ã and c~2 {4 ,2 {3 , . . . ,2 12 Â Ã . In addition, it is worth noting that, although the grid-search method may improve the classification accuracy of the SVM algorithm, it also significantly increases the total training time of SVM. This will be discussed below in the computational efficiency section.
Additionally, as the threshold for each decision function of the binary method may affect the performance of classification a lot, it should be determined according to the receiver operating characteristics (ROC) curves. In the current study, thresholds of all three algorithms were set to the default 0 since the discrimination showed balance performance between true positive rate and false positive rate then.

Permutation Tests
The permutation tests have been adapted to assess statistical significance of the classifier and its performance in many research fields [39,40]. A brief description of permutation tests processing steps is as follows: choosing the statistic of classifier, randomly permuting the class label of the training data before training, performing cross-validation on permuted training set and repeating the procedures as many times as needed. In this study, the generation rate was selected as the statistic and the times of repetition were set to 10000. We hypothesized that the classifier could not learn the relationship between data and labels reliably. The P-valueP P GR ELM ð Þrepresents the probability of observing a prediction rate no less than GR ELM obtained by classifier trained on real labeled data. If the generation rate GR ELM exceeded the 95% confidence interval of training on randomly relabeled data, the null hypothesis was rejected and the classifier learned the relationship with a probability of being wrong of at most P P GR ELM ð Þ .

Performance of ELM, SVM-Linear and SVM-RBF in ADHD Classification based on F-score Feature Selection
The F-score feature ranking method was used to arrange the 340 features of ADHD and CS in descending order according to the F-score value. We combined each feature with all preceding feature rows as an experimental dataset. For example, the seventh feature (F 7 ) would be combined with the previous six feature (F 1 ,F 2 , . . . ,F 6 ) rows to build an experimental dataset defined as the seventh experimental dataset (ED 7 ). This process was repeated for all the features in sequential order to generate 340 experimental datasets (ED 1 ,ED 2 , . . . ,ED 340 ). Next, leave-one-out cross-validation was applied to compare the performance of both methods in ADHD classification. The results are shown in Figure 2.
The overall testing accuracy of the ELM algorithm in ADHD classification was significantly higher than that of the both SVM algorithms. Because the high accuracy of these methods depended mainly on previous experimental datasets, we list the detailed results of the first 50 experimental datasets in Table 2. The ELM learning algorithm achieved a maximum classification accuracy of 70% at the forty-sixth experimental dataset (ED 46 ). The SVM-Linear and SVM-RBF algorithms respectively reached maximum of 67.27% at the twenty-seventh experimental dataset (ED 27 ) and 66.36% at the eighth experimental dataset (ED 8 ). Thus, we concluded that ELM has a better accuracy in ADHD classification than both SVM algorithms.
For the SVM algorithm, we considered the grid-search time separately from the SVM training time because it is much longer than the normal training time (more than 1000 times longer). Both ratio of SVM grid-search time to ELM training time for the first 50 experimental datasets increased rapidly with increasing experimental dataset size (Figure 3). This means that the ELM algorithm is much faster at ADHD classification than the SVM algorithm, especially when the experimental dataset is very large.

ADHD Classification Accuracy Enhancement by SFS
The results of ADHD classification show that all three classification algorithms achieve the maximum before the fortysixth experimental dataset. To further enhance the classification accuracy, the sequential forward selection (SFS) method was executed on the first 46 features of the F-score method and the results are shown in Figure 4.
The testing accuracy of all three methods in ADHD classification were improved as is detailed in Table 3. The ELM algorithm achieved a maximum testing accuracy of 90.18% at the eleventh experimental dataset (ED 11 ), while SVM-Linear and SVM-RBF algorithms respectively reached maximum of 84.73% at the fifteenth experimental dataset (ED 15 ) and 86.55% at the nineteenth experimental dataset (ED 19 ). Compared with the traditional SVM classification method, the ELM algorithm performs significantly better than SVM-Linear (paired t{test, pv0:001) and SVM-RBF (paired t{test, pv0:001).
To further compare the three methods, the receiver operating characteristics (ROC) curves were generated by varying a threshold applied to the continuous prediction score that each of the algorithms generated ( Figure 5). The area under the ROC  curve (AUC) for ELM is 0.8757, for SVM-Linear is 0.7792, and for SVM-RBF is 0.8258. Therefore, ELM performs the best for discriminating ADHD patients from healthy controls.

Permutation Tests for ELM
The permutation distribution of the estimate using the ELM classifier is shown in Figure 6. With the generalization rate as the statistic, cross-validation was performed on the 11 most discriminating features and the permutation test was repeated for 10000 times. This figure indicate that the ELM classifier learned the relationship between the data and the labels with a probability of being wrong of v0:0001.

Discussion
In this study, we established an automatic and efficient ADHD classification method using the ELM learning algorithm on structural MRI data to provide accurate, objective clinical diagnosis. In this study, we achieved two main findings. First, our results indicate that it is possible to classify ADHD and control subjects with a high degree of accuracy using an automatic procedure that combines structure with ELM. Our results from ADHD and control classification achieved an excellent prediction accuracy of 90.18%. This high testing accuracy will improve the actual auxiliary diagnostic accuracy. Second, we demonstrated that the ELM method is much faster (more than 1000 times faster) than other prediction models, such as SVM, making the ELM algorithm a high efficiency method for ADHD diagnosis.

Efficient Brain Structure Features in ADHD Classification
The cortex can be divided into five major segments according to the anatomical structure and function of the human brain, including the frontal lobe, the occipital lobe, the parietal lobe, the temporal lobe and the cingulate. To further understand the relationship between different brain segments and the etiology of ADHD, we pick off the most discriminative 11 brain structure features from the classification results and categorize them in major lobes shown in Table 4.
The cuneus and lingual are portions of the human brain in the occipital lobe. Both of them are linked to receiving and processing the visual information, especially related to letters. The disorder of these portions of brain can lead to a confusion of visual information which may further cause inattention. Additionally, insular cortex is a portion of the cerebral cortex folded deep within the lateral sulcus separating the temporal lobe from frontal lobes. Numerous studies have established that frontal lobe, temporal lobe and insular are mainly associated with attention, motivation, sensory, emotions and memory, which are likely to be involved in ADHD behavioral symptoms, such as inattention, hyperactivity, impulsivity and impaired executive function. In addition, since the ELM classification relied heavily on the anatomical MRI data of these regions, these findings could indicate that these cortical regions mentioned above have the most ADHD-related structural changes in the human brain.

Computational Efficiency of ADHD Classification
The computational efficiency of a pattern recognition method directly influences the performance of ADHD diagnosis in Table 3. Comparison of the training and testing accuracy of ELM and SVM in ADHD classification.  practice. An ideal ADHD machine classification method should achieve both high discrimination accuracy and fast classification speed. In the data presented in Figure 3, the ADHD classification time of the ELM was significant lower than both SVM algorithms. This may be due to that the SVM algorithm requires several user decisions, including the choice of the kernel parameters C and c, which usually take plenty of extra training time. In contrast, the ELM learning algorithm chooses hidden nodes randomly and determines the output weights of the feedforward neural networks analytically by calculating the Moore-Penrose generalized inverse H { of the hidden layer output matrix H. This has important implications. In particular, it indicates that there is no need for the ELM algorithm to spend extra training time on parameter searches and nearly unaffected by changing of experimental dataset size. Another major contribution to our ADHD classifier came from the relatively high classification accuracy (achieved a maximum prediction accuracy of 90.18%). All of these suggest that ELM achieves higher computing efficiency than SVM and make it possible for the ELM learning algorithm to be efficiently applied to ADHD classification. It is also worth noting that, although ELM algorithm performs better in generalization compared with conventional learning methods, too much hidden layer nodes chosen may lead to overfitting and impact the performance in practical application. Therefore, it is essential to determine the optimal number of nodes before training to avoid overfitting.

Influence of Subject Sample Sizes
For traditional pattern recognition methods, a large training sample is usually necessary to ensure classification accuracy because most common pattern recognition algorithms are probabilistic and use statistical inference to determine the best label for a given instance [41][42][43][44]. For example, several recent reports have demonstrated good performance in AD classification using different modalities of features. One of the common practices in these previous studies is the utilization of hundreds of training samples to achieve better classification accuracy [44][45][46][47]. The dependency of a classifier on training sample size is also an important criterion for evaluating the performance of a classifier. To further compare the ADHD classification performance of ELM and SVM for different experimental dataset sizes, we randomly extracted and combined data from all 110 subjects preprocessed MRI datasets into eleven new experimental datasets respectively containing 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 and 110 subjects. Each new experimental dataset consists of half ADHD subjects and half healthy controls. All three algorithms were used to evaluate the eleven ADHD experimental datasets. The results are shown in Figure 7.
The average training and testing rates of three methods are all influenced to a certain extent by the experimental dataset size, while the overall ADHD classification accuracy of ELM is significantly higher than that of both SVM algorithms during the whole experiment process. In contrast to SVM, the ELM algorithm performs more smoothly in ADHD discrimination with the changing of experimental dataset size ( Figure 7B). This suggests that ELM algorithm has a higher robustness and adaptability on different experimental dataset size. Together with advanced feature selection methods, ELM is likely to be a powerful imaging-based pattern recognition method for ADHD diagnosis.

Effect of Medication
In our study, thirty of 55 adolescents with ADHD received medical treatment. For ADHD medication, stimulant medications are the most frequently choice of pharmaceutical treatment. There are a number of non-stimulant medications, such as atomoxetine, that may be used as alternatives [48]. Some research show that patients with attention deficit hyperactivity disorder (ADHD) and a medication history present abnormal brain activation in prefrontal and striatal brain regions during cognitive challenge. Atomoxetine improved inhibitory control and increased activation in the right inferior frontal gyrus [49,50]. This may caused by atomoxetine increased extracellular (EX) concentrations of norepinephrine and dopamine in prefrontal cortex [51]. However, to the best of our knowledge, there is a lack of evidence on medication effects on changing the brain structure of the ADHD patients. Additionally, psychostimulant medications were withheld at least 48 hours prior to scanning in our study. Therefore, we ignored the influence of drugs on brain structure changing in the current study. More work and investigations will be needed to understand the influence of ADHD medication in the future study.

Limitations
The current study only considers structural MRI data from the subjects in the ADHD-200 Global Competition. Several resting state functional connectivity studies suggest that ADHD is associated with large-scale brain sub-networks dysfunction [52,53]. In the future, we will use additional modalities (i.e., fMRI, PET and DTI) with our current classification method to further improve ADHD classification performance. Moreover, since classification accuracy was directly impacted by the selected features, an efficient feature selection method may greatly improve the performance of a learning algorithm. In our current study, conventional feature selection methods, F-Score and SFS, were combined to obtain the optimizing classification features. This method, as simply based on geometry theory, can effectively select the optimizing features. However, it cannot consider the interrelationships among different patterns of data when classifying using multiple modalities data. Sparse representation, one of the latest feature selection methods, has been recently demonstrated to be an efficient feature selection method in pattern recognition of structural MRI scans [54]. It has become popularity since its ability to contrast high dimensional data with compressed samples especially in multivariate pattern analysis. Therefore, we will utilize the advanced sparse representation method combining with multiple modal data and efficient learning methods for ADHD classification in the future. Table 4. Most discriminative brain structure features for ADHD classification.

Lobe
Segmentation Feature

Conclusion
To our knowledge, this is the first study to propose an ADHD classification model using the extreme learning machine (ELM) with F-score and SFS feature selection methods to perform objective diagnosis. Our results show that the ELM algorithm has considerably good performance and an extremely high efficiency in discriminating ADHD subjects from healthy controls. Compared with traditional ADHD diagnosis methods, ELM has the following advantages: 1. extremely fast discrimination speed and satisfactory high classification accuracy; 2. ADHD discrimination using objective MRI data; 3. excellent ADHD classification performance with small training sample sizes and robustness with changes in sample size and 4. does not need to select the training parameters because the hidden nodes are randomly chosen. Moreover, we observed that the frontal lobe, temporal lobe, occipital lobe and insular are potentially involved in ADHDrelated structural changes in the human brain. These findings suggest that our ADHD classification method using the ELM learning algorithm is not only a promising method for ADHD aided diagnosis and the study of disease etiology but can also identify which features of the brain are involved in different diseases.