Discrimination of alcohol dependence based on the convolutional neural network

In this paper, a total of 20 sites of single nucleotide polymorphisms (SNPs) on the serotonin 3 receptor A gene (HTR3A) and B gene (HTR3B) are used for feature fusion with age, education and marital status information, and the grid search-support vector machine (GS-SVM), the convolutional neural network (CNN) and the convolutional neural network combined with long and short-term memory (CNN-LSTM) are used to classify and discriminate between alcohol-dependent patients (AD) and the non-alcohol-dependent control group. The results show that 19 SNPs combined with academic qualifications have the best discrimination effect. In the GS-SVM, the area under the receiver operating characteristic (ROC) curve (AUC) is 0.87, the AUC of CNN-LSTM is 0.88, and the performance of the CNN model is the best, with an AUC of 0.92. This study shows that the CNN model can more accurately discriminate AD than the SVM to treat patients in time.


Introduction
Alcohol dependence is the third most serious public health problem after cardiovascular disease and malignant tumors. Alcohol dependence is a hereditary chronic recurrent disease caused by long-term excessive drinking [1][2][3]. According to statistics from the National Institute on Alcohol Abuse and Alcoholism (NIAAA), more than 17 million people in the United States abuse/ depend on alcohol, causing losses to American society of more than $180 billion [3]. In addition, the World Health Organization estimates that the global mortality rate caused by alcohol from 2000 to 2013 is 4.64% [4], and the global disability rate is 4.15% [1]. A large number of studies have shown that alcohol dependence not only causes damage to personal health, such as cirrhosis [5], gastrointestinal diseases [6,7] and brain aging [8,9], but it also has an impact on an individual's mental and spiritual well-being in the form of domestic violence [10,11], depression [12], etc. Alcohol dependence severely harms individuals, families, and the social order; therefore, rapid and efficient AD diagnosis methods are indispensable for current clinical diagnosis. However, current clinical diagnostic measures are only based on a combination of clinical interviews, questionnaires, blood tests, etc. [7,13,14]. Although it can reflect the state of patients to some extent, there are problems such as its long cycle and highly uncertain factors [7]; therefore, a more efficient diagnostic tool is needed to support the rapid detection of AD. SNP refers to a DNA sequence polymorphism caused by the mutation of a single nucleotide in the genome and is one of the most common types of genetic variation in the human genome [15]. Because of its large data volume and uniform distribution, SNPs are widely used in the study of genetics and complex diseases [16], such as its use as a potential biomarker for the diagnosis and treatment of many cancers [15]. In recent years, much research has focused on the SNP information related to diseases [17][18][19], and certain results have been achieved. Examples include the clinical application of SNP array analysis in early pregnancy abortion [20], and single nucleotide polymorphism loop-mediated isothermal amplification (SNP-LAMP) assisting in the treatment of artemisinin-resistant malaria by decreasing the toxicity in patients due to the time span of medication being too long [21][22][23]. A study found that the serotonin receptor 5-HT3 plays an important role in the process of rapid neurotransmission in the brain [24]. In addition, the HTR3A and HTR3B genes in the 5-HT3 [25], age [26] and education [27] are closely related to alcohol dependence. In this study, combined with NCBI and literature reports, we selected the SNP information of each of the 10 sites on HTR3A and HTR3B [28][29][30] combined with the age, education and marital status of the participants as the features of AD diagnosis.
Using machine learning methods for data mining of SNP pathogenic sites has become a research hotspot in the field of bioinformatics [16,31]. Due to the flexibility of modeling different data sources [32], the SVM algorithm is used for many complex diseases [33][34][35]. In addition, the applicability of deep learning methods to complex diseases has been continuously explored in recent years, especially that of convolutional neural networks [36][37][38][39], while CNN-LSTM has achieved good results in medical disease classification [40,41]. However, most studies on alcohol dependence only use genetic information combined with basic mathematical statistics to find genes related to alcohol-dependent patients [25,28]. According to the current survey, this study is the first to propose feature fusion based on SNPs, age, education, and marital status using the GS-SVM and CNN algorithms to conduct an exploratory study on the determination of alcohol-dependent patients. Analogous to other studies that use SNP combined with machine learning to classify diseases [42][43][44] . This article collected 154 cases of alcohol-dependent patients and 163 cases of control samples, using the SVM machine learning algorithm combined with grid optimization, the model training is carried out on a combination of 20 SNP sites with age, education and marital status. Among them, after the fusion of 19 SNP sites and the education feature, the overall discrimination effect is the best; and its accuracy, specificity and sensitivity are 87.50%, 91.30% and 83.33%, respectively. After these 20 features are trained by the CNN-LSTM model, the evaluation criteria are more evenly distributed, and their accuracy, specificity and sensitivity are 87.50%, 91.30% and 83.33%, respectively. Through the training of the CNN model using deep learning [47], the classification effect is better, and the accuracy, specificity and sensitivity are 92.05%, 93.48% and 90.48%, respectively.

Experimental materials
Participants were recruited to the study from January to May 2017. In this experiment, we collected 154 cases of alcohol-dependent patients who were diagnosed by two chief psychiatrists according to the diagnostic criteria of the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) [48]. 163 cases of control samples were randomly selected by a psychiatrist using the Alcohol Use Disorders Identification Test (AUDIT) to evaluate the severity of alcohol dependence (> 8 points excluded [49]). All of the 317 samples excluded mental health problems and major disease history. 3-5 ml peripheral blood samples were drawn from each participant, and the collection and demographic data of the whole blood sample were completed by a professional physician at the Fourth People's Hospital of Urumqi.

Data analysis
Before training the model, to make the age, education and marital status evenly distributed for the AD and control groups, statistical analysis was performed to remove outliers. In order not to lose the generality and at the same time obtain a high discriminant accuracy, we choose the Kennard-Stone (KS) algorithm to divide the data, which is widely used in classification problems [52,53] and is based on the Euclidean distance [54]. We use the SVM algorithm (LIBSVM toolbox, MATLAB R2014a) to find the features with the best discriminant effect, and then perform CNN training (Keras, Python 3.7.4) to try to improve the discriminant effect.
The model evaluation indicators of this study are the accuracy, which is the proportion of patients and controls with correct predictions with respect to the total sample; the precision, which is the proportion of patients with correct predictions with respect to the total sample of patients with predictions; the sensitivity, which is the proportion of patients with correct predictions with respect to the total samples of real patients; the specificity, which is the proportion of the control group samples with correct predictions with respect to the real control group samples; and the area under the receiver operating characteristic curve [55]. The relevant code for this study can be obtained from Github (https://github.com/ChenFfha/CNN/).

statement
This study has been approved by the Ethics Committee of the Fourth People's Hospital of Urumqi, Xinjiang. Each participant in the case-control study signed an informed consent form before conducting the experimental test. In addition, this study was conducted in accordance with relevant regulations.

Data pre-processing and analysis
To evenly distribute age, education and marital status among the patient and control groups, statistical analysis was performed (Fig 1). Among them, the two highest-educated samples only appeared in the AD group, there were only 18 cases (outliers) between the ages of 60-90 among the recipients, and the distribution was uneven [56]. Remove these data. The details of the participants after excluding outliers are shown in Appendix A. Among the 297 participants, there are 141 cases of alcohol-dependent patients (AD001-AD141) and 156 cases of control samples (CONTROL001-CONTROL 156). Their age is concentrated in the 24-59 years old, in order to facilitate observation, we set the age of 21-30 years old as 1, and so on, set the age of 50-60 years old as 4. In addition, the participants were educated from primary school to undergraduate, taking into account married, unmarried and other marital situations. Visualizing these samples (Fig 2), we can see that age, education and marital status are almost evenly distributed in AD and control groups. In alcohol dependent patients and control groups, we considered people of all ages, as well as their education and marital status, so our data are representative. Horizontal axis 0-CONTROL, 1-AD. Education level ranges from 0-7, which represent elementary school to a master's degree, respectively; marital status ranges from 1-3, which represent married, unmarried and others, respectively; and age ranges from 1-7, which represent that the age is from 24 to 81 years old, quantified at 10 year intervals [57]. The red dot represents the mean, and the blue dotted line represents the median line.
The SNP data of these 297 samples, including the 141 AD and 156 control groups, were encoded using the PLINK command as genotype AA coded 0, genotype AB coded 1, and genotype BB coded 2 [58], and then the missing values of the encoded data set were interpolated at random. Fig 3 calculates the Pearson correlation coefficients for the data after removing the outliers. It can be seen from the figure that the correlation among most features is very low, while rs4938056 and rs10789970 are the highest at 0.98. In the linear regression plots of the SNP sites (Fig 4), the correlation coefficient is greater than 0.8 [59], that is, it has a very strong correlation, and only rs4938056 and rs10789970 are correlated. The correlation coefficient is 0.98, and it has a certain linear correlation [60].
Before the model training, the KS algorithm was used to extract 70% of the data from the control group and AD group as the training set, including 110 cases from the control group and 99 cases of from the AD group. The remaining 42 cases of the control group and 46 cases of the AD data are the test set.

SVM model
From 3.1, we know that rs4938056 and rs10789970 have a very strong correlation. For the 20 SNPs, the 19 SNPs with rs4938056 removed and the 19 SNPs with rs10789970 removed, we divide the data set. We set the initial penalty parameter C as [2 −10 ,2 10 ], the range of kernel function parameter g is set as [2 −10 ,2 10 ], and the grid search parameters are optimized through five-fold cross-validation. Finally, the SVM model is trained, and the results are shown in Fig  5. The original genotype data results are overfit, and the accuracy is only 70.46%. When we remove rs10789970, the results do not change much. When excluding rs4938056, the overfitting phenomenon is suppressed, the accuracy is 72.73%, the specificity is 82.61%, the sensitivity is 61.90%, and the overall effect is better. At this time, the best C is 181.019, the best g is 0.004, and the kernel function is the Gaussian kernel function.
The SNPs after excluding rs4938056 is combined with age, education and marital status and input into the GS-SVM model. Various model evaluation indicators (Table 1) show that the discrimination accuracy has been improved by the feature fusion method, but the sensitivity of the SNPs combined with age is low. Although the accuracy of combined education and marital status is the highest at 88.64%, its sensitivity is only 80.95%, and the distribution of the various indicators is uneven (± 14.7). In contrast, the overall discrimination effect of SNPs combined with academic qualifications is the best, with an accuracy, a specificity and a sensitivity of 87.50%, 91.30% and 83.33%, respectively; and the distribution of various indicators is relatively uniform (± 7.97).

CNN model
It can be seen from 3.2 that the sensitivity of the SNPs and academic qualifications needs to be improved after training using the GS-SVM model. To obtain a better discrimination effect, we used the convolutional layer and the pooling layer to try to extract more effective features by adjusting the GoogLeNet model [61], thereby improving the patient discrimination accuracy.
The 19 SNPs and educational background features after excluding rs4938056 from the original data are used as the input of the GoodLeNet model. The adjusted model structure is like that shown in Fig 6. To prevent the disappearing/exploding gradient and overfitting, batch normalization is added after each convolutional layer [62], and a dropout layer with a discarding rate of 0.5 is added to the fully connected layer [63]. The number of epochs in the model is 200, the batch size is 32, the learning rate is 0.01, the loss function is the cross-entropy loss suitable for binary classification, and the optimizer that is chosen is Adam, which performs well in practice [64]. The results showed that its accuracy, precision, specificity and sensitivity were 92.05%, 92.68%, 90.48% and 93.48%, respectively.

CNN-LSTM model
In recent years, the CNN-LSTM model has performed well in classification tasks [65][66][67]. This article adds LSTM after the second Inception block in Fig 6 is fully connected, sets the number of hidden layer units to 512, adds L2 regularization to prevent overfitting, and sets the regularization parameter to 0.005. The results showed that its accuracy, precision, specificity and sensitivity were 87.50%, 86.05%, 88.10% and 86.96%, respectively.

PLOS ONE
Discrimination of alcohol dependence based on the convolutional neural network

Comparison of three models
From the comparison of various model indicators of the three models in Fig 7,

Discussion
Here, we propose a method that uses the SNP on HTR3A and HTR3B, age, education level and marital status of human combined with convolutional neural network to screen for

PLOS ONE
Discrimination of alcohol dependence based on the convolutional neural network patients with alcohol dependence, which is of great significance for the prevention and treatment of related diseases caused by alcohol dependence. However, the participants in this study were all male and didn't analyze women. In addition, participants' educational background was all in primary school, junior high school, technical school, technical secondary school, senior high school, junior college and undergraduate. and other educational background groups were not studied.

Conclusions and prospects
In this study, we used SNPs of HTR3A and HTR3B to fuse the features with age, education level and marital status, and we used the GS-SVM, CNN and CNN-LSTM models to discriminate patients. Among them, rs10789970 has a strong correlation with rs4938056, and the combination of SNPs excluding rs10789970 and academic qualifications can effectively distinguish AD, and the accuracy, specificity and sensitivity are 87.50%, 91.30% and 83.33%, respectively. By adjusting the GoogLeNet model for training, the various performance indicators have been improved. The accuracy, specificity and sensitivity are 92.05%, 93.48% and 90.48%, respectively, and the distribution is relatively uniform. The classification effect of CNN-LSTM model is slightly better than SVM, but it is worse than CNN. This study is a preliminary exploratory study of alcohol-dependent patients. There are still some disadvantages, such as the small sample size. In follow-up research, we will add phenotypic traits (blood type, weight, etc.) for feature fusion, and establish a CNN rapid discrimination AD model based on SNPs, education, and phenotypic trait information.