Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection

Objectives Widespread implementation of electronic databases has improved the accessibility of plaintext clinical information for supplementary use. Numerous machine learning techniques, such as supervised machine learning approaches or ontology-based approaches, have been employed to obtain useful information from plaintext clinical data. This study proposes an automatic multi-class classification system to predict accident-related causes of death from plaintext autopsy reports through expert-driven feature selection with supervised automatic text classification decision models. Methods Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precisionM, recallM, F-measureM, accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system. Results Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to 16% improvement in the overall accuracy compared with the existing techniques and four baselines. Conclusion The proposed system is feasible and practical to use for automatic classification of ICD-10-related cause of death from autopsy reports. The proposed system assists pathologists to accurately and rapidly determine underlying cause of death based on autopsy findings. Furthermore, the proposed expert-driven feature selection approach and the findings are generally applicable to other kinds of plaintext clinical reports.


Methods
Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precision M , recall M , F-measure M , accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system.

Results
Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to PLOS  In ontology-based ATC technique, domain-related medical ontologies are developed and used to identify named entities. To classify pathology reports as cancer positive or negative by using ontology-based ATC technique, we first created an ontology related to cancer identification. This cancer-related ontology includes a set of concepts namely, things, events, or things, which are specified to develop an agreed-upon vocabulary for information exchange [8,9,15]. Ontology-based ATC techniques can utilize suitable entities that imitate the concept of interest. However, medical ontologies are lacking to imitate entities for certain condition or disease. Moreover, extensive domain knowledge and human effort are required to develop ontologies. The yielding acceptable accuracy in plaintext clinical report classification with ontology-based ATC technique is a significant bottleneck [16][17][18]. Moreover, ontologies usually evolve because of constant changes in domain terminologies, thus, ontologies require manual effort to reflect changing vocabulary [17]. Therefore, these limitations show that ontology-based ATC technique is inappropriate for medical domain because of its high maintenance and manual effort.
Supervised ATC technique can be used to overcome challenges associated with ontologybased ATC technique. In supervised ATC, the domain expert first assigns a category or a class to each plaintext clinical report to create a training set. For instance, in case of an autopsy report, the pathologist assigns a cause of death to each report based on autopsy findings, such as "multiple injury" or "acute myocardial infarction," after performing an autopsy examination. Second, all autopsy reports in the training set are tokenized into unique words to form a master feature vector. Third, various text classifiers can be applied on the master feature vector to build a decision model by classifying named entities from the source data being analyzed. Finally, the constructed decision model can be used in the new autopsy to assist pathologists in automatically determining the cause of death. The constructed decision model may accurately or inaccurately determine the cause of death because of numerous reasons [19]. Of these reasons, the most important is the selection of the most discriminative features in the decision model that correlate well with a specific cause of death. Thus, picking the finest features with highly discriminative power among various classes or causes of death is a complicated task and involves substantial effort in constructing a decision model [19]. Hence, this paper aims to develop an automatic multi-class classification system for predicting accident-related causes of death from free text autopsy reports by using expert-driven feature selection.
In this study, expert-driven feature selection approach, along with supervised ATC techniques, is used to achieve high-performance prediction of nine different accident-related causes of death from free medical text autopsy reports. Five different ATC classifiers were compared with the automated and expert-driven feature selection approaches to evaluate classifier performance by using macro precision (Precision M ), macro recall (Recall M ), macro Fmeasure (F-measure M) , and accuracy. This paper also investigates the effect of feature reduction on the overall performance of the decision model. In addition, the proposed expert-driven feature selection approach is compared with four baselines to show its significance. To the best of our knowledge, this paper is the first to use complete medical autopsy reports in determining accident-related causes of death.
This paper is organized in the following manner. In Section 2, the related work is presented. In Section 3, the methodology of this work is described. This section also includes the experimental setup and proposed feature selection approach. Section 4 presents the findings of the experiments. In Section 5, the findings are discussed. Section 6 also shows the significance of dataset and proposed feature selection approach. In addition, the proposed approach is compared with four baselines. Section 7 presents some limitations and future work. Finally, Section 8 concludes this work.

Related work
The supervised ATC techniques have been widely employed in the biomedical domain [20][21][22][23][24][25][26][27][28]. For instance, authors in [21] used support vector machine (SVM) text classifier to identify cancer-related causes of death from death certificates. The authors extracted features using term-based and concept-based features and used them to discover the discriminative features using information gain. Moreover, authors employed SVM and rule-based approach to classify death certificates they obtained for diabetes, influenza, pneumonia, and HIV diseases using term-based and concept-based features. Researchers in [29] developed a freely available graphic tool for biomedical text classification using various machine learning and text classification techniques. In [30], authors constructed and evaluated classifiers by employing SVM and Naive Bayes (NB)-supervised ATC techniques to classify the pathology reports.
In [31], researchers investigated the applicability and suitability of automatic text classification in epidemiological studies. Authors performed a comparative evaluation of a concept recognition approach and a variety of supervised ATC techniques, and they achieved 96.7% accuracy in the results. Authors in [32] explored the feasibility of using supervised ATC techniques in classifying clinical reports using NB and SVM text classifier, and they achieved 95% accuracy. The studies mentioned above gave us a concrete proof that the use of supervised ATC technique is appropriate for biomedical text documents, such as pathological reports, epidemiology reports, cancer-related reports, etc. However, very few studies employing supervised ATC techniques in predicting the cause of death from autopsy reports are available. This lack in related studies motivated us to contribute in this domain.
Perhaps, the work most related to our proposed work is that of [14]. In this study, authors used case-based reasoning approach coupled with the NB classifier and feature weight learning technique to support decision-making in forensic autopsy reports to determine the cause of death. Experimental results revealed that the CBR method, along with the implementation of a NB classifier, is a feasible approach of predicting the cause of death from forensic autopsy reports. Another important related work is that of [33], in which the authors used existing supervised ATC techniques to determine the cause of death using verbal autopsy reports. Here, authors used different combinations of linguistic and statistical features, such as unigram, bigram, and parts of speech tagging to extract useful features from verbal autopsy reports. Authors used SVM text classifier with various feature representation schemes to assist coroners and medical pathologist in determining the cause of death from the collected verbal autopsy reports, and they achieved 82.8% F-measure in predicting the time of death and 58.7% F-measure in the cause of death. However, the study only considered verbal autopsy reports in determining the cause of death.
On the other hand, both of these studies had two major limitations. The dataset used for cause of death prediction lack features mainly because the dataset only contains a brief summary or history-related information on the entire autopsy report. Authors did not consider other aspects, such as internal examination reports, external examination of the case, injuryrelated information, and other possible information in this study. However, history-related features are considered insufficient and are not discriminative to analyze the dynamics of an autopsy report. Therefore, considering further features from autopsy findings, such as injury-related findings, external examination findings, and internal examination findings can drastically improve the performance of prediction. The other limitation is the quality of features that depend on the source data being analyzed. Many pathologists may use different terms, synonyms, and vocabulary while preparing the autopsy reports. Hence, the consideration of various similar words using expert knowledge can further increase the accuracy of prediction. Therefore, an accurate, robust, and an efficient system must be developed to predict the cause of death by considering history-related features, internal examinationrelated features, and external examination-related features from free text autopsy reports. Moreover, an efficient and accurate technique for feature selection is needed to enhance the performance of a classifier.

Data collection
The experiments involved 2200 autopsy reports on nine different leading causes of death related to accidents in Kuala Lumpur, Malaysia. These autopsy reports were collected from Rasmi Pusat Perubatan Universiti Malaya (PPUM) Hospital, Kuala Lumpur, Malaysia. The ethical letter provided by PPUM has also been attached in supplementary files. The detailed distribution of all nine classes is shown in Table 1. These reports were collected from one of the largest hospital of Kuala Lumpur, Malaysia. The causes of death on these reports were manually labeled unanimously by a team of pathologists. Each report consisted of the detailed examination of the dead body, including the deceased's personal information, external examination, injury-related information, internal examination, history-related information, and information on histopathology reports. In the subsequent paragraph, the details of all these attributes are discussed. In addition, a sample of one autopsy report is also shown in S2 Appendix (please refer supporting information).
• Personal information: This section includes the name of the deceased, unique identity number, gender, date of birth, date of death, age upon death, and nationality.
• External examination: This part includes the information about the deceased's external body parts, such as height, weight, eyes, ear, hands, feet, legs, nose, mouth, lips, teeth, and reproductive organs. Furthermore, information about rigor mortis, hypostasis, and decomposition signs is also recorded here. In addition, any specific symbols or patterns on the body are noted.
• Injury related information: This portion includes injury related information, such as the size, location, and pattern of abrasion, laceration, and wound on the body.
• Internal examination: This segment includes the anatomical examination of the brain, neck, thorax, cardiovascular system, respiratory system, gastro-intestinal tract, liver, spleen, pancreas, endocrine system, kidneys, and urinary bladder.
• Histopathology reports: This section includes the result of histopathology reports.
• History: This section records the previous history of the deceases and history of the day of death.
• Cause of death: This portion is the output variable of the autopsy report. Here, the experts process the autopsy findings, correlate the findings with previous cases, use their experience, and finally, decide the primary cause of death according to ICD-10 classification.
In experiments, personal information was not used in the prediction of cause of death because these features do not contribute to the prediction of the cause of death. Furthermore, all other features, such as external examination, internal examination, history, and injuryrelated features were concatenated in one string for the sake of simplicity. Next, all the reports were tokenized into words.

Master feature vector creation
A program was coded in Python version 3.4.3 using NLTK [34] to parse and to pre-process each plaintext autopsy report to determine the distinctive tokens present in all the autopsy cases. In pre-processing, four basic steps were performed. First, spell checker was used to correct all the misspelled words using PyEnchant and the NLTK library [34]. Second, the whole report was broken down into sentences after converting it into lower case, and each sentence was tokenized into words or tokens to form unigrams. The less common tokens which appeared only once or twice in the reports were also discarded because of their low occurrences. In addition, stop words were removed from the stop word list [35]. Third, each identified unique token was also represented by lexical categories using parts of speech tagger (POStagger) to identify the semantics of the token. For instance, after applying the POS tagger, the token or word "knee" was converted to "knee/Noun." Finally, the tokens with their lexical categories were stored and represented in the master feature vector. This master feature vector was used for classification modeling by employing various supervised ATC techniques.

Feature engineering
To extract the highly discriminative features from the master feature vector, an expert-driven feature selection approach was used. Moreover, five automated feature selection techniques were also used to compare the performance of the proposed expert-driven feature selection approach.

Expert-driven feature selection.
The algorithm for the proposed expert-driven feature selection approach is given in Fig 1. Suppose we want to classify n different number of causes of death having unique σ ICD-10 cause of death code. Each cause of death comprises of m number of autopsy reports which are available in the rf raw files. For each cause of death, one expert feature set, E, exists. E contains the most discriminative features with ranked order list across all n. This E was prepared independently by two experienced domain experts. Moreover, in E, the possible synonyms and alternative words for the selected features were also added. Furthermore, both domain experts created the prioritized list of features that would predict the accurate cause of death from medical autopsy reports. Afterward, the domain experts matched their feature ranking and resolved their conflicts. A third pathologist was consulted to resolve the conflicts in case of disagreements. In this manner, n number of E was created.
Once we have the n number of rf, m number of autopsy reports in each rf, and n number of E, we then created the n number of M master feature vectors. To create the M, we first loaded one rf into memory and performed five different pre-processing tasks on each m in the rf to extract useful features from it. First, I s function was applied on m in rf to correct the misspelled words. Second, I l function was applied on m in rf to convert all words into lower case. Third, I t function was applied on each m in the rf to tokenize the autopsy reports into unique tokens. Fourth, I w function was applied on m in the rf to remove the most common words which do not contribute in the classification task. Finally, I p function was applied on each m in the rf to assign a lexical category or parts of speech tagging to each token. Finally, the processed m in the rf was stored in M. As such, all n number of rf were converted into n number of M that contained processed m autopsy reports.
After the creation of M and the preparation of E, M and E were loaded into memory. Eq 1 was applied on M and E to further process the M and form an ARFF file for classification. As shown, Eq 1 matches the tokens of m of M with each E and maintains the frequency count of the features of each E matched with the feature of m of M. Afterward, a unique ICD-10 cause of death (σ) was assigned to m of M, and this m of M was added to the ARFF file to create the training set.

Automated feature selection.
To compare the performance of proposed expertdriven feature selection approach, five automated feature selection schemes were also used were used to rank the most discriminative features from the master feature vector. These five automated feature selection techniques are: Chi-square [36], information gain [36], Pearson Correlation [36], Fisher Markov Selector [37], and improved global feature selection [38]. In automated feature selection techniques, the expert guidance was eliminated. Conversely, all tokens were ranked across all autopsy reports using these five automated feature selection schemes. Moreover, in all five automated feature selection schemes, subsets of highly discriminative tokens with maximum score were used as features for classifying the cause of death.
(1) Information Gain: Information gain, which identifies the importance of a given attribute in a feature vector, is the measure of reduction in uncertainty once the value of an uncertainty is known. If the uncertainty is largely reduced, knowing the value of an attribute provides a lot of information, and thus, we have a large information gain [8,15]. We assumed that the autopsy data set, D, have two classes, i.e., head injury (H) and multiple injury (M), with a total number of reports, d, containing h and m reports belonging to classes H and M, respectively. The amount of information is defined as shown in Eq 2.

I ¼ À ðh=dÞlogðh=dÞ À ðm=dÞlogðm=dÞ ð2Þ
If h = m, then I is equal to 1, and if h = s, then I is equal to 0. The information gain for autopsy dataset D using attribute A is then defined as shown in Eq 3.

GainðD; AÞ
I is the number of information before split and X i2valuesðAÞ t i s I i is the sum of information after the split, where I i is the information node I, and t i is the number of objects in node i. Once the information gain was computed for every remaining attribute, the attribute with highest information gain was selected.
(2) Chi-Square: The Chi-square test is a statistical test that can be used in determining whether the observed frequencies of the tokens are significantly different from their expected frequencies [8,15]. For instance, consider the two classes, head injury (H) and multiple injury (M), in our autopsy dataset D having a total number of reports of 100. Out of the 100 reports, 50 belonged to class H, and the remaining 50 belonged to class M. Suppose both the classes contained the word "skull." Therefore, the expected value for skull would be 50% for class H and 50% for class M, and the observed value would be the number of times the word "skull" appeared in class H and M. Therefore, Chi-square was defined as shown in Eq 4. Here, O i refers to the observed or collected data, and E i refers to the expected values.
(3) Pearson Correlation: Pearson correlation measures correlation between two variables. Eq 5 shows the mathematical representation of Pearson Correlation. The value of r is between -1 to +1, where +1 represents high correlation and -1 represents negative correlation [36,39].
(4) Fisher Markov Selector (FMS): Fisher Markov Selector (FMS) is an automated feature selection scheme that was proposed in [37]. FMS globally selects the optimal subset of features among the classes. This method is useful for handling high-dimensional data efficiently. In our experimental setup, we use FMS with linear polynomial kernel with d = 1, where d denotes the degree parameter [37].
(5) Improved Global Feature Selection (IGFS): The improved global feature selection (IGFS) scheme is an ensemble method where the power of global feature selection method and a one-sided local feature selection are combined in a different manner [38]. In our experimental setup, we have combined Odds Ratio as a one-sided local feature selection schemes and information gain as a global feature selection scheme [38].

Feature subset size
We hypothesized that various subsets of features would produce different performance results in terms of Precision M , Recall M , F-measure M , and overall accuracy. To evaluate this proposition, we selected feature subset sizes of 10, 20, 30, 40, and "all" after performing the sensitivity analysis (discussed in section 6). In addition, these subsets were extracted because of their implementation feasibility, thereby allowing the evaluation of classifier performance within a suitable operating range.

Text classification techniques
The features dug out from medical autopsy reports were used to build a decision model for accurately predicting the cause of death. Several machine learning classifiers were tested to select the best classifier. We tested SVM, NB, k-nearest neighbor (KNN), decision tree (DT), and random forest (RF) using Weka tool kit [8,40]. Five different text classifiers were employed because each classifier has a different philosophy behind the learning process. Moreover, these five classifiers have successfully been employed in text classification literature in the past. In subsequent paragraphs, these classification techniques are discussed.

Naive Bayes (NB)
. NB is one of the popular inductive learning classifier in supervised machine learning classifiers and is considered an efficient and effective decision model. The classifier has been widely employed in the classification of free text clinical reports [13,14,41,42]. NB is derived from Bayes' theorem with strong independence assumptions among features [43]. NB is very simple to use, fast, and it often produces better accuracy compared with other classifiers. Given a class variable C and dependent features f 1 through f n , Bayes' theorem states their relationship as shown in Eq 6.
Using the naive independence assumption, we obtained Eq 7.
For all i, this relationship is simplified as shown in Eq 8.
This technique is further discussed in [43].

Support Vector Machines (SVM)
. SVM is the popular supervised machine learning classifier and is based on statistical learning theories [44]. SVM has been proved to be an accurate classifier in many application areas such as image classification [45], and classification of biomedical documents [17,21,46]. SVMs are hyperplanes that separate the training examples by maximal margin [14]. Suppose we have two types of accident-related cause of death reports, i.e., liver rupture (L) and abdominal injury (A) in our autopsy reports. Given a training data of these autopsy reports (x 1 , x 2 ,. . .x n ) which are expressed in the master feature vector in a certain space X . Rd. These instances are labeled as (c 1 ,. . ..c m ), where c i ε (L,A). Class L is on one side of the hyperplane, and class A is on the other side.

K-Nearest Neighbor (KNN).
KNN employs instance-based learning. KNN is also termed as lazy learning classifier because it is the simplest classification algorithm that stores all the instances and classifies new instance using a similarity measure, such as the Euclidean distance shown in Eq 9 [47,48].

Decision Tree (DT).
J48 is a popular DT classifier, and implementation of C4.5 decision model is used to create pruned or unpruned decision trees [49]. DT is the most commonly used algorithm for the task of classification and prediction [50]. The DT represents rules that can be easily understand by humans and constructs the classifier in hierarchical form. J48 classifier uses entropy to compute for the homogeneity of an autopsy report. J48 is discussed in detail in [49,51].

Random Forest (RF).
RF is an ensemble supervised machine learning classifier that constructs multitudes of decision trees from training data using randomly selected features [52]. A new instance can be classified by all decision trees in a forest, and finally, the forest is responsible for choosing the classification decision using majority vote or by averaging the prediction using Eq 10.
RF shows significant performance over a single DT. The classifier also overcomes the issue of overfitting. The major issue with RF classifier is its complexity, which minimizes interpretability and slows it down. RF was chosen as it was the best performing classification model in a previous text classification [53] and biomedical studies [54][55][56][57]. Moreover, M Fernández-Delgado, et al [58] compared 179 classifiers on 121 different datasets and they found that the RF is the best classifiers compared to other classifiers used in the study [58].

Experiments
The complete flow of this research study is shown in Fig 2. By utilizing six feature selection schemes (five automated feature selection schemes and one proposed expert-driven feature selection scheme), five subsets of features (10,20,30,40, and "all" features), thirty (6 × 5) various feature sets were extracted for the construction and evaluation of text classifiers. Five different text classification techniques (NB, SVM, KNN, DT, and RF) were applied on each of these 30 feature sets with a total of 150 (5 × 30) analyses. All experiments were performed using 10-fold cross validation [59,60].

Evaluation metrics
In all 150 analyses, results obtained by each classifier with feature selection scheme and feature subset were compared. For performance evaluation, Precision M , Recall M , F-measure M , and overall accuracy were used. These performance measures were used because of imbalanced class distribution, and these metrics permit equal weights for each cause of death category [61]. In addition, receiver operating characteristics (ROC) curve and the area under the ROC curve (AUC) were also used to compare the performance of each cause of death category because of class imbalance [8]. In subsequent paragraphs, these evaluation metrics are discussed briefly.

Macro precision (Precision M ).
Precision M is the average of each class precision; whereby the precision is the probability of test correctly predicted as positive cases given that the number of cases labeled by the system was positive. The mathematical definition of precision M is shown in Eq 11.

Macro recall (Recall M ).
Recall M is the average of each class recall; whereby the recall is the probability of the test finding the positive cases among all the cases of a given class. Recall is also known as sensitivity. The mathematical definition of recall M is given in Eq 12.

Overall accuracy.
Overall accuracy is the fraction of classification results predicted correctly among all the classes. Mathematical definition of overall accuracy is shown in Eq 14.

Area Under ROC Curve (AUC).
The area under ROC curve or simply AUC has been recently introduced to evaluate machine learning algorithms [62,63]. This measure is very useful in analyzing the performance of the decision model with respect to a specific class. The AUC provides a good summary for the performance of the ROC curves. The ROC is a method to calculate the goodness of machine learning classifier by plotting a specific curve and calculating the area under this curve. It is instinctively obvious that for good performance algorithm the value of AUC will be close to 1 and the value of 0.5 or less than that indicates the poor performance of algorithm [63][64][65]. Hand and Till [64] presented a simple method to calculate the AUC of decision model using Eq 15. Here, n 0 and n 1 represent the number of positive and negative examples respectively, and S 0 ¼ X r i , where r i is the rank of i th positive example in the ranked list. The detail of AUC has been discussed in [63][64][65].

Results
From all 2200 autopsy reports, after applying the pre-processing steps, a total of 19164 unique tokens were identified. From the 19164 tokens, 9543 tokens were removed because they appeared once or twice in the whole autopsy reports set. The remaining tokens were stored in the master feature vector and evaluated using the aforementioned six feature selection approaches with subsets of 10, 20, 30, 40, and "all" tokens using five classifiers (please refer S1 Appendix in supporting information). Results of each of the 150 analyses in terms of accuracy, Precision M , Recall M , and F-measure M are as follows. Moreover, the AUC graphs of the best performing technique which produced the highest accuracy and highest F-Measure M are also shown in the results. Fig 3 shows the overall accuracy of all 150 analyses. In the figure, the expert-driven feature selection scheme significantly outperformed all five automated feature selection schemes, followed by IGFS scheme and FMS scheme. In addition, information gain and Chi-square produced the almost same results. The lowest results were shown by Pearson Correlation scheme. A fluctuating trend was found in the feature subset size. However, the lowest accuracy was observed in the "all" and 10 feature subset sizes. The reasonable accuracy was found in the feature sub set sizes of 20, 30, and 40, respectively. J48 classifier outperformed in all automated feature selection schemes excluding FMS by producing the highest accuracy of 75.94% (with a feature subset size of 30), followed by SVM (73.15%, with feature subset size 30) and RF (73.05%, with feature subset size 40). Moreover, the lowest performance was observed in NB classifier which produced 67.31% accuracy in the "all" feature subset size, followed by KNN (70.36%, with feature subset sizes of 20 and 10). In FMS feature selection scheme, SVM outperformed all other classifiers. Conversely, RF and J48 classifiers outperformed in expert-driven feature selection scheme by producing the highest accuracies of 90.09% and 89.50%, respectively, using a feature subset size of 30. In addition, KNN also showed an accuracy less than that of J48. The lowest accuracies of 80.31% and 82.31% with feature subset size of "all" was found in NB and SVM, respectively. Fig 4 shows the precision M of all 150 analyses. As shown here, in the automated feature selection schemes, IGFS produced the highest precision M , followed by FMS. Information gain and Chi-square schemes yielded roughly the same results and the lowest precision M was observed in Pearson correlation. Furthermore, J48, SVM, and RF classifiers produced the highest precision M of 85.93%, 83.83%, and 80.65% respectively, with the feature subset size of 30. In addition, NB and RF produced the lowest precision M of 68.71% and 78.20%, respectively, with the feature subset of "all." However, in expert-driven feature selection scheme, RF, The figure shows that the expert-driven feature selection scheme outperformed the automated feature selection schemes. In automated feature selection schemes, the highest recall M was observed in IGFS scheme. Moreover, the Minor difference was observed with the results produced by FMS, information gain and Chi-square automated feature selections schemes. The lowest recall M was observed in Pearson Correlation. Majority of the developed models yielded the lowest recall M with feature subset sizes of "all" and 10 and the highest recall M with feature subset sizes of 20 and 30. In automated feature selection schemes, J48 decision model outperformed by yielding a 75.93% recall M with a feature subset size of 30. Furthermore, the recall M produced by RF with a feature subset size of 30 was slightly lower than the recall M of J48. In addition, NB and KNN decision models showed the lowest recall M of 69.10% and 71.81% with feature subset sizes of "all" and 40, respectively. However, in expert-driven feature selection scheme, RF, J48, and KNN produced the highest recall M of 90.10%, 89.50%, and 89.10%, respectively, with feature subset size of 30. Moreover, SVM and NB showed the lowest recall M of 82.30% and 80.30%, respectively, with a feature subset size of "all." Fig 6 shows the F-measure M of all 150 analyses. Here, the highest F-measure M was produced by the expert-driven feature selection approach. In automated feature selection schemes, IGFS scheme outperformed all other four automated feature selection schemes, followed by FMS scheme. Moreover, in many decision models, both information gain and Chi- In Fig 7, the AUC of all nine classes (T07, S06, S38, S17, S28, S36, T71, T75, and G40) using expert-driven feature selection approach is shown. Results revealed that in class "T71," 100% AUC was achieved by all the five classifiers with all the five feature subset sizes. Moreover, in class "S36," all five classifiers produced 100% AUC with all subset sizes excluding the feature subset of 40, whereas an irregular trend was observed in the AUC of all five classifiers. In class "S17," RF, J48, and KNN yielded an almost 100% AUC with all the five feature subset sizes. However, in "S17," the AUC observed in NB and SVM is higher than the AUC of RF, J48, and KNN. In class "S38," AUC of 98% was observed in all five classifiers with all five feature subset sizes. Almost a similar type of irregular trend of AUC was found in classes "T75," "G4,"' and "S28," where all classifiers yielded an AUC in between 95%-98% with all the five feature subset sizes. The lowest AUC (87%-93%) was produced by NB and Intelligent forensic autopsy report generation and reasoning system SVM in classes "S06" and "T07" with all five feature subset sizes. Nevertheless, RF, J48, and KNN achieved approximately 95% AUC in "S06" and "T07" with all the five feature subset sizes.
From Fig 7, the RF decision model parameterized with expert-driven feature selection approach using a feature subset size of 30 correctly classified approximately 95% of the "S06" and "T07" causes of death and misclassified 5% in other classes. Moreover, RF decision model parameterized with expert-driven feature selection approach using feature subset size of 30 correctly classified approximately 98% of "T75," "S38," "S28," and "G40" causes of death and misclassified 2% in other classes. Furthermore, RF decision models parameterized with expert-driven feature selection approach using feature subset size of 30 correctly classified approximately 100% of "S17," "S36," and "T71" causes of death. Therefore, the RF parameterized with expert-driven feature selection approach using feature subset size of 30 can be concluded as a feasible solution for predicting ICD-10 causes of death in free text autopsy reports.
In text classification task, one of the crucial performance measures is the computational time taken by classifier in building the classification model. Fig 8 shows the average computational time for all five classifiers in all five feature subset sizes by using six feature selection schemes. All 150 analyses were run on Corei7 system having 2.80 GHZ clock speed and a 16-gigabyte memory. As shown here, the proposed expert-driven approach is much faster than the automated feature selection schemes. Moreover, in automated feature selection, information gain proved to be faster than other four automated feature selection schemes. In all six

Discussion
The experimental results of this research study show that supervised ATC techniques can identify the cause of death from free text medical autopsy reports with performance measures between 70%-90%. Furthermore, a considerable difference was observed in most of the analyses. From the experimental findings, different combinations were determined to optimize the performance of each measurement.
To optimize the overall accuracy, precision M , and F-measure M , RF decision model built with expert-driven approach using a subset of 30 features is recommended. Experimental results also indicate that in automated feature selection approaches IGFS scheme sowed the highest performance, followed by FMS. Moreover, in most of the experiments, information gain produced results that are almost similar with those returned by Chi-square. Pearson Correlation showed the lowest performance results in all of the experiments. However, expertdriven feature selection approach produced much better results than all five automated feature selection approaches used. Two primary reasons might account for the better performance of expert driven feature selection approach. First, all the 2200 cases belonged to the group with "accident" manner of death. All nine different classes under consideration were very similar in nature. Furthermore, numerous tokens were common across all the autopsy reports. For instance, the tokens "abrasion"' and "laceration" are highly related to all types of accident under consideration. Second, different pathologists might have used different synonyms and vocabulary while preparing the autopsy reports. For instance, many pathologists used the tokens "abrasion," "graze," and "trauma" interchangeably. Therefore, we suggested to experts during the creation of expert-driven features to select the features that were the most discriminative to a particular cause of death. In addition, experts were also suggested to come up with a possible set of synonyms of selected features. Hence, the resultant expert-based feature space comprised of rich set of discriminative features for each cause of death under consideration. Therefore, potential researchers should not only rely on results produced by automated feature selection but should also explore more features with the help of domain experts. Moreover, the proposed expert-driven feature selection approach was much faster than all five automated feature selection approaches. Such result was caused by algorithms, such as IGFS, FMS, Chisquare, information gain and Pearson Correlation, in automated feature selection which consider the whole dataset in determining the most discriminative features by applying various computational methods.
However, in expert-driven feature selection approach, the features were already provided and ranked by experts, hence, this approach only calculated the expert-driven feature weighted from autopsy reports and prepared the classification data using frequency count. Furthermore, the classification file prepared by expert-driven feature selection approach was much smaller in size compared with that of five automated feature selection schemes. The classification file prepared by expert-driven approach only contained the number of attributes equivalent to the number of classes. Conversely, the other five automated feature selection schemes counted each token as one feature after tokenization, and the number of attributes was equal to the number of unique tokens. Therefore, the automated feature selection techniques required longer classification time. Finally, the proposed expert-driven feature selection can be used in classifying any kind of clinical reports. The only thing required by this approach is the ranked features from an expert. The accuracy of classification task usually depends upon the quality of features set. The inadequate, extraneous, and irrelevant features may generate less accurate and incomprehensible results. Therefore, it is an important task to remove irrelevant and non-discriminative feature subset from master feature set by using feature subset selectors algorithms prior to classification [66]. The purpose of feature subset selection is to decide which number of features to include in classification and which to remove. For this research, we also hypothesized that various subsets of features would produce different performance results in terms of Precision M , Recall M , F-measure M , and overall accuracy. To evaluate this proposition, we aim to determine the best feature subset size for the classification of autopsy reports to improve the classification performance. To discover the best feature subset size, initially, the subset of 10 features were selected using all aforementioned six feature selection schemes to evaluate the performance of all five classifiers. The number of features were increased up to the point where no further improvement in performance was found. In addition, we also evaluated the performance of all five classifiers using 'all' features. In most of the experiments, we noticed that increasing the size of feature subset from 10 to 30 led to considerable improvements in experimental results. Conversely, increasing the size of feature subset from 30 to 40 to "all" did not cause considerable improvements in the results. As a result, we can infer that a feature subset of larger size may not positively affect the results. Thus, to determine an optimum size of features in feature vector, researchers are suggested to perform sensitivity analysis to examine a range of feature sizes from point 10 to a point where no improvement in accuracy is observed.
According to the "no free lunch" theorem [67], there is no single machine learning algorithm that performs best in all application areas. Hence, a variety of decision models should be tested. Therefore, we evaluated the performance of five classifiers (NB, SVM, kNN, J48 and RF) with six aforementioned feature selection approaches to classify free text autopsy reports. Here, in four automated feature selection approaches namely, information gain, chi-square, Pearson Correlation and IGFS, J48 classifier produced the most promising results, followed by SVM and RF. In FMS feature selection scheme SVM produced the highest results, followed by J48 and RF. Conversely, in expert-driven feature selection approach, RF outperformed the other classifiers, followed by J48. There may be various possible reasons for the outperformance of J48 in automated feature selection such as J48 does not need any domain knowledge or any parameter setting and it can handle data with high dimensionality. Moreover, J48 can handle datasets with errors and missing values. Furthermore, it is considered as a nonparametric classifier which means it does not use any assumptions for space distribution and classifier structure. The main disadvantage of J48 is that it can easily over fit. The possible reason for outperformance of RF has the best result because of its ensemble nature. RF constructs multitudes of decision trees from training data using randomly selected features. To classify a new autopsy reports from an input dataset, RF put the input vector down each of the trees in the forest. Each tree predicts the CoD for given autopsy report and finally the forest chooses the final CoD using having the most votes. SVM, KNN, and NB showed a considerable lower performance than RF and J48 in all the experiments. We speculated that NB supposes a conditional independence among features that is possibly inappropriate for the collected autopsy reports [68]. Furthermore, as the number of features increases, the conditional dependence among the features becomes more complicated, and this can negatively affect the performance of NB classifier. The reason behind the poor performance of KNN classifier may be its default supposition of linear scaling of features that might have led to the inaccurate computation of KNN distance measures. In addition, this assumption becomes misleading with features having very low discriminating power. The performance of SVM lies in the choice of kernel [69]. The selection of proper SVM kernel and kernel function parameters, such as width or sigma parameter, may further increase the SVM performance [70]. In our future optimization work, we might find the optimal parameter values for SVM decision model.

Significance of dataset and proposed feature sets
In medical autopsy, suitably annotated and statistically independent samples of autopsy reports for the construction and evaluation of classifier are inadequate and expensive. In addition, ethical considerations often restrict the number of autopsy reports collection. Thus, sample size planning is an important aspect in the design of experiments. Hence, to find the optimum sample size for each class, various experiments were performed to examine a range of sample size from 25 to a number of instances where no further improvement in accuracy was observed. Here, all the experiments were performed using expert-driven feature selection approach with feature subset size of "all" and RF classifier. The expert-driven feature selection approach with RF classifier was used in this study because it produced the best results in all the performed experiments. Results of these experiments are shown in Fig 9. Here, the lowest accuracy of 52%-73% was noted when the number of reports were 25 to 50. Accuracy between 79.80%-80.05% was observed with 75 to 100 autopsy reports. The extremely slight variation in accuracy was observed when the cases increased from 100 to 200. A consistent accuracy of 80% was found in 125 to 200 autopsy reports. Thus, we concluded that a minimum of 75 to 100 autopsy reports are a reasonable number for constructing and evaluating an accurate model for the classification of medical autopsy reports.
Given the restrictions brought about by privacy or ethical considerations, no public dataset was available for testing the significance of the proposed approach. To examine such significance, four baselines were created from the collected dataset for this research, namely, bag of words (BoW) and the combination of feature engineering techniques proposed in [14], [41] and [71,72]. In [14] rule based feature weight learning technique was used to select the features from forensic autopsy reports. In [41] locally-semi-supervised feature selection technique was used to select the most discriminative features from verbal autopsy reports belonging to each CoD. In [71,72] "Ensemble-based Multi-Filter Feature Selection (EMFFS)" was used to select the most important and the most discriminative features for the classifier by combining many filter based feature selection schemes. To compare, our proposed approach with EMFFS, four feature selection methods were combined, namely, chi-square, information gain, Pearson correlation and gain ratio using Weka tool. Finally, we combined the output of these four feature selection schemes using a fusion based rank aggregation method proposed in [73] to generate a final global features ranking list.
The experiments were conducted to measure the overall accuracy of all five classifiers using these four baseline features. The baseline accuracy was compared with the accuracy of the proposed expert-driven feature selection approach using the "all" feature subset size. The accuracy of baseline feature selection schemes and proposed expert-driven feature selection scheme is shown in Table 2. As shown in the table, in all four baselines, the J48 and RF decision models consistently showed a promising accuracy and the lowest accuracy was observed in KNN, NB and SVM. In all four baselines, the highest accuracy of 73.85% and 73.18% was obtained by J48 and RF in baseline 4. However, compared to all these baselines, our proposed expert-driven feature selection technique showed the promising results.

Limitations and future work
Some of the challenges were also identified in the proposed expert-driven feature selection approach. First, results of the proposed expert-driven approach depend heavily on the domain knowledge of the experts and their familiarity with autopsy findings. We believed  that in the current study, the engagement of pathologists yielded experimental results that can be reflected across other medical systems. Second, the presented findings are exclusive to the free text autopsy reports obtained from PPUM, one of the largest hospital in Kuala Lumpur, Malaysia. We also believed that the quality of the extracted reports is sufficiently heterogeneous, diverse, and comprehensive compared with the data gathered by other medical systems and therefore should produce acceptable results across other healthcare systems.
Third, the developed model can only detect nine accident-related causes of death. However, the system can be enhanced by following similar steps of detecting various accident-related causes of death. Finally, this paper proposes the use of supervised ATC techniques to predict the cause of death from autopsy reports. However, ontology-based approach may produce better results than our proposed approach. Meanwhile, because the proposed method resulted in performance measure exceeding 90%, slightly better results may not support the significant ontology development efforts. Various opportunities were also identified for the improvement of a system that will require future work on the domain under consideration. First, reviewing autopsy reports and assigning the cause of death on each autopsy reports by pathologists for the preparation of training set for classification purpose is a time-consuming and challenging task. Therefore, we suggest evaluating the presented results produced by the proposed classification task against those of clustering techniques that require unlabeled data. Moreover, compared with the classification approach, clustering is painless to implement, and it requires less involvement of experts in system implementation. Second, though our proposed expert-driven feature selection approach produced a satisfactory performance, however, in future work, we aim to employ the ontology-based approach to compare its results with our findings. Finally, in future work, we aim to employ the supervised ATC and ontology-based techniques on autopsy reports to predict heart-related causes of death and homicide-related causes of death.

Conclusion
In this paper, an expert-driven feature selection approach was proposed to predict the cause of death from free text medical autopsy reports. Moreover, the state-of-the-art supervised ATC techniques with automated and expert-driven feature selection approaches were used to classify the cause of death from free text medical autopsy reports. We discovered that the proposed expert-driven feature selection approach outperformed in terms of performance measures exceeding 90% when compared with automated feature selection approaches. Moreover, RF and J48 classifier was found to be suitable for the classification of autopsy reports with a feature subset size of 30. Based on the results, the proposed system proved to be more robust and more accurate when it was compared with four baselines. Furthermore, the promising results indicate that the pathologists can use the proposed system as a source of second opinion, assisting them in more accurately and rapidly determining the cause of death. In addition, this research can be enhanced to assist other clinical reports. The proposed technique has the capability to cut down the time and effort needed for public healthcare reporting.