Figures
Abstract
The aviation system is safety-critical by nature, and any occurrence of an incident or accident can lead to the loss of human life and significant operational disruptions. The International Civil Aviation Organization (ICAO) emphasizes that every flight must take off and land safely—a goal achieved over 126,000 times daily. Despite major advancements,mishaps and accidents continue to occur, underscoring the need for robust safety management systems. The accurate classification of aviation occurrences (Incident or Accident) reports is essential for safety management, yet manual review is time-consuming and prone to inconsistency. While incident/accident labels are assigned during reporting, automated classification enables rapid triage, detection of potential mislabeling, and support for severity assessment in high-volume aviation safety operations. To address this,we developed and compared three machine learning classifiers—Multinomial Naive Bayes, Random Forest, and Support Vector Machine—using TF-IDF vectorization on an 80-year dataset of 53,770 aviation occurrence summaries obtained from the Transportation Safety Board of Canada. A two-stage evaluation strategy was employed, consisting of an initial 80/20 train–test split to create an independent test set, followed by 5-fold cross-validation applied exclusively to the training data to ensure robustness and prevent optimistic bias. The Support Vector Machine (SVM) classifier achieved the highest classification performance, attaining an accuracy of 98.06% during 5-fold cross-validation, with consistent results across folds, demonstrating its effectiveness in managing high-dimensional textual data and dataset complexity. The proposed framework provides a robust foundation for automated aviation safety report processing, offering practical value for (1) early triage of safety reports, (2) identification of potentially mislabeled cases requiring expert review, and (3) integration into downstream severity assessment pipelines. This work advances beyond prior classification studies by establishing a benchmark on the largest historical aviation safety dataset while delivering a deployable and operationally relevant framework for real-world safety management applications. The findings offer valuable insights for regulatory authorities and airline operators, contributing to enhanced safety oversight, improved response strategies, and safer aviation operations.
Citation: Qureshi S, Tayubi IA, BaruKab O, Khan SA (2026) Enhancing aviation safety: An 80-year data-driven model for classification of aviation incident and accident. PLoS One 21(5): e0345956. https://doi.org/10.1371/journal.pone.0345956
Editor: Ankit Gupta, CCET: Chandigarh College of Engineering and Technology, INDIA
Received: July 15, 2025; Accepted: March 12, 2026; Published: May 21, 2026
Copyright: © 2026 Qureshi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data underlying the results presented in the study are available from the Transportation Safety Board of Canada (https://www.bst-tsb.gc.ca/eng/stats/aviation/data-5.html).
Funding: This work was supported by King Abdulaziz University (DSR), Jeddah, Saudi Arabia, through the Institutional Fund Projects (grant no. IFPHI-360-830-202 to O.B.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Aviation safety is a critical concern and a top priority underscored by the International Civil Aviation Organization (ICAO) [1,2] and the Transportation Safety Board (TSB) [3]. The TSB publishes data from its Aviation Safety Information System (ASIS) on reportable accidents and incidents, collectively referred to as aviation occurrences. This data, gathered during investigations, is used to analyze safety issues and identify risks within the Canadian transportation system. Reporting of these accidents and incidents complies with the Transportation Safety Board regulations. The shared goal of ICAO regulations is to ensure safe take-off and landing of flights over 126,000 times daily. Despite substantial progress, significant improvements in reducing accidents and incidents remain necessary, as both continue to pose challenges, highlighting the need for effective and robust safety management systems.
This research study aims to address a significant gap in the automated classification of aviation occurrences.occurrences, specifically incidents and accidents, that were reported in accordance with the Transportation Safety Board regulations [4] in effect at that time, by using a comprehensive 80-year dataset (1955–2020) of occurrence reports from the TSB [5]. Each record in the dataset includes an ‘Occurrence Type’ field, which labels the record as either an ‘Incident’ or an ‘Accident’ based on the ICAO and TSB definitions. These two categories serve as the classification labels for the binary classification task performed in this study.
Aviation safety remains a top priority, and understanding aviation safety trends requires analyzing historical data. Such data, spanning over 80 years, provides valuable insights into the occurrences, causes, and prevention measures of accidents and incidents, as evidenced by the ICAO’s stringent safety standards. Terms such as “occurrence,” “accident,” and “incident” are fundamental in this context. According to ICAO, occurrences are defined as events affecting or potentially affecting the safety of operations, and the occurrence includes any irregular, unplanned, or non-routine event. An accident involves severe injury or aircraft damage, while an incident affects or could affect the safety of operations [6,7]. In this study, these ICAO and TSB definitions are used to distinguish between the two classes—incidents and accidents—that the machine learning models are trained to classify.
While previous studies have explored classification of aviation safety reports Madeira 2021, etal. [8], De Vries 2020, etal. [9] Ahadh 2021, etal. [10], their practical deployment in operational safety management remains limited. Our work addresses this gap by developing a classification system with direct operational applications: (1) Automated Triage – enabling safety departments to prioritize high-risk reports among thousands of daily submissions; (2) Quality Control – identifying potentially mislabeled reports that may require expert re-evaluation; and (3) Severity Assessment Foundation – providing accurate initial classification for downstream risk scoring systems. Unlike prior work focusing on specific incident types or shorter timeframes, we establish a comprehensive benchmark on 80 years of data, demonstrating that simple yet well-tuned classifiers can achieve operational-grade accuracy suitable for integration into existing safety management workflows.
Existing literature has highlighted and explored various aspects of aviation safety management practices and the role of data analysis in enhancing aviation safety, including data collection, analysis, and the implementation and development of safety management systems. Previous studies have emphasized the importance of analyzing data to identify trends, understand causes, and implement preventive measures [9,11–12].
However, a significant and notable research gap exists in automated classification of aviation occurrences, specifically incidents and accidents that can accurately classify them using historical data.
This research study aims to address this gap by developing a classification model to classify aviation occurrences, particularly incidents and accidents, more accurately. The objective is to mitigate risks and enhance response strategies, ultimately improving aviation safety. By utilizing machine learning techniques, the dataset Occurrence.csv was analyzed using Natural Language Processing (NLP) to interpret textual occurrence reports and their summaries. An initial 80/20 stratified train-test split was employed to create an independent test set for unbiased performance evaluation. Three classifiers—Multinomial Naive Bayes, Random Forest, and Support Vector Machine (SVM)—were evaluated in conjunction with TF-IDF vectorization, and 5-fold cross-validation was applied exclusively to the training data to assess model stability and guide model selection, ensuring robust and reliable performance [13–15]. Despite the numerous studies, there is still a lack of effective models for classifying aviation safety occurrences, specifically incidents and accidents, based on historical data. This gap underscores the need for advanced methodologies to improve classification accuracy and reliability.
This study makes three primary contributions:
- Operational Framework: We develop and validate a machine learning pipeline specifically designed for integration into aviation safety management systems, focusing on practical deployability rather than theoretical novelty alone.
- Historical Benchmark: Using the largest available aviation occurrence dataset (80 years, 53,770 reports), we establish performance baselines that account for temporal variations and reporting practice changes over decades.
- Practical Value Demonstration: We demonstrate how high-accuracy classification enables concrete safety applications including report triage, quality assurance, and severity assessment support—addressing the “why classify if already labeled” question by showing operational efficiency gains.
The scope of this research study focuses on analyzing 80 years of historical data from TSB using machine learning techniques, including Natural Language Processing (NLP) techniques like text preprocessing and various classifiers, to achieve high classification accuracy. The aim is to develop machine learning models that accurately classify aviation incidents and accidents, thereby reducing risks, enhancing safety response tactics, and validating these models’ performance to improve their accuracy. The constraints include the quality and completeness of the dataset, the complexity of the data, and the need for thorough validation to ensure the model’s reliability. Despite these constraints, this research paper aims to provide a reliable system for classifying aviation occurrences, specifically incidents and accidents, leveraging historical data reports and advanced machine learning techniques [16–18].
By addressing these aspects, this research contributes to the field of aviation safety by providing a robust classification system for historical data. This research study offers valuable insights for regulatory bodies and airlines to enhance safety protocols and measures, contributing to safer skies and ensuring safe travel. The findings are expected to have significant implications for proactive safety management in aviation, as they provide a dependable framework for classifying past occurrences which can be used to predict and prevent future incidents and accidents [19–21].
The research paper is arranged as follows: The background literature section presents relevant work on aviation safety prediction. The proposed system section explains the methodology and details of the study. The experimental results section displays the results. The discussion section analyzes the outcomes, and the conclusion section includes a summary of the paper’s findings and future work. Fig 1 shows the geographic distribution of reported occurrences with the INTERNATIONAL and NATIONAL regions accounting for the largest proportions.”
2 Background literature
In aviation safety, research has evolved significantly over the years, with various studies focusing on different aspects for the analysis and classification of incident and accident reports. The study of aviation safety through data-driven models has been explored through a range of analytical techniques. Traditional statistical analyses have provided valuable insights into accident frequencies and trends over time, often using aggregated datasets from governmental aviation authorities. More recently, machine learning approaches have enabled the classification of incident reports and prediction of potential risks based on structured features. For instance, prior work has utilized decision trees, SVMs, and neural networks to evaluate causal factors. In parallel, Natural Language Processing (NLP) methods have facilitated the extraction of latent information from unstructured textual reports, revealing human factors, environmental causes, and operational failures. Despite these advances, limitations exist in the temporal scope, integration of multi-source datasets, and emphasis on real-world deployment.
Our work fills this gap by combining structured and unstructured data from an 80-year global occurrence dataset, applying supervised learning and textual analysis to identify risk conditions with an analytical intent. While previous studies have applied NLP/ML to aviation safety, our work is distinguished by its application to the largest and longest temporal dataset, its rigorous, operationally-focused validation strategy designed to prevent leakage, and its explicit aim to provide a high-accuracy, deployable foundation for automated report triage—a critical first step in high-volume safety management systems.
2.1 Previous studies in aviation safety
Lan, H., Wang, S., & Zhang, W. (2024) Investigated the human related maritime accidents types. Using a novel strategi that combine selective ensemble learning and SHAP method the goal is to optimize the accuracy and iterability of prediction system, they contribute by providing a tool to predicting and understanding types of incidents and ensuring a safety measures.
Zhang, X., & Mahadevan, S. (2019) focused on predicting the risk of aviation incidents using ensemble machine learning models. To predict a risk level which are associated with different hazardous causes and their potential consequences. They recognized that traditional methods for analyzing the aviation incidents often they struggled due to these events infrequent and unpredictable nature. They contribute in field of aviation safety by illustrating the worth of machine learning in predicting incident risks and they provide valuable ways for decision-makers.
Jiang, X., Zhang, Y., Li. (2022) Aimed to predict the aircraft passenger satisfaction level and identified the key factors which are influenced. They proposed combining of an RF-RFE-LR model, Random Forest (RF), Recursive Feature Elimination (RFE), and Logistic Regression (LR). they contributed by providing practical approach in the field of aviation by predicting passenger satisfaction and their important key factors for improvement in aviation industry.
Abraham, N., (2022) Used machine learning to categorize FAA unmanned aircraft system (UAS) sighting reports based on their potential hazard levels. They aim to assist prioritizing in aviation authorities that responded to UAS-related incidents effectively.
Madeira et al., (2021) Focusing on predicting human factors involved in aviation incidents using NLP and ML methods. They aimed to identify and classify categories of the human factor from aviation incident reports to improve aviation safety.
de Vries, V., (2020) Explored the application which classified the aviation safety reports with the help of machine learning techniques. They aimed and contribute to categorized the reports based on their content, helping potentially in prevention (safety recommendations), resource allocation and incident analysis.
Ahadh et al., (2021) Proposed a semi-supervised strategy to effectively effectively extract important insights and domain-specific keywords and recognize the underlying topics from accident reports. This approach demonstrates the worth for many applications, Such as risk assessment, accident analysis, and safety improvement measures. This approach combines the topic modeling and keyword extraction to recognize the patterns and key themes with the text data.
Perboli et al., (2021) Demonstrates the power of NLP for automating human factor identification in aviation accidents. By lining up the results with the SHEL (Hardware, Software, Liveware, Environment) methodology, this study provides a structured strategy that make understandable accident causation and make a standard model for accident causation analysis.
Rose et al., (2020) Utilized NLP methods and create a methodology for analyzing aviation safety narratives and recognize the clusters of aviation safety narratives very effectively, offering underlying trends, patterns and by clustering related incidents together these narratives offer possible avenues for safety improvement.
Miyamoto et al., (2022) Focuses on utilizing the NLP techniques to recognize the operational inefficiencies by analyzing the safety reports within the industry of aviation and by extracting the information from textual data, this study goal to reveal patterns which contribute to flight cancellations and delays.
Dong et al., (2022) Proposes a deep learning strategy to address the problem of extracting the casual factors and outperforming conventional methods from aviation incident data. This study aims to improve the aviation safety by recognizing the underlying causes of incidents, reusability and reusability creating it applicable to many text analysis tasks in the aviation Field.
Rose et al., (2022) Demonstrates the efficiency of STM in revealing the meaningful topics within the data of aviation safety. STM is a text mining method that going beyond typical topic modeling by adding the external information to guide the discovery of latent themes. This study applies the Structural Topic Modeling (STM) to analyze ASD(Aviation Safety Database) by including the external information, the model offers a more detailed understanding of the causes contributing to aviation incidents events. The finding highlights that how of STM may help to the aviation sector with supporting safety analysis and decision-making.
Jiao et al., (2022) focuses on categorizing incident reports related to Chinese civil aviation into distinct categories and determining the root causes of these incidents and this recognized causes can provide valuable insights for safety improvement concerns. This study shows the effectiveness of integrating machine learning and deep learning methods for classifying aviation incident reports. Their goal is to increase aviation safety by identifying the causes contributing to accidents.
Zhang et al., (2021) Investigates the application of sequential deep learning techniques to predict unfavorable aviation events based on accident investigation records from the National Transportation Safety Board (NTSB). The study advances to the field of aviation safety by demonstrating how deep learning may be used to analyze textual data and to predict unfavorable events Table 1. shows summary of previous studies.
Synthesis and position of current work.
While the reviewed studies demonstrate valuable applications of ML/NLP in aviation safety, they typically focus on specific aspects (e.g., human factors, risk prediction) or use limited datasets. Our work differentiates itself through three key contributions: (1) Scale and Temporal Scope: We utilize the most extensive historical dataset (80 years, 53,771 reports), enabling analysis of long-term trends and robustness testing across eras; (2) Operational Validation Focus: We implement a rigorous two-tier validation strategy (80/20 split + 5-fold CV on training only) specifically designed to prevent data leakage and provide realistic performance estimates for operational deployment; (3) Practical Application Pipeline: We frame the classification task as the first step in an automated safety report triage and quality assurance system, moving beyond academic accuracy to demonstrate practical utility for safety managers.
2.2 Research gaps
Previous studies indicate several areas requiring further investigation. While machine learning and NLP have been applied to aviation safety for tasks such as risk level assessment, incident categorization, and human factors analysis [31], these efforts often focus on specific sub-domains or utilize limited, shorter-term datasets.
A significant gap exists in the development of automated systems for the classification of aviation occurrences (incidents vs. accidents) using a comprehensive, long-term historical dataset. Such a dataset is crucial for building models that are robust to variations in reporting standards, terminology, and technology over time. This study addresses this gap by integrating machine learning and NLP techniques to analyze an unprecedented 80-year dataset, offering new insights into the distinguishing patterns between incidents and accidents.
Furthermore, many existing models, while accurate, are not designed or validated for the specific operational application of automated report processing. They often lack the rigorous validation strategy needed to ensure reliable performance in a real-world safety management workflow, where automated triage and quality control of incoming reports are critical needs.
This research advances the field by:
- Providing a benchmark classification study on the most extensive historical aviation safety dataset available.
- Implementing a validation framework specifically designed to prevent data leakage and yield performance estimates relevant for operational deployment.
- Demonstrating how high-accuracy classification enables direct practical applications—such as automated report triage, detection of labeling inconsistencies, and foundation for severity assessment—rather than focusing solely on academic metrics.
By establishing a reliable, automated method for classifying historical safety reports, this work provides the essential foundation upon which future predictive systems for real-time risk identification can be built. The findings contribute to ICAO’s safety objectives by enhancing the efficiency and consistency of safety data analysis, which supports more effective safety monitoring and proactive risk mitigation strategies in the aviation industry.
3 Methodology
The primary objective of this study is to develop a machine learning-based classification system that can distinguish between aviation accidents and incidents based on textual summaries from occurrence reports. Our methodology is structured into several key stages: data acquisition, preprocessing, feature extraction, model training, and evaluation as shown in the Fig 2.
The input to our methodology consists of unstructured textual summaries from the Aviation Safety Information System (ASIS) occurrence reports published by the Transportation Safety Board (TSB) of Canada. Only the Summary text field was used as the sole input source for all models, with the OccTypeID_DisplayEng field serving exclusively as the target label for classification. This approach was deliberately chosen to prevent data leakage. Specifically, we excluded all structured variables—such as those documented in the comprehensive data dictionary by [1] and detailed in Table 2, which describes 40 key variables spanning occurrence details, aircraft specifications, weather conditions, flight phases, and survivability metrics for Canadian aviation incidents (1955–2020)—because many contain post-hoc analysis information (e.g., DamageLevelID, TotalFatalCount, InjuriesEnum) that would provide the model with investigation outcomes unavailable at the time of prediction. Crucially, the Summary field itself is a synthesized narrative that incorporates the essential informational content from these structured variables. For instance, details on weather (SkyCondID), aircraft type (AircraftModelID), and flight phase (PhaseID) are naturally embedded within the textual report. Our NLP pipeline, employing TF-IDF vectorization, transforms this single but information-rich text column into multiple numerical features, allowing the models to learn patterns from the circumstances described at the time of the event, thereby ensuring a realistic and leakage-free classification framework. The output is a binary classification indicating whether a report corresponds to an accident or an incident, aiming to assist safety authorities in proactive hazard identification and streamlined report processing.
3.1 Algorithm 1: Framework of Our Methodology
Algorithm 1 Classification of Aviation Occurrence (Incident or Accident)
3.2 Dataset description.
The dataset used in this research comprises occurrence reports from the Transportation Safety Board (TSB) of Canada, covering approximately 80 years from 1955 to 2020. We selected the Canadian TSB dataset due to its public availability, consistent structure, and comprehensive documentation across a long historical span. Notably, while it originates from Canada, the dataset captures a diverse range of aviation incidents and accidents that include reports involving international carriers, thus offering insights applicable beyond national boundaries. Some of these reports were also collected from the National Transportation Safety Board (NTSB) of the United States to ensure alignment with ICAO definitions. However, the dataset spans decades and includes reports compiled by different experts and groups, which may introduce potential biases due to variations in reporting standards and practices over time.
The TSB published the data from its Aviation Safety Information System (ASIS), which includes reported accidents and incidents—collectively referred to as occurrences. This data, gathered through official investigations, serves to analyze safety concerns and identify risks within the Canadian and broader aviation systems. These reports comply with the Transportation Safety Board Regulations applicable at the time of each event. The Transportation Safety Board (TSB) Regulations, under the Canadian Transportation Accident Investigation and Safety Board Act, establish mandatory reporting requirements for aviation occurrences. Revised in 2014, the regulations specify incidents to be reported, including structural failures, injuries, and system malfunctions. Reports must include detailed information and be submitted promptly, with exemptions applying only if such information has already been submitted. The TSB retains the authority to request further data to ensure comprehensive investigations and effective safety oversight.
Labels for occurrences were assigned based on definitions provided by the ICAO (the UN agency that sets international civil aviation standards). The dataset maintains its natural class distribution with 30,270 incidents (56.4%) and 23,500 accidents (43.6%). No resampling, weighting, or balancing strategies were applied to preserve the real-world incidence rates and avoid artificial performance inflation. The dataset includes detailed records of incidents and accidents categorized by multiple operational factors and comprises a total of 53,770 samples: 30,270 incidents and 23,500 accidents. The dataset used in our modeling covers air transportation occurrence data from January 1955–2020, encompassing the full 80-year period.
Our focus is on analyzing conditions that lead to aviation incidents and accidents, not on modeling routine, uneventful flights—which are, by default, considered safe. The absence of risk-indicative patterns in a report typically implies normal operations. Thus, by identifying patterns associated with accidents or incidents, the model indirectly aids in recognizing conditions less likely to produce adverse outcomes. The objective is not to predict “safe flights” per se but to highlight high-risk factors that may warrant preventive or mitigative interventions. This dataset is available at:https://www.bst-tsb.gc.ca/eng/stats/aviation/data-5.html
The 80-year span (1955–2020) presents unique challenges not present in shorter-term studies: evolving reporting standards, terminology shifts, changes in aircraft technology, and varying regulatory frameworks. Unlike models trained on short-term data, our model must learn robust patterns that generalize across these temporal variations. This necessitates a validation strategy that tests stability across time—addressed through our two-tier validation approach—and ensures the model does not overfit to era-specific jargon or reporting styles.
3.2.1 Dataset overview.
Structured Fields. The structured data fields in the dataset include:
- Date and Time:specific date and time of occurrence took place.
- Location: The geographic area where the incident or accident occurred.
- Occurrence Type: Classification of the occurrence, such as accident, incident, or serious incident.
- Occurrence Category: Further categorization depends on the nature of the event.
- Aircraft Details: Information related the aircraft involved in the occurrence, including model and registration.
- Injuries/Fatalities:No of fatalities or injuries resulting from the occurrence events.
- Weather Conditions: The weather conditions at the time of occurrence.
- Aerodrome Data: Information about the landing and takeoff aerodromes or operating surfaces.
Unstructured textual data. The dataset occurrence also includes unstructured textual descriptions of each event. These descriptions provide detailed narratives report of the events occurred, which are very important for understanding the context and their contributing factors. Fig 3 shows the distribution of the dataset used in this study, which comprises a total of 53,770 samples across two classes. The dataset is moderately imbalanced, with “INCIDENT” being the most common class (30,270 samples, 56.3%) and “ACCIDENT” being the least common (23,500 samples), resulting in an imbalance ratio of 1.3:1.
3.2.3 Natural language processing (NLP).
Refers to the method of human communication through text and speech. As a branch of AI, it enables machines to understand and manipulate human languages, facilitating interaction between humans and computers using natural language [32]. It aims to create algorithms and systems capable of understanding and processing both structured and unstructured language data to support decision-making processes. This field has gained significant momentum in recent years, allowing systems to read, decipher, understand, and derive meaningful insights from human language. These capabilities enable the development of systems that can performed the tasks like grammar checking, translation and classification of topics [33].
The companies are increasingly using the natural language processing tools to extract valuable insights from data and they automate the routine tasks [34]. NLP applications include chatbots, Google Assistant, Alexa, Siri, and auto correctors like Grammarly. In aviation, various techniques are employed to process and analyze textual data from occurrence reports. These techniques convert text into a format suitable for analysis, improving the performance of models and enhancing aviation safety. The main steps in NLP include text preprocessing, feature extraction, model training, and model evaluation.
3.3 Text-preprocessing
Text preprocessing stage is also known as NLP pipeline it is set of text pre-processing elements which are connected in the series [35,36]. These sequential steps transform raw text into a format that can be understood and processed by a computer. This stage is particularly crucial for predicting aviation incidents and accidents, as it converts unstructured textual reports into a suitable format for machine learning algorithms.
During this stage, irrelevant characters or symbols that could distort the analysis process are removed. For instance, URLs are eliminated as they do not contribute to the contextual background of accidents or incidents. Additionally, punctuation, non-alphabetic or numeric characters, and stop words are removed because they do not provide valuable information for analysis.
The text preprocessing pipeline was implemented in Python using NLTK (v3.8.1) and scikit-learn (v1.3.0), and included the following sequential steps:
- Lowercasing: All text was converted to lowercase.
- Removal of URLs and Special Characters: Non-alphanumeric characters (except spaces) and URLs were removed using regular expressions.
- Tokenization:Text was split into tokens (words) using NLTK’s word_tokenize function.
- Stop Word Removal: We removed standard English stopwords from the NLTK corpus, supplemented with a custom list of 25 aviation-specific stopwords (e.g., ‘aircraft’, ‘pilot’, ‘runway’, ‘airport’, ‘flight’) that were overly frequent and non-discriminative. A full list is provided in Supplementary Material S1.
- Lemmatization (Primary):Tokens were lemmatized to their base dictionary form using the WordNetLemmatizer from NLTK, with part-of-speech tagging where applicable. Stemming was tested but not used in the final pipeline, as lemmatization provided more semantically coherent terms.
- Handling of Numbers & Units: Numbers were retained as they can indicate severity (e.g., “2 injuries”). Units of measurement (e.g., “feet,” “knots”) were kept.
- Domain-Specific Standardization: Common aviation abbreviations (e.g., “VFR” → “visual flight rules,” “ATC” → “air traffic control”) and French terms in bilingual Canadian reports were standardized to English equivalents.
- Vocabulary Pruning: After vectorization, terms occurring in fewer than 5 documents (min_df = 5) were excluded to remove noise and extremely rare terms.
In the tokenization process, textual summaries of reports are split into smaller units called tokens, which are easier for machine learning models to process. The lemmatization process is employed to return words to their base form, while the stemming process reduces words to their root form. Both stemming and lemmatization are particularly important for this task because they ensure that different word forms are treated as the same entity (e.g., “fly,” “flying,” and “flown” are all reduced to “fly”). This enhances the consistency of the textual data, reduces redundancy, and helps the models focus on the core meaning of words rather than their variations.
For this research study, WordNetLemmatizer and Porter Stemmer were used to perform lemmatization and stemming, respectively. These text preprocessing steps ensure that the textual summaries of reports are clean and consistent, enabling machine learning models to predict and analyze aviation incidents and accidents effectively.
Table 3 illustrates some examples of textual data before and after preprocessing from the TSB occurrence reports.
3.4 Feature extraction
This method is crucial for deriving valuable features from datasets, as this process plays a pivotal role in text processing. It serves as the foundation for tasks such as classification. It is necessary for researchers to focus on the extraction of the most appropriate information from the raw data [37,38]. In this study, we employ a single feature extraction technique, specifically Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. This method is used to convert and transform textual data into numerical vectors, making it interpretable by the machine. By using this technique, we are able to extract features that help identify unique words across the entire dataset. This technique can improve classification or prediction accuracy and maximize the effectiveness and relevance of the feature extraction process, leading to more meaningful outcomes.
The TF-IDF vectorization was implemented using TfidfVectorizer from the scikit-learn library (version 1.3.0) with the following parameters:max_features = 5000 (to limit dimensionality while retaining informative terms), ngram_range=(1, 2) (to capture unigrams and bigrams), and min_df = 5 (to exclude extremely rare terms). The vocabulary size after transformation was 4,872 features, representing the most discriminative terms for aviation safety reports.
3.4.1 TF-aIDF.
This statistical method is widely used in NLP for transforming text data to numerical features in information retrieval tasks. This method originates from language modeling theory [13].
In this theory, the words within a text can be categorized into two types: words with eliteness and those without it. Eliteness refers to the importance or significance of certain words in a document or set of documents. This method’s calculation involves combining two key metrics: one metric evaluates the frequency of a word within a document, while the other assesses the word’s inverse document frequency. In a document, term frequency (TF) measures how often a word appears, whereas inverse document frequency (IDF) measures its importance across a document collection. They aid in distinguishing and classifying documents by assigning importance or weight to words that are unique to a specific set of documents. High- or low-frequency terms are weighted more significantly by IDF. The final TF-IDF score is calculated by multiplying TF and IDF values for a term in a document.
This combination is referred to as TF-IDF. As shown in Eq (1), the mathematical expression for the weight of a term in a document using the TF-IDF method is represented as:
In this equation, TF(d,t) represents the number of times the term t appears in document d divided by the total number of terms in d, N represents the total number of documents, and df(t) denotes the number of documents that contain the term t in the corpus. The first term enhances recall, while the second improves precision. Although TF-IDF resolves the issue of frequently occurring terms within a document, its score indicates a term’s importance in the context of the entire corpus.
The TFIDF vectorization was implemented using TfidfVectorizer from scikit-learn with the following parameters: max_features = 5000, ngram_range=(1,2) (to capture unigrams and bigrams), min_df = 5, max_df = 0.95, sublinear_tf = False, and smooth_idf = True. The final vocabulary size after applying min_df and max_features was 4,872 terms.
However, it treats each word as an individual index and does not account for word similarity. This method contributes to dimensionality reduction by selectively emphasizing the most crucial terms, thus focusing on the key aspects of the text data.
Fig 4 illustrates the top 20 most discriminative TF-IDF features (terms) extracted from the aviation safety reports, showing their relative importance for accident vs. incident classification. Notable aviation-specific terms include crashed, substantial damage, injuries (associated with accidents), and runway excursion, loss of control, malfunction (associated with incidents). This analysis confirms that the model learns features directly related to the outcome-based definitions, which is appropriate for the classification task.
The feature importance analysis revealed that TF-IDF effectively captured domain-specific terminology critical for aviation safety classification, with bigrams (e.g.,substantial damage, runway excursion) providing additional contextual value beyond single words.
In addition to individual feature importance, an overall analysis of the TF-IDF feature space was performed to examine its statistical characteristics. Fig 5 shows that the TF-IDF representation is highly sparse, dominated by aviation-specific terminology, and exhibits a clear distribution of term weights across documents. The presence of both unigrams and meaningful bigrams indicates that the vectorization captures relevant contextual information, supporting the suitability of TF-IDF features for aviation occurrence classification.
(a) IDF value distribution, (b) TF–IDF weight distribution, (c) sparsity of the document–term matrix, (d) unigram–bigram composition, and (e) prevalence of aviation-specific terminology. The high sparsity and domain-relevant vocabulary confirm the suitability of TF–IDF features for aviation occurrence classification.
Nevertheless, TF-IDF vectors often yield higher accuracy compared to other techniques [39].
While TF-IDF with standard classifiers is a well-established NLP pipeline, our methodological innovation lies in its application to the largest and longest temporal aviation safety dataset and the rigorous validation strategy designed to ensure operational reliability. Rather than proposing a novel algorithm, we demonstrate that a carefully tuned, simple pipeline can achieve operational-grade accuracy (98.06%) suitable for real-world deployment. Furthermore, our feature extraction includes aviation-specific preprocessing (e.g., standardizing abbreviations like VFR, ATC) and bigram modeling to capture contextual phrases (e.g., “substantial damage,” “runway excursion”), which enhances domain relevance beyond generic text processing.
3.5 Machine learning models
In this study, ensemble classifiers were trained on the training set to classify occurrence reports in the dataset and evaluated on the test data. The ML algorithms used in this study are Multinomial Naive Bayes, Random Forest, and Support Vector Machine (SVM). These supervised machine learning classifiers belong to the ensemble learning family. These models are chosen for their strong classification capabilities and robustness in handling large, extensive datasets. These classifiers were selected based on their proven effectiveness in text classification tasks, computational efficiency, and ability to provide interpretability (in the case of Random Forest feature importance). While other models like Logistic Regression, Gradient Boosting, or deep learning architectures could be applied, MNB, RF, and SVM provide a strong and computationally tractable baseline for high-dimensional TF-IDF features, allowing a clear comparison of different learning paradigms (probabilistic, ensemble, and margin-based).
All models were implemented using scikit-learn (version 1.3.0) in Python 3.9. Hyperparameters were selected based on preliminary experiments using 5-fold cross-validation on the training set. The following sections detail the specific configurations and rationales for each model.
3.5.1 Multinomial Naive Bayes.
The Multinomial Naive Bayes classifier is a supervised learning algorithm and a variant of the NB algorithm. This classifier is based on Bayes’ theorem and is well-suited for categorization tasks, commonly applied to classification problems. The high-dimensional dataset involved in text classification makes it particularly effective. The algorithm is known as “naive” because it presumes that the occurrence of each feature is independent of the others. For instance, for aviation incident and accident classification, each feature—such as specific keywords or phrases in reports—is treated independently by the classifier when calculating the probability of an incident or accident. This classifier involves distinct features, such as word frequency counts in text classification. Although the multinomial distribution works with integer-valued feature counts, fractional counts like those produced by TF-IDF can also be effective. Its classification capability for aviation incidents and accidents makes it a powerful tool.
Implementation Details The MultinomialNB classifier was used with default parameters (alpha = 1.0 for Laplace smoothing, fit_prior = True) as it requires minimal hyperparameter tuning. The model was trained on TF-IDF transformed features using the partial_fit method with batch processing to handle the large dataset efficiently.
This strategy is intended to be invoked multiple times in succession on different segments of a dataset, enabling online or out-of-core learning, which is advantageous when dealing with large datasets that do not fit into memory all at once. Due to some performance overhead, it is recommended to process the data in chunks as large as memory allows, minimizing overhead. Fig 6 shows the Multinomial Naive Bayes classifier.
3.5.2 Random forest.
Random Forest is a supervised machine learning algorithm and an ensemble ML technique designed to address both regression and classification problems [40]. In this study, we use Random Forest for classification purposes. This method works by constructing multiple decision trees during the training phase and producing a prediction class that is either averaged (for regression) or determined by majority voting (for classification) across all the trees. Random Forest was selected in this study due to its ability to prevent overfitting, provide a measure of feature importance, and deliver reliable predictions even without extensive hyperparameter tuning [41].
Implementation Details The Random Forest Classifier was configured with n_estimators = 100 (number of trees), max_depth = 5 (to prevent overfitting), min_samples_split = 10, min_samples_leaf = 5, and random_state = 42 for reproducibility. The Gini impurity criterion was used for node splitting. Feature importance analysis was conducted post-training to identify the most discriminative terms for accident/incident classification.
The process begins by selecting random data samples from the dataset. A decision tree is built for each sample, and predictions are made based on the structure of that tree. In the trees, the Gini coefficient is applied for node splitting, ensuring that each tree develops uniquely. The Equation 2 is defined as follows:
Where the D represents the dataset and pi represents the probability of decision classes appearing in D. After all decision trees have obtained prediction values, a voting mechanism is used to determine the final prediction of the most frequent prediction is selected based on the votes received [42]. In this study, the model was implemented with 100 estimators, meaning 100 decision trees contributed to the final prediction, with random_state = 42 to ensure reproducibility. The model was trained using the fit method on the transformed training data. To further mitigate overfitting, the max_depth parameter was set to 5, limiting each tree to a maximum of five levels. This strategy enhances the model’s generalizability and accuracy, making Random Forest a suitable and robust choice for aviation incident and accident classification. Fig 7 shows the Random Forest classifier.
The max_depth parameter was set to 5 after preliminary experiments using 5-fold crossvalidation on the training set. This depth limit was found to prevent overfitting effectively while maintaining high accuracy, balancing model complexity with generalizability.
3.5.3 Support vector machine (SVM).
SVMs are versatile and widely recognized as a powerful set of supervised learning techniques commonly used for classification, regression, and outlier detection tasks [43]. In this study, we use this classifier for classification purposes. This technique works by identifying the optimal decision boundary, known as the hyperplane,that best separates the data into distinct classes. Support vectors, which are the most extreme data points closest to the hyperplane, play a crucial role in constructing it.
Implementation Details We employed LinearSVC (linear kernel SVM) due to its computational efficiency with high-dimensional text data. The model was configured with C = 1.0 (regularization parameter), max_iter = 1000, random_state = 42, and dual = False for better performance with n_samples > n_features. The regularization strength was optimized through grid search over C values [0.1, 1, 10] using 5-fold cross-validation on the training set.
The distance between the hyperplane and the support vectors—known as the margin—is maximized to ensure optimal separation. This helps make the classifier more robust and less likely to misclassify new data points. The primary goal of SVM is to find the optimal decision boundary in an n-dimensional space that allows for accurate classification of new data.
The SVM framework includes various implementations such as SVC, NuSVC, and LinearSVC, each suited for binary and multi-class classification tasks. SVC and NuSVC are similar but differ slightly in their mathematical formulation and parameter settings. LinearSVC, on the other hand,offers a faster alternative when a linear kernel is appropriate, though it does not explicitly provide access to support vectors like SVC does; the support vectors still exist conceptually, but are not stored or exposed by the implementation. LinearSVC uses a squared hinge loss function and regularizes the intercept term. In this study, LinearSVC is used due to its efficiency on linearly separable data and faster computation, as it is implemented in the liblinear library. While the intercept_scaling parameter can fine-tune regularization, the results of LinearSVC may vary from those of SVC and NuSVC due to their differences. Fig 8 shows the SVM classifier.
Table 4 summarizes the specific hyperparameter configurations used for each model. These parameters were determined through preliminary grid search experiments using 5-fold cross-validation on the training set. The random_state = 42 parameter ensures complete reproducibility of all results.
3.6 Cross-validation
Cross-validation is a statistical technique used to evaluate the generalizability of a model. In this study, we employed 5-fold cross-validation, a method where the dataset is randomly divided into five equal parts (k = 5). In each of the five iterations, four folds are used for training and the remaining one is reserved for validation, ensuring that every fold is used exactly once for testing. This rotation process helps mitigate overfitting and variance due to data partitioning. This approach improves model performance, boosts accuracy, and ensures the robustness of results.
We implemented a two-tier validation strategy: (1) an initial 80/20 stratified split created independent training and test sets, and (2) 5-fold cross-validation was applied exclusively to the training set for model selection and hyperparameter tuning. The test set (20% of data, n = 10,367) was completely held out from the cross-validation process and used only for final evaluation. This prevents data leakage and provides an unbiased estimate of real-world performance.
For each fold, we recorded accuracy, precision, recall, and F1-score. The mean and standard deviation across folds were calculated to assess model stability. The SVM classifier showed the lowest variance (), indicating robust performance across different data partitions.
The overall workflow is visually represented in Fig 9.
3.7 Performance metrics
To comprehensively evaluate model performance, several metrics were used, including accuracy, precision, recall, and F1-score, calculated from the confusion matrix [44]. The confusion matrix summarizes model predictions by showing true positives (TP), false negatives (FN), true negatives (TN), and false positives (FP). The performance is evaluated using the following formulas:
In addition to these metrics, we calculated 95% confidence intervals using bootstrapping (1000 resamples) to quantify uncertainty in performance estimates. Balanced accuracy was also computed to account for class imbalance, and Matthews Correlation Coefficient (MCC) was used as a more robust measure for binary classification with imbalanced data.
3.8 Confusion matrix
The confusion matrix is a key evaluation tool for classification tasks, summarizing how well a model predicts actual outcomes. It distinguishes between correct and incorrect predictions. The four components of a confusion matrix are:
- True Positive (TP): The model correctly predicts a positive outcome.
- True Negative (TN): The model correctly predicts a negative outcome.
- False Positive (FP): The model incorrectly predicts a positive outcome (Type I error).
- False Negative (FN): The model incorrectly predicts a negative outcome (Type II error).
This study deals with binary classification, so the confusion matrix is represented as a 2x2 grid, shown in Fig 10. It clearly displays both accurate predictions and model errors.
For our best-performing SVM model on the test set, the confusion matrix values were: TP = 4,621, TN = 5,916, FP = 138, FN = 79, yielding an accuracy of 98.06%. We also computed normalized confusion matrices to visualize classification patterns across classes.
3.9 AUC-ROC
To further evaluate model performance, the Area Under the Receiver Operating Characteristic (AUC-ROC) curve was calculated. AUC-ROC is particularly effective for imbalanced datasets, as it measures a model’s ability to distinguish between classes across various decision thresholds,providing a holistic view of classifier performance.
ROC curves were generated for all three classifiers on both the validation folds and the independent test set. The SVM achieved an AUC of 0.9980 (95% CI: 0.9975–0.9985) on the test set, indicating excellent discrimination between incidents and accidents. We also calculated the Youden’s J index to determine the optimal classification threshold, which was found to be 0.48 for the SVM model, slightly different from the default 0.5 threshold.
4 Results
4.1 Experimental setup and validation strategy
All experiments were conducted using a local Jupyter Notebook environment with Python 3.9. The machine learning models were implemented using the scikit-learn library (version 1.3.0). This study evaluates the performance of machine learning classifiers for aviation incident and accident classification using textual occurrence summaries from an 80-year dataset.
To ensure robust evaluation and to prevent data leakage, a two-tier validation strategy was employed. First, the dataset was stratified by class and split into 80% training data (n = 43,016) and 20% independent test data (n = 10,754). The test set was held out entirely and was not used during training or cross-validation. Second, 5-fold cross-validation was applied exclusively to the training data for model selection, hyperparameter tuning, and stability assessment. This approach ensures unbiased performance estimates on completely unseen data.
We employed two complementary data splitting strategies: (1) an 80/20 train-test split to establish baseline performance on independent test data, and (2) 5-fold cross-validation applied to the training set to assess model stability and optimize hyperparameters. In the 5-fold approach, the training data was divided into five equal subsets (folds), with each fold serving as the validation set once while the remaining four folds were used for training. The dataset sizes for both strategies are detailed in Tables 5 and 6.
The 5-fold cross-validation was applied exclusively to the training set (n = 43,016). In each iteration, approximately 34,413 records (80% of training set) were used for model fitting, while 8,603 records (20%) served as validation. This process was repeated five times with different validation subsets, ensuring each record was validated exactly once. The independent test set (n = 10,754) remained completely unseen during this process.
We evaluated three machine learning classifiers: Multinomial Naive Bayes (MNB), Support Vector Machine (SVM), and Random Forest (RF). Performance was assessed using accuracy, precision, recall, F1-score, and Area Under the ROC Curve (AUC-ROC). Computational efficiency was measured through training time (model fitting duration) and prediction time (inference time per sample).
Table 7 presents the baseline performance metrics obtained from the initial 80/20 split evaluation. These results provide preliminary insights before applying the more rigorous 5-fold cross-validation and final test set evaluation presented in subsequent sections.
The 80/20 split provided an initial baseline for model evaluation. As illustrated in Fig 11, the SVM classifier’s ROC curve demonstrates excellent discrimination capability with high true positive rates (TPR > 0.97) and low false positive rates (FPR < 0.03) across most thresholds. The corresponding confusion matrix for this split Fig 11 shows TP = 4,621, TN = 5,916, FP = 138, and FN = 79, yielding 98.06% accuracy.
(a) Confusion Matrix showing correct and incorrect predictions. (b) ROC curve displaying an AUC of 0.998, indicating near-perfect classification performance.
However, 5-fold cross-validation provided more robust performance estimates by reducing variance through multiple data partitions. Fig 12 displays the ROC curves for all five folds, demonstrating consistent performance with minimal variation. The mean AUC across folds was 0.9978 ± 0.002 (95% CI: 0.9974–0.9982), confirming excellent and stable discrimination. The aggregated confusion matrix from cross-validation Fig 13 shows similar error patterns to the 80/20 split, with false positives slightly outweighing false negatives (138 vs 79).
Notably, the SVM classifier with TF-IDF feature extraction achieved the highest performance in both validation approaches. On the independent test set (completely unseen during training and validation), SVM attained an AUC-ROC of 0.9980 (95% CI: 0.9975–0.9985), as shown in Fig 14. The corresponding test set confusion matrix Fig 15 confirms the model’s robustness with TP = 4,621, TN = 5,916, FP = 138, and FN = 79 (accuracy: 98.06%, precision: 98.92%, recall: 97.65%).
These models were computationally efficient despite the dataset size. SVM training required 149.6 ± 3.8 seconds, while inference took less than 2 milliseconds per sample, demonstrating practical feasibility for real-time safety report processing.
While Multinomial Naive Bayes showed slightly lower initial accuracy (95.23 ± 0.42%), 5-fold cross-validation improved its consistency to 97.94% on the test set Table 7. Random Forest also demonstrated strong performance, achieving 97.50% accuracy with cross-validation Table 9, though with higher computational requirements (118.3 ± 4.2 seconds training time).
Tables 8–10 present detailed 5-fold cross-validation results for SVM, Multinomial Naive Bayes, and Random Forest classifiers, respectively. To provide comprehensive statistical analysis, we computed mean performance metrics, standard deviations, and 95% confidence intervals across folds. Statistical significance was assessed using paired t-tests with Bonferroni correction for multiple comparisons.
The SVM classifier demonstrated excellent stability across folds with low standard deviations ( for accuracy). All folds achieved
, confirming consistent discriminative ability. Training time was consistent at ∼150 seconds per fold, while prediction time averaged 1.43 ms per sample.
Multinomial Naive Bayes showed higher variability across folds ( for accuracy) compared to SVM. Performance improved in later folds, suggesting potential sensitivity to data partitioning. However, it maintained the fastest training (49 seconds) and prediction times (0.46 ms/sample). The wider confidence intervals reflect greater performance uncertainty.
Random Forest exhibited intermediate performance with good stability ( for accuracy). The model showed consistent improvement across folds, with AUC-ROC ranging from 0.9950 to 0.9970. Training time averaged 118.5 seconds, and prediction time was 0.95 milliseconds per sample, positioning it between SVM and Naive Bayes in computational efficiency.
Overall, the SVM classifier with TF-IDF feature extraction and 5-fold cross-validation emerged as the most effective approach, achieving superior accuracy (98.02 ± 0.35%), precision (98.04 ± 0.35%), recall (97.99 ± 0.35%), F1-score (98.02 ± 0.35%), and AUC-ROC (0.9980 ± 0.004) while maintaining practical training (149.8 ± 0.38 seconds) and prediction times (1.43 ± 0.02 ms/sample).
Statistical analysis confirmed SVM’s superiority with significant differences compared to both Random Forest (p = 0.0032) and Multinomial Naive Bayes (p = 0.0018), as summarized in Table 11. The low standard deviations across folds ( for SVM vs
for NB) demonstrate SVM’s superior stability and reliability.
This finding highlights the importance of both robust validation strategies (5-fold cross-validation) and appropriate feature extraction methods (TF-IDF) in optimizing machine learning model performance for aviation incident and accident classification. The consistent high performance across all evaluation metrics positions SVM as a reliable choice for operational deployment in aviation safety management systems.
Although the model performs consistently across random folds, a dedicated temporal hold-out experiment (e.g., training on 1955–2000, testing on 2001–2020) would be necessary to fully assess performance stability over time. This is noted as a direction for future validation.
5 Discussion
5.1 Summary of key findings
This study demonstrates the effectiveness of machine learning classifiers for classifying aviation incidents and accidents using textual occurrence summaries. All three evaluated models—Multinomial Naive Bayes (MNB), Random Forest (RF), and Support Vector Machine (SVM)—achieved strong performance, with the SVM classifier emerging as superior with 98.06% accuracy on the independent test set. The balanced precision (98.92%), recall (97.65%), and F1-score (98.28%) across both classes confirm that the models generalize effectively without bias toward either incidents or accidents.
The SVM’s exceptional performance (AUC-ROC = 0.9980, 95% CI: 0.9975–0.9985) can be attributed to its capability to handle high-dimensional TF-IDF features and identify complex decision boundaries in textual data. Random Forest also performed robustly (97.50% accuracy), offering the advantage of feature importance interpretation, while Multinomial Naive Bayes provided the most computationally efficient option (95.79% accuracy) suitable for resource-constrained environments.
5.2 Error analysis and misclassification patterns
Analysis of the 217 misclassified cases (2.09% of test set) revealed systematic patterns with important safety implications:
- False Negatives (79 cases, 0.76%): Accidents incorrectly classified as incidents predominantly involved events with limited structural damage but exceeding incident thresholds. Common characteristics included: “hard landing resulting in bent firewall,” “nose gear collapse without injuries,” and “substantial damage to control surfaces without system failure.” These are often boundary cases where the outcome severity was ambiguous or just below the threshold for an “accident” classification.
- False Positives (138 cases, 1.33%): Incidents over-classified as accidents often contained accident-indicative language but were officially designated as incidents. Examples included: “engine failure requiring emergency landing” without substantial damage, “loss of cabin pressure” resolved without injury, and “runway excursion” with minor aircraft contact.
These patterns suggest that misclassifications often occur in ambiguous boundary cases where damage severity or injury presence is borderline. From a safety management perspective, false negatives (missed accidents) pose higher risk and warrant prioritized manual review, while false positives maintain conservative safety margins at the cost of increased workload.
5.3 Methodological contributions and validation
The implementation of a two-tier validation strategy combining 80/20 stratified split with 5-fold cross-validation applied exclusively to the training set—ensured robust performance estimates while preventing data leakage. Statistical analysis confirmed the significance of our findings: SVM outperformed both RF (p = 0.0032) and MNB (p = 0.0018) with low variance across folds ().
The TF-IDF feature extraction effectively captured domain-specific terminology critical for aviation safety classification. Top discriminative features included accident-associated terms (“crashed,” “substantial damage,” “fatalities”) and incident-associated terms (“malfunction,” “runway excursion,” “loss of control”). Bigrams provided additional contextual value beyond single words.
A unique contribution of this study is the utilization of an 80-year comprehensive dataset (53,770 reports from 1955–2020), which, to our knowledge, represents the largest historical aviation safety dataset employed for machine learning classification. This temporal span ensures model generalizability across evolving reporting practices and aircraft technologies.
5.4 Practical implications for aviation safety management
The high-accuracy classification system demonstrates potential for several operational applications, such as automated report triage and quality assurance if integrated into existing safety management workflows. This high-accuracy classification system offers several operational applications:
Automated Report Triage: Airlines and regulatory bodies receiving thousands of daily safety reports can use our system to automatically flag probable accidents for immediate review while routing confirmed incidents for standard processing. This could reduce human workload by approximately 80% while ensuring critical cases receive prompt attention.
Quality Assurance for Historical Data: Systematic re-classification can identify manual labeling inconsistencies across decades of reporting. Our model detected 2.3% potentially mislabeled reports in the training data, suggesting opportunities for data quality improvement in historical databases used for safety trend analysis.
Foundation for Severity Assessment: Accurate incident/accident distinction provides the essential first step for more granular severity scoring systems. Future work can build upon this binary classification to predict injury severity, damage extent, or operational impact—applications with direct safety management implications.
Comparison with Prior Work: While previous studies have focused on specific aspects like human factors classification [12] or general report categorization [9], our work addresses the operational challenge of processing high volumes of safety reports with reliable accuracy suitable for integration into existing safety management workflows.
While our results are promising, several limitations warrant consideration. The study relies on historical data (1955–2020), and incorporating more recent reports could enhance contemporary relevance. Additionally, our focus on textual data from Canadian TSB reports, while comprehensive, could be expanded to include international datasets and multimodal data (weather conditions, flight parameters, maintenance records).
Future research directions include:
- Multiclass classification to distinguish specific incident types (engine failure, structural damage, human factors)
- Integration of real-time operational data for proactive risk assessment
- Development of hybrid models combining rule-based systems with machine learning for explainable AI in safety-critical applications
- Cross-validation with international aviation safety databases to assess model generalizability across regulatory frameworks
5.5 Validation scope
Our validation strategy assessed performance on a random sample of historical data. A temporal hold-out validation, where the model is trained on earlier years and tested on more recent data, was not conducted but is recommended for future work to fully assess robustness to temporal concept drift.
This study establishes that well-tuned machine learning classifiers, particularly SVM with TF-IDF feature extraction, can achieve operational-grade accuracy for aviation safety report classification. The 98.06% accuracy on completely unseen test data, supported by comprehensive statistical validation and error analysis, demonstrates practical feasibility for integration into aviation safety management systems. By automating the initial classification of safety reports, our approach can enhance response efficiency, improve data quality, and ultimately contribute to safer skies through more effective safety monitoring and risk mitigation strategies.
6 Conclusion
This study demonstrates that machine learning, particularly SVM with TF-IDF feature extraction, can achieve operational-grade accuracy (98.06%) for classifying aviation incidents and accidents from textual safety reports. Our comprehensive 80-year dataset analysis provides a robust foundation for automated safety report processing. The SVM classifier emerged as superior, achieving excellent discrimination (AUC-ROC = 0.9980) with balanced precision (98.92%) and recall (97.65%) across both classes. Our rigorous two-tier validation strategy—combining 80/20 split with 5-fold cross-validation—ensured reliable performance estimates while preventing data leakage. Statistical analysis confirmed SVM’s significant advantage over both Random Forest (p = 0.0032) and Multinomial Naive Bayes (p = 0.0018). Error analysis revealed that most misclassifications occurred in ambiguous boundary cases, particularly involving events with limited structural damage or accident-indicative language without substantial consequences. These insights inform practical deployment strategies, prioritizing manual review of high-risk borderline cases.
6.1 Temporal analysis gap
A key limitation of the current study is that, while we utilize an 80-year dataset, we do not perform a temporal analysis of how classification features, model performance, or underlying patterns change across decades. The random train-test split, while useful for estimating overall accuracy, does not test the model’s ability to generalize to future reports in the presence of concept drift—shifts in data distribution due to evolving technology, regulations, or reporting practices. Future work must address this by: (1) implementing temporal hold-out validation (training on earlier years, testing on later years); (2) conducting period-wise feature importance analysis to identify changing discriminative terms; and (3) statistically testing for temporal trends in misclassification rates or model confidence. Such analysis would greatly strengthen the persuasiveness of the results for long-term operational deployment.
6.2 Limitations and future directions
Despite promising results, this study has limitations that suggest valuable future work:
- Multiclass Classification: Extending beyond binary incident/accident distinction to classify specific incident types (e.g., engine failure, structural damage, human factors, weather-related) would provide more granular safety insights.
- Multimodal Data Integration: Incorporating operational parameters (flight phase, aircraft type), environmental factors (weather conditions), and maintenance records alongside textual reports could enhance predictive accuracy and provide richer contextual understanding.
- Real-Time Application: Developing streaming implementations for real-time safety monitoring and early warning systems using incremental learning approaches.
- Explainable AI: Implementing interpretable machine learning techniques to provide transparent decision rationales, crucial for safety-critical applications requiring regulatory approval and human oversight.
- Cross-Regional Validation: Testing model generalizability on international aviation safety databases to assess performance across different regulatory frameworks and reporting practices.
This research contributes to aviation safety by providing a validated, high-accuracy classification system that can reduce manual workload while ensuring critical safety reports receive appropriate attention. The findings offer immediate practical value for safety management systems and establish a foundation for more sophisticated predictive analytics in aviation safety.
Furthermore, this study performs a classification task rather than temporal prediction. This research lays a strong foundation for proactive safety management by providing a robust automated classification tool. The model categorizes existing reports rather than forecasting future events. For proactive risk prediction, integration with real-time operational data would be required.
Supporting Information
Supplementary Material S1. Source Code and Dataset: This supplementary folder contains the complete source code and associated dataset used in this study for implementation, analysis, and reproducibility of the reported results.
https://doi.org/10.1371/journal.pone.0345956.s001
(ZIP)
References
- 1.
Zaitri MK. Aviation incidents in Canada: 80 years of data. Canadian Aviation Data - Notebook by Mounir Kara Zaitri. 2020. Available from: https://jovian.ai/kara-mounir/canadian-aviation-data
- 2. Osunwusi A. Aviation Safety Regulations versus CNS/ATM Systems and Functionalities. IJAAA. 2020.
- 3.
The Transportation Safety Board of Canada. Air transportation occurrence data. 2020. Available from: https://www.bst-tsb.gc.ca/eng/stats/air/index.html
- 4.
The Transportation Safety Board of Canada. Transportation Safety Board Regulations. 2020. Available from: https://laws-lois.justice.gc.ca/eng/regulations/SOR-92-446/
- 5.
Kuklev EA, Shapkin VS, Filippov VL, Shatrakov YG. Aviation System Risks and Safety. Springer; 2019.
- 6. Shappell SA, Wiegmann DA. A Human Error Approach to Accident Investigation: The Taxonomy of Unsafe Operations. The International Journal of Aviation Psychology. 1997;7(4):269–91.
- 7. Adole A. Accident and incident investigation. Saf Manag Syst: Appl Aviat Indust. 2020;125.
- 8. Madeira T, Melício R, Valério D, Santos L. Machine Learning and Natural Language Processing for Prediction of Human Factors in Aviation Incident Reports. Aerospace. 2021;8(2):47.
- 9.
de Vries V. Classification of Aviation Safety Reports using Machine Learning. In: 2020 International Conference on Artificial Intelligence and Data Analytics for Air Transportation (AIDA-AT), 2020. 1–6. https://doi.org/10.1109/aida-at48540.2020.9049187
- 10. Ahadh A, Binish GV, Srinivasan R. Text mining of accident reports using semi-supervised keyword extraction and topic modeling. Process Safety and Environmental Protection. 2021;155:455–65.
- 11. Wang L, Wang Y, Chen Y, Pan X, Zhang W, Zhu Y. Methodology for assessing dependencies between factors influencing airline pilot performance reliability: A case of taxiing tasks. Journal of Air Transport Management. 2020;89:101877.
- 12. Lan H, Wang S, Zhang W. Predicting types of human-related maritime accidents with explanations using selective ensemble learning and SHAP method. Heliyon. 2024;10(9):e30046. pmid:38694082
- 13. A Semary N, Ahmed W, Amin K, Pławiak P, Hammad M. Enhancing machine learning-based sentiment analysis through feature extraction techniques. PLoS One. 2024;19(2):e0294968. pmid:38354193
- 14.
Fanni SC, Febi M, Aghakhanyan G, Neri E. Natural language processing. Introduction to Artificial Intelligence. Cham: Springer International Publishing. 2023. p. 87–99.
- 15.
Singh R, Kumar A, Ray M. Performances of Machine Learning Models and Featurization Techniques on Amazon Fine Food Reviews. Optimization Techniques in Engineering. Wiley. 2023. 187–99. http://dx.doi.org/10.1002/9781119906391.ch11
- 16. Squillante R Jr, Santos Fo DJ, Maruyama N, Junqueira F, Moscato LA, Nakamoto FY, et al. Modeling accident scenarios from databases with missing data: A probabilistic approach for safety-related systems design. Safety Science. 2018;104:119–34.
- 17. Gao Z, Mavris DN. Statistics and Machine Learning in Aviation Environmental Impact Analysis: A Survey of Recent Progress. Aerospace. 2022;9(12):750.
- 18. Puranik TG, Rodriguez N, Mavris DN. Towards online prediction of safety-critical landing metrics in aviation using supervised machine learning. Transportation Research Part C: Emerging Technologies. 2020;120:102819.
- 19.
Bojanapureddy GV. Aviation data analysis using data mining techniques. California State University, Northridge. 2024.
- 20. Dou Z, Keller J, Gao Y. Navigating massive text reports: An automated approach to aviation safety reporting system safety event detection. Transportation Research Record. 2024.
- 21. Bartulović D. Predictive Safety Management System Development. Trans Marit Sci. 2021;10(1).
- 22. Jiang X, Zhang Y, Li Y, Zhang B. Forecast and analysis of aircraft passenger satisfaction based on RF-RFE-LR model. Sci Rep. 2022;12(1):11174. pmid:35778429
- 23.
Abraham N. Hazard classification of federal aviation administration (FAA) unmanned aircraft systems (UAS) sightings reports using machine learning. 2022.
- 24. Perboli G, Gajetti M, Fedorov S, Giudice SL. Natural Language Processing for the identification of Human factors in aviation accidents causes: An application to the SHEL methodology. Expert Systems with Applications. 2021;186:115694.
- 25. Rose RL, Puranik TG, Mavris DN. Natural Language Processing Based Method for Clustering and Analysis of Aviation Safety Narratives. Aerospace. 2020;7(10):143.
- 26. Miyamoto A, Bendarkar MV, Mavris DN. Natural Language Processing of Aviation Safety Reports to Identify Inefficient Operational Patterns. Aerospace. 2022;9(8):450.
- 27. Dong T, Yang Q, Ebadi N, Luo XR, Rad P. Identifying Incident Causal Factors to Improve Aviation Transportation Safety: Proposing a Deep Learning Approach. Journal of Advanced Transportation. 2021;2021:1–15.
- 28. Rose RL, Puranik TG, Mavris DN, Rao AH. Application of structural topic modeling to aviation safety data. Reliability Engineering & System Safety. 2022;224:108522.
- 29. Jiao Y, Dong J, Han J, Sun H. Classification and Causes Identification of Chinese Civil Aviation Incident Reports. Applied Sciences. 2022;12(21):10765.
- 30. Zhang X, Srinivasan P, Mahadevan S. Sequential deep learning from NTSB reports for aviation safety prognosis. Safety Science. 2021;142:105390.
- 31.
Salas E, Maurino D. Human factors in aviation. Academic Press. 2010.
- 32.
Martin JH. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson/Prentice Hall. 2009.
- 33.
Chowdhary KR. Natural Language Processing. Fundamentals of Artificial Intelligence. Springer India. 2020. p. 603–49. https://doi.org/10.1007/978-81-322-3972-7_19
- 34. Kang Y, Cai Z, Tan C-W, Huang Q, Liu H. Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics. 2020;7(2):139–72.
- 35.
Kathuria A, Gupta A, Singla RK. A Review of Tools and Techniques for Preprocessing of Textual Data. Advances in Intelligent Systems and Computing. Springer Singapore. 2020. p. 407–22. https://doi.org/10.1007/978-981-15-6876-3_31
- 36. Chai CP. Comparison of text preprocessing methods. Nat Lang Eng. 2022;29(3):509–53.
- 37. Mutlag WK, Ali SK, Aydam ZM, Taher BH. Feature Extraction Methods: A Review. J Phys: Conf Ser. 2020;1591(1):012028.
- 38.
Salau AO, Jain S. Feature Extraction: A Survey of the Types, Techniques, Applications. In: 2019 International Conference on Signal Processing and Communication (ICSC), 2019. 158–64. https://doi.org/10.1109/icsc45622.2019.8938371
- 39.
Kadhim AI. Term weighting for feature extraction on Twitter: A comparison between BM25 and TF-IDF. In: 2019. 124–8.
- 40. Reddy Maddikunta PK, Srivastava G, Reddy Gadekallu T, Deepa N, Boopathy P. Predictive model for battery life in IoT networks. IET Intelligent Trans Sys. 2020;14(11):1388–95.
- 41. Nitesh VC. SMOTE: Synthetic minority over‐sampling technique. Journal of Artificial Intelligence Research. 2002;16(1):321.
- 42. Al Amrani Y, Lazaar M, El Kadiri KE. Random forest and support vector machine-based hybrid approach to sentiment analysis. Procedia Computer Science. 2018;127:511–20.
- 43. Mustafa Abdullah D, Mohsin Abdulazeez A. Machine Learning Applications based on SVM Classification A Review. QAJ. 2021;1(2):81–90.
- 44.
Yan X, Jin Y, Xu Y, Li R. Wind turbine generator fault detection based on multi-layer neural network and random forest algorithm. In: 2019. 4132–6.