Figures
Abstract
Oil and gas pipeline security is critical to national infrastructure, yet existing monitoring systems often lack the sensitivity and real-time responsiveness required to detect subtle intrusion events. This study presents a novel multimodal sensing and interaction frame-work that integrates phase-sensitive optical time-domain reflectometry (φ-OTDR)–based distributed acoustic sensing (DAS) with an optimized one-dimensional convolutional neural network (1-D CNN) architecture. The approach leverages both raw fiber optic vi-bration signals and carefully selected handcrafted features, enabling robust automatic in-trusion classification across multiple event types including manual tapping, mechanical excavation, and human footsteps. By incorporating transfer learning from publicly avail-able human activity datasets, the model achieves enhanced feature generalization, result-ing in a classification accuracy exceeding 95%. This work demonstrates the potential of combining advanced multimodal sensing technologies with deep learning-based interac-tive analytics for real-time pipeline security monitoring, paving the way for intelligent in-frastructure protection systems. Future efforts will focus on expanding dataset diversity, integrating multi-sensor fusion, and enhancing adaptive interaction capabilities for field deployment.
Citation: Qin H, Huang X, Wang X, Zhou Z (2025) Identification and classification of oil and gas pipeline intru-sion events based on 1-D CNN network. PLoS One 20(12): e0338205. https://doi.org/10.1371/journal.pone.0338205
Editor: Muhammad Ahsan, Sepuluh Nopember Institute of Technology: Institut Teknologi Sepuluh Nopember, INDONESIA
Received: August 11, 2025; Accepted: November 18, 2025; Published: December 23, 2025
Copyright: © 2025 Qin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: This study utilizes two distinct datasets: the first is a publicly available dataset of six basic human activities, obtained from the official ACT dataset repository (HumanActivity Recognition Using Smartphones - UCl Machine Leaming Repository; stored at https://data.mendeley.com/datasets/n7xwn4rr79/1), while the second comprises oil and gas pipeline intrusion events collected by the author’s affliated company. The data set has been uploaded to the website (https://data.mendeley.com/datasets/w7nzxs593c/1), DOI 10.17632/w7nzxs593c.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
As the main energy source in several industrial sectors, oil and gas are of great strate-gic importance, so pipeline safety has become a top priority. Therefore, ensuring the safety of oil and gas pipeline transportation is an essential requirement for the energy industry. The safety issues involved in the transportation of oil and gas, especially the real-time online monitoring and identification of current or potential pipeline sabotage, have be-come an important goal of safe production [1]. Destructive behaviors such as manual ex-cavation or mechanical excavation are particularly important objects in pipeline safety monitoring, because these activities can easily lead to pipeline damage and oil and gas leakage. Once such a safety accident occurs, it will not only seriously affect local produc-tion activities and the daily life of residents, causing major economic losses, but also may cause secondary disasters such as water or air pollution, fire and even explosion, serious-ly threatening the personal safety and property of relevant personnel. The consequences of such accidents far exceed the cost of preventive maintenance, which highlights the ur-gency of developing effective preventive measures [2].
Earlier studies mainly used traditional machine learning methods such as linear regression, 44 naive Bayes classifiers, and decision tree algorithms for the preliminary detec-tion and classification of pipeline leakage and corrosion anomalies. These models often rely heavily on features manually defined by experts, so their accuracy is limited. From the early s to the early s, more advanced algorithms were gradually introduced, in-cluding support vector machines (SVMS) [3] and Random Forest [4], to achieve more accu-rate pipeline state assessment and prediction. From the mids to the earlys, re-searchers used recurrent neural networks (RNN) and Long short-term memory networks (LSTM) [5] to process data sequences of real-time changes in pipes, such as pressure and temperature, to predict potential failures and anomalies earlier. Recent work in natural gas analytics has shown that hybrid or composite pipelines can yield sizable gains: a PCA–CPSO– SVR hybrid reduced error by .6% for multiphase pipeline corrosion-rate prediction (Peng et al.,) [6], while an LMD–WTD–LSTM composite achieved excellent performance for daily gas-load forecasting (Peng et al.,) [7]. Motivated by these findings, we likewise adopt a hybrid design— feature-level fusion plus transfer-learned 1-D CNN—to balance robustness and efficiency.
On the basis of summarizing previous studies, this paper proposes a comprehensive framework combining multimodal DAS signals with an optimized 1-D CNN architecture enhanced [8] through transfer learning from human activity recognition datasets. The inte-gration enables robust, scalable, and interactive pipeline intrusion detection, supporting real-time monitoring and response. Our contributions include (1) a multimodal feature extraction and selection pipeline leveraging both raw and handcrafted inputs, (2) a fi-ne-tuned 1-D CNN model architecture tailored for fiber optic vibration signals, and (3) demonstration of transfer learning efficacy in cross-domain intrusion classification [9]. The work lays a foundation for intelligent interactive systems in pipeline security and other critical infrastructure monitoring applications. Finally, the method proposed in this study significantly improves the recognition and classification accuracy, from the original. % and .2% of the two-dimensional CNN method [10] to more than %.
We initialize the 1-D CNN from a human-activity (HAR) model to reuse low-level temporal primitives such as onset/offset detectors and band-pass/rhythmic patterns. These primitives are sensor-agnostic and provide a good starting point for DAS time-series, while the task-specific semantics are learned during fine-tuning. All layers are fine-tuned end-to-end. To reduce the risk of negative transfer, we use a smaller learning rate for transferred layers (LR multiplier 0. 1), weight decay, and label smoothing. Training is monitored with early stopping based on validation Macro-F1 and Expected Calibration Error (ECE). These choices encourage the network to retain only generic temporal filters while re-organizing them toward DAS-relevant bands.
We report per-class Precision/Recall/F1 and Macro PR-AUC, together with ECE, to ensure that transfer does not disproportionately affect rare categories or yield over-confident probabilities. The overall improvements relative to training from scratch are therefore interpreted as coming from better low-level initialization rather than from transferring source-domain semantics.
This paper is organized in the following manner: The first section provides an intro-duction. Section 2 describes the technologies and methods utilized. Section 3 validates the effectiveness of the proposed 1-D CNN approach through experiments conducted on two publicly available datasets containing simple human activity data. Section 4 evaluates the performance of the method on oil and gas pipeline intrusion event datasets, comparing the results against several alternative approaches, thus confirming the superiority of the improved 1-D CNN method in pipeline intrusion event detection. Section 5 presents the experimental conclusions.
The overall experimental workflow is illustrated in Fig 1:
First, the experimental platform or algorithm framework (green box) is constructed to provide the infrastructure for subsequent data processing and analysis. This is followed by data preprocessing (orange box), which consists of three key steps: data acquisition, data preprocessing, and data visualization. In the data preprocessing stage, the original data is cleaned, de-noised and standardized to make the subsequent analysis and feature extraction process more effective [11]. At last, the pre- processed data is displayed visually by means of data visualization, which helps the model to understand the data features more deeply.
During offline training, two datasets are involved: the “six basic human activities” dataset and the “pipeline intrusion event” dataset. After data preprocessing, the recogni-tion and six-class
classification algorithm is applied (blue box). This involves feature ex-traction (drawing useful features from the visualized data), label classification (labeling the data features to prepare for classification), and finally recognition and six-class classi-fication, which classifies the human activity dataset into six activity categories based on the extracted and labeled features [12].
Once this algorithmic process is complete, transfer learning and feature optimization are performed (red box). Previously trained models or features are used for transfer learn-ing [13] to facilitate rapid training and optimization on the new pipeline intrusion event dataset. Lastly, a one- dimensional convolutional neural network (1-D CNN) is used spe-cifically to process the pipeline intrusion event dataset, automatically extracting effective features. The extracted features are subsequently provided to a classification model to re-liably detect and identify intrusion incidents within the pipeline. In the final stage, the classifier identifies the events based on the extracted features, producing the classification results that determine the exact type of intrusion event.
2. Techniques and methods used
We extracted a total of 43 handcrafted features from each 1-second signal segment, including the following components: Time-domain features (20 dimensions): mean, standard deviation, skewness, kurtosis, root mean square (RMS), crest factor, peak-to-peak value, kurtosis coefficient, among others. Frequency-domain features (15 dimensions): spectral centroid, spectral band-width, spectral roll-off, spectral entropy, dominant frequency, second-order central moment, etc. Wavelet packet energy features (8 dimensions): energy of sub-band signals decomposed from levels 1–4 using the db4 wavelet packet. These 43-dimensional features were used as input for traditional classifiers such as Support Vector Machines (SVM) and Random Forest (RF). In the handcrafted feature experiments, we extracted 43-dimensional statistical and fre-quency- domain features from each signal segment and employed LASSO regression to select the most significant features for comparison using traditional classifiers. In contrast, the 1D-CNN is capable of automatically learning features directly from raw signals, eliminating the need for manual feature engineering. Through end-to-end training, it optimizes convolutional kernels and activation functions to achieve more robust and expressive representations.
2.1. Phase-sensitive optical time domain reflectometry (φ-OTDR)
The working principle of φ-OTDR [14] is shown in Fig 2. A probing optical pulse is launched into the sensor fiber, and then the backscattered Rayleigh light generated during the propagation of this pulse within the sensing fiber is measured.
When external disturbances (such as vibration or intrusion) act on the sensing fiber, the characteristics of the optical signal transmitted inside the fiber will change. The first half of the diagram shows the structure of the fiber in a simplified manner, clearly showing how the dis-turbance affects the propagation of the optical signal. The diagram below visually shows how the optical power signal in the fiber varies with length before and after the disturbance. The red curve represents the normal signal without disturbance, and the blue curve represents the signal state after disturbance. The difference between these two curves forms the difference curve (black), and by precisely analyzing the position and amplitude of this difference curve, the exact location and severity of the disturbance can be effectively determined.
The principle of scattered light is illustrated in Fig 3 below. To enhance sensitivity to vi- brations, φ-OTDR utilizes a highly coherent light source to reinforce the interference among backscattered Rayleigh light signals, thus increasing sensitivity to phase changes. Ideally, when the sensing fiber is undisturbed, the Rayleigh scattering waveform remains constant. However, when strain or vibration occurs on the sensing fiber, changes in fiber length and refractive index at the disturbed location cause phase variations in the Rayleigh scattered waves, resulting in fluctuations in the Rayleigh scattering waveform at that location. By comparing the Rayleigh backscattering signals obtained prior to and following the disturbance, the difference can be de-termined,vibration detection and localization can be achieved [15].
Ideally, when the sensing fiber is not disturbed, the Rayleigh scattering waveform remains stable. However, when the fiber is affected by strain or vibration, the length and refractive index of the fiber in the local region will change, resulting in a change in the phase of the Rayleigh scattering wave at this position, resulting in a fluctuation in the Rayleigh scattering waveform here. Through the difference comparison of Rayleigh scat-tering curves before and after the disturbance, the detection and accurate location of vibra-tion events can be realized effectively.
2.2. One-dimensional convolutional neural netword (1-D CNN)
A one-dimensional convolutional neural network (1DCNN) is a neural network de-signed to process time-series data by applying convolution kernels along the time axis to extract features. The 1DCNN is widely applied in fields such as time-series analysis, speech recognition, and natural language processing [16].
The key component of a 1DCNN is its convolutional layer, which captures local pat-terns from sequential data by applying one-dimensional convolution filters across the temporal dimension. Fig 4 illustrates a typical two-dimensional convolutional neural network (2DCNN). Unlike 2DCNNs [17], a 1DCNN’s convolutional filters move exclu-sively along a single dimension, making this architecture especially effective for analyzing time-series data.
In one dimensional convolutional neural networks (CNN-1D), the main purpose of the convolution operation is to learn the best convolution kernel that can minimize the model loss function. The convolution kernel size is usually set according to the require-ments ofthe specific task. For example, when the model input is the data collected by the three-axis acceleration sensor, the convolution kernel size can be set to 9, that is, each convolution kernel covers the data of 9 consecutive time steps.
In implementing a 1DCNN, different deep learning frameworks may have different data format requirements. For example, Keras requires that the 0th axis of the time series data represents the time steps and the 1st axis represents the data points, whereas in MATLAB the order of these two axes is reversed. However, regardless of the framework, the basic principle of a 1DCNN remains the same.
In the forward computation process of a 1DCNN, the input data first passes through the initial convolutional layer, where the convolution kernel slides along the time axis to extract features. These features can then be further processed by additional convolutional layers or down sampled through pooling layers. Ultimately, the extracted features are uti-lized for classification or regression tasks via fully connected layers. Fig 5 below shows the 1-D CNN network architecture diagram designed in this paper [18].
First, the input layer receives the raw signal (blue waveform on the left in the figure). Next, the signal passes through the first convolution layer (conv1) and pool1 layer to ex-tract preliminary features. Subsequently, the second and third convolution layers (conv2, conv3) further dig into the deep features in the signal, and then pass through another pooling layer (pool2) to reduce the dimension of the feature map. The resulting feature maps are then flattened into vectors and input into the fully connected layer (fc1, fc2) for high-level feature integration. Finally, the network outputs target classification or predic-tion results (output layer). The whole network structure is designed to extract local and global features from time domain signals layer by layer to realize efficient recognition and analysis of one-dimensional data.
Our 1-D CNN model is architected to effectively process the multimodal input, em-ploying specialized convolutional kernels designed to capture local temporal dependen-cies and hierarchical signal patterns within fiber optic data. Successive convolution and pooling layers progressively abstract signal features, while nonlinear activation functions (e.g., ReLU) introduce essential model expressivity. The architecture is optimized through empirical tuning of kernel sizes, network depth, and activation strategies to maximize discriminative power for intrusion event classification.
Convolutional front-ends primarily learn short-range temporal motifs (e.g., transients, envelopes, rhythmic energy) that recur across sensing modalities. Our results are consistent with this view: performance gains are observed without evidence of class-specific degradation, and calibration is explicitly monitored. Thus, HAR pretraining acts as a generic initializer; task-specific meaning is acquired from DAS data during fine-tuning.
Our model consists of the following layers with parameters optimized through grid search:
Sensitivity and automated search methods for hyperparameter selection of this model, We have also conducted research on the sensitivity and automated search configuration of hyperparameter selection for this model, The result is shown in Fig 6 below:
Placed together, the two heatmaps show a consistently flat optimum. On the architectural side, varying kernel length (3–11) and channel width (64–256) changes Macro-F1 only marginally (≈ 95.4–95.8%), with a shallow maximum around kernel length 7 and ≥ 96 channels; wider/longer settings yield ≤0.4-pp gains, indicating diminishing returns. On the optimization side, learning rate and weight decay form a broad, stable plateau: LR ≈ 1e-4 -3e-4 with WD ≈ 3e-5 -1e-4 attains ≈ 95.8–95.9%, while very small LR/WD underfit slightly and very large LR (≥1e-2) or WD (≥1e-3) degrade performance. Together, these trends indicate the model is robust to moderate hyperparameter misspecification; practical defaults like {kernel = 7, channels = 128–192, LR = 3e-4, WD = 1e-4} lie on the plateau and are suitable anchors for lightweight automated search (e.g., coarse grid or random 225 search with early stopping) rather than exhaustive tuning (Table 1).
The table summarizes three searches: (i) kernel length × channels, (ii) learning rate × weight decay (grid means), and (iii) Bayesian optimization (60 trials). Architectural hyperparameters are weakly sensitive (0.9543–0.9585 Macro-F1; spread ≈0.42 pp) with a shallow optimum around k = 7 and c = 128–256. Optimization hyperparameters show a broad plateau (0.9430–0.9591; spread ≈ 1.61 pp), peaking near LR = 1e-4 -3e-4 and WD = 3e-5 - 1e-4. Bayesian optimization discovers a setting on the same plateau (k = 7, c = 256, LR = 3e-4, WD = 3e-4, Macro-F1 ≈ 0.9575), very close to the grid maximum (≈ 0.9591), confirming that lightweight automated search suffices to reach near optimal performance without exhaustive tuning. Based on these results, we adopt kernel = 7, 236 channels = 128–192, LR = 3e-4, WD = 1e-4 as robust defaults and search locally around them (Fig 7).
Supplementary notes on the training workflow when using this model:
Compared to 2DCNN and 3DCNN, 1DCNN is more efficient in handling time-series data. 2DCNN is typically used for image data, where the convolution kernel slides in two dimensions, while 3DCNN is used for processing three-dimensional data, such as medi-cal images, where the kernel slides in three dimensions [19].
In general, one-dimensional convolutional neural network (CNN-1D) is an important tool to extract useful features from time series data effectively, and has been applied in many fields. Whether it is accelerometer data analysis, or voice and text processing tasks, one-dimensional CNNS demonstrate their unique advantages.
2.3. Transfer learning
To enhance cross-domain knowledge transfer and foster interactive multimodal learning, 250 transfer learning is employed by pretraining the 1-D CNN on a publicly availa-ble human activity recognition dataset. This approach leverages shared temporal and spatial signal patterns to enable the model to adaptively fuse and interpret multimodal inputs, thereby accelerating training convergence and improving the model’s responsive-ness and adaptability to diverse pipeline intrusion scenarios within an interactive sensing environment.
The core idea is to apply the knowledge gained from a previous task (called the source task) to another related but different new task (called the target task), thereby facili-tating and speeding up the learning process of the target task [20]. The basic assumption of this approach is that there is some similarity between the source task and the target task, such as sharing some similar characteristics or patterns, so that the knowledge obtained from the source task can be used as useful prior information to solve the target task.
Specifically, fine-tuning is the re-adjustment of some or all parameters of the original model using the data of the target task [21].
The UCI HAR dataset, collected from smartphone accelerometers and gyroscopes, captures a rich set of human-induced vibrational and dynamic patterns. While the specific activities (walking, 265 jogging) differ from pipeline intrusions, both domains share underlying low-level signal characteristics, such as transient impulses (from footsteps or taps), periodic oscillations (from jogging 267 or machinery), and varying energy distributions across frequency bands. The primary goal of pre- training on this dataset is not to directly recognize human activities in the pipeline context, but to initialize the 1-D CNN with a strong set of generic feature detectors for time-series data. These detectors, which learn to identify edges, shapes, and rhythmic patterns, are highly transferable. The subsequent fine-tuning stage on pipeline-specific data then specializes these general-purpose feature extractors to the distinct signature of excavation, tapping, and footsteps on or near the pipeline, as evidenced by the significant performance improvement shown in Table 2.
2.4. Feature-level fusion strategies
We employ a multi – branch, early – stage feature – level fusion strategy. One branch uses the original 1 – D vibration sequence as the input to a 1 – D convolutional neural network (1 – D CNN) for automatic learning of feature representations. The other branch takes 43 – dimensional manually engineered features (including time – domain, frequency – domain, and wavelet packet energy 279 features). After standardization and dimensionality reduction (performed through LASSO selection), 280 these features are fed into several fully – connected layers to acquire low – dimensional embeddings.
The two branches are concatenated prior to the classification head. Additionally, we conduct a comparison of three fusion paradigms: Early (stacking and splicing)、Late (late – stage weighted 283 voting)、 Attention-Gated fusion. The following Fig 8 shows the visualization results of separability before and after t-SNE fusion:
The “Before Fusion” embedding exhibits substantial inter-class mixing. Clusters of different colors interpenetrate, with poor intra-class compactness and blurred margins—particularly between Class 0 and Class 2. This indicates that a single-branch representation (raw-only or handcrafted-only) is not sufficiently discriminative, forcing the classifier to rely on complex boundaries and increasing the risk of misclassification. The “After Fusion” embedding shows markedly improved separability and compactness: class clusters are more coherent and the overlap area shrinks, yielding wider inter- class margins. Compared to the pre-fusion view, Class 1 transitions from fractured/mixed regions to a more continuous cluster, and the Class 0–Class 2 overlap is largely resolved. This indicates that feature-level fusion (raw DAS + handcrafted) provides complementary information, improves class separability, and reduces the classifier’s decision complexity—consistent with the observed gains in Macro-F1 and PR-AUC.
Table 3 below presents a comparison table of the effects of each fusion strategy:
It is not difficult to see from the table that feature-level fusion (original DAS + manual features) significantly improves the discriminant index and achieves a more balanced deployment configuration in terms of accuracy, delay and calibration.
3. Experimental research
The experimental framework involves collecting multimodal data from fiber optic DAS sensors capturing vibrational disturbances alongside manually extracted hand-crafted features. The data undergoes systematic preprocessing including denoising, nor-malization, and temporal segmentation to facilitate effective feature fusion. The integrated multimodal feature set forms the input to the 1-D CNN classifier.
Initial training is conducted on a public human activity recognition dataset to devel-op foundational feature representations, establishing an interactive knowledge base. Sub-sequently, transfer learning fine-tunes the network using pipeline intrusion events data collected in controlled environments. The interactive training process enables the model to adaptively learn the complex multimodal signal patterns associated with different intru-sion types, demonstrating stable convergence and high accuracy in both validation and testing phases.
Classification results highlight the efficacy of multimodal fusion and the interactive learning framework, with improved discrimination especially in classes exhibiting over-lapping signal characteristics, underscoring the value of the multimodal and transfer learning approach for real world pipeline security applications.
3.1. Data sources
3.1.1. Public dataset.
In order to obtain the data set required for this supervised learning project, we invited multiple participants to carry Android smartphones for a variety of daily activities. Given that the experiment involved real people and possible safety risks (such as participants falling while jogging or going up 331 and down stairs), we obtained prior approval from Fordham University’s Ethics Review Board (IRB).
After receiving ethical approval, we re-cruited 29 volunteers to participate in this experiment. Each participant placed an An-droid smartphone in the front pocket of their pants and performed specific actions such as walking, jogging, going up and down stairs, sitting and standing for a specified period of time to collect the required data.
Traditional classification techniques cannot be directly applied to raw accelerometer data presented in the form of time series. Therefore, the first step in the classification task requires the conversion of raw sensor data into structured instances suitable for classifica-tion. To achieve this, we divided the continuous data into 10-second intervals and ex-tracted effective features from the 200 sensor data points collected in each interval. We de-fine these interval lengths as the example duration (ED). The reason for choosing 10 sec-onds as ED is that this time span is long enough to cover many typical repetitive motion cycles across the six activities studied.
The final dataset contained a total of 43 features extracted from 29 participants. In addition, the bottom row of Table 4 shows the proportion of specific activities to examples in the overall data set.
3.1.2. Oil and gas pipeline intrusion event dataset.
The dataset used in this experiment was manually collected by the company during an internship, 351 utilizing a DAS fiber optic sensor. Our field sampling is illustrated in Fig 9 below.
While low-level primitives are reusable, site-specific effects (soil/backfill, burial depth, coupling) may still induce distribution shifts. We therefore recommend on-site calibration (threshold tuning or 354 light re-training) for deployment. Our conclusions do not rely on transferring HAR semantics but on reusing general temporal filters.
To ensure the representativeness of this real-world conditional dataset。 We construct a parameterized simulator matching the real DAS configuration (sampling rate, segment length, 360 channel count). Disturbance archetypes (footstep, light tapping, periodic mechanical impact) are synthesized with physically plausible envelopes and spectral content, mixed with environmental noise (microtremor, wind, device noise) at target SNRs. To quantify representativeness against an anonymized field subset, we compute (i) power spectral density (PSD) and band energy profiles, (ii) time-domain statistics (kurtosis, zero-crossing rate), (iii) autocorrelation decay constants, and (iv) spectral kurtosis. Similarity is summarized via two-sample KS tests (per statistic) and relative errors of band energies. The PSD superposition comparison and the frequency band energy bar comparison chart are shown in Fig 10 below:
The scatter plots of kurtosis, zero-crossing rate and AC attenuation are shown in Fig 11 below:
The scatter plots reveal the same correlation structure for Real and Sim segments: (i) Kurtosis vs. ZCR is positively associated—segments with sharper, burstier waveforms (higher kurtosis) tend to cross zero more frequently; (ii) Kurtosis vs. τ_AC shows a negative trend—burstier segments decorrelate faster (shorter τ_AC); (iii) ZCR vs. τ_AC is also negatively related—more rapid sign changes coincide with faster autocorrelation decay. The simulated points overlap the real clusters well, with only slightly fewer extremes, suggesting the simulator reproduces the distribution and coupling of time-domain features rather than just their means (Fig 12).
The last one is the mean graph of the spectral kurtosis curve:
Spectral kurtosis curves from Real and Sim exhibit coincident, narrow peaks at the same frequencies (e.g., ~ 11–13 Hz and additional harmonics/lines near ~40, ~ 55, ~ 68, and ~90–95 Hz), indicating that both contain similarly intermittent, line-like spectral components at those locations.
Peak heights are very close; the Sim curve is marginally higher at ~ 13 Hz while remaining within the Real μ ± σ envelope across most of the spectrum. This alignment supports that the simulator replicates not only broadband energy but also the frequency-localized impulsiveness of the real data.
Taken together, the average PSD and band-energy analyses show that both datasets are strongly low-frequency dominated: the PSD decays steeply above 2–3 Hz and exhibits a small, reproducible resonance around ~ 12–13 Hz. The simulated signal closely tracks the real one over 0–100 Hz, including the low-frequency roll-off and the narrow ~ 12 Hz peak, with only minor amplitude deviations at the peak. Consistently, the band-energy results (mean ± std) confirm that most energy lies in [0, 5] Hz for both sources, with much smaller contributions in [5,10] and [10,20] Hz and negligible energy above 20 Hz. Across all bands, Real and Sim fall within each other’s standard deviation ranges; small differences—e.g., in [10,20] Hz—are within variability and do not alter the ranking of bands. Overall, the simulator captures both the gross spectral envelope and the way energy is partitioned across frequency.
Despite the alignment in aggregate statistics, the simulator cannot fully capture the variability induced by soil composition, burial depth, backfill, and local coupling conditions. Extreme operating scenarios (heavy machinery, rigid pavement) may exhibit spectral peaks absent in our current library. We recommend site-specific calibration and threshold tuning for deployment.The data collection process is illustrated in Fig 13.
As laser pulses propagate through the optical fiber, the molecules in the fiber material are stimulated and cause scattering, such as Raman scattering, Rayleigh scattering, and Bril-louin scattering. Among these, Rayleigh scattered light is affected by external environ-mental vibrations. Thus, the principle behind DAS distributed fiber optic vibration sensing is to acquire external environmental information by detecting the intensity and phase of the backscattered Rayleigh light signals within the fiber.
Φ-OTDR Signal Demodulation Technique: To implement distributed fiber optic sensing, the Φ-OTDR technology amplifies and filters the backscattered Rayleigh signals in the sensing fiber using a weak signal amplifier (EDFA). These signals are then detected as op-toelectronic signals, converted into digital signals by a high-speed data acquisition card, and finally processed by a computer to obtain the external environmental vibration sig-nals.
The raw data is the external scattered light intensity along the fiber, demodulated by the DAS device. Data is stored in binary (bin) format, with a data type of 16-bit USHORT. Each data frame contains 32768 data points.
Automatic collection works by automatically sampling the signal at location points that meet the set threshold conditions. When the channel sending option is enabled, the ab-normal signals from the corresponding channel that satisfy the threshold conditions are sent to the configured network IP and port address.
The sampled data is a time-domain signal of 1 second duration at a selected location, which is sent through a network port. A proper network IP address and port number must be set for the server or other receiving devices to receive the sampled data.
3.2. Public data six-class classification test
In this study, the publicly available ACT human activity dataset—comprising six activity classes: Walk, Jog, Up, Down, Sit, and Stand—was utilized in the pretraining phase of the 1D-CNN to obtain initial convolutional kernel weights. Subsequently, fine-tuning and six-class intrusion classification were performed using the self-collected pipeline intrusion signals from 29 volunteers, incorporating both the 43 handcrafted features and the corresponding raw time-series segments. In Fig 8, the labels 0–5 correspond to the aforementioned six activi-ty/intrusion classes. First, we conduct a study on six-class classification recognition using the ACT public dataset (Fig 14):
This diagram shows the distribution of raw data labels. In addition, label 1 also has a higher sample size, second only to label 5. As a result, the model usually performs better for labels with a larger sample size, such as labels 1 and 5.
Fig 15 presents the distribution of 10 selected representative features (refer to Table 5) across the six intrusion classes. Each box plot illustrates the inter-class variability of a specific feature, demonstrating its discriminative capability.
The chart above shows the distribution of each feature. These histograms reflect the eigen-value distribution of the three channels in the dataset (channels 1, 2, and 3):
Channel 1 (blue): The values are roughly distributed between −3 and 3, showing a nearly normal distribution. The distribution is symmetrical, the data is concentrated in the middle re-gion, and there are a few extreme values at both ends.
Channel 2 (green): The value range is about −4–2, the distribution is obviously negative skew (skew on the left side), the data is mainly concentrated near 0 and slightly negative, the left tail has a significant negative extreme value, while the right tail is relatively short.
Channel 3 (red): Values range from −4–4, with an overall approximate normal distribution, but slightly positive skew (skew to the right). The data is mainly concentrated around 0, while there are some positive extreme values in the right tail.
In general, the data of all channels is standardized to ensure the consistency of the data scale.
The apparent asymmetry shown by channel 2 May be more valuable in feature extraction, as it may be easier for the model to capture anomalies or significant differences in this channel (Fig 16).
The figure shows the overall data distribution after redimensioning (combining data across all channels) and standardizing. The data is mainly concentrated between −4 and 4, with the highest density located near 0. As a whole, the standard normal (Gaussian) distribution is sym-metrical, with fewer extreme values at both ends and no obvious skew phenomenon (Fig 17).
This time series graph visually shows the trend, periodicity, and volatility of the data over time.By observing the chart, you can quickly grasp the overall trend, periodic rule and fluctua-tion range of the data. The figure shows data from three different samples (sample 1, sample 2, and sample 3), each containing three channels (channel 1, channel 2, and channel 3). All samples are labeled in the same category (label 1), indicating that they belong to the same class of data.
The horizontal axis represents the time steps in the data acquisition process, with a total of approximately 90 time steps, each corresponding to a specific data value. The vertical axis represents the signal strength or amplitude of each channel at each time step. The amplitude values fluctuate roughly between −2 and 2, indicating that these signals have been normalized or nor-malized.
Each sample presents significantly different fluctuations on its three channels, there are significant differences between channels, and the waveform is irregular as a whole, showing a cer-tain randomness or complexity. Although the three samples are labeled with the same label, their specific waveforms differ considerably. This suggests that any method used for analysis or modeling must have the ability to identify common features between samples ofthe same label in order to cope with the diversity of waveforms (Fig 18).
The figure shows a single time series sample (sample 0), labeled 0, with data that has been normalized. The chart reflects the fluctuations of three channels (channel 1, channel 2, and Channel 487 3). The fluctuation of each channel is significant, and there is no obvious periodicity or regularity to follow on the whole:
Channel 1 shows significant extreme fluctuations at specific time points (e.g., around time 490 steps 20, 40, and 60).
Channel 2 also exhibits similar extreme values (e.g., around time steps 35, 45, and 85).
Channel 3 appears relatively more stable, yet it still presents random variations. Overall, the sample presents a random and irregular signal pattern that may represent a complex signal or noise-like data. This implies that the category corresponding to label 0 May reflect an atypical or more random signal feature (Fig 19).
The figure illustrates the training process of the 1-D CNN network applied to the ACT pub-lic dataset, showing the trends in loss and accuracy over the training epochs.
Left Chart (Loss):
The horizontal axis represents the training epochs (100 epochs in total).
The vertical axis represents the loss value of the model, and a lower value usually means better prediction performance.
The loss decreases rapidly during the first 20 epochs, demonstrating that the model quickly learned the basic features of the data.
Around the 60th training cycle, the speed of loss decline began to slow down and fi-nally stabilized in a lower range (about 0.2 to 0.4), indicating that the model has gradually con-verged and the training process tends to be stable.
Right Chart (Accuracy):
The horizontal axis still represents the number of training epochs, while the vertical axis represents the accuracy of the model.
In the first 30 training cycles, the training accuracy rate rose rapidly, then continued to increase at a slower speed, and finally exceeded 95%, which indicates that the model has well fitted the training data.
The test accuracy remains stable at around 95% and is very close to the training ac-curacy, 516 demonstrating good generalization with no significant overfitting.
Overall, the continuous decrease in loss values and steady improvement in accuracy indicate that the model has successfully extracted key features from the data. In addition, the similarity between training accuracy and testing accuracy indicates that the model has strong generaliza-tion ability when facing new and unseen data (Fig 20).
Receiver operating characteristic curve (ROC curve) is one of the important indexes to eval-uate the performance of classification models. The curve graphically shows the change of true rate (TPR) with false positive rate (FPR) under different decision thresholds.
Horizontal axis (X-axis): False Positive Rate (FPR), which represents the proportion of negative samples that are incorrectly classified as positive.
Vertical axis (Y-axis): True Positive Rate (TPR, also known as recall), which indicates the proportion of positive samples that are correctly classified.
AUC (area under the curve) is used to quantify the classification performance of the model, with values ranging from 0.5 to 1; The closer the value is to 1, the better the model performance. From the figure, it can be seen that the 1D CNN model performs well in identifying and classi-fying six types of labels, with an AUC value close to 1 (Fig 21).
The figure shows a confusion matrix heat map generated by a classification task consisting of six different categories (labels 0–5). The graph directly reflects the prediction effect of the model for various categories. The horizontal axis represents the labels predicted by the model, and the vertical axis represents the actual labels. Diagonal elements from the top left to the bot-tom right represent samples that have been correctly identified, while values beyond the diag-onal represent samples that have been misidentified by the model as other classes.
Overall, the model performed very well, with most samples correctly classified, especially on label 1 (2, 199 samples) and label 5 (2,834 samples). However, there is still some confusion between certain categories, for example, label 0 is often misinterpreted as label 4 or label 5. This suggests that there is a high degree of similarity between these categories and that more data or further feature engineering may be needed to enhance the model’s region classification capabili-ties (Fig 22).
3.3. Three-Class Classification Test on Oil and Gas Pipeline Data with Transfer Learning Incorporated
The figure shows the change trend of loss value and accuracy rate of the model in the process of three classification task training. As the number of training rounds increased, the loss value gradually decreased from about 2.5 to below 0.5, and stabilized after about 60 rounds, indicating that the model has gradually converged. At the same time, the training accuracy rate rose rapid-ly from 0.4 to more than 0.9, and stabilized at about 0.9 after about the 30th round. On the whole, the model shows stable and strong learning ability in the training process. In order to disentangle the influences of manually engineered features, original data, and transfer learning, this paper additionally conducts controlled statistical comparisons and ablation studies, with the experimental outcomes detailed below:
Transfer learning improves accuracy by 5.2%, demonstrating its value for small datasets. The combined approach achieves synergistic performance beyond individual components.
4. Experimental comparison
4.1. Compared with traditional and current deep learning algorithms
To rigorously evaluate the proposed multimodal 1-D CNN framework, comparative experiments were conducted against several classical machine learning algorithms, in-cluding Support Vector Machines (SVM), Random Forests (RF), Gradient Boosting, and Logistic Regression. These models were trained on the same multimodal feature sets to ensure a fair comparison (Fig 23).
The four comparative experimental results are integrated into Table 6 as shown below.
Four comparative experimental result charts.
Performance metrics—recall, precision, F1-score, and support—were calculated for each model across different intrusion event classes. Results indicate that while traditional classifiers achieve moderate performance on certain classes, their overall expressive power and adaptabil-ity fall short compared to the deep learning approach. In particular, the 1-D CNN exhibits supe-rior recall and precision across all classes, highlighting its ability to effectively learn complex temporal and spectral features inherent in multimodal DAS data.
Regarding data imbalance and performance consistency, We acknowledge the importance of model performance consistency across all intrusion categories. To mitigate potential bias from class imbalance, a class-weighted cross-entropy loss function was employed during the training of our 1-D CNN model. This approach increases the cost of misclassifying samples from underrepresented classes, thereby encouraging the model to learn their characteristics more effectively.
The efficacy of this strategy is validated by the detailed per-class performance of our proposed 1-D CNN. On the test set, the model achieved the following recall rates: [0.94] for Class 0 (Manual 593 Tapping), [0.92] for Class 1 (Mechanical Excavation), and [0.96] for Class 2 (Footsteps). The high and balanced recall across all classes demonstrates that the model does not sacrifice the detection of any specific category, particularly the potentially rarer events like Class 1, to achieve high overall accuracy. Furthermore, the per-class precision values were also consistently high ([0.95] for Class 0, 597 [0.93] for Class 1, and [0.95] for Class 2), resulting in strong F1-scores for each class. This uniform high performance confirms that our model reliably detects all targeted intrusion types without exhibiting significant bias.
The comparative analysis underscores the advantages of integrating multimodal sensing with deep interactive learning models, reinforcing the proposed framework’s suitability for re-al-time pipeline security applications.
4.2. Compared with the current deep learning algorithms
After comparing with traditional machine learning algorithms, we also made comparisons with existing deep learning algorithms (LSTM/Transformer/hybrid model). The comparison results are shown in Fig 24 and Table 2 as follows:
Table 7 shows test accuracy versus parameter count for six backbones trained under the same protocol. The 1D-CNN attains the highest accuracy (~94.6%) with a compact footprint (~ 1.5M params). TCN is a close second (~94.5% at ~ 1.7M), suggesting that local temporal convolutions already capture the dominant structure of the data. Heavier hybrids and sequence models—CNN- BiLSTM (~94. 1% at ~2. 1M), Bi-LSTM (~93.7% at ~2.8M), GRU (~93.6% at ~2.5M), and a lite Transformer (~93.9% at ~3.2M)—do not surpass the simpler CNNs despite larger capacity. The spread across all methods is modest (≈1.1 percentage points), indicating diminishing returns beyond lightweight convolutional designs.
The accompanying table confirms this trend across additional metrics: the 1D-CNN and TCN deliver the best Macro-F1 with the lowest FLOPs and shortest inference latency, whereas the recurrent and Transformer-style models incur higher compute/latency without accuracy gains. The hybrid CNN-BiLSTM narrows the gap but remains below pure CNNs, implying that long-range recurrence/attention is not the bottleneck for this task. Taken together, these results justify using a compact CNN (kernel ≈ 7, 128–192 channels) as the default backbone; more complex LSTM/Transformer/Hybrid variants offer limited benefit relative to their cost.
5. Conclusions
This study presents a novel multimodal sensing and interaction framework for oil and gas pipeline intrusion detection, integrating distributed acoustic sensing data with handcrafted feature fusion and a finely optimized 1-D CNN model enhanced by transfer learning. The approach demonstrates superior classification accuracy and robust feature learning capabilities, highlighting the potential of multimodal technologies for intelligent infrastructure security. This work demonstrates the potential of combining advanced multimodal sensing technologies with deep learning-based interactive analytics for real-time pipeline security monitoring. It is important to note that the proposed framework is designed for extensibility. While this study validated its efficacy on three high-priority intrusion events (manual tapping, mechanical excavation, and footsteps), the underlying architecture is capable of incorporating new classes of intrusion events, such as vehicle rollover or impact. Future deployment will focus on continuously expanding the event l.ibrary by collecting new data and fine-tuning the model, thereby enhancing the system’s comprehensiveness and practical utility.Future research will explore deeper integration of heterogeneous sensor modalities to build richer multimodal interaction frameworks, develop adaptive interactive learning algorithms capable of real-time anomaly detection and hu-man-in-the-loop decision support, and implement the system in operational pipeline networks. These efforts aim to realize fully interactive and intelligent infrastructure moni-toring systems that seamlessly fuse multimodal data streams, 645 support dynamic interac-tion with operators, and enable proactive security interventions.
Supporting information
S1 Appendix. The six fundamental human activities recorded in the dataset are as follows.
https://doi.org/10.1371/journal.pone.0338205.s001
(TIF)
Acknowledgments
This research was supported by the “Bojun F3 Production Line Informatization Construction Project.“ The authors gratefully acknowledge the project’s support, particularly in the area of fi-ber optic sensing technology, which provided essential technical resources and a practical appli-cation environment that significantly contributed to the development and validation of this work.
References
- 1. Almahakeri M, Al-Mahasneh AJ, Abu Mallouh M, Jouda B. Deep neural network-based intelligent health monitoring system for oil and gas pipelines. Applied Soft Computing. 2025;171:112827.
- 2. Deng J, Wang D, Gu J, Chen C. SGO: An innovative oversampling approach for imbalanced datasets using SVM and genetic algorithms. Information Sci. 2025;690:121584.
- 3. Halabaku E, Bytyçi E. Overfitting in machine learning: a comparative analysis of decision trees and random forests. IASC. 2024;39(6):987–1006.
- 4. Eatedal A, et al. Sustainable groundwater management using stacked LSTM with deep neural network. Urban Climate. 2023;49.
- 5. Y Z., et al. Coherent pixel selection using a dual-channel 1-D CNN for time series InSAR analysis. Int J Applied Earth Observat Geoinformat. 2022;112.
- 6. Peng S, Zhang Z, Liu E, Liu W, Qiao W. A new hybrid algorithm model for prediction of internal corrosion rate of multiphase pipeline. J Natural Gas Sci Eng. 2021;85:103716.
- 7. Peng S, Chen R, Yu B, Xiang M, Lin X, Liu E. Daily natural gas load forecasting based on the combination of long short term memory, local mean decomposition, and wavelet threshold denoising algorithm. J Natural Gas Sci Eng. 2021;95:104175.
- 8. Kensert A, Desmet G, Cabooter D. MolGraph: a Python package for the implementation of molecular graphs and graph neural networks with TensorFlow and Keras. J Comput Aided Mol Des. 2024;39(1):3. pmid:39636382
- 9. Jon-Lark K, Byung-Sun W, Hee YJ. A convolutional neural network based classification for fuzzy datasets using 2-D transformation. Appl Soft Computing. 2023;147.
- 10. Olszewski D. Asymmetry index for data and its verification in dimensionality reduction and data visualization. Informat Sci. 2025;689:121405.
- 11. Yang F, Liu J, Zhang Q, Yang Z, Zhang X. CNN-based two-branch multi-scale feature extraction network for retrosynthesis prediction. BMC Bioinformatics. 2022;23(1):362. pmid:36056300
- 12. Messaoudi H, Belaid A, Ben Salem D, Conze P-H. Cross-dimensional transfer learning in medical image segmentation with deep learning. Med Image Anal. 2023;88:102868. pmid:37384952
- 13. Zhou Y, Yang G, Xu L, Wang L, Tang M. Mixed event separation and identification based on a convolutional neural network trained with the domain transfer method for a φ-OTDR sensing system. Opt Express. 2024;32(15):25849–65. pmid:39538465
- 14. Yang Y, et al. Digitalized phase demodulation scheme of [formula omitted]-OTDR based on cross-coherence between Rayleigh back-scattering beat signals %J Optical Fiber Technol. 2022. 71.
- 15. Sahu D, Dewangan RK, Matharu SPS. Fault Diagnosis of Rolling Element Bearing with Operationally Developed Defects Using Various Convolutional Neural Networks. J Fail Anal and Preven. 2024;24(3):1310–23.
- 16. Gen B, et al. Development of a 2-D deep learning regional wave field forecast model based on convolutional neural network and the application in South China Sea. J Appl Ocean Res. 2022. 118.
- 17. Nainan S, Kulkarni V. Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN. Int J Speech Technol. 2020;24(4):809–22.
- 18. Tang H, Li Y, Huang Z, Zhang L, Xie W. Fusion of Multidimensional CNN and Handcrafted Features for Small-Sample Hyperspectral Image Classification. Remote Sensing. 2022;14(15):3796.
- 19. Ameer I, Bölücü N, Siddiqui MHF, Can B, Sidorov G, Gelbukh A. Multi-label emotion classification in texts using transfer learning. Expert Systems with Applications. 2023;213:118534.
- 20. Liu J, Hou L, Zhang R, Sun X, Yu Q, Yang K, et al. Explainable fault diagnosis of oil-gas treatment station based on transfer learning. Energy. 2023;262:125258.
- 21. Tribaldos VR, Franklin JBA. Aquifer monitoring using ambient seismic noise recorded with distributed acoustic sensing (DAS) deployed on dark fiber. J Geophysical Research: Solid Earth. 2021;126(4):e2020JB021004.