One-pass-throw-away learning for cybersecurity in streaming non-stationary environments by dynamic stratum network

Throughout recent times, cybersecurity problems have occurred in various business applications. Although previous researchers proposed to cope with the occurrence of cybersecurity issues, their methods repeatedly replicated the training processes for several times to classify datasets of these problems in streaming non-stationary environments. In dynamic environments, the conventional methods possibly deteriorate the adaptive solution to prevent these issues. This research proposes a one-pass-throw-away learning using the dynamical structure of the network to solve these problems in dynamic environments. Furthermore, to speed up the computational time and to maintain a minimum space complexity for streaming data, the new concepts of learning in forms of recursive functions were introduced. The information gain-based feature selection was also applied to reduce the learning time during the training process. The experimental results signified that the proposed algorithm outperformed the others in incremental-like and online ensemble learning algorithms in terms of classification accuracy, space complexity, and computational time.


Introduction
The assumption of current machine learning systems is designed to handle stationary data in which training and testing data could be independently distributed. This assumption is often violated in the study of cybersecurity problems, as the systems typically operate with a nonstationary environment called concept drift [1]. However, if machine learning systems are deployed online, an attacker may attempt to seek their vulnerabilities and evade the predicted performances by manipulating data on a variety of scenarios (situations) [2,3]. For the Internet of Things (IoT) security, risks and threats may occur on devices connected to the Internet in uncertain situations [4,5]. Moreover, cloud computing systems can be threatened from intruders in terms of the vulnerability of data integrity, authentication, and denial of service [6,7]. Therefore, recent security issues of machine learning systems led to complications in several real-life applications such as network traffic analysis, fraud detection, spam filtering, studied to enhance the predicted performance [18][19][20]. For a streaming non-stationary environment, these models cannot be appropriately applied due to the changes of context over time. The learning of the cybersecurity problems under streaming non-stationary environments must adapt to the structures of the network corresponding to these environments to maintain its performance. An example of occurring changes is found in spam filtering, in which a user has changed his/her interested issues over a period of time as discussed in [16].
To overcome the cybersecurity problems in a streaming non-stationary environment, a new dynamic stratum network aiming to provide an adaptive structure, is composed of versatile elliptic function, recursive functions for adjusting the parameters of the network, and an expandable and shrinkable adaptive network to capture the characteristics of available data. The proposed learning algorithm can learn data without retaining all learned data known beforehand, as called one-pass-throw-away learning. Those already learned data are removed from the learning process forever. To reduce memory space and computational time, feature selection based on information gain is also applied in this study.
The remainder of this paper is organized as follows. Section 2 briefly summarizes the relevant theoretical background. Section 3 describes the concept of proposed method and the experimental results are shown in Section 4. Section 5 concludes the study.

Relevant background
This section provides some backgrounds related to the studied problem and the proposed algorithm. The proposed algorithm adapted some partial concepts of one-pass-throw-away to create and to expand hidden neurons capturing data in the learning process. The summary of one-pass-throw-away learning in [36,37] is the following.
The output of the neuron k with respect to an input x is computed from a rotated elliptic function as shown in Eq (1).
where C k ¼ ½c k 1 c k 2 ::: c k d T is a center vector of the ellipse represented by a hidden neuron O k , can be combined into one new hidden neuron O c with a particular condition. After combining, the new parameters are computed as follows.
where λ i is the i th eigenvalue of the covariance matrix U c . After successful combining, O a and O b are removed from the network.
The brief of initializing a width vector given in [38] can be described by the following procedure.

Proposed method
This study focuses on one-pass-throw-away learning to maintain less space complexity and computational time in a streaming non-stationary environment of cybersecurity problems. The learning can adjust the structure of network with changing in the environment. Furthermore, if there exists a sub-network in a class i whose all data are eventually changed for some unknown reasons, the relevant neurons and their links must be entirely removed. The overview of the proposed method for classifying a streaming non-stationary environment can be shown in Fig 1. This method captures the dynamic network that can adjust the structure according to the incoming data. Meanwhile, the entire data is removed forever after the learning process. These mentioned problems are concentrated in our proposed method as a viable solution with the following issues. The first issue considers the case where dynamic network structure must adjust accordingly to the unexpected situations arriving continuously. Along with the proposed dynamic network structure, the second issue is the dimensional reduction of features to speed up the training process. The details of each issue are described in the following sections.

Dynamical structure of proposed network
The proposed Dynamic Stratum (Dyn-Stratum) network is developed as a streaming incremental framework for handling concept drift in cybersecurity problems. There are three layers in the Dyn-Stratum network. The first layer is the input layer consisting of a set of neurons whose size is equal to the dimensions (attributes) of the input vector. The second layer (hidden layer) connected to the first layer consists of a set of neurons in each class. The sub-network of this layer is composed of two strata with timestamp of the streaming data in the training Assuming that there are four classes which are 1, 2, 3, and 4, without loss of generality, the network of each class which has only one output denoted by y i can be defined. All classes in each stratum are learned by a 3-layer feed-forward network. The hidden layer of this network includes many groups of neurons, i.e. one group for one class.

Proposed learning algorithm
To clarify the proposed Dyn-Stratum concept, an example of how the proposed learning algorithm works is shown in Fig 3. At the beginning time t 1 , assumes that there are two neurons of Class 1 and Class 2 denoted by thick dots and stars, respectively. There are three data within the neuron of Class 1 and two data within the neuron of Class 2 with the recent timestamp. All classes are stored in the stratum 1. At the second time t 2 , the label of a datum in Class 1 changed to Class 2. This causes the neuron of Class 1 to shrink and to assign in the stratum 2 corresponding to the less timestamp, meanwhile the neuron of Class 2 expanded due to the class label of data in Class 1 stored in the stratum 1, which has been changed. At the last time t 3 , one new datum denoted by the square is captured by the neuron of Class 3 stored in the stratum 1, but none of the others changed in Class 1 and Class 2. The three main steps in our proposed algorithm will be concentrated by the following procedure. One-pass-throw-away learning by dynamic stratum network The first step is to define the appropriate parameters for creating a hidden neuron in which these parameters will influence the computational time and classification accuracy of the learning process. Especially, in terms of a width vector, if it is initially computed with appropriate values, the structure of network will also be efficiently updated. The suitable width vector was calculated from the distances among some training data points, as introduced in [38]. The method to compute initial width vector is given in the above-mentioned section.
The second step is to adjust the proposed Dyn-Straum network which can expand and shrink according to the incoming data or class-changed data. When adding an incoming datum, the center vector and the covariance matrix are recursively computed as demonstrated in the following: For any hidden neuron O k , the following parameters {C k , U k , N k , W k , l k } are already learned. Suppose an incoming datum x j arrives and falls inside the boundary of O k . The value of ψ k in Eq (1) must be less than or equal to 0. The new center C ðnewÞ k of O k is computed from the old center C ðoldÞ k by the following recursive function.
The new covariance matrix U ðnewÞ k of O k is recursively computed from the old covariance matrix U ðoldÞ k and the new center C ðnewÞ k as follows.
The total data captured by O k becomes N k + 1. The detail of process can be conducted in the ExpandingNeuron function. One-pass-throw-away learning by dynamic stratum network However, in the case of removing a datum from a hidden neuron O k , the following equations can be computed to acquire the center vector and the covariance matrix as follows.
The remaining data captured by O k becomes N k − 1. However, if there is no data captured by O k , its links must be entirely removed. The detail of process can be conducted in the Shrinking-Neuron function. The third step is to compute the timestamp of O k by using the number of data captured by O k and previous learning time. The value of timestamp t k of the neuron O k can be computed recursively by: where N k , N ðnewÞ k are the present number and the updated number of data captured by O k , respectively. Furthermore, the testing step is to predict the class of any testing datum x j denoted as y(x j ). Let H be a set of hidden neurons in two strata.
or, when x j is outside all neurons, when ψ k (x j ) according to Eq (1) is the output of a neuron O k in any stratum. The proposed learning algorithm, as namely Dyn-Stratum algorithm, is summarized with three main sub-processes as follows: 1. Creating a new hidden neuron called CreatingNewNeuron.

Expanding and updating relevant parameters of the hidden neuron called
ExpandingNeuron.
3. Removing data captured by the hidden neuron called ShrinkingNeuron.
2. Update the center vector C ðnewÞ k by C ðnewÞ Return Ω k and t k .
Return Ω k and t k .

Algorithm 1: Dyn-Stratum
Find the closest neuron with same class of x j in each stratum by using the Mahalanobis distance with, ExpandingNeuron(x j , C k2 , U k2 , N k2 , W k2 , l k2 , t k2 ). % in stratum 2. 14.

16.
EndIf Find the closest neuron of x j in stratum 1 by, If class(x j ) 6 ¼ the class of neuron Ω k1 then 20.

22.
Else 23. ExpandingNeuron Find the closest neuron of x j in stratum 2 by,
Combine Ω a and Ω b neurons into new neuron Ω c with Eqs 2-5.

Ranking relevant attributes
As the aforementioned proposed learning algorithm, the main processes of learning depend on the computations of the center vector and the covariance matrix (also known as dispersion matrix). If the process, in terms of the high number of features (attributes) is computed, this will consume most of the computational learning time. Although the features may contain a large number of characteristics, not all of them are essential. Some features are redundant or irrelevant. Redundant features are highly correlated to other features and do not have additional information to the target learning task whilst irrelevant features do not have any helpful information with regards to the context [39,40]. Thus, the main objective of ranking the most important features is to enhance the classification accuracy and to correctly represent the characteristics of patterns. An example introduced in [40], demonstrates the grouping and clustering of alerts for detecting attacks by using the similarity of features.
Attribute ranking is a filter method of feature selection. Because of its simplicity, the method is successively used for practical applications. The attribute ranking method is implemented by applying Information Gain (IG) before classification to filter out the less relevant attributes [39]. The Information Gain is frequently used as a term-goodness criterion in different applications of classification problems. The proposed algorithm for ranking relevant attributes is conducted from information entropy to compute Information Gain and return the sort order of most useful attributes (highest Information Gain) to the lowest. The process of ranking attributes can be detailed with the following algorithm.

Experiments
The benchmarked datasets were collected from three real-world concept drift datasets in cybersecurity studies, as shown in Table 1. These datasets consist of a large number of dimensional attributes and are popular as they have experimented in many machine learning literatures. The Spam dataset taken from [41] represents the task of separating malicious spam emails from legitimate ones. Phishing websites dataset [42] was collected from malicious web pages. The Phishing website is one of the many worldwide challenging security problems. The NLS-KDD dataset [43] was obtained from the application of intrusion detection systems, where the main focus is on filtering malicious network traffic. All attributes in the datasets are converted to numeric type before applying the proposed algorithms. These datasets depict the concept drift, but the type of drifts cannot be determined in advance. The description of the datasets is detailed in the next subsection.

Dataset description
Three real-world concept drift datasets were chosen from the domain of cybersecurity, as summarized in Table 1, the datasets consist of the number of instances, the number of features (attributes), and the number of class labels. The Spam dataset contains 9,324 instances and was constructed from the email messages of the Spam Assassin Collection. Of all the instances, there are 500 attributes including class labels, such that each attribute stands for the occurrence of a single word in an instance (e-mail). As previously mentioned in [41], this dataset contains spam message characteristics which will gradually change over time. One-pass-throw-away learning by dynamic stratum network The Phishing websites dataset was acquired from the UCI Irvine machine learning repository [42]. There are 11,055 website samples, such that each sample consists of 30 attributes. The 6,157 legitimate websites are defined to a class label of "+1", while the 4,898 phishing websites are defined to a class label of "-1". All of them were collected from PhishTank archive, MillerSmiles archive, and Google's searching operator.
The NSL-KDD dataset is a modified version of the KDD Cup 99 data set, which is studied as the benchmarked dataset of cybersecurity problems. This dataset includes TCP connection records that consist of 41 informational attributes and one labelling attribute classified into one of four types of attacks or normal connection. Among the 41 attributes, there are 32 continuous attributes and 9 nominal attributes. In addition, for further evaluation, this experiment transformed the dataset into a two-class problem consisting of abnormal and normal classes as well.

Performance evaluation
This study concerns the evaluation, in terms of performance, for the proposed Dyn-Stratum algorithm when compared to several other existing classification algorithms; including incremental learning for non-stationary environments (Learn++.NSE) [27], weighted majority vote (WMV) [30], Anticipative Dynamic Adaptation to Concept Drift (ADACC) [34], and Adaptive Random Forests (ARF) [35]. These compared algorithms were implemented with chunkbased and online learning modes for non-stationary streams. Chunk-based mode processes incoming data in chunks, where each chunk contains a fixed number of training instances. Online mode learns each incoming datum separately, rather than in chunks, and then discards it. The evaluation of the performance of proposed Dyn-Stratum, Learn++.NSE, WMV, ADACC, and ARF methods were implemented with MATLAB and Massive Online Analysis (MOA) framework [44]. All experiments could be evaluated with two settings. The first evaluation comprises space and time complexities as well as overall classification (e.g., accuracy, precision, recall, f-measure, and geometric mean) in details of the equations as will be given afterward. The measures are evaluated on the whole data stream based on 5-fold cross-validation technique. In cross-validation, the whole data set was sequentially divided into five subsets of instances. In each iteration, the four subsets were used as for training to derive a method and then the rest was used to test the method. The process was accomplished five times repeatedly. In addition, to evaluate other important measures, this study employs true positive (TP), false negative (FN), false positive (FP), and true negative (TN) called confusion matrix, as shown in Table 2. The confusion matrix was used to explain the calculation such as Precision, Recall, F-measure, and Geometric mean (G-mean) with the following equations. One-pass-throw-away learning by dynamic stratum network In this setting, to accurately evaluate the important attributes based on Information Gain, as introduced by Algorithm 2 in an earlier section, the comparison results were also reported in our experiments.
The second evaluation is an incremental learning curve of the benchmarked datasets separated into chunks using test-then-train strategy with respect to the classification accuracy of each algorithm.
The experimental set-up was conducted to fairly evaluate the performance of algorithms. The training data were randomly partitioned into several chunks to test our concept of onepass-throw-away learning. For our experimental set-up, the same set of data used by those compared algorithms was also used in our experiment. The initial width value calculated in Procedure 1 was used for setting the appropriate width vector of the proposed Dyn-Stratum algorithm. As introduced in [27], the sigmoid parameters of the Learn++.NSE were equal to 0.5 and 1.0. The classifiers were set as classification and regression (CART) both the Learn++. NSE and WMV methods. On the other hand, Naive Bayes classifier was used as base learners of ADACC method since the method usually learns incrementally and is frequently employed in online learning mode. The parameters of threshold and ensemble size in ARF method were set according to the condition, as previously mentioned in [35].

Experimental results by using cross-validation strategy
In this section, the evaluation of the performance is designed for each algorithm by using cross-validation technique. The percentage of the average accuracy with standard deviation, the number of neurons, and the training time are shown in Tables 3-6. Note that other existing methods do not define the neurons as the structure of the network. The word neuron is used to represent the number of classifiers (or trees) of all compared methods. Table 3 shows the comparison results on the Spam dataset. The comparison results from different methods for Phishing and NSL-KDD datasets are shown in Tables 4 and 5. The result of NSL-KDD dataset transformed into binary classification is also shown in Table 6. Moreover, in terms of ranked attributes based on Information Gain, the average accuracy of the Dyn-Stratum was also maintained to boost the performance for all benchmarked cybersecurity datasets.
To evaluate the performance of algorithms based on confusion matrix, G-mean was used to report with the comparison results for cybersecurity datasets in terms of the binary classification. In addition, both G-means with and without process of ranking attributes were also compared to report these experimental results. G-mean of all benchmarked cybersecurity datasets obtained from each algorithm is shown in Table 7. G-mean of our Dyn-Stratum achieved the highest values for experimental results of all benchmarked datasets. Additional performance measures, i.e. precision, recall, and F-measure, are evaluated for all benchmarked cybersecurity datasets, as shown in Tables 8-10. From these results, they are notable that for almost all benchmarked datasets, the Dyn-Stratum gained the highest values.  One-pass-throw-away learning by dynamic stratum network

Experimental results of streaming data
Three real-world cybersecurity datasets without ranking attributes were used to evaluate the performance of algorithms with streaming scenarios. These datasets were categorized into data chunks for evaluating the performance based on test-then-train strategy. The percentage of classification accuracy with respect to algorithms of all benchmarked cybersecurity datasets is illustrated in

Discussion
In this study, we focused on the one-pass-throw-away learning for solving the problem of cybersecurity domain in non-stationary scenarios. All experiments were conducted with several data chunks and one-pass-throw-away learning mode. Therefore, the already learned data can be discarded forever after the learning process. The comparison results of experiments on three real-world cybersecurity datasets in non-stationary environments are illustrated in the previous section. The highest average accuracy of the proposed Dyn-Stratum for all benchmarked datasets signified that our Dyn-Stratum network can be flexibly adjusted with the parameters of the network according to consecutive training data. The other methods adjusted the parameters of the network based on the entire dataset. This implies that the consideration of data distribution throughout the space is crucial to speed up the computational time and classification accuracy. For additional measures, the proposed Dyn-Stratum algorithm achieved the highest precision, recall, F-measure, and G-mean compared with the other methods. The computational time of the proposed Dyn-Stratum algorithm is based on the number of times to mainly perform the covariance matrix computation. In the case of test sets with a large dimension of features, the learning process will consume a large amount of learning time. In our experiments, the feature selection based on Information Gain is used to reduce the effect of this case. Consequently, the dimensionality reduction of features to speed up the One-pass-throw-away learning by dynamic stratum network learning process and the efficiency of classification should be simultaneously considered to improve the performance of method in streaming non-stationary environments.

Conclusion
The real-world cybersecurity problems in streaming non-stationary environments were studied. A new one-pass-throw-away learning algorithm using the dynamic stratum network named Dyn-Stratum network was proposed to learn these data. The main Dyn-Stratum network comprises two strata. The first stratum designs the dynamical structure based on incrementing a new neuron and expanding the neuron with the recent most timestamp according to incoming data. The second stratum adjusts vigorously to the structure with expanding neurons and removing its connections from the network when the classes of all data have been changed. The center vector and the covariance matrix were also calculated by the recursive functions to reduce the computational time and memory space when data are removed or changed class. The feature selection based on Information Gain was also applied to enhance the performance of computational time. The comparison results between the proposed Dyn-Stratum and the other methods with several benchmarked cybersecurity datasets signified that Dyn-Stratum achieved the minimum structural space complexity, less computational time, and higher classification accuracy than the other methods.