Mid-price prediction based on machine learning methods with technical and quantitative indicators

Stock price prediction is a challenging task, in which machine learning methods have recently been successfully used. In this paper, we extract over 270 hand-crafted features (factors) inspired by technical indicators and quantitative analysis and test their validity on short-term mid-price movement prediction for Nordic TotalView-ITCH stocks. The suggested feature list represents one of the most extensive studies in the field of financial feature engineering. We focus on a wrapper feature selection method using entropy, least-mean squares, and linear discriminant analysis. We also introduce a novel quantitative feature based on adaptive logistic regression for online learning. The proposed feature is consistently selected as the first feature among a large number of indicators used in this study. We further examine the best combinations of features using a high-frequency limit order book Nordic database. Our results suggest that sorting methods and classifiers can be used in such a way that one can reach the best classification performance with a combination of only a few advanced hand-crafted features.


Introduction
The problem under consideration in this paper is the prediction of a stock's mid-price movement (i.e., up, down, or stationary state) during high-frequency trading (HFT). At a given time instance, the mid-price of a stock is defined as the average of the best ask and bid prices. The mid-price is considered as vital information for market makers who continuously balance inventories as well as for traders who need to be able to correctly predict the direction of market movements. Moreover, the mid-price facilitates the process of monitoring the markets' stability (i.e. spoofing identification). The concept of mid-price prediction can be described as follows: at a given time instance t, the state of the stock is encoded in a vector-based representation calculated using a multi-dimensional time series information from a short-term time window of length T. Given this representation, the direction of the mid-price is predicted at a horizon of Δt. Over the past few years, several methods, such as those described in [1][2][3][4][5][6], and [7], have been proposed for analyzing stock market data. All of these methods follow the standard classification pipeline formed by two processing steps. Given a time instance during the trading process, the state of the market is described based on a (usually short) time window preceding the current instance. A set of hand-crafted features is selected to describe the dynamics of the market, leading to a vector representation. Based on such a representation, a classifier is then employed to predict the state of the market at a time instance within a prediction horizon, as illustrated in Fig 1. The majority of the studies, as discussed in the Literature section, utilize a limited number of features without providing any motivation about their selection. In this paper, we employ a large number of the technical indicators, state-of-the-art limit order book (LOB) features, and quantitative indicators [8]. We further propose a novel quantitative feature that is selected first among several features for the task of mid-price movement prediction. The use of different hand-crafted features leads to encoding different properties of the financial time-series and excluding some of these features can result in failing to exploit the relevant information. The definition of a good set of features is directly connected to the performance of the subsequent analysis since any discarded information at this stage cannot be recovered later by the classifier.
A common approach to address this problem is the use of feature selection methods (e.g., [9,10]) which can be performed in a wrapper fashion using various types of criteria for feature ranking. While the use of transformation-based dimensionality reduction techniques such as principal component analysis (PCA) or linear discriminant analysis can lead to a similar processing pipeline, in this paper, we are interested in defining the set of features that convey most of the information in the data. PCA is not considered in this study since it converts existing features to new ones which are not interpretable. That means that we will not be able to provide insight into which specific features are suitable for the task of mid-price movement prediction. The use of feature selection using unsupervised criteria, and in particular, the maximum entropy criterion, has been used in [11] and [12]. The motivation behind this approach is the fact that as the entropy of a feature increases (when it is calculated in a set of data), the data variance and, thus, the information it encodes, also increases. However, the combination of many high-entropy features in a vector-based representation does not necessarily lead to good classification performance. This is because different dimensions of the adopted data representation need to encode different information.
The main contribution of our work are three-fold. The first contribution is the use of an extensive list of technical indicators for high-frequency trading. The second contribution is a novel quantitative feature, named adaptive logistic regression feature, which was selected first among several feature selection metrics. The third contribution is an extensive evaluation of three feature sets (i.e., technical, quantitative, and LOB indicators) via the conversion of (i) entropy, (ii) linear discriminant analysis (LDA), and (iii) linear mean-square (LMS) as feature selection criteria. LMS, LDA, and radial basis function network (RBFN) are used as classifiers for the task of mid-price movement prediction task. Our findings suggest that the best performance is reached by using only a few (advanced) features derived from both quantitative and technical hand-crafted feature sets.
These different realizations (i.e., entropy, LMS, and LDA) of the feature selection method are applied to a wide pool of hand-crafted features, which are selected to cover both basic and advanced features from two different trading approaches (i.e., those focusing on technical and quantitative analyses). Technical analysis is based on the fact that price prediction can be achieved by monitoring price and volume charts, while quantitative analysis focuses on statistical models and parameter estimation. For the technical indicators, we calculate basic and advanced features accompanied by digital filters, while for the quantitative indicators, we primarily focus on time series analysis. The features and their respective descriptions are provided and used as input in twelve feature selection models (each corresponding to a different criterion and classifier combination) for the classification task. We present the best combinations of these two types of features and provide a comparison of the two trading styles of feature sets in terms of F1 performance. F1 score is a common test used to measure performance and is calculated as the harmonic mean of precision and recall. To the best of our knowledge, this is the first study to define which type of information needs to be used for high-frequency time series description and classification.
The remainder of the paper is organized as follows. We first provide a comprehensive literature review of the technical and quantitative features followed by the problem statement and data description. We then provide various realizations of the wrapper method adopted in our analysis, together with the empirical results. A detailed description of all features used in our experiments, as well as all ranking lists for each method, can be found in the Appendix section.

Related work
Algorithmic trading uses computers, under specific rules, to rapidly perform accurate calculations based on statistical analysis. A trader using machine learning (ML) techniques can use several tools based on this analysis in order to select the best trading strategy. However, a number of challenges remain to be solved. First, how can one determine which indicators (i.e. features) are able to secure a profitable move? second, do past and present prices contain all the relevant information? Several authors utilized technical indicators and quantitative analysis for several tasks using only a limited set of these features. Hidden patterns extracted from past data as well as statistical models can provide relevant information to the ML trader.
Technical analysis (e.g., [13]) has traditionally received less academic scrutiny than quantitative analysis. Nevertheless, several studies employ technical indicators as the main mechanism for signal analysis and price prediction. In the sphere of HFT, authors in [14] utilize seven trading rule families as a measure of the impact of trading speed, while in [15] the authors provide only a few technical indicators for high-speed trading. In the current ML era, authors in [1] used six basic technical indicators as feature representations for a decision support system based on artificial neural networks (ANN). Only ten technical indicators are utilized in [16] as input features for several ML algorithms (i.e. ANN, Support Vector Machines, Random Forest, and Naive Bayes) to predict stock trends. However, one can also resort to quantitative analysis, which involves ML traders using complex mathematical and statistical indicators when making trading decisions. Quantitative finance is a broad field, including portfolio optimization (e.g., [17,18]), asset pricing (e.g., [19,20]), risk management (e.g., [21,22]), and time series analysis (e.g., [23,24]). In this work, we focus on time series analysis and use ideas from financial quantitative time series analysis that have been adopted to Machine Learning. For example, authors in [25] use support vector machines and decision trees via correlation analysis for stock market prediction. Another aspect of quantitative analysis is building trading strategies such as mean-reversion as tested in [26]). An additional aspect of quantitative analysis is the calculation of order book imbalance for order imbalance strategies. This idea is used as one of the features in a deep neural network in [4].
In the present work, we focus on extracted hand-crafted features based on technical and quantitative analysis. We show that a combination of features derived from these groups can improve the forecasting ability of the algorithms. A combined method is employed by [27] for asset returns predictability based on technical indicators and time series models. To the best of our knowledge this is the first attempt to compare these trading schools using several feature selection methods in a wrapper fashion in HFT.

Problem formulation
HFT requires continuous analysis of market dynamics. One way to formulate these dynamics is by constructing a limit order book (LOB), as illustrated in Table 1. LOB is the cumulative order flow representing valid limit orders, that are not executed nor cancelled, which are listed in the so-called message list, as illustrated in Table 2. LOBs are multi-dimensional signals described by stochastic processes, and their dynamics are described as càdlàg functions (i.e., Table 1. Limit order book example: Wartsila Oyj on 01 June 2010. LOB is divided in 10 levels, where each level consists of four columns. These four columns refer to the ask price and volume and the bid price and volume, respectively. The best level is Level 1. It contains the best Ask price which is the minimum price that a seller is willing to accept for a share of stock and the best Bid price which is the maximum price that a buyer is willing to pay for a share of stock. Next to Level 1 is Level 2 and so on, up to Level 10 with the same formation but with worst Ask and Bid prices.  [2]). The functions are formulated for a specific limit order (i.e. an order with specific characteristics in terms of price and volume at a specific time t), as t), as: order = (t, Price t , Volume t ) that becomes active at time t holds that: order 2 LðtÞ; order = 2 lim order 0 "order x Lðorder 0 Þ. Depending on how the LOB is constructed, we treat the new information according to event arrivals. The objective of our work is to predict the direction (i.e. up, down, or stationary) of the midprice (i.e. (p a + p b )/2, where p a is the ask price and p b is the bid price at the first level of LOB). The goal is to utilize informative features based on the order flow (i.e. message list or message book [MB]) and LOB, which will help the ML trader improve the accuracy of mid-price movement prediction.

Feature pool
Limit Order Book (LOB) and Message Book (MB) are the main sources from which features are extracted. We provide a comprehensive list of features explored in the literature for technical and quantitative trading in Table 3. The description of the hand-crafted features, except the newly introduced feature, named Adaptive Logistic Regression feature, is provided in the Appendix where the description of the newly introduced feature, named Adaptive Logistic Regression feature, follows. The motivation for choosing the suggested list of features is based on an examination of all the basic and advanced features from technical analysis and comparisons with advanced statistical models, such as adaptive logistic regression for online learning. The present work has identified a gap in the existing literature concerning the performance of technical indicators and comparisons with quantitative models. This work sets the ground for future research since it provides insight into the features that are likely to achieve a high rank on the ordering list in terms of predictability power. To this end, we divide our feature sets into three main groups. The first group of features is extracted according to [28] and [29]. This group of features aims to capture the dynamics of the LOB. This is possible if we consider the actual raw LOB data and relative intensities of different look-back periods of the trade types (i.e. order placement, execution, and cancellation). The second group of features is based on technical analysis. The suggested list describes many of the existing technical indicators (basic and advanced). Technical indicators might help traders spot hidden trends and patterns in their time series. The third group of features is derived according to quantitative analysis, which is mainly based on statistical models; it can provide statistics that are hidden in the data. This can be verified by the ranking process, where the proposed advanced online feature (i.e. adaptive logistic regression) is ranked first in most of the feature selection metrics (i.e. four out of five feature lists). The suggested features are fully described in the Appendix; whereas, the proposed adaptive logistic regression feature is described next.

Proposed adaptive logistic regression feature
We introduce a novel logistic regression model that we use as a feature in our experimental protocol. The motivation for this model is the work done in [4] where the focal point is the local behavior of LOB levels. We extend this idea by doing online learning with an adaptive learning rate. The new feature operates under an online learning mechanism by taking into consideration the latest trading event of the 10-event message book block which means that the forecasting of the next mid-price move is updated according to the latest information flow. More specifically, we use the Hessian matrix as our adaptive rate. We also report the ratio of the logistic coefficients based on the relationship of the LOB levels close to the best LOB level and the ones which are deeper in the LOB. Since 0 � h θ (V)�1 and V are the stock volumes for the first best six levels of the LOB, we formulate the model as follows: be the logistic function (i.e. Hypothesis function) and for m training samples and the cost function, based on this probabilistic approach, is as follows: À y ðiÞ logðh y ðV ðiÞ ÞÞ À ð1 À y ðiÞ Þlogð1 À h y ðV ðiÞ ÞÞ The next step is the process of choosing θs for optimizing (i.e. minimizing) J(θ). To do so, we will use Newton's update method: where the gradient is: ðh y ðV ðiÞ Þ À y ðiÞ ÞV ðiÞ ð5Þ and the Hessian matrix is: with V ðiÞ ðV ðiÞ Þ T 2 R ðnþ1Þ�ðnþ1Þ and y (i) are the labels which are calculated as the differences of the best level's ask (and bid) prices. The suggested labels describe a binary classification problem since we consider two states, one for change in the best ask price and another one for no change in the best ask price. We perform the above calculation in an online manner. The online process considers the 9 th element of every 10 MB block multiplied by the θ coefficient first-order tensor to obtain the probabilistic behavior (we filter the obtained first-order tensor through the hypothesis function) of the 10 th event of the 10 MB block. The output is the feature representation expressed as a scalar (i.e. probability) of the bid and ask price separately.

Wrapper method of feature selection
Feature selection is an area that focuses on applications with multidimensional datasets. A ML trader performs feature selection for three primary reasons: to reduce computational complexity, to improve performance, and to gain a better understanding of the underlying process. Feature selection, as a pre-processing method, can enhance classification power by adding features that contain information relevant to the task at hand. There are two metaheuristic feature selection methods: the wrapper method and the filter (i.e. transformation) -based method. We choose to perform classification based on the wrapper method since it considers the relationship among the features while filter methods do not.
Our wrapper approach consists of five different feature subset selection criteria based on two linear and one non-linear methods for evaluation with a general description in Algorithm 1 where: l opt is the optimal feature list per criterion, Feature_Select is the name of the function algorithm, X is the input feature set, labels represent the movements of mid-price, l is the current feature list up to D feature dimensions, N is the sample size, curr_X is the current input data up to the current number i of the features together with the optimal ones up to i, and best_d is a vector that saves the best feature out of the entire list of features. More specifically, we convert entropy, LMS, and LDA as feature selection criteria. For the last two cases (i.e., LMS and LDA) we provide two different selection criteria as follows: i) for LMS the first metric follows the L 2 -norm and the second metric is a statistical bias measure, and ii) for LDA the first metric is based on the ratio of the within-class scatter matrix while the second metric is derived according to the between-class scatter matrix. For classification evaluation we utilize LMS, LDA and a radial basis function network (i.e., [30]) as these were used in [29] and [31]. Our choice to apply these linear and non-linear classifiers is informed by the amount of data in our dataset (details will be provided in the following section). We measure classification performance according to accuracy, precision, recall, and F1 scores for every possible combination of the hand-crafted features by using LMS, LDA, and RBF ANN. These perfromance measures are defined as follows: Recall where TP and TF represents the true positives and true negatives, respectively, of the midprice prediction label compared with the ground truth, where FP and FN represents the false positives and false negatives, respectively. Algorithm 1 Wrapper-Based Feature Selection

Feature sorting
We convert sample entropy, LMS, and LDA (for the latter two methods we use two different criteria for feature evaluation) into feature selection methods. A visual representation of the feature sorting method can be seen in Fig 2. It can be briefly described as follows: the process of feature sorting and classification is based on the wrapper method. From left to right: 1) Raw data is converted to LOB data via superclustering, 2) the feature extraction process follows (i.e., three feature sets are extracted), 3) then every feature set is normalized based on z-score in a rolling window, 4) Wrapper method: there are two main blocks in this process-the first block refers to sorting (i.e., five different sorting criteria based on entropy, LMS1, LMS2, LDA1, and LDA2) of the feature sets independently (i.e., LOB features, technical indicators, and quantitative indicators are sorted separately) and all together (i.e., the three feature sets are merged and sorted all together) and the second block refers to the incremental classification (i.e., we increase the dimension of each sorting list during classification by one feature in each . The next step is the z-score normalization of the suggested feature representations which are used as inputs to the wrapper protocol. The wrapper protocol is the main block of our experimental analysis. This main block is divided into two secondary blocks (i.e., yellow-colored blocks). The top yellow block acts as an incremental sorting method since it uses five different methods (i.e., entropy, LMS1, LMS2, LDA1, and LDA2) for feature evaluation. When the five sorted feature lists are ready then each sorted feature goes through a classifier (i.e., three different classifiers in our case based on RBFN, LDA, and LMS). This last part of the wrapper topology performs the classification task of the mid-price movement prediction.
Feature sorting with entropy. We employ entropy [32], which is a measure of signal complexity where the signal is the time series of the multidimensional two-mode tensor with dimensions R p�n , and where p is the number of features and n is the number of samples, as a measure of feature relevance. We calculate the bits of each feature in the feature set iteratively and report the order. We measure the entropy as follows: where p(x i ) is the probability of the frequency per feature for the given data samples.
Feature sorting with least-mean-squares. We perform feature selection based on the least-mean square classification rate (LMS1) and L 2 -norm (LMS2). LMS is a fitting method that aims to produce an approximation that minimizes the sum of squared differences between given data and predicted values. We use this approach to evaluate the relevance of our handcrafted features. A hand-crafted feature evaluation is performed sequentially via LMS. More specifically, each of the features is evaluated based on the classification rate, the L 2 -norm of the predicted labels, and the ground truth. The evaluation process is performed as follows: where H 2 R p i �n is the input data with feature dimension p i which is calculated incrementally for the number of training samples n, W 2 R p i �#cl are the weighted coefficients for the number of features p i of the number() of classes (i.e. up, down, and stationary labelling), and T 2 R #cl�n represents the target labels of the training set. The weight coefficient matrix W is estimated via the following formula: where H † is the Moore-Penrose pseudoinverse matrix. Feature sorting with linear discriminant analysis. Linear discriminant analysis (LDA) can be used for classification and dimensionality reduction. However, instead of performing these two tasks, we convert LDA into a feature selection algorithm. We measure feature selection performance based on two metrics. One is the classification rate (LDA1) and the other is based on the error term (LDA2), which we define as the ratio of the within-class scatter matrix and the between-class scatter matrix. The main objective of LDA is finding the projection matrix W 2 R m�#clÀ 1 , where m is the sample dimension, and cl is the number of classes, such that Y = W T X maximizes the class separation. For the given sample set :::;#cl is the class-specific data subsample, we need to find W that maximizes the Fisher's ratio: where and are the between-class and within-class scatter matrices, respectively, with In a similar fashion, we compute the projected samples ' k Y k , while the scatter matrices (i.e. within and between scatter matrices, respec- The above calculations constitute the basis for the two metrics that we use to evaluate 285 the hand-crafted features incrementally. The two evaluation metrics are based on the 286 classification rate and the ratio of the within-class and between-class scatter matrices of the projected space Y.

Classification for feature selection
We perfrom classification evaluation based on three classifiers: LMS, LDA, and RBFN. The basic concept of the first two methods was discussed above while RBFN classifier is described in the following section.
RBFN classifier. We utilize a single-layer feedforward neural network (SLFN) as this suggested by [33]. A detailed description and the implementation of the method can be found in [29]. In order to train this model fast, we follow the procedures outlined in [34], and [35]. We use K-means clustering for K prototype vectors identification, which then are used as the network's hidden layer weights. After determining the the network's hidden layer weights V 2 R D�K , the input data x i , i = 1, . . ., N are mapped to vectors h i 2 R K in the feature space determined by the network's hidden layer outputs R K . Then a radial basis function is used, i.e. h i = ϕ RBF (x i ), calculated in an element-wise manner, as follows: where σ is a hyper-parameter denoting the spread of the RBF neuron and v k corresponds to the k-th column of V. The network's output weights W 2 R K�C are determined by solving the following equation: where H = [h 1 , . . ., h N ] is a matrix formed by the network's hidden layer outputs for the training data and T is a matrix formed by the network's target vectors t i , i = 1, . . ., N. The network's output weights are given by: Then a new (test) sample x 2 R D is mapped to its corresponding representations in spaces R K and R C , i.e. h = ϕ RBF (x) and o = W T h, respectively. Finally, the classification task is based on the maximal network output, i.e.:

Experimental results
In this section, we provide details regarding the conducted experiments. The goal of the experiments is to predict the mid-price state (i.e. up, down, and stationary) for ITCH feed data in millisecond resolution (ITCH is a type of direct data-feed protocol and the acronym according to Nasdaq carry no semantic meaning). Additional information regarding the dataset can be found in [29]). For the experimental protocol, we followed the setup in [29], which is based on the anchored cross-validation format. According to this format, we use the first day as training and the second day as testing for the first fold, whereas the second fold consists of the previous training and testing periods as a training set, and the next day is always used as the test set. Each of the training and testing sets contains the hand-crafted feature representations for all the five stocks from the FI-2010 dataset. Hence, we obtain a mode-three tensor of dimensions 273 × 458, 125. The first dimension is the number of features, whereas the second one is the number of sample events. At this point, we must specify that the process of hand-crafted feature extraction is conducted in the full length of the given information based on MB with 4,581,250 events. The motivation for taking separate blocks of messages of ten events is the time-invariant nature of the data. To keep the ML trader fully informed regarding MB blocks, we use features that convey this information by calculating, among others, price and volume averages, regression coefficients, risk factors, and online learning feedback indicators. The results we present here are the mid-price predictions for the next 10 th , 20 th , and 30 th events (i.e. translated into MB events) or else one, two, and three next events after the current state translated into a feature representation setup. The prediction performance of these events is measured by the accuracy, precision, recall and F1 score, whereas F1 score is further emphasized, as it can only be affected in one direction by skewed distributions for unbalanced classes, as observed in our data. Performance metrics are calculated against mid-price labelling calculation of ground truth extraction. More specifically, we extract labels based on the percentage change of the smoothed mid-price with a span window of 9, for our supervised learning methods, computed as follows: where MP curr is the current mid-price, and MP next is the next mid-price. The percentage change identification is thresholded by an empirically fixed value γ = 0.002 rolling z-score normalization is performed on the dataset to avoid look-ahead bias (look ahead bias refers to the process that future information is injected to the training set). The rolling window z-score normalization is based on the anchored cross-validation setup which means that the normalization of the training set is unaffected by any future information.
We report our results in Tables 4-8 for each possible combination of feature set, sorting and classification method used. RBFN classifier operates under the extreme learning machine model with a slight modification in the initialization process of weights calculation based on the K-means algorithm. The full description of this method can be found in [29]. We provide results based on the whole feature pool (see Table 4), the first feature pool according to [28] and [29] (see Table 5), based only on technical indicators (see Table 6) and the quantitative indicators (see Table 7). More specifically, for the first feature pool, we have 136 features, while for the second pool we have 82 features, and for the last pool we have 55 features; in total, we have 273 features.
The number of top features that used in the above methods is different in each case and can be monitored in Figs 3 and 4. We should point out that we tested all the possible combinations for the five sorting methods and the three classifiers (i.e., 15 different cases) but we report only results that exhibit some variations. For instance, in Tables 5-7 we report results for entropy as sorting method and LMS together with RBFN as classifiers but not with LDA (as a classifier) since the last method reports similar results. We focus on F1 score and particularly on F1-macro (i.e., F1-macro = 1 C P k2C F1 k , with C as the number of classes for the 9-fold experimental protocol) results, based on the total feature pool, for the five sorting lists classified per LMS, LDA, and RBFN for the next T = 10 th , 20 th , and 30 th events, respectively, as the predicted horizon. Again, as the number of top features used in the above methods is different in each case (as seen in Fig 3), it can be briefly described as follows: Bar plots with variance presents the average (i.e. average F1 performance for the 9-fold protocol for all the features) F1 score of the 12 different models for the cases of 5, 50, 100, 200, and 273 number of top features. The order of the models from the left to the right column is:     There is a dual interpretation of the suggested feature lists and wrapper method results. Regarding the feature lists, we have five different feature sorting methods starting from entropy, to LMS1 and LMS2 and continue to LDA1 and LDA2. More specifically, results based on the entropy sorting method reveal that the top 20 features almost entirely come from technical indicators (i.e. only 1 out of 20 comes from the first basic group), while the first 100 top ranked features include 36 quant features, 48 technical features, and 16 features from the first basic group.
For the LMS case, we present two sorting lists where we use two different criteria for the final feature selection. In the LMS1 case, the top 20 features are derived mainly from quantitative analysis (11 out of 20), 7 features come from the first basic group, and only 2 from the technical group of features. The top (best) feature is the proposed advanced feature based on the logistic regression model for online learning. For the same method, the first 100 best features include 25 quant features, 18 technical features, and the remaining 57 features come from the first basic group. In the LMS2 case, the first top 20 features include 7 quant, 9 technical, and only 4 features from the first basic group. LMS2 also selects the advanced feature based on the logistic regression model for online learning as its first top feature.
The last method that we use as the basis for the feature selection process is based on LDA. Similarly, we use two different criteria as a measure for the selection process. In Table 9.
The second interpretation of our findings is the performance of the 12 different classifiers (based on LMS, LDA, and RBFN) used to measure, in terms of F1 score, the predictability of the mid-price movement. Fig 3 provides an overview of the F1 score performance in terms of best feature numbers and classifiers. We can divide these twelve models (pairs based on the sorting and classification method) into three groups according to their response in terms of information flow. The first group, where LMS2-LMS, LDA2-LMS, LMS2-RBFN, and LDA2-RBFN belong, reach their plateau very early in the incremental process of adding less informative features. These models were able to reach (close to) their maximum F1 score performance with approximately 5 top features, which means that the dimensionality of the input matrix to the classification model is quite small. The second group of models, Entropy-LMS, LMS1-LMS, LDA1-LMS, Entropy-RBFN, LMS1-RBFN, and LDA1-RBFN, had a slower reaction in the process of reaching their top F1 score performance. The last group of models, LDA1-LDA and LDA2-LDA, reached their best performance (which is not higher than that of the other models) very early in the process with only five features.
The conducted experiments show that this quantitative analysis can provide significant trading information; however, the results improve when technical features are incorporated in the feature set. All top listed features include the logistic regression model based feature. This shows that more advanced quantitative features may provide the ML trader with vital Table 9. List for the first 10 best features for the 5 sorting methods. information regarding metrics prediction. One implication of the proposed experimental protocol is that the development of advanced hand-crafted features as part of a wrapper framework requires from the ML trader to compare and combine several sets until a target level is reached.

Conclusion
In this paper, we proposed extracted hand-crafted features inspired by technical and quantitative analysis and tested their validity on the mid-price movement prediction task. We introduced a novel quantitative feature based on adaptive logistic regression for online learning and used a wrapper feature selection method by utilizing entropy, least-mean squares, and linear discriminant analysis to guide feature selection combined with linear and non-linear classifiers. This work is the first attempt of this extent to develop a framework in information edge discovery via informative hand-crafted features. Therefore, we provided the description of three sets of hand-crafted features suitable for high-frequency trading (HFT) by considering each 10-message book block as a separate trading unit (i.e., trading days).
We evaluated our experimental framework on five ITCH feed data stocks from the Nordic stock market. The dataset contained over 4.5 million events which were incorporated into the hand-crafted features. The results suggest that sorting methods and classifiers can be combined in such a way that market makers and traders can reach, with only a few informative features, top performance levels. Furthermore, the proposed advanced quantitative feature based on logistic regression for online learning has most of the time been selected as the top feature by the sorting methods. This is a strong indication for future research on developing more advanced features combined with more sophisticated feature selection methods. Classification performance can be easily improved by using more advanced classifiers such as convolutional neural networks and recurrent neural networks. Our work opens avenues for other applications as well. For instance, the same type of analysis is suitable for exchange rates and bitcoin time series analysis. As part of our future work, we also intend to test our experimental protocol on a longer trading periods.

Feature pool
First group of features. This set of features is based on the work in [28] and [29] and is divided into three groups: basic, time-insensitive, and time-sensitive features. These are fundamental features since they reflect the raw data directly without any statistical analysis or interpolation. We calculated them as follows:

PLOS ONE
Mid-price prediction based on machine learning methods

Basic
which represents the raw data of the 10 levels of our LOB, where P ask i , V ask i , P bid i , and V bid i are the Prices and Volumes for the ask and bid sides for every LOB level i, respectively. Time-Insensitive • Price Differences ¼ fP ask n À P ask 1 ; P bid 1 À P bid n ; jP ask iþ1 À P ask i j; jP bid iþ1 À P bid i jg n iþ1 • Price&Volume Means ¼ 1

Time-Sensitive
where λ denotes the average short or longer-term intensity per type and horizon.
Technical analysis. Technical analysis is based mainly on the idea that past data provides all the relevant information for trading prediction. The prediction, based on technical analysis, takes place according to open-close-high and low prices in day-to-day trading. We adjust this idea to the HFT ML problem for every 10-MB block of events. More specifically, we consider every 10-MB block as a 'trading' day (i.e. with t as the current 10-MB block and t-1 the previous 10-MB block), and we extract features according to this formation as follows: Accumulation Distribution Line. Accumulation Distribution Line (ADL) [36] is a volumebased indicator for measuring supply and demand and is a three-step process: with C t , L t , and H t being the closing, lowest, and highest current 10-MB block prices, respectively, and BlockPeriodVolume, ADL t−1 , and MoneyFlowVolume t are the total amounts of 10-MB block volume density, the previous ADL price, and the current MoneyFlowVolume t , respectively.
Awesome oscillator. An awesome oscillator (AO) [37] is used to capture market momentum. Here, we adjust the trading rules according to the previous block horizon investigation to 5 and 34 previous 10-MB blocks as follows: where SMA 5 and SMA 34 are the simple moving averages of the previous 5 and 34 previous blocks, respectively.
Accelerator oscillator. An accelerator oscillator [37] is another market momentum indicator derived from AO. It is calculated as follows: Average directional index. An average directional index (ADX) indicator [38] has been developed to identify the strength of a current trend. The ADX is calculated as follows: Average directional movement index rating. An average directional movement index rating (ADXR) evaluates the momentum change of ADX, and it is calculated as the average of the current and previous price of ADX: Displaced moving average based on williams alligator indicator. A displaced moving average [39] is the basis for building a trading signal named Alligator. In practice, this is a combination of three moving averages (MA). We adjust this idea as follows: where SMA 13 ((H t + L t )/2), SMA 8 ((H t + L t )/2), and SMA 5 ((H t + L t )/2) are the simple moving averages based on the previous 13, 8, and 5 average highest and lowest block prices, respectively.
Absolute price oscillator. An absolute price oscillator (APO) belongs to the family of price oscillators. It is a comparison between fast and slow exponential moving averages and is calculated as follows: where EMA 5 (M t ) and EMA 13 (M t ) are the exponential moving averages of range 5 and 13 periods, respectively, for the average of high and low prices of the current 10-MB block.
Aroon indicator. An Aroon indicator [40] is used as a measure of trend identification of an underlying asset. More specifically, the indicator has two main bodies: the uptrend and downtrend calculation. We calculate the Aroon indicator based on the previous twenty 10-MB blocks for the highest-high and lowest-low prices, respectively, as follows: where H high 20 and L low 20 are the highest-high and lowest-low 20 previous 10-MB block prices, respectively.
Aroon oscillator. An Aroon oscillator is the difference between Aroon Up and Aroon Down indicators, which makes their comparison easier: • Arron Oscillator = Aroon Up -Aroon Down Average true range. Average true range (ATR) [41] is a technical indicator which measures the degree of variability in the market and is calculated as follows: • ATR = (ATR t−1 × (N − 1)+ TR)/N Here we use N = 14, where N is the number of the previous 10-TR values, and ATR t−1 is the previous ATR 10-MB block price.
Bollinger bands. Bollinger bands [42] are volatility bands which focus on the price edges of the created envelope (middle, upper, and lower band) and can be calculated as follows: • BB middle = SMA 20 (CL) where BB middle , BB upper , and BB lower represent the middle, upper, and lower Bollinger bands, SMA 20 (CL) represents the simple moving average of the previous twenty 10-block closing prices, and BB std 20 represents the standard deviation of the last twenty 10-MB blocks.
Ichimoku clouds. Ichimoku clouds [43] are 'one glance equilibrium charts,' which means that the trader can easily identify a good trading signal and is possible since this type of indicator contains dense information (i.e. momentum and trend direction). Five modules are used in an indicator's calculation: where H, L, and CL denote the highest, lowest, and closing prices of the 10-MB raw data, respectively, where subscripts 9, 26, and 52 denote the past horizon of our trading rules, respectively.
Chande momentum oscillator. A Chande momentum oscillator (CMO) [40] belongs to the family of technical momentum oscillators and can monitor overbought and oversold situations. There are two modules in the calculation process: where CL i is the 10-block's closing price with i = 1, and CL t and CL t−19 are the current block's closing price and the 19 previous blocks' closing prices, respectively.
Chaikin oscillator. The main purpose of a Chaikin oscillator [44] is to measure the momentum of the accumulation distribution line as follows: where MFM and MFV stand for Money Flow Multiplier and Money Flow Volume, respectively, V is the volume of each of the trading events in the 10-block MB, and EMA 3 (ADL) and EMA 10 (ADL) are the exponential moving average for the past 3 and 10 10-MB blocks, respectively.
Chandelier exit. A Chandelier exit [45] is part of the trailing stop strategies based on the volatility measured by the ATR indicator. It is separated based on the number of ATRs that are below the 22-period high (long) or above the 22-period low (short) and is calculated as follows: where H 22  Center of gravity oscillator. The center of gravity oscillator (COG) [46] is a comparison of current prices against older prices within a specific time window and is calculated as follows: where M t is the current mid-price of the highest and lowest prices of each of the 10-MB blocks, and r is a weight that increases according to the number of the previous M t−1 prices. Donchian channels. The Donchian channel (DC) [47] is an indicator which bands the signal and notifies the ML trader of a price breakout. There are three modules in the calculation process: where H high 20 and L low 20 are the highest high and lowest low prices of the previous twenty 10-MB blocks.
Double exponential moving average. A double exponential moving average (DEMA) [48] provides a smoothed average and offers a diminished amount of delays as follows: where EMA 20 is the exponential moving average of span 20 of the closing prices under the 10-MB block format.
Detrended price oscillator. A detrended price oscillator (DPO) is an indicator used for short-term and long-term signal identification. The DPO eliminates cycles that are longer than the MA horizon. On day-to-day trading, the closing prices are considered for the calculation, but here, we use the highest 10-MB block price as follows: • DPO ¼ ðH high 10 =ð10 þ 2ÞÞ À SMA 10 ðCLÞ.
Heikin-Ashi. Heikin-Ashi [49] is a candlestick method and is described as a visual technique that eliminates irregularities: where O t−1 and CL t−1 are the open and close prices of the previous 10-MB block.
Highest high and lowest low. Highest high and lowest low creates an envelope of the trading signal for the last twenty 10-MB blocks: Hull MA. A Hull moving average is a weighted moving average that reduces the smoothing lag effect by using the square root of the block period. It is calculated as follows: • HULL MA ¼ WMA ffi ffi ffi where WMA 5 (AHL) and WMA 10 (AHL) denote the weighted moving average of the average high and low 10-MB block for periods 5 and 10, respectively.
Internal bar strength. Internal bar strength (IBS) [50] is based on the position of the day's closing price in relation to the day's range where we adjust this idea to the 10-MB block setup as follows: Keltner channels. Keltner channels [51] are based on Bollinger bands. The main difference, for this volatility-based indicator, is that it uses ATR instead of standard deviation, as follows: Moving Average Convergence/Divergence Oscillator (MACD). A moving average convergence/divergence oscillator [52] is a measure of the convergence and divergence of two moving averages and is calculated as follows: where AHL is the average of high and low prices for 12 and 26 previous 10-MB blocks, respectively, with EMA 12 (AHL) and EMA 26 (AHL) as the exponential moving average of AHL of span 12 and 26, respectively.
Median price. Median price is an indicator which simplifies the price overview. We calculate this indicator based on the 10-MB block highest and lowest average prices: Momentum. A momentum (MOM) indicator measures the rate of change of the selected time series. In our case, we calculate it based on closing prices: Variable moving average. A variable moving average (VMA) [53] is a dynamic indicator which acts as a variable-length moving average with volatility-adaptation capabilities. We calculate VMA based on the efficiency ratio (ER) as follows: where α = 2/(N+ 1), for N = 3 previous 10-MB blocks, is the adaptive parameter. Normalized average true range. A normalized average true range (NATR) normalizes the average true range as follows: Percentage price oscillator. A percentage price oscillator (PPO) displays the convergence and divergence of two moving averages and focuses on the percentage change of the larger moving average, as follows: • MACD = EMA 12 (AHL) − EMA 26 (AHL) • PPO = (MACD/EMA 26 (AHL)) × 100.

Rate of change.
Rate of change (ROC) measures the ascent or descent speed of the time series change: Relative strength index. A relative strength index (RSI) [38] is a measure of the velocity and magnitude of directional time series movements and is calculated as follows: where AG 14 and AL 14 denotes the average gain and loss of the last fourteen 10-MB blocks, respectively.
Parabolic stop and reverse. Parabolic SAR (PSAR) [41] is a trend following indicator which protects profits. There are two main modules for its calculation, the Rising SAR and the Falling SAR, and they are calculated as follows: • Rising SAR • AF = Incremental increase of a predefined step where AF is the acceleration factor, and EP is the extreme point Standard deviation. Standard deviation is a measure of volatility. We calculate this indicator based on the closing prices of every 10-MB block, as follows: • Deviation = CL t − SMA 10 (CL) • SASD ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi SMA 10 ðSVDÞ p where SMA 10 (CL) is the simple moving average of the last 10 closing 10-MB prices, SASD is the squared deviation of the SMA of the standard deviation (SVD) of the last 10 closing values of our 10-MB blocks. Stochastic relative strength index. A stochastic relative strength index (Stoch RSI) [40] is a range-bound momentum oscillator which provides information for the RSI based on the closing prices in terms of high and low stock prices: • Stoch RSI ¼ ðRSI curr À RSI L Low 10 Þ=ðRSI H High 10 À RSI L Low 10 Þ where RSI L Low 10 and RSI H High 10 are the lowest low and highest high of the last ten RSI values.
T3-triple exponential moving average. A triple exponential moving average [54] is a moving average indicator where the main motivation for its development is to reduce lag in the time series response. For this reason, we use the closing prices for our calculation and perform a reversal explanation calculation as follows: with: • EMA 2 = EMA 10 (EMA 1 ) • EMA 3 = EMA 10 (EMA 2 ) • EMA 4 = EMA 10 (EMA 3 ) • EMA 5 = EMA 10 (EMA 4 ) where α is the volume factor, and EMA 10 (CL) is the exponential moving average of the 10 previous 10-MB closing prices.
Triangular moving average. A triangular moving average (TRIMA) is the average of the time series with emphasis placed on the middle region: • TRIMA = SMA 10 (SMA 10 (SMA 10 (CL))) where, for its calculation, we use the closing prices of the last 10 10-MB blocks.
Triple exponential average. A triple exponential average (TRIX) is a momentum oscillator which measures the rate of change of the triple smoothed moving average as follows: • EMA First = EMA 10 (CL) • EMA Double = EMA 10 (EMA First ) • EMA Triple = EMA 10 (EMA Double ) • TRIX = 1-period Rate of Change.
True strength index. A true strength index (TSI) [55] is an indicator which specifies the overbought and oversold levels with market return anticipation. We calculate TSI as follows: • TSI = 100 × EMA 2 /EMA 4 where PC represents the closing price differences for the whole time series lookback period.
Ultimate oscillator. An ultimate oscillator (UO) [56] is a momentum oscillator indicator with a multiple timeframe perspective. There are three main modules as presented in the following calculations: • Average of seven 10-MB blocks • Average of fourteen 10-MB blocks TR l • Average of twenty-eight 10-MB blocks where BP represents buying pressure.
Weighted close. Weighted close (WCL) is the average of the four universal types of prices which are included in each of our 10-MB blocks: [56] is a momentum technical indicator which informs the ML trader whether the market is trading close to the high or low trading range. It is calculated as follows:
Zero-lag exponential moving average. Zero-lag exponential moving average (ZLEMA) belongs to the EMA family of indicators where the main purpose is to reduce or remove the impulse lag by introducing an error term. It is calculated as follows: where lag = (N − 1)/2 with N = 1 in our case.
Fractals. A fractal [39] is an indicator used to detect top and bottom trends by focusing on five consecutive blocks, which, in our case, are five 10-MB blocks used for two different scenarios:

• Buy Fractals
A buy fractal is a sequence of five consecutive 10-MB blocks where the highest high is preceded by two lower highs and is followed by two lower highs.

• Sell Fractals
The opposite framework is a sell fractal. 10-MB blocks can overlap in the quest of these two types of fractals.
Here, we calculate fractals separately for the open, close, lowest, and highest 10-MB block prices.
Linear regression line. Linear regression line (LRL) is a basic statistical method that provides information for a future projection wherein trading is used to capture overextended price trends. We perform LRL for each 10-MB block without any prior stationarity assumptions. The basic calculations are as follows: ðMB prices ðiÞÀ MB prices ÞðPVðiÞÀ PV Þ ! � ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P where n denotes the 10-MB lookback period • Detrend-Least Squares Fitting Line where α and b are the regression coefficients of g, and x represents the 10-MB closing prices. Beta-like calculation. Beta [61] is a volatility indicator which considers market risk. We adjust the notion of beta calculation to our experimental protocol where we index based on the average of the closing prices (Av CL ) with Av t as the current MB block price and Av t−1 as the previous MB block's closing price. Our calculations are as follow: • Index CL = CL t /CL t−1 • Index Av CL ¼ Av t =Av tÀ 1 • Dev CL = Index CL − SMA 10 (Index CL ) • Dev Av CL ¼ Index Av CL À SMA 10 ðIndex Av CL Þ • Beta ¼ cov 10 ðDev CL ; Dev Av CL Þ=var 10 ðDev Av CL Þ where cov 10 ðDev CL ; Dev Av CL Þ represents the covariance between the current closing price and the average of the previous ten 10-MB closing prices, and var 10 ðDev Av CL Þ is the variance of the sum of the ten previous Index Av CL .
Quantitative analysis. Quantitative analysis captures trading activity mainly via statistical modelling. We focus on time series analysis, and more specifically, we examine features such as autocorrelation and partial autocorrelation, among others (e.g., statistical tests), while at the end of the section, we build an ML feature extraction method based on an online learning setup and test the validity of our hypothesis.
Autocorrelation and partial correlation. Autocorrelation and partial correlation [62,63] are key features in the development of time series analysis. We treat our time series (i.e. stock prices and log returns per 10-MB blocks) as stationary stochastic processes since we estimate their local behavior based on 10-MB blocks: • Autocorrelation • ac k ¼ E ðz t À mÞðz tþk À mÞ ½ � ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi E ðz t À mÞ 2 ½ � E ðz tþk À mÞ 2 ½ � p where z t and z t+ k are the time series of lag k, m ¼ E z t ½ � ¼ R 1 À 1 zpðzÞdz and s 2 z ¼ E ðz t À mÞ 2 � � ¼ R 1 À 1 ðz À mÞ 2 pðzÞdz are the constant mean and constant variance respectively.
• Partial Correlation • For the general case of an autoregressive model AR(p), we have: x i+ 1 = ϕ 1 x i + ϕ 2 x i−1 + . . .+ ϕ p x i−p+ 1 + ξ i+ 1 of lag 1 up to p follows: ð� j < x iÀ pþ1 x iÀ jþ1 >Þ by diving with N − 1 and autocovariance of zero separated periods (where the autocovariance function is even), all the lag periods above will be: where r 1 , r 2 , . . ., r k , r p can be described by the matrix operations RF = r, R 2 R p�p , Φ 2 R p�1 and r 2 R p�1 . The symmetric and full rank F are as follows:Φ ¼ R À 1 r.
• Yule-Walker Equations calculation: Cointegration. We investigate time-series equilibrium [64,65] by testing the cointegrated hypothesis. Utilizing the cointegration test will help ML traders avoid the problem of spurious regression. We employ the Engle-Granger (EG) test for the multivariable case of LOB ask (A t ) and bid (B t ) times series. We formulate the EG test for the ask and bid LOB prices as follows: • A t and B t * I(d), where I(d) represents the order of integration • Cointegration equation based on the error term: u t = A t − αB t • EG Hypothesis: u(t)* I(d), d 6 ¼ 0 • Perform ordinary leat squares (OLS) for the estimation ofâ and unit root test for: Order book imbalance. We calculate the order book imbalance [4] based on the volume depth of our LOB as follows: where V a l and V b l are the volume sizes for the ask and bid LOB sides at level l.

Additional results
Tables 5-9 are provided below:

Feature sorting lists
These are the five sorted lists of the 273 hand-crafted features. A detailed feature name list follows.