Systematic assessment of intrinsic factors influencing visual attention performances in air traffic control via clustering algorithm and statistical inference

The intrinsic factors (IF) influencing visual attention performance (VAP) might cause potential human errors, such as “error/mistake”, “forgetting” and “omission”. It is a key issue to develop a systematic assessment of IF in order to distinguish the levels of VAP. Motivated by the Stimulus-Response (S-R) model, we take an interactive cancellation test—Neuron Type Test (NTT)—to explore the IF and present the corresponding systematic assessment. The main contributions of this work include three elements: a) modeling the IF on account of attention span, attention stability, distribution-shift of attention with measurable parameters by combining the psychological and statistical concepts; b) proposing quantitative analysis methods for assessing the IF via its computational representation—intrinsic qualities (IQ)—in the sense of computational model; and c) clustering the IQ of air traffic control (ATC) students in the feature space of interest. The response sequences of participants collected with the NTT system are characterized by three parameters: Hurst exponent, normalized number of decisions (NNoD) and error rate of decisions (ERD). The K-means clustering is applied to partition the feature space constructed from practical data of VAP. For the distinguishable clusters, the statistical inference is utilized to refine the assessment of IF. Our comprehensive analysis shows that the IQ can be classified into four levels, i.e., excellent, good, moderate and unqualified, which has a potential application in selecting air traffic controllers subject to reducing the risk of the inadequacy of attention performances in aviation safety management.


Backgrounds and motivations
Human error is a major source of air traffic control (ATC) accidents or incidents [1][2][3][4][5][6]. Luo [7] showed that 74.3% of accidents related to ATC of China are caused by air traffic controllers, or controllers for simplicity, instead of ATC equipments. The Aviation Safety Reporting System (ASRS) indicated that the aviation accidents are closely related with the visual attention performance (VAP) [8]. The objective of the controller is to resolve flight conflicts and ensure flight safety management subject to dealing with a large amount of visual and auditory information and making decisions accurately, rapidly and stably [9][10][11]. When there is hazardous weather, communication failures and/or aircraft mechanical failures, controllers will commit unsafe acts which lead to human errors, like "error/mistake", "forgetting" and "omission". It is necessary for controllers to concentrate and allocate attention so as to avoid flight accidents or incidents [12,13]. The history of ATC shows that controllers' misidentification may cause severe aviation accidents or incidents. On 11th October 2016, a serious runway incursion incident during takeoff occurred because of the controller's erroneous operations in Shanghai Hongqiao International Airport of China [14,15]. This incident was due to controller's forgetting runway state and commanding mistake. It should be pointed out that there are two main factors for this incident: lapse of attention and lack of expertise as a novice in ATC.
A key issue for the cause of operational errors by novice controllers is the inadequate VAP [16]. According to the applied attention theory [17], visual misidentification may occur due to individual factors and attentional slips. Visual misidentification usually engenders errors in visual detecting and searching. There are quite a number of references about human perception processing [18][19][20][21][22][23], visual searching and cognition mechanism [24][25][26][27], which emphasize the mechanism of attention and play a fundamental role for human perception, cognition and action. There are two categories of factors influencing VAP: extrinsic factors (EF), which refer to the impersonal factors that affect attention or the factors that are not related to individual and affect attention, and intrinsic factors (IF), which refer to the personal-related factors that affect attention. Our emphasis is put on the IF influencing VAP and their systematic assessment. The inadequacy of VAP may cause serious human errors in civil aviation. The descriptions of the EF influencing VAP are abundant such as budget of search, complexity of task, noise or dark environment [21,22]. However, the studies on the relation of human errors and the IF influencing VAP, such as attention span, attention stability, and distribution-shift of attention [20,[28][29][30], are deficient. Moreover, these IF are conceptual and there is no quantitative method to analyze them mathematically. It is necessary to analyze the VAP quantitatively by converting the conceptual IF into the computational counterparts, i.e., intrinsic qualities (IQ). Intuitively and empirically, the controllers can be classified reasonably into four levels, i.e., excellent, good, moderate and unqualified. Theoretically, this classification can be implemented by clustering the features constructed from the IQ and evaluated with statistical inference. The determination of specific level which the control belongs to has a potential application in selecting air traffic controllers subject to detecting human errors and reducing the operation risk.

Visual attention performances and clustering
It is feasible to analyze the VAP quantitatively with the help of our Neuron Type Test (NTT) system [31] and data collected (see S1 Appendix). Once the response sequences of ATC students are collected with NTT, we can define, compute and analyze the VAP by modeling the IF with mathematical methods. Technically, we take a three-step strategy to attain this objective: 1. Modeling: Based-on the response sequences, the feature space is constructed with measurable parameters proposed for IF with psychological experiments, mathematical and statistical theories.
2. Clustering: The ATC students characterized by the 2-dim or 3-dim features are classified via the K-means clustering algorithm automatically.
3. Assessing: Different clusters are evaluated by statistical inference in order to control the risk of human errors in aviation safety management by selecting potential professionals of ATC.
In civil aviation, the clustering method has been successfully applied in intelligent classification of air traffic controllers [32] and aeronautical information network intrusion detection [33]. On account of worldwide aircraft accidents between 1950s and 2010s, Christopher [34] and Olja [35] adopted the clustering method to predict the warning level of aircraft components and constructed an accident predictive models respectively. Normal and abnormal operations can be identified by trajectory clustering [36], which helps to surveillance the airspace efficiently. In 2012, main factors about approaching and the quick access recorder (QAR) data were analyzed by Zhou [37] with gray clustering method for assessing aircrew and improving flight safety. Vitali [38] studied the statistical regularities in ATM. In 2013, Monechi [39] analyzed the interrelation of safety data with flight trajectories and network metrics. In 2017, anomalous flight operations are identified from energy-based metrics and clustering [40] and Li et al. [41][42][43] studied the clustering and classification of IQ influencing the VAP of controllers.
With the purpose of controlling human errors and enhancing the safety performance in aviation operations, not only should we monitor the abnormal operations in time, but also assess the attention of operators efficaciously at the same time. The history of aviation safety shows that it is vital to pay more attention on operators instead of equipments [44]. In this paper, we propose systematic assessment on individual's IF via the computational IQ by analyzing the response sequence with R/S analysis [31], K-means clustering algorithm [45] and statistic inference so as to identify the relation of human error and information processing level specified by the IF. We deem that the predictive quantification assessment for human errors caused by inadequacy of attention qualities could be used for selecting controller candidates and risk management.

Roadmap, contents and organization
For the purpose of quantifying the essential differences among controllers, a group of participants-air traffic control students-are invited to take two consecutive 12 minutes selective attention test with the "Ruleout-Ignore-Eliminate" rules. The experiments were carried with the help of the NTT system. The participants' IF are coded in the response sequences collected in the experiment. Motivated by psychology and data science, we defined 2-dim and 3-dim feature spaces clustering the participants and assessing the IF with K-means clustering algorithm and statistical inference. The main ideas, logics and contents framework of this work are depicted in Fig 1. The contents of this paper are organized as follows: the mathematical preliminaries are presented briefly in Section 2; the computational representation of IF in the sense of attention theory and data science are discussed in Section 3; the interpretation of the experiments about intrinsic factors with NTT are explained in Section 4; the intrinsic qualities are explained with the help of 2-dim and 3-dim clustering results in Section 5; finally the conclusions are given in Section 6. In the appendices, we give some introductions to the NTT system and K-means clustering algorithm.

R/S analysis and Hurst exponent
In [31], the R/S analysis is used to analyze the Hurst exponent [46][47][48] of the reaction time data collected from the NTT system. For a discrete reaction time sequence fX k g LÀ 1 k¼0 with length ℓ = d × n, it can be segmented into d subsequences in which the m-th subsequence is X ðmÞ We can compute the local average E m , bias Y ðmÞ i , cumulative deviate Z ðmÞ i , range R ðmÞ n and standard deviation S ðmÞ n respectively as follows: S ðmÞ n ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n À 1 Thus the average of the rescaled range for all subsequences of length n will be In [47], Mandelbrot showed that the R/S statistics asymptotically follows the relation: Hence the Hurst exponent H can be estimated via least square approach according to (8). Usually, H 2 (0, 1). For the reaction time sequences in NTT experiment, [31] shows that H > 0.5 means the process is persistent, H = 0.5 means that the process is near random, and H < 0.5 implies that the process is anti-persistent.

Features and clusters
n-dim feature space and partition. The n-dim feature space is denoted by where the component x i lies in the set R i for each feature x, and each R i may be discrete or continuous. For any two features x and y in S n , we use the Euclidean distance kx − yk to measure their difference although there are plenty of candidates of metric. If the magnitudes of components for a feature point span a large scale, then we normalize each component in order to balance the contribution of each component. In this paper, we take the following min-max normalization step: For simplicity, we will use the notation x for both the primitive feature x and its normalized version x new . For the normalized feature space, we have R i = [0, 1] and each feature is located in an n-dim unit cube C n ¼ 0; 1 ½ � � � � � � 0; 1 ½ � |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } n terms For the index set {1, � � �, N}, it can be partioned into K disjoint subsets F 1 , � � �, F K such that f1; 2; � � � ; Ng ¼ [ K k¼1 F k and F j \ F i = ; for i 6 ¼ j. Let |A| be the cardinality of set A and |F k | = N k , then we have j [ K k¼1 F k j ¼ N. Similarly, for the data set which consists of N discrete n-dim features, we can divided it into K disjoint subsets S n 1 ; � � � ; S n K as follows In consequence, S n k ; S n j \ S n k ¼ ; for j 6 ¼ k: and jSj ¼ P K k¼1 jS n k j ¼ N. For the purpose of clustering, we use the disjoint sets S n 1 ; � � � ; S n K to denote the K different clusters.
For each feature point x j in S n ¼ [ k S n k , we use the binary indicator variable I kj , known as the 1-of-K coding scheme [49], to label how to assign the point x j into the cluster of interest. If x j is assigned to the k-th cluster S n k , i.e., x j 2 S n k , then I jk = 1, otherwise I jk = 0. Statistics and metrics for clusters. The centroid of a given cluster S n k is the average of the features in the cluster S n k , i.e., For the features x i and x j in feature space S n , their dissimilarity is measured by their Euclidean distance. viz., For clusters S n k and S n i , we use the Euclidean distance of their corresponding centroids to represent their distance Without doubt, the distance of clusters is an inter-class metric which describes the degree of separation of two clusters. For each cluster S n k , the standard deviation of features is defined by ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 jS n k j À 1 Note that we use the unbiased estimation of variation since N k = |F k | may not be efficiently large. Intuitively, the smaller the standard deviation is, the better the cluster is. The standard deviation per dimension is defined by which can be used to compare the clusters with different dimensions when the impact of the dimension of features is considered in the practical problems. Although the standard deviation is simple and useful, it is difficult to describe the orientation of the features in a cluster. It is necessary for us to consider the (unbiased) estimation of the covariation matrix of S n k , i.e., For the features x i and x j in S n , their correlation coefficient is defined by For the purpose of comparing the difference of clusters, we define the TrLogDet (TLD) divergence for the clusters S n k and S n i by Then we introduce the TrLogDet (TLD) distance Note that the expression for TLD in (24) is different from LogDet in [50,51] since the factor 1/n is introduced for the purpose of eliminating the impact of the dimension of the data and the trace operation is emphasized here. It is easy to show that the TrLogDet distance satisfies the axioms of distance: The TLD distance measures the dissimilarity of two clusters and it is a metric of interclusters.
For the clusters S n k and S n i , when Σ i ¼ CovðS n i Þ ¼ I is the identity matrix, namely an "ideal" or "standard" cluster, we immediately have In this case, rðS n k Þ is the so called Information-Theoretic Metric (ITM) [50,51] per dimension, which is a positive, strictly convex and spectral function. Particularly, if Σ k = I = diag(1, � � �, 1), then ρ k = 0. Usually, the ITM ρ k can be used as a divergence metric for the cluster S n k , just as the standard deviation σ k does.
The choice of K in the K-means algorithm is a key issue in clustering. The value of K has direct impact on the clustering results, which is related to the sum of the squared errors (SSE) for K clusters: where SSE ðS n k Þ is the sum of squared errors for the k-th cluster S n k .

Conceptions
Logically, there are three key issues to be done for quantitatively assessing the IF influencing VAP: 1. proposing an abstract description for IF by mathematical modeling; 2. developing an interactive system to collect data which preserve the intrinsic and invariant properties interested naturally; 3. finding proper ways to analyze the data collected and give reliable, effective and practical assessment methods for the IF.
Conceptually, a reasonable data collecting system for analyzing the IF should have the following elements: • STRUCTURE: the stimulation paradigm should be an Stimulus-Response (S-R) model, which can be represented by an Input/Output (I/O) model theoretically; • INPUT: non-semantic symbols serve as visual stimulus; • OUTPUT: the response of participant is the decision which includes reaction time and the symbolized description of correctness; • FUNCTION: the data collected with the system can be used to analyze the attention span, stability of attention, distribution-shift of attention for individuals; • INTERFACE: the human-machine interaction interface should be friendly and easy to be used for the users.
According to these elements, the NTT system has been well developed [31] and used to collect the response sequence of individuals, in which the i-th response hy i ; t i i p ¼ hy p i ; t p i i for the p-th individual consists of two components: • symbolic decision, denoted by y i such that y i 2 {Y, N, I} where Y, N and I stand for "Yes", "No" and "Ignore" respectively; • reaction time, denoted by τ i such that τ i > 0.
The length of the response sequence refers to the count of responses in a time interval specified by the experimental setup. Usually, the length is different for different individuals and it can be regarded as a random variable mathematically.

Modeling the intrinsic factors: Qualitative and quantitative descriptions
The performance of attention is affected by a variety of factors, which can be divided into two fundamental categories [29,30]: extrinsic factors and intrinsic factors. Extrinsic factors include mission objectives, plans, operating procedures and so on. Alternatively, the intrinsic factors include knowledge, skills, experience, personal attention level and so on.
In terms of human errors in aviation, the rapidness, accuracy, and stability of the behaviors of operators are essential for avoiding incidents or accidents in case of emergencies. There are two famous cases in civil aviation in the recent decade: the forced landing of New York National Airline on Hudson River on 15 January 2009 (engine failed) [52] and the abdomen landing of Polish Airline (landing gear failed to lay down) on 1 November 2011. These three characteristics are closely related to the IF influencing VAP for aircraft operators.
Currently, there is a lack of a complete test approach for attention performances, thus we can only assess some aspects of human behaviors characterized by some components of attention performances [53]. In the sense of psychology, the interests of the assessment of IF in this work consist of three aspects: attention span, attention stability, distribution-shift of attention. With the response sequence collected by the NTT system, it is feasible to measure and assess the IF with observable data.
For the convience of understanding the ideas and methods for modeling the IF influencing VAP and the corresponding IQ of interest, we figure out a block diagram for VAP, see Attention span-Number of decisions and its normalization. Attention span is the amount of concentrated time one can spend on a task without becoming distracted. If the concentrated time interval is fixed, then attention span refers to the amount of perceived or recognized objects during a given period for an individual. Intuitively, the more decisions the participant makes in each test via NTT System, the faster the participant responses in a given time interval (2 × 12 = 24 minutes). Therefore, the number of decisions (NoD) made in the fixed time can be used as a data feature to describe the operational capacity of a single participant. Attention span relates to the individual's information processing intensity, which determines the number of the objects perceived. NoD maps the number of percepts and furthermore reflect the degree of rapidness which serves as the assessment index for attention span.
Mathematically, the NoD is denoted by I and its value must be a positive integer, i.e., I 2 N. In our NTT test, we have I 2 [200, 800] according to the setup of the practical experiments of NTT. For the sake of data processing and pre-conditioning in data science, we take the following normalized number of decisions (NNoD) with min-max normalization step according to Attention stability-Hurst exponent. Stable attention relates to the goal of the activity, the level of understanding of the task, the interest on the object, the degree of motivation to engage in the activity, emotional state, activeness of thought and healthy state. Attention stability describes the individual's quality of keeping attention on a certain object or an activity for a long duration [54]. In civil aviation, the attention stability reflects ATM controllers' ability for tracking aircrafts among sectors and/or flights en rout continuously and smoothly.
Based on the authors' previous work [31], the Hurst exponent [55], denoted by H, is a good feature for the reaction time data such that the latent information and long term correlation can be characterized efficiently. The Hurst exponent H is a natural and intrinsic parameter for representing the level of attention stability. In a heuristic way, there are two aspects which should be noted: • The available methods for describing attention stability are based on average, range of reaction time and the standard deviation independently [56]. On the contrary, the Hurst exponent H integrates all of these factors into a single and simple statistical invariant in the sense of affine transformation by its definition.
• This statement is a fundamental hypothesis and its effectiveness should be verified and validated by practical cases and possible applications. In the sense of mathematical modeling of IF, it is an abstraction for describing the attention stability with Hurst exponent H. For our response sequences obtained by NTT system based on Stimulus-Response model, this hypothesis is acceptable and the abstraction leads to interesting and reasonable results when clustering with experimental data.

Distribution-shift of attention-Error rate of decisions.
Distribution of attention indicates distributing attention on distinct objects simultaneously, which reflects the capability of processing multiple tasks parallelly. Shift of attention refers to transferring the attention intentionally from one object, or an activity, or an operation, to another. "Error" is usually a conscious behavior of individual. It is due to the lack of practitioners' ability, behavioral fault, unfamiliarity with the rules of operation, the lack of relevant knowledge and other factors caused by operational errors or fault. The job of air traffic controller is to monitor, control and service the flight activities of aircrafts by using communication, navigation, monitoring techniques and methods to prevent aircraft collision and guarantee the safety, which covers the different stages of perception of information, analysis, decision-making and implementation. During this information processing, the controller's memory and attention play the dominate roles in the accuracy of his operation. Error, omission, and forgetting-three typical forms of human mistakes-are caused by the lapse of distribution-shift of attention. We use the error rate of decisions (ERD), denoted by E, to represent the correctness of decisions which indicates the distribution-shift of attention. In the sense of aviation operation, ERD is of significance for the air traffic controllers. For the tests with NTT System, it is defined by

Representation of IF and feature space
The representation of IF is IQ, which leads to the feature space with high dimensional features generally, please see Fig 2. 2-dim feature space. In practical problems, some features of data may not be easily acquired. Hence, the analysis for low dimensional sample space may be indispensable. On the basis of the modeling of IF, we have obtained three features and constructed interested 2-dim and 3-dim feature spaces.
For the 2-dim feature space, we use the Hurst exponent H and the NNoD W to construct the feature space which consists of points [W, H] ⊺ such that H 2 [0, 1] according to [31] and W 2 [0, 1] by normalization. We should remark it is not a wise choice if we use the NoD I instead of the NNoD W. Actually, the order of magnitudes of H and I are 10 −1 and 10 2 respectively, and the weight for the two components of each feature vector [I, H] ⊺ are the same. In consequence, the Euclidean distance of the feature vectors is dominated by the NoD I. To avoid such a result, it is necessary to use the normalized version W rather than the primitive I as the suggestion proposed in [57].
Consequently, the p-th feature vector x 2 p ¼ ½H p ; W p � ⊺ , which corresponds to the p-th individual/participant and is constructed by Hurst exponent and NNoD, are located in the normalized feature space C 2 (a unit square), i.e., For the p-th participant and the corresponding response sequence hy p i ; t p i i i2N , we can find that: • I p is the number of decisions determined by the reaction time sequence ft p i g i2N ; • W p can be calculated from I p with the help of Eqs (10) and (11); • H p can be computed from the reaction time sequence ft p i g i2N .

3-dim feature space.
Meanwhile, it is not sufficient if only 2-dim clustering is taken for analyzing our problem. For the sake of more deep understanding of the problem, we also constructed 3-dim feature space by adding the component E, i.e., the ERD.
Hence, the p-th 2-dim feature x 2 f now turns into a 3-dim vector x 3 p ¼ ½W p ; H p ; E p � ⊺ , which is constructed by NNoD W, Hurst exponent H and ERD E. Obviously, each x 3 p is located in the 3-dim normalized feature space C 3 (a unit cube), i.e., For the the p-th participant and its decision sequence fy p i g i2N , the ERD E p can be determined easily.
Eigen features, measurability and operation risk. Mathematically, the selection of the basis is not unique for the n-dim feature space C n . However, different bases are equivalently and they can be related with an invertible transition matrix. For the general features, it is sufficient for us if we take any form of the basis. However, if the clusters are considered, we need the concept of eigen-feature and eigen space.
For the k-th covariance matrix Σ n k of cluster S n k defined by (22) in C n , we denote its eigenvalue decomposition (EVD) as is not standard and there are more than one factors/components should be considered simultaneously. On the other hand, this implies that each cluster has its own eigen-features and manifests its characteristics, thus the corresponding operators of civil aviation may have specific operation risk. Logically, we have the chain "cluster-eigen-features-operation-risk" for the aviation safety about air traffic control.

Levels of intrinsic qualities
Here we define four levels-High, Moderate, Low and Bottom-which are sequencing the intrinsic visual attention quality from "great" to "trivial". We now give some remarks for the levels of IQ: • The bottom W indicates that the participants' reaction is very slow and it costs much time for making a decision. Experimentally, the process for making symbolic decision y i 2 {Y, N, I} is long and the speed is very slow in the sense of statistics. Psychologically, it implies that the attention span of the participants is small. On the contrary, high W means fast reaction and large attention span.
• The moderate H, say H � 0.6, shows that there is a weak positive correlation in the sense of long term time [31]. Therefore, the attention stability of the participants is considerable.

Setup of experiments
We seriously declare that the Ethics Committee of Research Institute of Civil Aviation Safety approved this research and the consent procedure. Given that the present study only measures simple behavior responses, it does not elicit adverse physiological and psychological reactions from the participants. Thus, the institutional ethics committee waived the need for written informed consent from the participants. After all procedures had been explained by one of the researchers, those participants who decided to join the study could get an invitation code to access the program, which was the representation of their oral consent. Participants have the right to withdraw from the study at any time, and the researcher counted four participants who quited from the study in total. The participants are college students and they are required to read some simple symbols on the computer screen and make some harmless decisions by clicking the mouse [31]. In this work, the 168 participants involved are undergraduate and graduate students (20 � 25 years old) selected from the College of Air Traffic Management in Civil Aviation University of China. In other words, the data collected for the systematic assessment of IF are based on air traffic control students. It should be pointed that the testing conditions of experiments are considered seriously: • The participants involved in the NTT test are in a good mental state, i.e., with no sickness or being lack of sleep in the past 48 hours.
• The test environment is quiet and isolated from any observable natural and/or artificial interruptions.
In each NTT experiment, each participant is required to finish 2 successive testing tasks [31] within 2 × 12 minutes. In each task, each participant is guided to read lots of positive/negative non-semantic symbols and is asked to make decisions simultaneously. Without doubt, different participant will have different response sequence, and also different NoD in the same time interval due to the different reaction time in each decision operation. After data cleaning, the responses of 143 participants are kept and 25 records are abandoned. For the final data set, please see S1 Appendix.

Constructing features for intrinsic qualities
The selected features and the number of features will affect the clustering result. For a prototype modeling, the data feature extraction shall be described firstly. After the pre-processing of data, the features of interested are extracted from the reaction time sequence of each participant. As a companion of Fig 2, the pseudo-code for the feature extraction for particilants in NTT is given in Algorithm 1.

Implementation of K-means clustering algorithm
Applicability of K-means clustering algorithm. The advantage of K-means lies in its simple interpretation due to the intuitive and reliable clustering results. However, there are also some disadvantages for this algorithm: the possibility of convergence to local optimal and high computational complexity for large scale sample spaces. In our experiment, the data set collected from the NTT system is a small scale, thus the K-means algorithm works well.
Choice of K. For the implementation of K-means clustering algorithm, the K value can not be too large. The cross-validation method is used to select the K value. The majority of cases of selection are relying on the prior belief or intrinsic characteristics of the data. According to the characteristics of the NTT data set, the data set is divided into four categories which aims at classifying all of the participants into four levels: excellent, good, moderate and unqualified. The objective of clustering the feature points corresponding to the controllers (or air traffic control students), includes three folds: • exploring the potential regulation for classifying controllers into different levels; • predicting the lapse of attention in the practical operations in air traffic control by quantitative analysis; • improving the IQ of controllers via professional training.
In our recent studies [41][42][43], the characteristics of attention quality of controllers are described and the differences between controllers and ATC students are revealed according to the types of attention quality. The results obtained show that: • both controllers and ATC students could be classified into four types according to attention quality; • the qualified controllers concerned show large attention span, strong attention stability, rational attention switching and improper attention distribution; • there is a significant difference between full time controllers and ATC students.
Pseudocode for clustering algorithm. For the details of K-means algorithm, please refers to [45]. The 2-dim and 3-dim feature spaces of interest have been obtained from the feature extraction, and serve as the input of the K-means algorithm. Mathematically, we have x j ¼ x 2 k or x j ¼ x 3 k for the j-th feature when clustering.

Clustering with experimental data from ATC students
The data obtained from the NTT System are analyzed carefully and systematically. The distribution of clusters can be visualized intuitively with 2-dim and 3-dim clustering. It is necessary for us to explain why both 2-dim and 3-dim clustering are valuable for classifying air traffic control problem. The key issue lies in the level of information processing of ATC students in the sense of psychology: • the 2-dim clustering analysis with K-means algorithm is mainly based on the Hurst exponent and the workload, which reflects the speed and stability of information processing of the ATC students (see Fig 3); • the 3-dim clustering analysis also reflects the accuracy of information processing of the ATC students (see Fig 4) if compared with the 2-dim clustering analysis; • in general, the 3-dim clustering analysis is more comprehensive than the 2-dim clustering analysis for describing and explaining the level of information processing of the ATC students.
2-dim clustering. The 2-dim feature space is taken for the 2-dim clustering. Eq (14) shows that the set of 2-dim feature vectors can be partioned into K = 4 disjoint subsets S 2 1 ; S 2 2 ; S 2 3 and S 2 4 , where With the purpose of partitioning different clusters with a distinct visualization, here we use Graham's Scan [58] with a quasi-linear time computational complexity to find the convex hull of the feature points in our problem, i.e. the contours of clusters. This procedure simplifies the step of separating different clusters and avoid the complex methods for determining the border 3-dim clustering. The 3-dim feature space is taken for 3-dim clustering. Eq (14) shows that the set of 3-dim discrete feature vectors can be partitioned into K = 4 disjoint subsets S 3  we will find that a higher-dimensional leads to a better clustering result, which will be proved in the following analysis.

Comparison of 2-dim and 3-dim clustering
Centroid. According to the data collected by the NTT system for the ATC students, the centroids for the clusters are obtained, see Table 2. The centroids are computed by the following expressions: It clearly shows that the clusters respectively possess different centroids, which demonstrates the validation of the clustering algorithm. The centroids hold the role of the most representative points of each cluster. Euclidean distance. Figs 5 and 6 are two heat-maps about the matrix of Euclidean distance dist E ðS n k ; S n i Þ between every two centroids by (18). The distance ranges for 2-dim and 3-dim clusters are 0; ffi ffi ffi 2 p � � and 0; ffi ffi ffi 3 p � � respectively. The Euclidean distance gives an intuitive clustering result. By comparison, we find that 3-dim clustering separates the sample space with more prominent distinctness. If the sample space is better partitioned into four clusters, then more distinct characteristics will be obtained. Consequently, clustering on a higher dimensional sample space could represent more profound IF hidden in the participants.
Second order statistics. For each cluster, either 2-dim or 3-dim clustering, the values of second order statistics are satisfactory for statistical inference: • Table 3 shows that the variances and standard deviations of 2-dim and 3-dim clusters are small by Eq (20); • Tables 4 and 5 show that the covariance matrices are diagonally dominant for each 2-dim cluster S 2 k and 3-dim cluster S 3 k .
Obviously, this means good inter-cluster statistical characteristics. In other words, the validation of clustering algorithm has been proved once again.
Once two clusters are considered at the same time, we can find the intra-cluster characteristics with the covariance matrix by (22). Covariance matrix represents the relevance among different features and its eigenvalues indicate the contribution of the corresponding feature to the characteristic of an individual (participant of NTT). With the help of Table 6 for the 2-dim clustering, we find that the features are weakly dependent due to the small correlation coefficients Cr ij,i6 ¼j < 0.3 by (23). However, Table 6 also shows that for the 3-dim clustering the dependency among three features is higher with the similar calculation for the correlation coefficients. In other words, the analysis explored on the 3-dim feature space is more complex than its counterpart in 2-dim space. The higher the dependency of the features, the stronger the cross-impact of the IQ is. Intuitively, the attention span, stability of attention, distribution-shift of attention may interplay in some sense. Fortunately, both correlation coefficient of these clusters own a weak dependency among different features, i.e., the features of the IQ corresponding to an excellent participant are nearly independent.
TrLogDet distance. The TrLogDet distance dist T ðS n k ; S n i Þ for each two clusters are computed according to Eq (25), see Figs 7 and 8. We find that TrLogDet values for the 2-dim clusters are smaller than their counterparts in 3-dim space. For 2-dim clustering, the clusters have similar profiles where the samples locate within a neighborhood. By contrast, for 3-dim clustering, the samples spread in a wide range. Although the clustering in 3-dim sample space has higher complexity and the assessments for 3-dim objects is more complicated, the 3-dim result is more persuasive in general since it contains more information.
Information theoretic metric. The information theoretic metric (ITM) is calculated via (26), see Table 7. The ITMs for the four 2-dim clusters have no significant difference, which means their profiles are similar. However, for the 3-dim clusters, the difference of ITMs are much larger than those in 2-dim space. This implies two facts:     • the feature points can be distinguished well with ITM if the dimension can be set up properly; • the performance of 3-dim clustering is better than 2-dim clustering since more informaton is involved in for the purpose of distinguishing feature points.
Sum of squared errors. According to Eq (27) for SSE, we obtain Table 8. For the SSE of n-dim clustering, a small value of SSE per sample reflects satisfactory clustering outcome for a given K. A good choice of K depends on the characteristics of the data set, experiences, and objective of the data analysis. In our study, we set K = 4. With the four levels (high, moderate, low and bottom) mentioned above, we now propose some interpretations as following:  • S 2 1 with bottom W and moderate H • The average W is the smallest among the four clusters, which indicates that the participants' reaction is very slow and they spend much time to make a decision. Experimentally, the process for making the symbolic decision y i 2 {Y, N, I} is slow in the sense of statistics. Psychologically, it implies that the attention span of the participants is small.

Clusters and performances of attention
• The average H is about 0.64 which shows a good positive correlation in the sense of long term time [31]. Therefore, the attention stability of the participants is acceptable.
• This type of controllers are not appropriate for busy airports with large traffic flow. On the contrary, only small airports with fewer flights could be considered for these controllers.
For the sake of low operation risk, the participants should be allocated with low workload.
• S 2 2 with low W and bottom H • H floats up and down at 0.5, which indicates that the participants are most unstable during decision-making since H = 0.5 means strong randomness and poor correlation [31]. Therefore, the attention of the participants is not stable. • Low W shows that more time is needed for decision-making, thus the attention span is small.
• The participants included in S 2 2 are not appropriate for being selected as controllers since neither the attention span nor the attention stability is good in the sense of controlling the operation risk.
• S 2 3 with moderate W and high H • For a given period, the NoD of participants in the 3rd cluster is stable. H above 0.6 shows that their response sequences are of the following characteristics: better long-range correlation and respondent continuity as well as positive correlation.
• Moderate W indicates an ordinary level of decision-making of participants and good attention span.
• Samples in cluster S 2 3 and cluster S 2 2 are at opposite positions and also with opposite characteristics. It proves that the clustering algorithm is capable of partitioning the samples with significant differences.
• S 2 3 includes the best participants separated automatically via 2-dim clustering process. • S 2 4 with high W and low H • High W means high attention span while low H means low attention stability.
• The corresponding participants could work with those in S 2 3 when the workload of controllers is large in the cases of peak air traffic flow at busy airports.  Table 9 has no distinct difference because of H 2 [0.6, 0.7] for four different clusters (which implies good positive long-range correlations and attention stability). We now give some interpretations for these clusters as follows:

3-dim clusters of features [W, H, E]
• S 3 1 with bottom W, low E and poor H • The process of decision-making of participants in cluster S 3 1 needs a long time to make decisions, thus the attention span must be small. Meanwhile, a sufficient thinking time will lead to high accuracy or equivalently low E.
• The values of H spread in a wide range which covers the critical point H = 0.5, thus their long-range correlation is relatively poor. We can infer that the occurrence of mistakes fluctuates rapidly and randomly. In consequence, the attention stability is poor.
• S 3 2 with moderate W, low E and a wide range of H • Moderate W of the participants in cluster S 3 2 indicates a moderate level of information processing and attention span. Particularly, 25% of the participants in S 3 2 have excellent attention span since they have large W values when compared with other clusters.
• Low E shows the high accuracy during decision-making. Therefore, the distribution-shift of attention is good and acceptable.
• The value of H has a wide range, which implies the long-range correlation hidden in the response sequence is complex. However, for the participants in S 3 2 , the box-plot also shows that more 70% of them have good attention stability since H � 0.6 and 25% of them have excellent attention stability since H � 0.7.
• In general, S 3 2 is the best cluster for 3-dim clustering and the participants of interest can be selected objectively since the IQ are the best.   • Obviously, the participants characterized by S 3 3 have poor work performances: low attention span, low distribution-shift of attention and strong stability. In other words, they response slowly, make decisions with high error rate, and it is difficult for them to improve their work efficiency and correctness of decisions.
• S 3 4 with high W, low H, and high E • Both the W and the H are moderate, thus the attention span and stability of the participants in S 3 4 are acceptable. • The distribution-shift of attention is very poor since E > 70% for each participant and E � 100% for participants. This implies that the participants should be re-evaluated seriously and they may not be selected as air traffic controllers.

Levels of intrinsic qualities for ATC students and controllers
According to the results of 2-dim and 3-dim clustering analysis that are shown in Figs 3 and 4 respectively, we invited five experts of ATC with more than 10 years work experiences to evaluate attention span, attention stability and distribution-shift of attention of the controllers (grade from 1 to 9 points). The higher the score is, the better the attention features are. In the end, we get three intervals: 1 � 3 points, 4 � 6 points and 7 � 9 points, see Table 10.
Based on the assessment of IQ for 2-dim and 3-dim clusters, the ATC students can be divided into four classes, from which the classification could be generalized to the assessment of air traffic controllers, see Table 11. Consequently, we have • Level-1 air traffic controllers-Concentrating attention rapidly, distributing attention reasonably, and finishing the specific task with high accuracy subject to high workload. It is feasible to allocate complex air traffic control tasks for this kind of controllers.
• Level-2 air traffic controllers: Processing information swiftly and stably with high error rate.
In other words, they may suffer great difficulties if multi-tasks or the task with multi-objectives are concerned. This kind of controllers may make lots of mistakes due to high air traffic flow. Their low performance of distribution-shift of attention might cause omission of traffic conflicts, which may increase the operation risk.
• Level-3 air traffic controllers: The distribution-shift of attention is swift. However, neither the attention span is sufficient nor the attention stability is acceptable. Apparently, the "error/mistake","forgetting" and "omission" phenomenon appear frequently if the controllers' workload is high. As a suggestion, it would be better for this kind of controllers to work in the period of an average air traffic flow intensity or in a small airport with low air traffic flow intensity.
• Level-4 air traffic controllers: Characterized by a non-timely reaction and high probability of making mistakes. Generally, the inferior IQ of Level-4 controllers may cause serious operation risk.

Key results
Aviation safety is closely related with the IF influencing the VAP of air traffic controllers. However, there is a lack of mathematical descriptions and effective analysis of IF based on observable data. We presented the IQ-a mathematical representation for the IF-and constructed the interested 2-dim and 3-dim feature spaces with the computational parameters NNoD, Hurst exponent and ERD for characterizing the response sequences. The K-means clustering algorithm is applied for the IQ and the systematic assessment is performed for the clusters in 2-dim and 3-dim through statistical inference through computing and comparing the geometric metrics and statistics defined. By the NTT system and the response sequence collected from 143 ATC students, the statistical analysis for the clusters obtained demonstrates some interesting results: • The data set has been well divided into four clusters which coincides with the practical observations for the air traffic controllers.
• The eigen-features of the clusters and the correlation coefficients for the directly measurable IQ, i.e., NNoD, Hurst exponent and ERD, show that each cluster has its own eigen-features and manifests its unique characteristics, thus the corresponding controllers or professionals may have specific operation risk.
• The participants can be divided into four levels: EXCELLENT, GOOD, MEDIUM, UNQUALIFIED, which implies: • the potential rules for selecting controllers for specific tasks in different air traffic flow intensity in order to satisfy the demand for ensuring safe operations in the civil aviation; • the controllers in different clusters should be allocated with different workload when the type of airports and the air traffic flow intensity are considered.
In the sense of methodology, our work illustrates a novel chain of logic for analyzing and reducing the operation risk in civil aviation: professionals ! VAP ! IF ! IQ ! clustering ! statistical inference ! IQ levels ! human errors. We could hope that the method developed in this work will be helpful for selecting professionals and allocating workload more reasonably subject to operation risk allowed in lots of fields.

Future work
Generally, there are plenty of forms of human factors which have impact on aviation safety. The "error/mistake", "forgetting", "omission" are three kinds of top factors which can be captured by NTT system. Statistical inference itself shows that we should design experiments for collecting data carefully and objectively as possible as we can. In future, we will pay more attention to the following aspects: • Explore new IF influencing VAP which can be captured by the NTT system. We just take three intrinsic factors and the corresponding IQ in this work. It is still an open problem that seeking measurable VAP and their representations (IQ).
• Invite left-handers to do the tests with NTT system. Currently, all of the ATC students invited are right-handers. Thus, the visual attention is mainly focused on the left brain in the sense of neurophysiology and cognitive psychology. In the future, we will also collect the data from the left-handers.
• Increase the number of participants, from which we can utilize the statistical theory of a large number of samples. The more participants, the more confidential and convincing the levels of IQ for air traffic controllers.
• Determine the parameter K automatically. In our paper, the parameter K for clustering is set as K = 4 according to our experiences and knowledge of classifying air traffic controllers, which has potential impact on the statistical inference. In the future, we will design algorithms to select the parameter K automatically and find a trade-off between the complexity of clustering algorithm and difficulty of interpreting the clusters obtained subject to the facts and requirements of ATC.