Classification and Verification of Handwritten Signatures with Time Causal Information Theory Quantifiers

We present a new approach for handwritten signature classification and verification based on descriptors stemming from time causal information theory. The proposal uses the Shannon entropy, the statistical complexity, and the Fisher information evaluated over the Bandt and Pompe symbolization of the horizontal and vertical coordinates of signatures. These six features are easy and fast to compute, and they are the input to an One-Class Support Vector Machine classifier. The results are better than state-of-the-art online techniques that employ higher-dimensional feature spaces which often require specialized software and hardware. We assess the consistency of our proposal with respect to the size of the training sample, and we also use it to classify the signatures into meaningful groups.


Introduction
The word biometrics is associated to human traits or behaviors which can be measured and used for individual recognition.In fact, the biometry recognition, 1 arXiv:1601.06925v1[cs.IT] 26 Jan 2016 as a personal authentication signal processing, can be used in applications where users need to be security identified [1].Clearly, these kind of systems can either verify or identify.
Two types of biometrics can be defined according to the personal traits considered: physical/physiological or behavioral.Physical/physiological biometrics is about catering the biological traits of users, like fingerprints, iris, face, hand, etc. Behavioral biometrics takes into account dynamic traits of users, such as, voice, handwritten and signature expressions.
One of the main advantages of biometric systems is that users do not have to remember passwords or carry access keys.Another important advantage lies in the difficulty to steal, imitate or generate genuine biometric data, leading to enhanced security [1].
As mentioned, behavioral biometrics is based on measurements extracted from an activity performed by the user, in conscious or unconscious way, that are inherent to his/her own personality or learned behavior.In this aspect, behavioral biometrics has interesting pros, like user acceptance and cancelability, but it still lacks of some level of the uniqueness physiological biometrics has.
Among the pure behavioral biometric traits, the handwritten signature and the way we sign is the one with widest social and legal acceptance [2][3][4][5][6].Identity verification by signature analysis requires no invasive measurements and people are familiar with the use of signatures in their daily life.Also, it is the modality confronted with the highest level of attacks.
A signature is a handwritten depiction of someone's name or some other mark of identification written on documents and devices as proof of identification.The formation of signature varies from person to person, or even from the same person due to the psychophysical state of the signer and the conditions under which the signature apposition process occurs.
Hilton [7] studied how signatures are produced, and found that the signature has at least three attributes: form, movement and variation; being movement the most important, because signatures are produced by moving a writing device.The study also noted that a person's signature does evolve over time and, with the vast majority of users, once the signature style has been established the modifications are usually slight.The movement is produced by muscles of fingers, hand, wrist, and in some writers the arm; these muscles are controlled by nerve impulses.When one person is signing these nerve impulses are controlled by the brain without any particular attention to detail.The signing processes can be described then, at high level, as how the central nervous system (the brain) recovers information from long term memory in which parameters such as size, shape, timing etc. are specified.At the peripheral level, commands are generated for muscles.In consequence, the signing process is believed to be a reflex action (ballistic action 1 ) rather than a deliberate action.Then, the production of genuine signatures is associated to a ballistic handwriting, which is characterized by a spurt of activity, without positional feedback, whereas the production of forgery signature is associated to a deliberate handwriting which is characterized by a conscious attempt to produce a visual pattern with the aid of positional feedback [9,10].
Handwritten signature verification is a problem in which the input signature (a test signature) is classified as genuine or forged.This process is usually performed in three main phases: [2][3][4][5][6] • Data acquisition and pre-processing.Two different categories of systems can be identified, depending on whether there is electronic access to the handwritten process or not.a) Online or dynamic recognition, in which the pen's instantaneous information trajectories, and also information like pressure, speed or pen-up movements can be captured.b) Offline or static recognition: those that record signatures as images on paper which can be later digitized by means of a scanner, and processed.In the latter, the pre-processing phase involves filtering, noise reduction and smoothing.Online signature verification offers reliable identity protection, as it employs dynamic information not available on the signature image itself but in the process of signing.As a consequence, online signature verification systems usually achieve better accuracy than offline systems.
• Feature extraction.Two types of features can be used.a) Function features of the signature: time functions whose values constitute the feature set.b) Parameter features: the signature is characterized as a vector of elements, each one representative of the value of the feature.Usually, the last one yields better performance, but it is also time-consuming.
• Classification.In the verification process, the authenticity of the test signature is evaluated by matching it against those stored in the knowledge base developed during the enrollment stage.This process produces a single response that attests to the authenticity of the test signature.When template matching techniques are considered, a questioned sample is matched against templates of authentic/forgery signatures.Distancebased classifiers, mostly when parameters are used as features, are usually developed with statistical techniques, e.g. with Mahalanobis and Euclidean distances.The performance of a signature verification system is commonly assessed in terms of the percentage Equal Error Rate.
On the one hand, template matching attempts at finding similarities between the input signature and those in a data base.Most approaches use Dynamic Time Warping to perform this match [5,6].On the other hand, distance-based classifiers rely on the use of features derived from the signatures.
Two opposite mechanisms describing the signing process can be found in the literature.The nonlinear character and chaotic behavior of several physiological complex processes are well established [11,12].In particular, Longstaff and Heath [13] found evidence of chaotic behavior on the underlying dynamics of time series related to velocity profiles of handwritten texts.Taking into account the inherent behavioral nature of the online signing process, the input information could be associated to deterministic (nonlinear low dimensional chaotic) signals, and the handwritten signature variations as a consequence of chaos (sensibility to initial conditions).In opposition, most of the research in the field of signal verification considers the input information as well described by a random process [2][3][4][5][6].Then, the dynamic input information acquired through a time sampling procedure must be consequently considered as discrete time random sequence.In any case, the signature analysis taken as a time-based sequence characterization process is strongly related to the way in which a reference model is established.From the stochastic point of view, Hidden Markov Models are among the most commonly used in the literature, and the ones with the best performance in signature verification [2][3][4][5][6].
Our proposal relies on the use of time causal quantifiers based on Information Theory for the characterization of online handwritten signatures: normalized permutation Shannon entropy, permutation statistical complexity and permutation Fisher information measure.These quantifiers have proved to be useful in the identification of chaotic and stochastic dynamics throughout the associated time series [14,15].Their evaluation is simple and fast, making them apt for the signature verification problem.We apply our proposal to the well know MCYT online signature data base [16].
Next section describes the database used in this study, followed by a section where we detail the quantifiers employed and by their application to the data.In addition to the usual data flow, we present an exploratory data analysis (EDA) of the features that enhances their appropriateness for this problem.The expressiveness and usefulness of these descriptors for the problem of online signature classification and verification follows in the sequence: we experiment their application to the test-bed.

Handwritten signatures database
The present study is carried out on the freely available and widely used handwritten signatures database MCYT-100 subset of 100 persons [16].The acquisition of each on-line signature is accomplished dynamically using a graphics tablet.The signatures are acquired on a WACOM c graphic tablet, model INTUOS A6 USB.The tablet resolution is 2540 lines/in (100 lines/mm), and the precision is ±0.25 mm.The maximum detection height is 10 mm (so also pen-up movements are considered), and the capture area is 127 mm (width) × 97 mm (height).This tablet provides the following discrete-time sequences: a) position x t in the x-axis, b) position y t in the y-axis, and c) also the time series corresponding to the pressure p t applied by the pen, as well as the azimuth γ t and altitude ϕ t angles of the pen with respect to the tablet, not used in the present work.The sampling frequency is set to 100 Hz.Taking into account the Nyquist sampling criterion and the fact that the maximum frequencies of the related biomechanical sequences are always under 20-30 Hz [17], this sampling frequency leads to a precise discrete-time signature representation.
The signature corpus comprises genuine and shape-based highly skilled forgeries with natural dynamics [16,18].In order to obtain the forgeries, each contributor is requested to imitate other signers by writing naturally.For this task, they were given the printed signature to imitate and were asked not only to imitate the shape but also to generate the imitation without artifacts such as breaks or slow-downs (see [16,18] for more details of the acquisition procedure).Each signer contributes with 25 genuine signatures in five groups of five signatures each, and is forged 25 times by five different imitators.Figure 1 presents examples for six different subjects, being the first two columns genuine and the third column forgery signatures.Since signers are concentrated in a different writing task between genuine signature sets, the variability between client signatures from different acquisition sets is expected to be higher than the variability of signatures within the same set.The total number of contributors in the MCYT is 330, and the total number of signatures present in the signature database is 16, 500, half of them genuine signatures and the rest forgeries [16,18].
As previously mentioned, we used a subset of the database, denominated MCYT-100, which includes 100 subjects and for each one, 25 genuine and 25 skilled forged signatures, and only the corresponding time series corresponding to the x-and y-coordinates of each signature will be analyzed.In particular, one must note that the time series' lengths are quite variable.In order to facilitate our Information Theory analysis, we pre-processed each time series as follows: a) the coordinates were re-scaled into the unit square [0, 1] × [0, 1]; b) taken as base these scaled values, the original total number of data for each time series is expanded to M = 2000 points using a cubic Hermite polynomial.In this way, for each subject k (k = 1, . . ., 100) and associated signatures j (j = 1, . . ., 25) we will analyze two time series, denoted by X

Information Theory quantifiers
The basic elements for the study of a system dynamics, either natural or manmade, are sequences of measurements or observations whose evolution can be tracked through time.Then, given an observable of such system, a natural question that arises is: how much information is this observable encoding about the dynamics of the underlying system?The information contents of a system are typically evaluated via a probability distribution function (PDF) P obtained from such observable.We can define Information Theory quantifiers as measures able to characterize relevant properties of the PDF associated with these time series, and in this way we should judiciously extract information on the dynamical system under study.

Shannon entropy, Fisher Information Measure, and Statistical Complexity
Entropy is a basic quantity with multiple field-specific interpretations; for instance, it has been associated with disorder, state-space volume, and lack of information [19].When dealing with information content, the Shannon entropy is often considered the foundational and most natural one [20,21].
Given a continuous probability distribution function (PDF) f (x) with x ∈ Ω ⊂ R and Ω f (x) dx = 1, its associated Shannon Entropy S [20,21] is It is a global measure, that is, it is not too sensitive to strong changes in the distribution taking place on a small-sized region of Ω.Such is not the case with Fisher's Information Measure (FIM) F [22,23], which constitutes a measure of the gradient content of the distribution f , thus being quite sensitive even to tiny localized perturbations.It reads The Fisher Information Measure can be variously interpreted as a measure of the ability to estimate a parameter, as the amount of information that can be extracted from a set of measurements, and also as a measure of the state of disorder of a system or phenomenon [23], its most important property being the so-called Cramer-Rao bound.It is important to remark that the gradient operator significantly influences the contribution of minute local f -variations to the Fisher information value, accordingly, this quantifier is called "local" [23].Note that the Shannon entropy decreases with the distribution skewness, while the Fisher information increases.
Local sensitivity is useful in scenarios whose description necessitates an appeal to a notion of "order".In the previous definition of FIM (Eq.( 2)) the division by f (x) is not convenient if f (x) → 0 at certain points of the support Ω.We avoid this if we work with real probability amplitudes, by means of the alternative expression that employs ψ(x) [22,23].This form requires no divisions, and shows that F simply measures the gradient content in ψ(x).
Let now P = {p i ; i = 1, . . ., N } be a discrete probability distribution, with N the number of possible states of the system under study.The Shannon's logarithmic information measure reads This can be regarded to as a measure of the uncertainty associated (information) to the physical process described by P .For instance, if S[P ] = S min = 0, we are in position to predict with complete certainty which of the possible outcomes i, whose probabilities are given by p i , will actually take place.Our knowledge of the underlying process described by the probability distribution is maximal in this instance.In contrast, our knowledge is minimal for a uniform distribution P e = {p i = 1/N, ∀i = 1, . . ., N } since every outcome exhibits the same probability of occurrence, and the uncertainty is maximal, i.e., S[P e ] = S max = ln N .In the discrete case, we define a "normalized" Shannon entropy, 0 ≤ H ≤ 1, as The concomitant problem of loss of information due to the discretization has been thoroughly studied (see, for instance, [24,25] and references therein) and, in particular, it entails the loss of Fisher's shift-invariance, which is of no importance for our present purposes.For the FIM we take the expression in terms of real probability amplitudes as starting point, then a discrete normalized FIM, 0 ≤ F ≤ 1, convenient for our present purposes, is given by It has been extensively discussed that this discretization is the best behaved in a discrete environment [26].Here the normalization constant F 0 reads The perfect crystal and the isolated ideal gas are two typical examples of systems with minimum and maximum entropy, respectively.However, they are also examples of simple models and therefore of systems with zero complexity, as the structure of the perfect crystal is completely described by minimal information (i.e., distances and symmetries that define the elementary cell) and the probability distribution for the accessible states is centered around a prevailing state of perfect symmetry.On the other hand, all the accessible states of the ideal gas occur with the same probability and can be described by a "simple" uniform distribution.
According to López-Ruiz et al. [27], and using an oxymoron, an object, a procedure, or system is said to be complex when it does not exhibit patterns regarded as simple.It follows that a suitable complexity measure should vanish both for completely ordered and for completely random systems and cannot only rely on the concept of information (which is maximal and minimal for the above mentioned systems).A suitable measure of complexity can be defined as the product of a measure of information and a measure of disequilibrium, i.e. some kind of distance from the equiprobable distribution of the accessible states of a system.In this respect, Rosso and coworkers [28] introduced an effective Statistical Complexity Measure (SCM) C, that is able to detect essential details of the dynamical processes underlying the dataset.
Based on the seminal notion advanced by López-Ruiz et al. [27], this statistical complexity measure [28] is defined through the product of the normalized Shannon entropy H, see Eq. ( 4), and the disequilibrium Q J defined in terms of the Jensen-Shannon divergence J [P, P e ].That is, the above-mentioned Jensen-Shannon divergence and Q 0 , a normalization constant such that 0 ≤ Q J ≤ 1: are equal to the inverse of the maximum possible value of J [P, P e ].This value is obtained when one of the components of P , say p m , is equal to one and the remaining p j are zero.The Jensen-Shannon divergence, which quantifies the difference between probability distributions, is especially useful to compare the symbolic composition between different sequences [29].Note that the above introduced SCM depends on two different probability distributions: one associated with the system under analysis, P , and the other the uniform distribution, P e .Furthermore, it was shown that for a given value of H, the range of possible C values varies between a minimum C min and a maximum C max , restricting the possible values of the SCM [30].
Thus, it is clear that important additional information related to the correlational structure between the components of the physical system is provided by evaluating the statistical complexity measure.In this way, the information plane H × C constitute a nice tool to visualizate and characterize different dynamical systems.
If our system lies in a very ordered state, which occurs when almost all the p i -values are zeros except for a particular state k = i with p k ∼ = 1, both the normalized Shannon entropy and statistical complexity are close to zero (H ≈ 0 and C ≈ 0), and the normalized Fisher's information measure is close to one (F ≈ 1).On the other hand, when the system under study is represented by a very disordered state, that is when all the p i -values oscillate around the same value, we have H ≈ 1 while C ≈ 0 and F ≈ 0. One can state that the general FIM-behavior of the present discrete version (Eq.( 5)), is opposite to that of the Shannon entropy, except for periodic motions.The local sensitivity of FIM for discrete-PDFs is reflected in the fact that the specific "i−ordering" of the discrete values p i must be seriously taken into account in evaluating the sum in Eq. ( 5).This point was extensively discussed by Rosso and co-workers [31,32].The summands can be regarded to as a kind of "distance" between two contiguous probabilities.Thus, a different ordering of the pertinent summands would lead to a different FIM-value, hereby its local nature.In the present work, we follow the Lehmer lexicographic order [33] in the generation of Bandt and Pompe PDF (see next section).Given the local character of FIM, when combined with a global quantifier as the normalized Shannon entropy, conforms the Shannon-Fisher plane, H × F, introduced by Vignat and Bercher [34].These authors showed that this plane is able to characterize the non-stationary behavior of a complex signal.

The Bandt and Pompe approach to the PDF determination
The evaluation of the Information Theory derived quantifiers, like those previously introduced (Shannon entropy, Fisher information and statistical complexity), suppose some prior knowledge about the system; specifically, a probability distribution associated to the time series under analysis should be provided beforehand.The determination of the most adequate PDF is a fundamental problem because the PDF P and the sample space Ω are inextricably linked.
Usual methodologies assign to each time point of the series X a symbol from a finite alphabet A, thus creating a symbolic sequence that can be regarded to as a non causal coarse grained description of the time series under consideration.As a consequence, order relations and the time scales of the dynamics are lost.The usual histogram technique corresponds to this kind of assignment.Causal information may be duly incorporated if information about the past dynamics of the system is included in the symbolic sequence, i.e., symbols of alphabet A are assigned to a portion of the phase-space or trajectory.
Many methods have been proposed for a proper selection of the probability space (Ω, P ).Bandt and Pompe (BP) [35] introduced a simple and robust symbolic methodology that takes into account time causality of the time series (causal coarse grained methodology) by comparing neighboring values in a time series.The symbolic data are: (i) created by ranking the values of the series; and (ii) defined by reordering the embedded data in ascending order, which is tantamount to a phase space reconstruction with embedding dimension (pattern length) D and time lag τ .In this way, it is possible to quantify the diversity of the ordering symbols (patterns) derived from a scalar time series.
Note that the appropriate symbol sequence arises naturally from the time series, and no model-based assumptions are needed.In fact, the necessary "partitions" are devised by comparing the order of neighboring relative values rather than by apportioning amplitudes according to different levels.This technique, as opposed to most of those in current practice, takes into account the temporal structure of the time series generated by the physical process under study.As such, it allows us to uncover important details concerning the ordinal structure of the time series [14,36] and can also yield information about temporal correlation [37,38].
It is clear that this type of analysis of a time series entails losing details of the original series' amplitude information.Nevertheless, by just referring to the series' intrinsic structure, a meaningful difficulty reduction has indeed been achieved by BP with regard to the description of complex systems.The symbolic representation of time series by recourse to a comparison of consecutive (τ = 1) or nonconsecutive (τ > 1) values allows for an accurate empirical reconstruction of the underlying phase-space, even in the presence of weak (observational and dynamic) noise [35].Furthermore, the ordinal patterns associated with the PDF are invariant with respect to nonlinear monotonous transformations.Accordingly, nonlinear drifts or scaling artificially introduced by a measurement device will not modify the estimation of quantifiers, a nice property if one deals with experimental data (see, e.g., [39]).These advantages make the BP methodology more convenient than conventional methods based on range partitioning, i.e., a PDF based on histograms.
To use the BP methodology [35] for evaluating the PDF, P , associated with the time series (dynamical system) under study, one starts by considering partitions of the D-dimensional space that will hopefully "reveal" relevant details of the ordinal structure of a given one-dimensional time series X (t) = {x t ; t = 1, . . ., M } with embedding dimension D > 1 (D ∈ N) and time lag τ (τ ∈ N).We are interested in "ordinal patterns" of order (length) D generated by which assign to each time s the D-dimensional vector of values at times s, s − τ, . . ., s − (D − 1)τ .Clearly, the greater D, the more information on the past is incorporated into our vectors.By "ordinal pattern" related to the time (s), we mean the permutation π = (r 0 , r 1 , . . ., r D−1 ) of [0, 1, . . ., D − 1] defined by We set r i < r i−1 if x s−ri = x s−ri−1 for uniqueness, although ties in samples from continuous distributions have null probability.For all the D! possible orderings (permutations) π i when embedding dimension is D, and time-lag τ , their relative frequencies can be naturally computed according to the number of times this particular order sequence is found in the time series, divided by the total number of sequences, where # denotes cardinality.Thus, an ordinal pattern probability distribution P = {p(π i ), i = 1, . . ., D!} is obtained from the time series.Figure 2 illustrates the construction principle of the ordinal patterns of length D = 2, 3 and 4 with τ = 1 [40].Consider the sequence of observations {x 0 , x 1 , x 2 , x 3 }.For D = 2, there are only two possible directions from x 0 to x 1 : up and down.For D = 3, starting from x 1 (up) the third part of the pattern can be above x 1 , below x 0 , or between x 0 and x 1 .A similar situation can be found starting from x 1 (down).For D = 4, for each one of the six possible positions for x 2 , there are four possible localizations for x 3 , yielding D! = 4! = 24 different possible ordinal patterns.In Fig. 2, full circles and continuous lines represent the sequence values x 0 < x 1 > x 2 > x 3 , which leads to the pattern π = [0321].A graphical representation of all possible patterns corresponding to D = 3, 4 and 5 can be found in Fig. 2 of Parlitz et al. [40].
The embedding dimension D plays an important role in the evaluation of the appropriate probability distribution, because D determines the number of accessible states D! and also conditions the minimum acceptable length M D! of the time series that one needs in order to work with reliable statistics [41].Regarding the selection of the parameters, Bandt and Pompe suggested working with 4 ≤ D ≤ 6, and specifically considered a time lag τ = 1 in their cornerstone paper [35].Nevertheless, it is clear that other values of τ could provide additional information.It has been recently shown that this parameter is strongly related, if it is relevant, to the intrinsic time scales of the system under analysis [42][43][44].
Additional advantages of the method reside in i) its simplicity (it requires few parameters: the pattern length/embedding dimension D and the time lag τ ), and ii) the extremely fast nature of the calculation process.The BP methodology can be applied not only to time series representative of low dimensional dynamical systems, but also to any type of time series (regular, chaotic, noisy, or reality based).In fact, the existence of an attractor in the D-dimensional phase space in not assumed.The only condition for the applicability of the BP method is a very weak stationary assumption: for k ≤ D, the probability for x t < x t+k should not depend on t.For a review of BP's methodology and its applications to physics, biomedical and econophysics signals see Zanin et al. [45].Moreover, Rosso et al. [14] show that the above mentioned quantifiers produce better descriptions of the process associated dynamics when the PDF is computed using BP rather than using the usual histogram methodology.
The BP proposal for associating probability distributions to time series (of an underlying symbolic nature) constitutes a significant advance in the study of nonlinear dynamical systems [35].The method provides univocal prescription for ordinary, global entropic quantifiers of the Shannon-kind.However, as was shown by Rosso and coworkers [31,32], ambiguities arise in applying the BP technique with reference to the permutation of ordinal patterns.This happens if one wishes to employ the BP-probability density to construct local entropic quantifiers, like the Fisher information measure, which would characterize time series generated by nonlinear dynamical systems.
The local sensitivity of the Fisher information measure for discrete PDFs is reflected in the fact that the specific "i-ordering" of the discrete values p i must be seriously taken into account in evaluating Eq. (5).The numerator can be regarded to as a kind of "distance" between two contiguous probabilities.Thus, a different ordering of the summands will lead, in most cases, to a different Fisher information value.In fact, if we have a discrete PDF given by P = {p i , i = 1, . . ., N }, we will have N !possibilities for the i-ordering.
The question is, which is the arrangement that one could regard as the "proper" ordering?The answer is straightforward in some cases, the histogrambased PDF constituting a conspicuous example.For such a procedure, one first divides the interval [a, b] (with a and b the minimum and maximum amplitude values in the time series) into a finite number on non-overlapping sub-intervals (bins).Thus, the division procedure of the interval [a, b] provides the natural order sequence for the evaluation of the PDF gradient involved in the Fisher information measure.In our current paper, we chose the lexicographic ordering given by the algorithm of Lehmer [33], among other possibilities, due to its better distinction of different dynamics in the Shannon-Fisher plane, H × F (see [31,32]).

Signature features and exploratory data analysis
Online handwritten classification and verification is an interesting and challenging classification problem.On the one hand, intra-personal variation information can be large.Some people provide signatures with poor consistency.The speed, pressure and inclination, for example, pertaining to the signatures made by the same person can differ greatly on regularity which makes it quite challenging to extract consistent features.On the other hand, we can only obtain few samples from one person and no forgeries in practice.This makes it very difficult to determine the reliability of extracted features.The main idea is to construct an efficient classification scheme for data acquisition, or the reduction of often unmanageable large datasets to a parsimonious form, without mislay important statistical information.We aim at discovering relevant characteristic statistical structures which could be exploited if the key information can be efficiently condensed into a suitable low-dimensional object.
The features we employ in this work are the Information Theory quantifiers already presented.For each of the k subjects (k = 1, . . ., 100) in the database and its j associated signatures (25 genuine and 25 skilled forgery), two associated time series X We computed the normalized permutation Shannon entropy H, the permutation statistical complexity C, and the permutation Fisher information measure F from these PDFs, and the obtained values are denoted as: We perform Exploratory Data Analysis (EDA) on the Information Theory quantifiers looking for simple descriptions of the data.Apart from simple descriptive univariate measures, we use the Pearson correlation to measure the association between features.This analysis was performed using the R language and platform version 3.2.1 (http:\\www.R-project.org).
Figure 3 shows a scatterplot of the Entropy for both the genuine and skilled forgery signatures.The 5000 points correspond to 25 genuine signatures (in blue) and 25  Entropies are less dispersed in the genuine than in the skilled forgery signatures, a signal of the separability between them.Marginal density plots show the distribution of entropy for each coordinate of both types of signatures.These plots, however limited due to its marginal nature, reveal several modes, and suggest both wide and narrow structures in the data.
Figure 4 shows the contour plots of bivariate kernel density estimates for the entropy in genuine and forgery signatures.A number of features are immediately noticeable.The dispersion in the former group is much smaller than in the latter (less than 0.4).The kernel density estimates reveal skewness and a mild multimodality in the joint distribution of the data.There are also quite many points that are far from these curves and cluster centers.These points correspond to abnormal local estimates obtained in heterogeneous blocks, possibly induced by the presence of clusters.The modes in genuine signatures are smaller than in forgery signatures, and this may be used as discriminatory measure.Similar results are obtained for the Complexity and the Fisher information; these are reported in the Supplementary Information, see Figs.S1 to S4, respectively.

Signatures classification
As pointed out by Boulétreal et al. [46], a signature is characterized by two aspects: a) a conscious one associated to the pattern signature; and b) an unconscious one which leads spontaneous movements constituting the drawing.These two factors produce high variability, being the amount of signature variability strongly writer-dependent.In fact, the signature variability or, conversely, the signature stability can be considered an important indicator for writer characterization [47].Houmani and Garcia-Salicetti [47] argue that signature stability is required in genuine signatures in order to characterize a writer: the less stable a signature is, the more likely it is that forgery will be dangerously close to genuine signatures for any classifier.Also, complex enough signatures are required in order to guarantee a certain level of security, in the sense that the more complex a signature is, the more difficult it will be to forge it [47].
Boulétreal and collaborators [46,48] propose a signature complexity measure related to signature legibility and based on fractal dimension.They classify writer styles into: highly cursive, very legible, separated, badly formed and small writings, using only genuine signatures.Unfortunately, such resulting categories were not confronted to classifiers for performance analysis.
We classify the genuine signatures based on causal Information Theory quantifiers: Normalized Permutation Shannon Entropy, Permutation Statistical Complexity and Permutation Fisher Information Measure of both X and Y trajectories on each of the one hundred writers in the MCYT data base, and their 25 original signatures.The mean and standard deviation values were clustered using the neighbor-joining method and an automatic Hierarchical Clustering with the Euclidean distance-based dissimilarity matrix.Each feature was treated independently, and the results are shown as circular dendrograms.Figure 5 shows the results of clustering the Entropy.We distinguish three classes of genuine signatures denoted by H1, H2, and H3.
The H3 group is formed at, approximately, the 43% level by the fusion of two other highly unbalanced subgroups: one, H3A, with only two subjects (44,46) and the other, H3B, with thirteen subjects (3,11,14,26,47,53,54,56,61,72,90,93,98).These two clusters form at approximately the same level.The former is composed of calligraphic signatures where vertical traces predominate over horizontal ones.The latter is composed of highly cursive signatures, where separation between the surname and the family name predominates.
The same results of clustering was obtained with the Manhattan (norm L 1 ) and Maximum distances (L ∞ norm), showing that Entropy is an expressive and stable quantifier.Similar analyses were carried with the Permutation Statistical Complexity and Permutation Fisher Information (presented in figures Figs.S5 and S6 in the Supplementary Information).Complexity produces the same clusters identified by Entropy, so it adds no new information.The Fisher information measure forms the same H1A group that was identified by the Entropy, but with less cohesion, at about 15%.In other words, these nine subjects are more similar locally than globally.As with Entropy, three main groups form at similar levels.The members of these clusters are slight variations of those identified using Entropy, with very similar structure.
Table 1 presents the mean and standard deviation of the three quantifiers over the 25 genuine and 25 skilled forgery signatures (X and Y time series) for each of the typical subjects, split in the three aforementioned types H1, H2 and H3.There are interesting tendencies in these data.Genuine signatures present quantifiers values lower than those corresponding to forgery signatures, and the latter also exhibit larger standard deviation.This could be explained by the imitative character of these signatures, however it deserves closer studies.
The classification into subclasses of genuine signatures was also carried by the parallelepiped algorithm [49], arguably the simplest model-free classification procedure.Entropy leads to clusters with nice interpretability.Figure 6 shows the regions that define the three classes identified by the dendrogram based on Entropy presented in Fig 5 .All subclasses are well separated by disjoint boxes, with the only exception of H1B and H2A that overlap slightly but without compromising the discrimination.The classes are preserved using this classification superimposed with Complexity and Fisher Information features; see Figs.S7 and S8 in the Supplementary Information.

Online signature verification
The problem we have at hand consists in identifying suspicious signatures, given that we only have examples from genuine signatures.This is due to the fact that, in practice, it is too expensive, too hard or even impossible to obtain a significant number of good quality forgery signatures for every possible individual in the data base.This, thus, configures a One-Class classification problem.Among the many ways of tackling such problems, Support Vector Machines (SVMs) are suitable for solving machine learning problems even in large dimensional feature spaces [50][51][52].
SVMs were introduced by Vapnik and co-workers [51,53], and extended by a number of other researchers.Their remarkably robust performance with respect to sparse and noisy data makes them the choice in several applications.A SVM is primarily a method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates cases of different class labels.SVMs perform both regression and classification tasks and can handle multiple continuous and categorical variables.To construct an optimal hyperplane, a SVM employs an iterative training algorithm, which is used to minimize an error function.
One-Class Support Vector Machines (OC-SVMs) are a natural extension of SVMs [54,55].The solution consists in estimating a distribution that encompasses most of the observations, and then labeling as "suspicious" those that lie far from it with respect to a suitable metric.An OC-SVM solution is built estimating a probability distribution function which makes most of the observed data more likely than the rest, and a decision rule that separates these observation by the largest possible margin.The computational complexity of the learning phase is intensive because the training of an OC-SVM involves a quadratic programming problem [51], but once the decision function is determined, it can be used to predict the class label of new test data effortlessly.
In our case, the observations are six-dimensional vectors: Entropy, Complexity subject to ν ∈ (0, 1], ξ i ≥ 0, ∀i = 1, . . ., N, and where ξ i are nonzero slack variables which allow the procedure to incur in errors.Using Lagrange techniques and a kernel function K(z, z i ) = Φ(z) T Φ(z i ), for the dot-product calculations, the decision function f (z) becomes: This method thus creates a hyperplane characterized by w and b which has maximal distance from the origin in the feature space G and separates all the data points from the origin.Here α i are the Lagrange multipliers; every α i > 0 is weighted in the decision function and thus "supports" the machine; hence the name Support Vector Machine.Since SVMs are considered to be sparse, there will be relatively few Lagrange multipliers with a nonzero value.Our choice for the kernel is the Gaussian Radial Base function: where σ ∈ R is a kernel parameter and z i − z j 2 is the dissimilarity measure; we used Euclidean distance.
The parameter σ 2 = 10 was selected by 5-fold-cross validation, that its, the dataset is divided into five disjoint subsets, and the method is repeated five times.Each time, one of the subsets is used as the test set and the other four subsets are put together to form the training set.Then the average error across all trials is computed.Every observation belongs to a test set exactly once, and belongs to a training set four times.Accuracy (ACC), Area Under the ROC Curve (AUC) and Equal Error Rate (EER) are used as performance measures [56].
In the context of signature verification one-class classification problems, a false positive occurs when a genuine signature is erroneously classified as being atypical.The probability of false positive misclassification is the false positive rate, which is controlled by the parameters ν in the aforementioned OC-SVM formulation.The parameter ν can be fixed a priori and it corresponds to the percentage of observations of the typical data which will be assigned as the Type I Error.
We used the LIBSVM (version 2.0) tool, linked with the R software, that supports vector classification and regression, including OC-SVM.[57] We used the standard parameters of the algorithm.
In order to assess the consistency of our procedure, and to promote the comparison with other methods reported in the literature, we evaluate the performance of the proposed verification system for different training samples: random samples of size n (n = 5, 10, 14, 18, 22) of genuine signatures were selected for each user.Table 2 presents the average value of all performance metrics using σ 2 = 10.ACC suggests that the larger the training sample, the better the performance is.AUC presents a similar tendency, and its average is larger than 0.88, indicating that our verification system produces an excellent classification.
As mentioned in the introduction, the two methodologies with best results are those based on Dynamic Time Warping (DTW) and Hidden Markov Models (HMM).In the following we compare our proposal with these two recent stateof-the-art methods using the ERR(%) over the same data base: The results of our proposal using five (ten, respectively) training samples, are ERR(%) = 0.19 (0.17, respectively).Clearly, our system provides better performance using similar number of training signatures (see Table 2 for more details).
In the following we analyze the performance of the proposed procedure applied selectively to the pre-classified samples.Table 3 presents the performance of the system when applied to genuine pre-classified signatures.For all classes we observe that the larger the training sample, also the larger the average ACC is.The best average AUC are observed for the class H2, followed by H1 and H3.This indicate that H2 signatures are easily identifiable.Note that the mean values of ERR(%) for H2 are smaller than H1 and H3.The ERR(%) values in H3 indicate that identifying forgeries in this class is hard.
forgery signatures (in red) for each of the 100 subjects.Both types of signatures show similar association (Correlation): Corr(H (k;G) X;j , H (k;G) Y ;j ) = 0.9665 and Corr(H (k;F ) X;j , H (k;F ) Y ;j ) = 0.9770.The entropies of both types of signatures are overlapped and scattered elliptically.However, the bivariate mean and dispersion values differ.
The parameter ν characterizes the solution as a) it sets an upper bound on the fraction of outliers (training examples regarded out-of-class) and, b) it is a lower bound on the number of training examples used as Support Vectors.We used ν = 0.1 in our proposal.

Figure 1 :
Figure 1: Six different subjects signatures from the MCYT database.Two genuine signatures (left, blue) and a skilled forgery (right, red).The two first signatures were classified as H1A and H1B, the following two to types H2A and H2B, and the last two to types H3A and H3B; cf.Sec.Signatures classification.25

Figure 2 :
Figure 2: Illustration of the construction principle for ordinal patterns of length D [40].If D = 4 and τ = 1, full circles and continuous lines represent the sequence of values x 0 < x 1 > x 2 > x 3 which lead to the pattern π = [0321].

Figure 3 :
Figure 3: Scatter plot with marginal kernel density estimates of entropy quantifiers in both trajectory coordinates time series X and Y. Genuine (blue) and skilled forgery signatures (red points), 100 subjects.Marginal kernel densities depict the distribution of entropy quantifiers along both axes.

Figure 4 :
Figure 4: Contour plot superimposed on the scatterplot of entropy quantifiers for genuine (right panel) and skilled forgery signatures (left panel)
and Fisher Information in each of the two directions, horizontal and vertical, and we train the OC-SVM with genuine signatures.Let Z = {z 1 , z 2 , . . ., z N } be the six-dimensional training examples of genuine signatures.Let Φ : Z → G be a kernel map which transforms the training examples to another space.Then, to separate the data set from the origin, one needs to solve the following quadratic programming problem:

Table 2 :
Performance of the system trained with varying number n of samples of genuine signatures.↑ and ↓ denote measures of quality (the higher the better) and of error (the smaller the better), respectively.

Table 3 :
Performance of the classification of pre-classified samples varying the number n of samples of genuine signatures used for training; same coding as in Tab. 2.