Perspectives of human verification via binary QRS template matching of single-lead and 12-lead electrocardiogram

Objective This study aims to validate the 12-lead electrocardiogram (ECG) as a biometric modality based on two straightforward binary QRS template matching characteristics. Different perspectives of the human verification problem are considered, regarding the optimal lead selection and stability over sample size, gender, age, heart rate (HR). Methods A clinical 12-lead resting ECG database, including a population of 460 subjects with two-session recordings (>1 year apart) is used. Cost-effective strategies for extraction of personalized QRS patterns (100ms) and binary template matching estimate similarity in the time scale (matching time) and dissimilarity in the amplitude scale (mismatch area). The two-class person verification task, taking the decision to validate or to reject the subject identity is managed by linear discriminant analysis (LDA). Non-redundant LDA models for different lead configurations (I,II,III,aVF,aVL,aVF,V1-V6) are trained on the first half of 230 subjects by stepwise feature selection until maximization of the area under the receiver operating characteristic curve (ROC AUC). The operating point on the training ROC at equal error rate (EER) is tested on the independent dataset (second half of 230 subjects) to report unbiased validation of test-ROC AUC and true verification rate (TVR = 100-EER). The test results are further evaluated in groups by sample size, gender, age, HR. Results and discussion The optimal QRS pattern projection for single-lead ECG biometric modality is found in the frontal plane sector (60°-0°) with best (Test-AUC/TVR) for lead II (0.941/86.8%) and slight accuracy drop for -aVR (-0.017/-1.4%), I (-0.01/-1.5%). Chest ECG leads have degrading accuracy from V1 (0.885/80.6%) to V6 (0.799/71.8%). The multi-lead ECG improves verification: 6-chest (0.97/90.9%), 6-limb (0.986/94.3%), 12-leads (0.995/97.5%). The QRS pattern matching model shows stable performance for verification of 10 to 230 individuals; insignificant degradation of TVR in women by (1.2–3.6%), adults ≥70 years (3.7%), younger <40 years (1.9%), HR<60bpm (1.2%), HR>90bpm (3.9%), no degradation for HR change (0 to >20bpm).

a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 statistical analysis of one-year distant measurements over an uncommonly large population, is an important asset to provide further evidence about the stability and scalability of the ECG as a biometric modality. The statistical analysis is presented for different perspectives of the human verification problem, i.e. the choice of the optimal single and multi-lead ECG set; the influence of the test database size and different physiological factors (gender, age, heart rate).

Related studies on QRS template matching
Although the great number of recent studies on ECG biometrics with evidence in extensive literature surveys and reviews [12,30,[38][39][40], this field is still in the state of active research on different ECG transforms, extracted features and classification methods. We further review different template matching techniques, which utilize the biometric information carried by the beat morphology. Generally, the template matching process involves: pre-processing, template extraction, feature calculation, dimensionality reduction and classification.

Database
This retrospective study considers a proprietary clinical ECG database, provided with the courtesy of Schiller AG (Switzerland) for the purpose of human biometrics on a large population observed over time: • • Anonymization: The biometric database is anonymized and analyzed under conditions keeping the privacy of the involved subjects.
The person verification scheme for comparison of subjects between S1 and S2 sessions gives a total of N = 460 pairs of subjects with equal identity (ID) and N Ã (N-1) = 211140 pairs of subjects with different ID. Our approach to handle the imbalance ratio (459:1) of differentto-equal ID pairs considers two independent datasets (Fig 1): • Training dataset: 230/230 equal/different ID pairs, presuming that the verification classifier should be trained on the first half of subjects using balanced data, not over-fitted to any of the classes.
• Test dataset: 230/210910 equal/different ID pairs, ensuring that unbiased classifier performance is further reported on a big dataset, including all available cases fully independent from the training.

QRS pattern analysis
The presented method for extraction of subject-specific ECG information is focused on the QRS waveform, being a prominent feature in many heartbeat classification and automated diagnostic system. Besides, we consider the stability of the QRS complex to the heart rate, previously proved to outperform the QT-signal for the purpose of ECG biometrics [35]. The main methodological concern is the proper extraction of 12-lead QRS patterns and the subsequent quantification of the lead-specific QRS waveform differences between pairs of recordings. It is presented below as a four-stage QRS pattern analysis process, including: (1) QRS pattern extraction; (2) amplitude normalization; (3) time-amplitude approximation; (4) pattern matching and feature extraction.

QRS pattern extraction.
Each ECG recording is processed by a certified commercial ECG measurement and interpretation module (ETM, Schiller AG, Switzerland) for extraction of a 12-lead average beat with duration of 500ms. The embedded arrhythmia detection and lead quality monitoring algorithms reject beats with abnormal morphologies (e.g. ventricular extrasystoles and artifacts). The average beats are commonly used for measurement of ECG waves with diagnostic precision because they provide higher signal-to-noise ratio (SNR) and are more robust with respect to respiration induced morphology changes than the single beats. We observe a time shift between the average beats from different recordings (Fig 2). Therefore, the task for extraction of aligned QRS patterns is of crucial importance for the correct inter-subject comparisons. In order to provide a more accurate analysis during the subsequent time-alignment and QRS pattern matching calculations, the time resolution of the average beats is increased to 1 ms by resampling from 500 to 1000 Hz. We employ the Matlab function 'resample' (upsampling with a Kaiser window anti-aliasing filter). The time-alignment is performed by maximal cross-correlation between the average beat and a reference pattern. The reference pattern (Fig 2) has been initialized at the beginning of the study as a 'normally' behaving average beat in lead I (with positive P-QRS-T waves), belonging to a subject from the population.
At the next step, the QRS patterns of all subjects are synchronously extracted for all 12-leads, taking the subject's average beat in a window of 30ms before and 70ms after the fiducial point, aligned to the R-peak of the reference pattern (see Fig 2). The window length of 100ms was not tuned with respect to the specific biometric study, rather it reflects the average length of a normal QRS interval. The short window protects the selected pattern to include the P, T-waves and ST-interval, taking into consideration the findings of our previous study [21]. It distinguished the biometric potential of the amplitude-temporal features of R, S-waves and rejected P, ST, T parts due to low intra-subject reproducibility and low inter-subject variability. This is also confirmed in [44], reporting that P-waves are dominated by noise, while T-waves are not distinct for biometrics.

QRS pattern amplitude normalization.
In order to compensate for large inter-subject and inter-lead amplitude spans, the amplitudes of 12-lead QRS patterns in any ID pair from sessions Si = (S1, S2) and lead Li = (1, 2,.., 12) The aim of normalization is to further use the same computational range [-1;1] for all individuals, regardless of their signal amplitudes. We note that the re-scaling process is offline and there is no need for any prior settings of the scale factor based on unknown (expected) amplitudes.
Larger values of (Δt, Δa) form a coarse grid, which makes a more rough approximation of QRS Si Li ðtÞ in a smaller size binQRS Si Li ðt; aÞ matrix at the cost of potential loss of QRS pattern waveform details. In contrary, smaller values of (Δt, Δa) form a fine grid, which makes a more fine approximation of QRS Si Li ðtÞ in a larger size binQRS Si Li ðt; aÞ matrix, thus increasing the computation cost. In our application, the settings of both resolutions are: • Δt = 1 ms delineates the finest resolution in the time-scale, defined by the sampling rate of 1000 Hz.
The size of the binary matrix binQRS Si Li is 100 columns and 80 rows, occupying a memory of 1kB per lead. On demand, it can be easily re-sized by changing (Δt, Δa) settings. The present settings equalize small variations of the cardiac depolarization process within ±Δt (±1ms) over time and ±2Δa (±2.5%) over amplitude, as defined in the approximation transform (Eq 2). Fig  3A illustrates the approximation span around QRS Si Li , while it is reproduced in the binary matrix binQRS Si Li ð100x80Þ for 12 ECG leads (Li = 1, 2,.., 12) and two recording sessions (Si = S1, S2). For most of the leads, the approximation spans (gray area) are considerably overlapped for QRS patterns from the same subject (left side) and substantially distinct for different subjects (right side).

QRS pattern matching and feature extraction.
Simple binary matching operations are applied on the matrices binQRS S1 Li and binQRS S2 Li to quantify the lead-specific similarity of the QRS pattern waveforms between S1 and S2 sessions by means of two measures: • Time equality measure (tEQU) counts the time for overlapping of both QRS patterns after binary element-wise multiplication (AND operation) of binQRS S1 Li and binQRS S2 Li : where Li = (1-12), aj = (1-80). Scaling by the time resolution (Δt) gives a normalized tEQU value that could be further easily interpretable, where 100% corresponds to full-time coincidence, i.e. patterns have at least one overlapping binQRS entry per time step (1 ms), and 0% corresponds to null coincidence, i.e. patterns do not overlap for any binQRS entry over the complete pattern length.
• Area difference measure (aDIF) counts the area enclosed between the non-overlapping amplitudes of both QRS patterns after binary element-wise multiplication and inversion (NAND operation) of binQRS S1 Li and binQRS S2 Li : where the summation interval in the amplitude scale is enclosed between the minimal and maximal QRS amplitudes among S1 and S2 patterns, measured at each specific time index ti, i.e. ½aminðtiÞ ¼ min Scaling by the time (Δt) and amplitude (Δa) resolution gives a normalized aDIF value that could be further easily interpretable, where 0% corresponds to full-amplitude coincidence, i.e. patterns overlap for all binQRS entries over the complete pattern length, and 100% corresponds to pattern differences that cover the full amplitude range, i.e. all binQRS entries.
For better comprehension, the resultant matrices from the binary AND and NAND opera- A total set of 24 features (12-leads x 2 features per lead (tEQU Li , aDIF Li )) is defined to quantify the QRS pattern differences. Their numerical measurements over the whole population are provided within the supporting information file (S1 File). The signal-processing and feature measurement scheme is implemented in Matlab (The Mathworks Inc.).

Human verification model
The human verification task answers the question: "Is the subject who he/she claims to be?". The designed human verification model takes the binary decision 'verified' or 'rejected' subject ID, comparing pairs of ECG recordings {S1,S2} by means of LDA classifier with input feature vector: where Li = (1-12) is the set of leads involved in the analysis. The human verification performance is estimated with the statistical indices: • where TAR is calculated for all equal identity pairs (ID S1 = ID S2 ), TRR is calculated for all different identity pairs (ID S1 6 ¼ID S2 ), and TVR (the common mean of TAR and TRR) is reported to equally weight both acceptance and rejection rates in an unbalanced data with number of comparisons (ID S1 = ID S2 ) << (ID S1 6 ¼ID S2 ), seen in the test dataset (defined above in section Database). We note that part of the biometric studies report their accuracy in terms of false acceptance rate (FAR), false rejection rate (FRR) and equal error rate (EER), where EER is valid for FAR = FRR. There is a straightforward relationship between both kinds of results, which could be recalculated by the direct conversion: FAR = 100-TAR, FRR = 100-TRR, EER = 100-TVR (valid for TAR = TRR). We further interpret our accuracy results in terms of positive merit maximization (TAR, TRR, TVR), instead of negative error minimization (FAR, FRR, EER).
Non-redundant LDA models are trained by stepwise feature selection until maximization of the area under the receiver operating characteristic curve (ROC AUC). The ROC is calculated by changing the operating LDA threshold function through scanning the full-range of prior-probabilities of equal-to-different identity pairs (ID S1 = ID S2 ):(ID S1 6 ¼ID S2 )2[0;1], using only samples from the training database. We use the test database, fully independent of the training, to finally report the test ROC as unbiased estimation of the human verification model's performance.

Statistical study
The statistical study is presented for different perspectives of the human verification problem: comparative study of single and multi-lead ECG configurations, influence of the test database size and different physiological factors (gender, age, heart rate). The Statistics toolbox in Matlab (The Mathworks Inc.) has been used for management of the statistical study, including training and evaluation of the forward stepwise LDA models. The non-normal features distributions (tEQU and aDIF, represented as median value, quartile range) are compared via the non-parametric Wilcoxon signed-rank test. The comparison of the performance rates (TVR, TAR, TRR) within different study groups (by sample size, gender, age, heart rate) has been done with two-proportion Chi-squared test. A value of p 0.05 is considered statistically significant.
3.4.1. ECG lead configurations. The option to include any lead in the feature set (Eq 7) is used to train different LDA models for the following lead configurations, available in 12-lead ECG: • Single leads: Li = [1, 2, . . ., or 12] for independent selection of leads (I, II, III

Statistical analysis of the feature set
The first part of results is focused on statistical evaluation of the introduced QRS pattern matching features, trying to answer the question: "Is there a statistical merit to use any of 12 ECG leads as a biometric modality, regarding high inter-subject differences (distinguishability) and low intra-subject differences (stability)?". In Table 1, the two groups of equal and different ID pairs are compared for all 12 leads, clearly indicating statistically different distributions (p<0.001): • tEQU: the median value for the time equivalence between two QRS patterns is as high as 75-99% for equal IDs and as low as 53-74% for different IDs, with absolute difference in the range 18-31% points, considering all 12 leads.
• aDIF: the median value for the area difference between the two QRS patterns is as low as 0.

Verification models in single and multi-lead configurations
This section presents a comparative study of the training and test performance of LDA verification models for different lead configurations, trying to answer the question: "What is the optimal lead set for human biometrics?".

tEQU aDIF
Lead ID S1 = ID S2 ID S1 6 ¼ID S2 ID S1 = ID S2 ID S1 6 ¼ID S2  Table 2 shows the performance of lead-specific LDA verification models in terms of training and test AUC. The test AUC is found to be maximal for the single leads: II (0.941) among limb leads, V1 (0.885) among chest leads. The multi-lead sets are ranked in ascending order: 6 chest leads (0.97), 6 limb leads (0.986) and 12 leads (0.995). The respective ROC curves are illustrated in Fig 4. For each lead set, the observed good coincidence between training and test ROC curves (Fig 4) and the comparable training and test AUC values (Table 2) are a sign for confident training of the LDA model, which is able to adequately evaluate independent test data without a bias.
The settings of the optimal LDA model are defined for the training ROC operating point, which corresponds to balanced acceptance and rejection rates (TAR = TRR), commonly referred in the literature as the operating point at EER-see the 'o' mark in Fig 4. For the selected operating threshold LDA function, the observed performance on the independent test ROC could be considered as unbiased assessment of the human verification model-see the filled 'o' mark in Fig 4. The optimal LDA performance for both, training and test ROC operating points is reported in Table 3 for all types of lead sets. The training operating point behaves at EER (TAR = TRR), while the test operating point has a slight misbalance with TAR>TRR (difference of about 0.6% to 10% points), that is a natural consequence from the imbalanced test set with imbalance ratio (917:1) of different-to-equal ID pairs. The highlighted leads with maximal test set accuracy (Table 3) closely correspond to those with maximal test ROC AUC (Table 2). . The TVR profile of the chest leads is about 2% to 15% lower than limb leads, with decreasing trend from septal V1 (80.6%) to lateral V6 (71.8%). Here, we can rather distinguish anterior V3 (76.2%) with severe accuracy drop by 3.3% from the

Influence of the test database size and different physiological factors (gender, age, heart rate)
This section presents results in support of the stability of the LDA-based models' performance, considering different factors that might influence the human verification process.
The influence of the test sample size is evaluated in Fig 6, regarding a broad range of subjects included in the test database (from 10 to 230 subjects). The 12-lead LDA model shows a stable performance with non-significant change of the mean value of all performance metrics ( 1%, p>0.67): TAR (mean value: 98.3-98.7%), TRR (95.3-96.3%), TVR (96.8-97.5%). We Table 3

. Human verification performance of single and multi-lead ECG sets for the EER operating point on the training ROC (Train-TAR = Train-TRR = Train-TVR).
The observed performance on the independent test set has a slight bias Test-TAR>Test-TRR. The bolded values highlight the maximal TVR on the test set for single limb leads, single chest leads, and the multi-lead sets. observe an inverse relationship between the sample size and the min-max margin of TAR, TRR, TVR values, i.e. the verification accuracy metrics might differ within a span up to 13.3%, 4.4%, 2.2%, 1%, <0.2%, depending on the selected combination of 10, 50, 100, 150, >200 subjects, respectively. The gender-specific performance of the LDA models for all single and multi-lead ECG configurations is evaluated in Fig 7. All TVR differences (males vs. females) are not significant (p>0.27). Better TVR for females are observed in the lateral leads V6 (by 6.3%), I (by 3.3%),-aVR (by 1.4%). Better TVR for males (by 1.9-3.6%) are observed in all other limb leads (aVL, II, aVF, III), chest leads V1, V2, emphasized in V3 (by 6.3%). The same TVR trend in favor of males is observed for the multi-lead ECG configurations, which is most prominent in the chest leads (by 3.5%) than in the limb leads (by 1.2%).

Limb leads
The influence of the subject's age is evaluated in Fig 8,   The physiologically related HR differences between individuals ( Fig 9A) and between different recording sessions of the same individual ( Fig 9B) do not show to have great impact on the 12-lead LDA model performance. Both TRR (range 95.9-96.6%) and TVR (range 94-98.3%) keep stable (p>0.05) for the broad range of HR values (<60 bpm to !90 bpm), as well as for small (<10 bpm) and large (!20 bpm) HR changes between the recording sessions. The same is valid for TAR (range 98.5-100% for HR = 60-89 bpm), with insignificant drop by 3.8% (96.15% vs. 100%, p = 0.087) for the slowest HR<60 bpm and significant drop by 8.3% (91.67% vs. 100%, p = 0.012) for the rapid HR!90 bpm.

Discussion
This study reproduces a realistic scenario for the two-class person verification task, taking the decision to validate or to reject the subject identity based on binary QRS pattern matching between two 10s sessions with 12-lead ECG recordings. The presented cost-effective methodology uses a minimal feature set with only two straightforward QRS matching features per lead. Their statistical study on an uncommonly large population (460 subjects) proves a longterm stability within individuals (> 1 year basis) and distinguishability across individuals for any among 12 ECG leads (Table 1). We point out a confident LDA classification model with slight misbalance <3.5% between training and test accuracy reported on different datasets (Tables 2 and 3, Fig 4). The statistical analysis is presented for different perspectives of the human verification problem. First, we show the choice of the optimal ECG lead (Tables 2 and 3, Fig 5) for single (in the projection of lead II) and multi-lead scenario (limb leads and   HR-specific performance of 12-lead LDA model, evaluated for 230 subjects in the test database, divided into: (A) 5 groups based on the absolute HR value in S1 session; (B) 3 groups based on the absolute HR change between S1 and S2 sessions (ΔHR). The differences between groups are not statistically significant (p>0.05), except TAR for !90 bpm ( Ã p = 0.012).
https://doi.org/10.1371/journal.pone.0197240.g009 12-leads); second, we show a stable performance without significant influence of the test database size (Fig 6) and different physiological factors-gender (Fig 7), age (Fig 8), heart rate ( Fig  9). Finally in discussion, a comparison to other published results on human verification is presented, showing the competitive achievements in this study, especially in multi-lead ECG configurations ( Table 4).
The milestones are further highlighted and discussed. Short-duration recording (10s) is long enough to accumulate personalized average beat pattern with biometric significance, relying on the accurate beat extraction by a certified diagnostic ECG measurement and interpretation module (ETM, Schiller AG).

Table 4. Verification accuracy reported in published ECG biometric studies, which use at least two recording sessions per subject (distanced from days to years).
Various accuracy metrics reported in other studies (EER, FAR, FRR, TAR, TRR) are transformed to the common metric TVR, using the direct conversions: TVR = 100-EER, TVR = (TAR+TRR)/2, TVR = 100-(FAR+FRR)/2. Simple binary matching operations on 2D binary QRS matrices are a cost-effective strategy for computation, using only AND and NAND operations applied to the small binary matrix binQRS Li (100x80), reserving a memory of about 1kB per lead. A minimal feature set with only two behavioral QRS pattern characteristics per lead is calculated, including:

Study
• tEQU (calculated by binary AND operation) is a pattern similarity measure in the time scale (matching time) • aDIF (calculated by binary NAND operation) is a pattern dissimilarity measure in the amplitude scale (mismatch area).
The use of normalized values for both metrics [0-100%] gives a subject invariant scale for pattern matching in large biometric databases. A simple visual biometric scheme is shown in Fig 3, where maximization of matching time and minimization of mismatch area in confident leads is a simple indicator for verification of patterns from the same subject (Fig 3, left panel), while the opposite distribution with short matching time and large mismatch area is a sign for dissimilar subjects (Fig 3, right panel). Such techniques for 2D binary computation, normalization and visualization are a cost-effective strategy for a biometric tool in smart portable devices that could optimally work with the minimal lead set, providing non-redundant and most reliable information.
Long-term stability of the personalized QRS pattern in the presented time and amplitude matching scale is statistically validated over a long period (> 1 year) across an uncommonly large population (460 subjects). We adopted two strategies against the measurement bias: (i) synchronous QRS pattern extraction in all 12-leads, using time-alignment to a singlelead reference pattern by maximal cross-correlation (Fig 2); (ii) time-amplitude approximation to mitigate the effect of intra-subject variations of the recording conditions across different sessions (Fig 3A, left panel), introducing an approximation tolerance of ±0.5% in the normalized amplitude scale and ±1ms in the time scale, as defined in Eq (2). Table 1 is a basis for tracking the long-term stability of the personalized QRS pattern in all 12-leads, showing large matching time tEQU = 75-99% median value (64-93% lower quartile) and low mismatch area aDIF = 0.9-10.8% median value (1.5-18.9% upper quartile) for 460 cases with ID S1 = ID S2 . The statistical evaluation (median values tEQU/aDIF, %) highlights the leads with the most stable QRS patterns, ranked in the order: aVR (99/0.2), II (96/0.9), I (93/1.8) and those with the largest intra-subject instability: V3 (75/10.8), III (75/10.6), V2 (75/9.4), aVL (76/9.0), V4 (82/6.5), aVF (85/4.5), V1 (85/4.1), V6 (88/3.5), V5 (88/3.3). We speculate about technical and biological sources for the observed long-term QRS instability, i.e. changes of the recording conditions across different sessions and physiologically related intra-individual ECG variability. The relatively frequent human uncertainty about the proper landmarks of precordial leads (V1-V6) and the proximity to the signal source makes their QRS pattern sensitive to electrode misplacement errors [47][48][49]. Considering that limb leads are almost invariant to the actual positioning of the electrodes [27], we suggest about functional and physiological sources [50] for the observed instability of the inferior leads III, aVF (+90˚to +120˚) and the high lateral lead aVL (-30˚).
Unique personalized QRS patterns with distinctive time and amplitude matching measures across individuals are statistically validated in a large population (211140 inter-subject pairs). Table 1 gives an evidence about relatively low matching time tEQU = 48-74% median value (59-85% upper quartile) and high mismatch area aDIF = 8-30.6% median value (4.1-22.7% lower quartile) after statistics of 12-lead QRS patterns in 211140 inter-subject pairs with ID S1 6 ¼ID S2 . Comparing the groups of different-to-equal ID pairs, all leads have significantly distinguishable QRS matching features (p<0.001). Detailed review highlights the leads with the most distinctive QRS patterns across individuals in the time scale (II, I, aVF, III, V1 with the largest inter-to-intra subject reduction of the matching time by 27-31%), and in the amplitude scale (aVL, III with the largest inter-to-intra subject increase of the mismatch area by about 20%).
Straightforward feature selection and optimization of binary LDA classifier is achieved by ROC AUC maximization on the training dataset, which comprises the first half of subjects in the database (230 subjects). Unbiased validation of the LDA model is reported on the test set from the remaining data, fully independent on the training. As shown in Fig 4, both the training and test ROC curves are closely coinciding for the same LDA model, which is a straightforward value for reproducible performance that could be expected on other clinical data. Referring to ROC AUC as a statistic index that characterizes the overall predictive power of a binary classifier, unaffected by fluctuations caused by an arbitrarily chosen operating point with a trade-off between TAR and TRR [51,52], the reported AUC values ( Table 2) could rate the LDA verification model as 'good' (AUC = 0.8-0.9) for single chest leads and 'excellent' (AUC = 0.9-0.995) for single limb leads and all multi-lead configurations. The choice of the optimal LDA setting according to the EER strategy during training is consistent with a numerous human verification studies, which report equally weighted both errors from false verification and false rejection [7,9,10,43,45,46]. In addition, our study validates LDA on independent test set (Table 3). Therefore, a slight misbalance of Test-TAR>Test-TRR (0.6-10% points) is considered as a consequence from the imbalance ratio (917:1) of different-toequal ID pairs (see the shift of the test ROC operating point from the line TAR = TRR in Fig  4). The maximal drop in performance between Test-TVR vs. Train-TVR of about <3.5% (single leads) and <0.5% (all multi-lead sets), points out a confident LDA model.
Objective selection of the optimal electrode scenario for ECG biometrics is presented by comparative study of single limb-leads, single chest-leads and multi-lead configurations, extracted from clinical standard 12-lead ECG recordings, thus emulating a realistic case. The single-lead vector with the best biometric view over the personalized QRS pattern should present a trade-off between highest long-term stability (leads aVR, II, I as highlighted above) and highest distinctive matching across individuals (leads II, I, aVF, III, V1, aVL as highlighted above), thus justified for the common intersection (leads II, I). This hypothesis is confirmed by the LDA model performance (Tables 2 and 3) with maximal indices (Test-AUC, Test-TVR) observed for lead II (0.941, 86.8%) and slight accuracy drop for leads I (-0.01, -1.5%) and aVR (-0.017, -1.4%). This has a straightforward geometrical justification (Fig 5), which indicates that the frontal plane sector (60˚-0˚) encompassed by neighboring leads (II, -aVR, I) could be recognized as the most powerful projection of the cardiac vector for the aims of single-lead ECG human identity applications. The placement of the ECG electrodes on the chest is not recommendable because a gradual TVR drop from septal V1 (-6.2%) to lateral V6 (-15%) is observed in comparison to the limb lead II (Fig 5). The proximity to the signal source is not confirmed as an advantage for giving a view to unique personalized QRS patterns (only V1 has been highlighted above, however less distinctive than the limb leads). We rather suggest the major V1-V6 problem from the long-term instability of the QRS patterns, which are highly sensitive to electrode misplacement errors across the recording sessions. This effect has not been observed by Zhang and Wei [23], who underline that V1-V2 outperforms I-II by 5.5-10% in a human identification study. An explanation concerns the use of single-session recordings, not influenced by the real multi-session recording conditions.
We show that multi-lead identity systems could explore more detailed view of the subjectspecific QRS patterns. • Gender: Fig 7 shows that gender is not a significant factor in human biometrics with insignificant TVR differences by maximum of 6% for males vs. females (p>0.27). The largest differences are observed in chest leads V3 (6.3% in favor of men) and V6 (6.3% in favor of women), which are due to the failure in recognition of similar identity subjects. We suggest the human error for misplacement of V3 in women and V6 in men as the most probable reason for these errors. The better TVR in males for most of the leads (by 1.2-3.6% for II, III, aVL, aVF, V1, V2, all multi-lead sets) is due to the better recognition of different identity subjects. This is a normal consequence from the reported larger range of variation of the QRS amplitudes and durations in men than in women [53][54][55].
• Age: Fig 7 shows that the age is not a significant factor in human biometrics based on 12-lead ECG analysis. Insignificant failure for verification of the same identity subjects (TAR drop by 6.7%, p = 0.066) is observed in the oldest group (!70 years old), suggested from the reported prevalence of aging-associated cardiovascular changes [56]. Insignificant failure for rejection of different identity subjects (TRR drop by 3.7%, p = 0.54) is observed in the youngest groups (<40 years), which implies that ECG morphology is less distinctive between younger individuals.
• HR : Fig 9 demonstrates that the proposed 12-lead QRS template matching model for human verification is robust to HR variations between individuals (covering HR range <90 bpm, Fig 9A) and HR changes between the recording sessions (covering the larges HR changes !20 bpm, Fig 9B). The largest problem is observed for verification of the same identity subjects with insignificant TAR drop by 2.8% for slow HR<60bpm and significant TAR drop by 8.3% for rapid HR!90 bpm. This is an outcome from the reported heart rate dependency of the QRS duration with noticeable non-linear increase of QRS duration variations for heart rates >90 bpm [57].
Comparative literature research reveals wide variations of the ECG authentication accuracy, suggesting dependencies on the database size, experimental conditions, type and number of ECG leads, health status, etc. A comparison to other biometric studies is presented in Table 4, limited only to those under conditions similar to this study, i.e. two-class person verification classification, use of multi-session recordings. Most of the studies use private databases without public access; therefore we further give a reference to the accuracy results as originally published. Due to practical ECG acquisition simplification, major part of the studies employ single-lead configuration from lead I between fingers [7][8][9][10] or wrists [13,45]. Based on different feature extraction and classification techniques, all above 'lead I' studies report TVR in the range from 84% to 88%, with one superior value of 90.9% for an SVM classifier [8]. We are suspicious about overtraining because all 'lead I' studies use the entire population for training or even training and test from different windows of the same recording in less than 20 subjects with limited intra-subject variation [7,9,13]. We report comparable TVR range for lead I (train-TVR = 87.4%, test-TVR = 85.3%), pretending for 'unbiased' validation on up to 40 times larger test population, independent from the training. Our finding for the optimal lead selection suggests a room for improvement of 'lead I' studies if the left arm finger/wrist electrode is moved on the body to form lead II equivalent. In bilateral lower rib cage configuration, Odinaka et al [46] reported the highest single-lead accuracy-about 89% or 94% if 128 beats from one or two training sessions are used, respectively. The latter training mode benefits from studying the impact of the long-term variability between the two sessions. Comparing Odinaka et al [46] and Matos et al [9] who implement the same signal-processing method for single-lead ECG biometrics (TVR = 89% vs. 86%), we might speculate that the electrode configuration and the good sticking of the ECG electrodes on the body [46] improves the accuracy compared to finger-based biometrics [9], largely susceptible to noise. We found three published studies, which investigate the feasibility of combined limb leads for human verification with reported TVR in a large span of about 20% points, i.e. minimal value of 78% with morphological PQRST features [21], 87.2% with PQRST cross-correlation [25] and 97.2% with Euclidean distance from the first and second QRS signal derivatives [43]. Our study, based on analysis of the same short QRS template, obtains about 3% lower TVR than the latter superior result. We see that [43] has not been verified on independent dataset and potentially might be over-trained to the empirical distance threshold of the whole population. Multi-lead ECG sets for human verification in configuration of only chest leads and 12-lead ECG is almost a blank area of research. There is evidence that the binary QRS template matching in this study outperforms morphological PQRST features (worsen by 2-7% [37], 11-22% [21]), cross-correlation PQRST matching (worsen by 1.7-2.6% for all multi-lead sets [22]) and cross-correlation QRS matching (worsen by 2.7-6.5% for all multi-lead sets [22]). We note that the comparison to our recent studies [21,22] is straightforward because they use the same large biometric databases for training and validation.
The limitation of the study concerns the reported verification accuracy only on healthy (non-cardiac) individuals during rest. We might expect slight TAR reduction (failure to verify the same identity subject) in case of cardiovascular disease developed over time between the reference and test sessions, due to potentially affected ECG morphology, as suggested in [12,25,33,41]. In such cases, the ECG biometric reference database might be permanently calibrated over years.

Conclusions
This study gives straightforward evidence about the questions: • "Is binary template matching able to capture significant 12-lead QRS pattern differences across individuals, while keeping stable personalized measurements in a long-term basis?" • "How reliable are these differences seen from different leads in single-and multi-lead verification scenarios?" • "Could we guarantee a stable biometric performance under different conditions, independent from the number of verified subjects, gender, age and heart rate?".
The justification of these questions is given by statistical validation on independent subset from a clinically relevant database across a large population, representative for physiologically related long-term ECG changes and multi-session recording conditions. The practical benefit of our findings is the presented cost-effective strategy for 2D binary computation, normalization and visualization as a biometric tool in smart portable devices. They can rely on an effective lead-selection scheme based on ranking of 12 ECG leads by maximal TVR. Our recommendations about the optimal electrode setting concern peripheral lead II (87%) in a single-lead scenario. Including one additional electrode on the left arm would increase TVR by 7.5%. The fusion of information from 6 more chest leads, forming the standard 12-lead ECG would increase TVR by additional 3%, reaching 97.5%-a verification accuracy, which is likely to be tolerated in commercial ECG biometric technologies with potential application for patient validation support and error screening of digital hospital databases. The individual ECG might be also a useful candidate as an add-on to improve established biometrical systems.
Supporting information S1 File. The Archive contains all data related to the measurements of the pattern matching features in 12-lead ECG database, including all pairwise combinations between S1 and S2 sessions of the whole population, with clusterization to the subject's identity (equal/different), data subset (training/test), age, gender, HR. (ZIP)