Remembered or Forgotten?—An EEG-Based Computational Prediction Approach

Prediction of memory performance (remembered or forgotten) has various potential applications not only for knowledge learning but also for disease diagnosis. Recently, subsequent memory effects (SMEs)—the statistical differences in electroencephalography (EEG) signals before or during learning between subsequently remembered and forgotten events—have been found. This finding indicates that EEG signals convey the information relevant to memory performance. In this paper, based on SMEs we propose a computational approach to predict memory performance of an event from EEG signals. We devise a convolutional neural network for EEG, called ConvEEGNN, to predict subsequently remembered and forgotten events from EEG recorded during memory process. With the ConvEEGNN, prediction of memory performance can be achieved by integrating two main stages: feature extraction and classification. To verify the proposed approach, we employ an auditory memory task to collect EEG signals from scalp electrodes. For ConvEEGNN, the average prediction accuracy was 72.07% by using EEG data from pre-stimulus and during-stimulus periods, outperforming other approaches. It was observed that signals from pre-stimulus period and those from during-stimulus period had comparable contributions to memory performance. Furthermore, the connection weights of ConvEEGNN network can reveal prominent channels, which are consistent with the distribution of SME studied previously.


Introduction
The brain is one of the largest and most complex organs in human body. In order to decode specific cognitive states from brain activity, many efforts have been made, for instance, detecting concealed true thoughts when answering questions [1], decoding features of motor behavior [2], distinguishing specific perceived stimulus from several candidate stimuli [3,4], inferring visual imagery in dreams [5] and identifying traces of individual episodic memories [6].
Memory formation is an important cognition process. It enables us to store information, accumulate experiences and learn from experiences to guide our behaviors. Understanding cognitive states related to memory formation is essential to investigate underlying brain mechanisms and even improve our memory performance. As a result, decoding neural activities during memory process has aroused much interest in the cognitive neuroscience community.
Neural activities relevant to memory formation can be observed using different physical measurements, e.g. fMRI (functional magnetic resonance imaging) for BOLD activity and EEG (electroencephalography) for electrophysiological activity. These measurements help us analyze the process of memory. Among different measurements, EEG is widely used in disease diagnosis [7], neuroscience and psychological research [8,9] for its practical advantages, such as noninvasion, mobility and relatively inexpensive devices. Specifically, EEG can be used to reveal the correlation between memory cognition process and subsequent memory performance. Many studies have found statistical differences in EEG before or during learning between subsequently remembered and forgotten events, which are defined as subsequent memory effects [10][11][12][13][14][15]. These differences have shown that brain signals relevant to an event can contribute to successful memory encoding and later recollection.
In this paper, we addressed prediction of subsequent memory performance using EEG recorded during memory process. While the findings mentioned above used multi-instance EEG signals to reveal the SME, we try to make single-instance analysis of SME for predicting memory performance. By predicting whether an event will be remembered or forgotten later, effective actions could be taken to help us remember new knowledge and improve the efficiency of learning. It could also help people with memory disorder and even cognitive impairment with new prevention, diagnosis and rehabilitation methods.
From the computational perspective, prediction of subsequent memory performance is a typical binary pattern recognition problem with the two classes of subsequently remembered and forgotten events. For a general pattern recognition problem, meaningful features need to be extracted to maximize the differences between different classes and then a classifier uses hand-crafted features to predict which class they belong to. However, for most problems, it's difficult to design features exactly useful for classification. And feature extraction and classification are two separate phases, which make it complex for realization and optimization. This paper proposed a convolution neural network for EEG, named ConvEEGNN, to predict whether an event will be remembered or forgotten later. It can combine feature extraction and classification as a whole. We conducted an auditory memory task, consisting of a study phase and a memory test, to verify ConvEEGNN. EEG signals before and during an auditory event were recorded in the study phase. According to information before and during an event, the average prediction accuracy of 72.07% was achieved.

Related Work
Many studies have been carried out to investigate the correlation between memory performance and episodic memory process, which is a form of long-term memory. It has been recently shown that item-related, state-related and task-related neural activity all can affect whether an event will be remembered or forgotten later [11,[14][15][16], updating the theoretical explanation of memory encoding. Meanwhile, techniques and methods from pattern recognition have been embraced to help analyze and interpret memory process, such as multi-voxel pattern analysis (MVPA), common spatial pattern (CSP), support vector machine (SVM) and linear discriminant analysis (LDA).
With the improvement of neural measurements, increasing interest has been aroused in researches on the neural systems responsible for episodic memory encoding. Episodic memory is the memory of events in our own personal past. It is defined as the conscious knowledge of temporally dated, spatially located, and personally experienced events or episodes [17]. The components of an event such as words or pictures requiring discriminative responses are items or stimuli [18]. Item-related activity, state-related activity and task-related activity influence episodic memory encoding. Sanquist et al. found that item-related activity affects the efficacy of episodic memory encoding of a stimulus by means of segregating item-related neural responses with different later memory performances and identifying the features of the responses correlated with successful encoding of visual stimuli [11]. In addition to item-related activity, Otten et al. showed that state-related activity, that is, neural activity sustained across a succession of stimulus events, influences memory encoding by finding the relation between the mean level of activity across a task block and the number of visually presented words subsequently remembered from that block [16]. After that, the investigation by Otten et al. on task-related neural activity preceding a stimulus event suggests that task-related activity is also predictive of successful encoding for visual and auditory events [14,15].
Task-related activity and item-related activity constitute SMEs and both showed statistical differences in EEG response to a stimulus between the subsequently remembered and forgotten items for respective pre-and during-stimulus period. An fMRI study also presented SMEs in the level of hippocampal BOLD activity before item presentation [18]. Therefore, EEG and fMRI signals have shown correlation with subsequent memory performance in group analysis.
Currently, little study has been carried out to predict subsequent memory performance for single stimulus in every participant. Methods from pattern recognition have been used for this prediction problem [19][20][21]. In a recent fMRI study, MVPA has been used to predict subsequent memory performance for 19 participants according to the period of encoding phonogram stimuli [19]. The analysis consisted of 3 stages. First, MVPA-based voxel-wise search for the clusters in the medial temporal lobe was conducted to find the signals contained the most information about subsequent memory performance. Then, a classifier function in MVPA was trained using the extracted pattern vectors from the selected clusters. Finally, the trained classifier predicted subsequent memory performance with approximately 66% accuracy. However, the slowness of the vascular response may influence the precise selection for the encoding period and lead to impure signals for analysis. Using EEG, a more mobile and affordable noninvasive method for monitoring brain activity, Noh et al has identified subsequent episodic memory performance on single-trial neocortical dynamic activity recorded before and during item presentation from 18 participants [20]. CSPs were used to learn the spectral features of pre-and during-stimulus SME, which were classified respectively by two soft margin support vector machines (v-SVM). Another classifier using LDA was trained to learn the temporal features of during-stimulus SME. By combining the results from the three separate classifiers and also combining information from the pre-and during-stimulus periods, the overall prediction accuracy achieved 59.6%. The accuracy using EEG signals might be relative lower than that using fMRI signals because of the higher spatial resolution of fMRI. In a recent EEG study with Sternberg Working Memory Task (SWMT), SVM has been used to identify signal features associated with working memory performance for 40 schizophrenia adults and 12 healthy adults [21]. Using continuous wavelet transform (CWT), EEG of each trial was analyzed to extract time-frequency and spatial features, including 5 frequency bands at 4 processing stages and 3 scalp sites. Then, 1-norm SVM was used as a classification approach to predict working memory performance according to the extracted 60 features. This approach predicted SWMT trial performance with 84% accuracy in healthy adults and 74% accuracy in schizophrenia adults. Overall, the related work mentioned in [19][20][21] all used methods from pattern recognition to extract information from data and predict memory performance for each stimulus.

Materials
To evaluate the proposed prediction approach, we adopted an auditory memory task [15], during which EEG responses to auditory stimuli were recorded for predicting memory performance at a later time. Participants were paid to take part in the auditory memory task. The experimental procedure consists of a study phase and a memory test. In the study phase, participants listened to a word after a cue and made semantic (animate or not) judgments about the word. In the memory test, words in the study phase had to be discriminated from new words. Participants were asked to make a judgment from five candidates (1.definitely familiar, 2.possibly familiar, 3.uncertain, 4.possibly unfamiliar, and 5.definitely unfamiliar), and press a key from 1 to 5 accordingly.
This experiment was approved by the Ethical Committee of the First Affiliated Hospital, Zhejiang University School of Medicine. All of the healthy participants obtained written informed consent before the experiment. The experiment was performed in accordance with the guidelines issued by the Ethical Committee of the First Affiliated Hospital, Zhejiang University School of Medicine.

Participants
Twenty-two right-handed healthy participants (16 females and 6 males, 21-32 years old) were enrolled, who are native Chinese speakers without neurological or psychiatric history. Out of the 22 participants, 13 were excluded based on the two criteria below: 1. participants who remembered or forgot less than 15 words were excluded to ensure the number of samples for algorithm training [15]; 2. participants with high response bias (response bias>0.2) were excluded for the high possibility of recognizing a new word as a studied one. Response bias, which was proposed in [22], is a criterion to exclude the participants with high possibility of choosing "definitely familiar" or "possibly familiar" when facing "uncertain" words in a memory test.
In our study, we excluded 6 participants for their high response bias and 7 participants who forgot less than 15 words. As a result, 9 participants were for our evaluation (4 females and 5 males).

Stimuli
Study and test list were drawn from a pool of 200 concrete nouns with a length of two Chinese characters and a frequency of 0-500 occurrences per million from [23]. Each word was recorded in spoken form (male voice, 44.1 kHz, mean duration 650 millisecond (msec), range 600-700 msec). A study list consisted of 100 words with random order. A test list contained 200 words, made up of a random sequence of 100 studied and 100 new words. Auditory cue is a 44.1 kHz pure tone (200 msec duration).

Task and Procedure
digitized at 500 Hz. Impedances of recorded electrodes were kept below 5k Ohm. In order to suppress the influence to EEG recording brought by muscular movements, the participants were instructed to reduce their facial and head movements during signal recording. In addition, any facial or head movement was inspected and marked during the experiment.
During the study phase, the participants were instructed to create a mental image denoted by each word heard via headphones and make a semantic judgment about the word. An auditory cue presented 1.5 seconds (sec) before each word, indicating the upcoming of a stimulus. Judgments were saved by depressing a corresponding button with the index finger of left or right hand. A practice block helped the participants get used to the task at the beginning of the study phase. The study list was presented across four blocks. Each block consisted of 25 words and was separated by a break of five seconds.
The participants were given a memory test approximately 45 minutes after the end of the study phase. In the memory test, an auditory cue presented 1 sec before word onset, and then all the words in the study list were re-presented as well as new words not encountered previously. The participants were required to decide whether they had experienced the word in the study phase and to indicate the confidence in their decision by pressing a number key from 1 to 5 for a rating scale (1.definitely familiar, 2.possibly familiar, 3.uncertain, 4.possibly unfamiliar, and 5.definitely unfamiliar). Fig 1 shows timings of the auditory memory task in the study phase and the memory test with a fixed inter-stimulus interval (ISI) for 0.6 sec.

EEG Pre-processing
The recorded EEG data from the study phase of the experiment were pre-processed by the following six steps: 1. re-reference: EEG were algebraically re-referenced to linked mastoids;

Fig 1. Timings of the auditory memory task in the study phase (A) and memory test (B).
The two shaded areas in the study phase are the lasting time for an auditory cue and an auditory word respectively. The participants were instructed to make a semantic judgment about the word with the "animate or not" question showing on a screen. In the memory test, the two shade areas have the same meaning as those in the study phase. The participants made a judgment about the scale of familiarity by pressing a key from 1 to 5. The ConvEEGNN approach is designed to predict whether the participant remembered the word in the study phase by analyzing the EEG recorded from the study phase. 2. filtering: the data were band-pass filtered between 0.05 and 15 Hz (48 dB roll-off, zero phase shift IIR filter) to remove low-frequency noise [15]; 3. blink detection and correction: in order to remove eye movement artifacts, a standard regression technique [24] was used to estimate and correct the contribution of artifacts to the waveforms; 4. segmentation: data from -0.1-2.9s duration around events of interest were further segmented into several trials. The start point of a segment is 100msec before cue onset (0s), namely -0.1s. The end point of a segment is 2.9s, right before making a judgment about a word; 5. baseline correction: each segment was referred to a 100-msec period before cue onset.
After the forth step, EEG data were segmented to ERP (event-related potential). ERP is the measured brain response that is the direct result of a specific sensory, cognitive, or motor event [25]. It is an EEG response to a stimulus. ERPs provide a continuous measure of processing between a stimulus and an EEG response, making it possible to determine which period is being affected by a specific stimulus [25]. The data after pre-processing are available publicly (https://github.com/panlab/ConvEEG).

SME of Our Auditory Memory Task
To verify SME in our data, ERP waveforms after artifacts rejection for each participant were averaged into individual-averaged ERP according to whether the word was remembered or forgotten in the subsequent memory test. Trials were labeled as remembered for the words in the study list given definitely familiar or possibly familiar judgments in the memory test. And trials were labeled as forgotten for the words in the study list given uncertain, possibly unfamiliar or definitely unfamiliar judgments in the memory test. Then, individual-averaged ERPs of all the participants were further averaged into grand-averaged ERP. Finally, ERP waveforms were qualified by measuring mean amplitudes of grand-averaged ERP (Fig 2). Fig 2 shows a significant subsequent memory effect for both pre-stimulus period (t-score < 0.01) and during-stimulus period (t-score < 0.01). For pre-stimulus period, more negative-going ERPs are elicited for subsequently remembered words than forgotten ones while for during-stimulus period, more positive-going ERPs are elicited for subsequently remembered words than forgotten ones, which are in accordance with [14,15].

Problem Definition
SMEs have shown that there exist differences in EEG data between the subsequently remembered and forgotten events, which may be used to predict subsequent memory performance.
Here we attempt to predict remembered or not from the recorded EEG signals. We formalize it as a pattern recognition problem of two-category classification. Suppose that we have a set of n samples with their labels {(I i , Z i ), i = 1, 2, . . ., n}, where I i is a piece of EEG signals during memory process for an event (in our experiment, each word presenting is an event), and Z i is the label of the sample I i indicating remembered or forgotten. We want to use these samples to learn a model H: I ! Z to establish the connection between the neural activities and memory performances. Therefore, for any EEG input I 0 of an event, its memory performanceZ will be predicted by H,Z A sample I i usually consists of EEG data from N channels of EEG electrodes, with the temporal sampling length T of each channel.

ConvEEGNN: Convolutional EEG Neural Network for Prediction
To predict whether an event will be remembered or forgotten, we design a convolutional neural network (CNN) for EEG, called ConvEEGNN. In general, CNN is a variant of multilayer perceptron with local connectivity and shared weights, which were inspired by biological processes [26,27]. It can be efficient to extract underlying features and tolerate variations over space and time. It has been widely used in various applications, for example, handwriting character recognition [28], object categorization [29][30][31][32], multimedia retrieval [33], face recognition [34], and speech recognition [35,36]. Our proposed ConvEEGNN is a CNN specified for EEG understanding. Network topology of ConvEEGNN is a key feature, which may eventually affect its prediction performance. A reasonable topology can translate successive signal processing or feature extracting steps. Consequently, we design the topology for ConvEEGNN depicted in Fig 3. It contains five layers: an input layer L in , a spatial convolutional layer L c , a temporal convolutional and subsampling layer L cs , and two fully connected layers L h , L out . Neurons in a layer are organized in planes and the output of neurons in a plane is called a feature map. Each layer comprises one or several feature maps. Generally, for the convolution transform in Layer L c and L cs , each neuron of a map is connected locally from the previous layer and shares the same set of weights. Layer L cs , L h and L out can be regarded as a multilayer perceptron. The architecture of Con-vEEGNN is described in more detail as follows.
The input of ConvEEGNN is a matrix I, consisting of N ch channels. Each channel is a time series of voltage measures with the length of T, namely, d th channel is a d = [x 1 , x 2 , . . ., x T ]. Therefore, the input I can be denoted as The size of I is N ch × T, where T corresponds to the temporal sampling length. T depends on sampling frequency and time interval for analysis.
• Layer L in : the input layer receives input EEG data I i . The input data are real values for N ch channels and temporal sampling length T. An EEG-Based Computational Memory Prediction Approach • Layer L c : the first hidden layer is a convolutional layer, which convolves data in the spatial domain. Neurons in the convolutional layer are organized in N c feature maps, each of which has S neurons. A neuron in a feature map has M inputs connected to a M by 1 area in the input, which is the receptive field of the neuron. Accordingly, each neuron has M trainable weights and a trainable bias. To detect the same feature at all possible location on the input, all the neurons in a feature map share the same set of weights, which is called the kernel of the map, and the same bias. Therefore, L c contains N c × (M + 1) trainable parameters and S × N c × (M + 1) connections. In this study, M is set to be N ch and S is set to be T.
• Layer L cs : the second hidden layer is a convolutional and subsampling layer, which subsamples and transforms the data in the temporal domain. Neurons in this layer are organized in N cs feature maps. The map m of L cs has P m neurons (m = 1, 2, . . . N cs ). Each neuron in a feature map m is connected to 1 × K m neighborhood in the corresponding feature map in L c . The 1 × K m receptive fields are non-overlapping in order to down-sample the input from L c . L cs contains P m N cs m¼1 ðK m þ 1Þ trainable parameters and P m N cs m¼1 P m Â ðK m þ 1Þ connections. In this study, the number of maps in L cs is set to be the same as that in L c , that is N cs = N c .
• Layer L h : the third hidden layer is composed of one map of Q neurons and is fully connected to L cs . Each neuron has P m N cs m¼1 ðP m þ 1Þ input parameters and connections of the same size. Q is set to be 10 in this study.
• Layer L out : the output layer has one map of two neurons fully connected to L h . The two neurons, Z 0 and Z 1 , represent the two classes of remembered and forgotten events. This layer has 2 × (Q + 1) parameters.
In ConvEEGNN, layer L c and L cs play the important role in prediction. Neurons in the spatial convolutional layer L c are organized in maps and each neuron has M inputs connected to a M × 1 area in the input layer, that is, the receptive field of the neuron. The weight vector connecting the receptive field and each neuron in layer L c is the kernel for this layer. The stride of the kernels for this layer is set to one. All the neurons in the same map share the same set of weights. Thus, all the neurons in one map of L c perform the spatial filtering on different channels of the data and result in a channel combination weighing the importance of different channels. Another map in L c uses a different set of weights to extract different channel combinations. The convolution operation is achieved by a single neuron scanning the input EEG data across the spatial domain with a local receptive field. The robustness of convolution operation to shifts and distortions of input is based on the property that if the input data shifted, the feature map output will be shifted accordingly.
After the convolution in the spatial domain, the spatial filters are detected. Then, the filtered data are convolved and subsampled in the temporal domain in L cs . The convolution operation is achieved by a single neuron scanning the input from the previous layer across the temporal domain with a 1 × K m local receptive field. The subsampling operation can be achieved at the same time since the receptive fields of the contiguous neurons are non-overlapping. The convolution and subsampling combination tolerates the variance of the input to some degree because the reduction of spatial resolution can be compensated by the increase of the number of maps.
For layer L c and layer L cs in ConvEEGNN, each neuron in a layer receives inputs from a set of neurons located in a small neighborhood of the previous layer. By connecting neurons to local receptive fields on the previous map, neurons can extract elementary features like channel importance. The elementary feature detectors are useful on part of the previous map as well as across the entire map. Therefore, neurons in a map share the same set of weight vectors and perform the same operation even though the corresponding receptive fields are located at different places on the map. After a feature has been extracted, its approximate position relative to other features is more important to its exact location. To reduce the precision about the positions of features and obtain some degree of spatial or temporal invariance, convolutional layers are interspersed with subsampling layers. And layer L cs combine the two operations in one layer.

Input Normalization
The input I i of ConvEEGNN is a matrix (similar to an image in ConvNet [29]), where each row of the matrix is a numeric time series of voltage measures for a channel. The size of I i is N ch × T, where T corresponds to the temporal sampling length. In our experiment, for the entire period, T is set to be 75 (25Hz × 3s, representing -0.1-2.9s). For pre-stimulus period and during-stimulus period, T pre and T dur are both set to be 30, representing 0.3-1.5s with 25Hz and 1.5-2.7s with 25Hz respectively.
First, the data are subsampled to the sampling frequency of 25 Hz in order to reduce the size of input. Many studies showed that memory performance is related to oscillatory activity in the theta (4-8 Hz) frequency band [13,[37][38][39]. Therefore, the subsampling operation provides most of the information relevant to memory performance. Then, the data are normalized with mean 0 and variance 1 to improve convergence during the learning of ConvEEGNN [40].
In this experiment, in total 30 channels are used. We exclude the horizontal electrooculogram and vertical electrooculogram since the two channels provide information to measure eye movement and are irrelevant to brain activity.

Learning in ConvEEGNN
After the network topology of ConvEEGNN has been structured, we need to learn the weights of the network from training data. A typical process of learning consists of two main steps: feedforward and back-propagation [41]. For feedforward pass, the network processes the inputs according to the initial weights and provides resulting outputs. For back-propagation pass, the errors between the resulting outputs and the desired outputs corresponding to the input data are used to update the weights in order to gradually reduce the errors.
In this study, we extended the derivation and implementation of feedforward pass to Con-vEEGNN. Let k l m denote the kernel for the map m in layer l and b l m denote the bias for the map m in layer l. Define output of the map m in layer L c and L cs to be: where conv is the convolution operation. For L c , input I i is convolved with the kernel of k L c m and the convolution stride of one. For L cs , the output from layer L c is convolved with the kernel of k L cs m and the convolution stride of K m (the size of k L cs m ). Notice that a kernel is shared by each neuron of one map and layer L cs has S/K m neurons for each map. Then, the data are put through the activation function f(Á) to form the output feature map. In L c , a kernel allows filtering in the spatial domain. In L cs , a kernel represents temporal filters and down-sampling, and this size of the data to analyze is reduced in this layer by performing convolution and subsampling at the same time.
The output of layer L h and L out can be achieved according to the typical feedforward pass of fully connected neural network [41]. Two neurons of the output layer, Z 0 and Z 1 , represent the two classes. The input is predicted to be a forgotten event if the output of Z 0 is larger than that of Z 1 , otherwise the input is recognized as a remembered one.
For each layer, the weights/kernels are initialized with a standard distribution around AE1=n l m ðjÞ input , where n l m ðjÞ input is the number of inputs of the neuron j in the map m of layer l. The activation function f(Á) for L c and L cs is hyperbolic tangent function. The constants are set with a = 1.7159 and b = 2/3, according to the recommendations described in [40]. The activation function for the last two layers is the logistic (sigmoid) function.
For back-propagation pass, we applied backpropagation algorithm [41] by minimizing the least mean square error. Like typical backpropagation pass, the resulting outputs are compared against the desired outputs corresponding to the input. And then, the errors are propagated back through the network to adjust the weights while the network is gradually converging on the ability to provide the desired outputs.
We use cross-validation for training and testing [42]. This method divides data into training data and testing data. For each division, the testing set is composed of one sample from each class (remembered or forgotten) and the remaining samples are used for training. The cross-validation process was repeated k times until each sample was used exactly once for testing. The k results from the k divisions can then be averaged to produce a single prediction accuracy. For the training procedure, the training samples are divided into a training set and a validation set, accounting for 70% and 30% respectively. To balance the number of samples for each class in the training set, we copied the samples of the smaller class [43]. The training stopped when the least mean square error was minimized on the validation set.
The average training time was around 4 minutes on a computer with an Intel Core i5-3470 CPU (3.20GHz) and 4GB RAM. The time depends on the number of training samples. The model was implemented in MATLAB without any special hardware optimization (multicore or GPU). The source codes will be available publicly if this paper is accepted. The average testing time was around 1 sec on the same computer.

Results
In this section, we carried out four experiments to evaluate the performance of ConvEEGNN:

Prediction Accuracy
We experimented with different network structures of ConvEEGNN. The structure is determined by the number of maps in the convolutional layer (N c ) and the size of map m in the convolutional and subsampling layer (P m , m = 1, 2, . . . N cs ). Table 1 shows the prediction results of different ConvEEGNN network structures for all 9 participants. In addition to the average prediction accuracy, significance based on total samples of each participant is also included. Significance means the number of significantly over chance results (significantly over 50% with p < 0.05) in all 9 participants [44,45]. Higher significance suggests that an approach has higher accuracy. The average prediction accuracy varies nearly from 65% to 72%. The best accuracy of 72.07% was achieved with the network structure of N c = 1, P 1 = 3.
For this network structure (N c = 1, P 1 = 3), all of the 9 participants showed prediction accuracies significantly over chance. The ConvEEGNN with different P m in one network structure (N c = 2, P 1 = 3 & P 2 = 5) achieved the accuracy of 70.15%, which is approximate to the best accuracy (72.07%) with high significance. The reason behind this may be that P m is directly related to the size of kernel for the convolutional and subsampling layer. For this layer, the kernel convolved data in the temporal domain to extract sequential temporal features. The kernel size affects the range of time used for higher feature extraction. In view of the neural activity during memory process, kernel size may indicate the complexity for neurons related to memory formation to process EEG signals. Kernels with smaller size may indicate a relative simple signal process to extract short-time features about memory formation. Kernels with bigger size may indicate a relative complex signal process to extract long-time features about memory formation. The network structure of N c = 2, P 1 = 3 and P 2 = 5 may take the advantage of combining these two kinds of features or signal processes and resulted in a high prediction accuracy (cf. Table 1).

Comparison with Other Approaches
For comparison purposes, six other approaches were implemented and optimized, then tested on the same data with the same experimental protocol.
2. ANN-1: one-hidden layer fully-connected artificial neural network. For ANN-1, the hidden layer has 10 neurons. Hyperbolic tangent function is used as the activation function of the first hidden layer and logistic (sigmoid) function is used for the other layer.
3. ANN-2: two-hidden layer fully-connected artificial neural network. For ANN-2, the two hidden layers have 20 and 10 neurons respectively. Hyperbolic tangent function is used as the activation function of the first hidden layer and logistic (sigmoid) function is used for the other layers. The significance column gives the proportion of the number of participants with significantly over chance results in all 9 participants. doi:10.1371/journal.pone.0167497.t001 An EEG-Based Computational Memory Prediction Approach 4. SVM: support vector machine. After testing different kernels (linear, polynomial and radial basis function), we optimized the approach by using cubic polynomial as the kernel function.
5. SVM + LDA [20]: this classifier-fusion approach combined the results from two SVM classifiers for the spectral features of pre-and during-stimulus SME and an LDA classifier for the temporal features of during-stimulus SME. The kernel function for SVM is cubic polynomial.
6. CWT + SVM [21]: this approach used continuous wavelet transform to extract time-frequency features and then used 1-norm SVM to predict memory performance. Since the data were band-pass filtered between 0.05 and 15 Hz according to the structure of Con-vEEGNN, the frequency bands extracted for 1-norm SVM are Theta 1 (centered at 4.00 Hz), Theta 2 (centered at 6.42 Hz) and Alpha (centered at 11.26 Hz).
As it can be seen in Table 2, ConvEEGNN outperformed all the other six approaches, suggesting that convolutional neural network may have some advantages over EEG analysis. By convolving across spatial and temporal domain, ConvEEGNN may be more robust to shifts or distortions of EEG signals. By subsampling in the temporal domain, the relative positions of features may be extracted to obtain some degree of temporal invariance [28,46]. Since each neuron in a layer receives inputs from a set of neurons located in a small neighborhood of the previous layer, neurons may extract local fine grained features which benefits the signal analysis [46]. LDA was the worst model with the lowest accuracy and significance. Since LDA is good at classifying features with linear separability, the performance of LDA in Table 2 may suggest that the data have less linear separability. Among all the other approaches, SVM achieved relatively higher accuracy but the significance was similar to the other approaches, which was lower than the half number of the participants. From Table 2, SVM + LDA [20] showed average accuracy around 60% with low significance, which was outperformed by Con-vEEGNN. This may suggest that the features exploited by ConvEEGNN might be more informative than those extracted by SVM + LDA [20].  The significance column gives the proportion of the number of participants with significantly over chance results in all 9 participants. doi:10.1371/journal.pone.0167497.t002 An EEG-Based Computational Memory Prediction Approach accuracy without sharp increase in the variance of accuracy, which is similar to the variance of accuracy for SVM. This suggests that ConvEEGNN is relatively stable and accurate. The variance of accuracy for ANN-1 or ANN-2 was relatively low among all the approaches, revealing that the performance was stable, while the significance was low.

Prediction Performance of Pre-and During-stimulus Periods
Since both EEG signals before an event [14,15] and that during an event [11] have been found to reveal clues to distinguish remembered events from forgotten ones, in this subsection, we hope to investigate which kind of EEG signals contributes more for memory performance prediction. For that, EEG signals from the pre-stimulus period and during-stimulus period are taken separately as an input for the ConvEEGNN. In our experiment, the parameters standing for temporal sampling length for the pre-stimulus period and during-stimulus period, T pre and T dur , were both set to 30, so as to be adapted to the structure of ConvEEGNN, representing 0.3-1.5s and 1.5-2.7s respectively. Therefore, the two inputs of the ConvEEGNN for the two periods were of the same size. The number of maps in the second layer and the size of maps in the third layer were in accordance with the network structure with the best performance mentioned before, which was N c = 1, P 1 = 3. The performances for the two periods are compared with the performance for the entire period (-0.1-2.9s) in Table 3, which shows the prediction accuracy as well as significance for all 9 participants. From Table 3, we can see that the prediction performance with pre-stimulus signals and that with during-stimulus signals is very close and both of them are nearly 67%, significantly over chance for at least eight participants. This result may indicate that pre-stimulus period and during-stimulus period have very similar contribution for subsequent memory performance prediction. The information from pre-stimulus period and during-stimulus period may have similar relation with memory process. Compared to average accuracy using single period of EEG data, the average accuracy using the entire period increases approximately 5%, which is significantly better than either the pre-stimulus period (t-score < 0.01) or the during-stimulus period (t-score < 0.05). And each participant's accuracy is significantly over chance for the entire period.

EEG Channel Analysis with ConvEEGNN
To infer the influence of different channels for prediction, the weights from the input layer to the second layer of the ConvEEGNN were examined. The absolute value of a weight provides a channel's discriminant capability for telling remembered events from forgotten ones. Higher absolute value means higher discriminant capability. Fig 5 shows the weight maps of EEG channels averaged over all the participants for pre-stimulus, during-stimulus and entire period respectively. The red means a high absolute value while the blue represents a low weight. Table 4 shows the top three channels for all the three periods. For the pre-stimulus period, the largest weight was from the channels over prefrontal cortex. In other words, the discriminant capability over prefrontal cortex was the highest. For the during-stimulus and entire period, the greatest weights were from the signal sources above the prefrontal and temporal cortex (Fp1 and T7) (cf . Fig 5b and 5c). The discriminant channels are inferred by the weights in the data-driven ConvEEGNN. We find that these results are in accordance with the distribution of SME, that is, the magnitude over prefrontal cortex is the largest [14,15]. This indicates the rationality of our approach somewhat.

Conclusion & Discussion
In this paper, we proposed a computational approach called ConvEEGNN to predict memory performance using EEG signals. The ConvEEGNN can automatically extract features and integrate them with classification. The effectiveness of the proposed approach was validated by the recorded EEG signals in an auditory memory task. The results demonstrated that Con-vEEGNN is effective to estimate earlier than his/her actual memory performance, outperforming other typical approaches. It was also found that EEG signals from pre-stimulus period and those from during-stimulus period have the very similar prediction accuracy.
ConvEEGNN has some underlying advantages for EEG-based memory prediction. Con-vEEGNN allows automatic feature extraction via end-to-end training within the convolutional layers and the subsampling layers. This is helpful for EEG signal analysis since the signal contains many variations over time. By convolving, ConvEEGNN can be more robust to shifts or distortions of the input data. By subsampling in the temporal domain, the relative positions of features can be extracted to obtain some degree of temporal invariance [28,46]. In addition, since each neuron in a layer receives inputs from a set of neurons located in a small neighborhood of the previous layer, neurons may extract local fine grained features or some kinds of underlying features which benefits the signal analysis [46]. For example, in ConvEEGNN, the kernel size, that is the number of neurons in the network used for convolving or subsampling, may indicate the complexity for neurons about memory formation to process EEG signals. In this way, kernels with smaller size may indicate a relative simple signal process to extract short-time features about memory. Kernels with bigger size may indicate a relative complex signal process to extract long-time features. By combining these two kinds of features, the network structure with different kernel size could achieve a relatively high prediction accuracy.
Compared to standard convolutional neural network for image recognition [29], kernels used in ConvEEGNN are vectors but not matrices, in order to separately extract spatial An EEG-Based Computational Memory Prediction Approach features crossing channels and temporal features in a single channel. We notice that the input of ConvEEGNN is a matrix which includes both spatial and temporal dimension. A vector kernel can convolve over only one dimension (i.e. spatial or temporal), thus only one kind of features (spatial or temporal) could be extracted. However, if a matrix kernel is used, it will convolve over not only spatial dimension but also temporal dimension, which will result in the features combining spatial and temporal domain. Separation of spatial and temporal domain has a distinctive advantage that it is easy to explain and understandable to optimize. For example, with spatial features, we can easily find which channel is more significant for remembering performance, and which is less. This study still is limited by its participant number. Since only 9 participants were used for prediction, the data for training an optimal structure of ConvEEGNN were limited in number. Therefore, the ConvEEGNN that we optimized in this study is relatively limited in its prediction performance. As a matter of fact, CNN has advantages in modeling in various fields, such as speech recognition [36] and image classification [29]. Its advantages would be strengthened given more data. If more participants were provided, we might take full advantages of CNN to achieve a better prediction performance with a more complex structure of ConvEEGNN.
The weights from the input layer to the second layer of ConvEEGNN showed the influence of different channels for prediction. The results are consistent with the distribution of SME [14,15]. In addition, the results for the during-stimulus and entire period may reveal a potential relation between the channels over temporal cortex and memory process.
Memory performance prediction has various applications. For instance, it could help us remember new knowledge better, and then improve learning efficiency. It may help diagnosis and treatment of those diseases regarding memory symptom, such as mild cognitive impairment and Alzheimer's disease. It may also be very helpful to build a brain-in-loop system for cyborg intelligence [47][48][49].