Simultaneously Uncovering the Patterns of Brain Regions Involved in Different Story Reading Subprocesses

Story understanding involves many perceptual and cognitive subprocesses, from perceiving individual words, to parsing sentences, to understanding the relationships among the story characters. We present an integrated computational model of reading that incorporates these and additional subprocesses, simultaneously discovering their fMRI signatures. Our model predicts the fMRI activity associated with reading arbitrary text passages, well enough to distinguish which of two story segments is being read with 74% accuracy. This approach is the first to simultaneously track diverse reading subprocesses during complex story processing and predict the detailed neural representation of diverse story features, ranging from visual word properties to the mention of different story characters and different actions they perform. We construct brain representation maps that replicate many results from a wide range of classical studies that focus each on one aspect of language processing and offer new insights on which type of information is processed by different areas involved in language processing. Additionally, this approach is promising for studying individual differences: it can be used to create single subject maps that may potentially be used to measure reading comprehension and diagnose reading disorders.


Material
Participants read chapter 9 of Harry Potter and the Sorcerer's Stone [Rowling, 2012]. We chose this chapter because it involves many characters and spans multiple locations and scenes. We chose a famous book series because we hypothesized all subjects already had characteristic mental representations of the different characters and locations, and that at least a part of this representation would remain constant throughout the reading of chapter 9. This assumption allows us to use data from the entire chapter to look for the representation of the different characters, e.g. the protagonist Harry Potter. In contrast, had we chosen an unfamiliar story in which we learn about the protagonist's personality throughout the text, the mental representation of this protagonist will arguably change more than Harry's would.
Participants fMRI data was collected from 9 subjects (5 females and 4 males) recruited through Carnegie Mellon University, aged 18 to 40 years. The participants were all native English speakers and right handed. They were chosen to be familiar with the material: we made sure they had read the Harry Potter books or seen the movie series and were familiar with the characters and the story. All the participants were screened for safety, signed the consent form and were compensated for their participation. Data from one of the subjects was excluded from the analysis because of an artifact that was not removed by our preprocessing procedure.

Design
The words of the story were presented in rapid serial visual format [Buchweitz et al., 2009]. Words were presented one by one at the center of the screen for 0.5 seconds each (see Fig. 5). The background was gray and the font was black. We used MATLAB and the Psychophysics Toolbox extensions [Brainard, 1997, Pelli, 1997, Kleiner et al., 2007.
The chapter was divided into four runs, of approximately 11 minutes each. Subjects had short breaks between runs. Each run started with a fixation period of 20 seconds in which the subjects stared at a cross in the middle of the screen. The words presentation started after the fixation period. The total length of the runs was 45 minutes, during which about 5200 words were presented. Chapter 9 was presented in its entirety without modifications and each subject read the chapter only once.
Before the experiment, we supplied the subjects with a summary of the events preceding chapter 9 and a summary of the main characters and concepts in Harry Potter and the Sorcerer's Stone to refresh their memory. We also instructed them to practice rapid serial presentation by viewing a video that replicated the parameters of our design, but with another story (The Tale of Peter Rabbit [Potter, 2006]). On the day of the experiment, the subjects were instructed to lay in the scanner and read the chapter as naturally as possible while remaining alert.

fMRI procedure
Functional images were acquired on a Siemens Verio 3.0T scanner (Siemens, Erlangen, Germany) at the Scientific Imaging & Brain Imaging Center at Carnegie Mellon University, using a T2* sensitive echo planar imaging pulse sequence with repetition time (TR)=2s, echo time=29 ms, flip angle=79 • , 36 slices and 3 × 3 × 3mm voxels. Anatomical volumes were acquired with a T1-weighted 3D-MPRAGE pulse sequence.   Figure 5. Illustration of our fMRI experimental protocol. Words from a story are presented serially for 0.5 seconds each while recording brain activity with fMRI at a rate of one entire brain image each 2 seconds. Our goal is to model how fMRI neural activity during reading reflects the perceptual and conceptual features of the story. Each fMRI activity volume is shown here in 36 horizontal slices. Going right to left through the slices, then bottom-up, corresponds to looking at slices from the bottom of the brain up. Within each slice, the top of the slice corresponds to the posterior of the brain, and the right side of the slice corresponds to the left side of the brain. The images are on a scale from blue to red where blue indicates negative deviation from baseline and red indicates positive deviations. A TR is the time needed to record one brain volume, and is 2 seconds in our experiment.

Data preprocessing
We used the MATLAB suite SPM8 [Ashburner et al., 2008] to preprocess the data. Each subject's functional data underwent realignment, slice timing correction and co-registration with the subject's anatomical scan, which was segmented into grey and white matter and cerebro-spinal fluid. The subject's scans were normalized to the Montreal Neurological Institute (MNI) space and smoothed with a 6 × 6 × 6mm Gaussian kernel smoother.
Using the Python toolbox PyMVPA [Hanke et al., 2009], we masked the functional data using the segmented anatomical mask, discarding cerebrospinal-fluid voxels. The data was then detrended in MATLAB by running a high-pass filter with a cut-off frequency of 0.005Hz. Visual inspection of the time course of a large number of voxels showed that this threshold was enough to get rid of large block effects and slow trends in the data.
Finally, we selected voxels from each subject, keeping only voxels in 78 cortical Regions Of Interest (ROIs), defined using the AAL brain atlas [Tzourio-Mazoyer et al., 2002], excluding the cerebellum and white matter. We ended up with an average of 29227 voxels per subject. The anatomical union (number of MNI voxel locations for which at least one subject had a voxel) of these 6 subject's brains was a set of 41073 voxel locations.

Appendix B. Representing stories in a feature space
We represented our story features as a multivariate discrete time series. We used one TR as a unit of time. This enables us to have the same time scale for the features and the data time series. We compute the value of a feature at any TR by aggregating the features of the four words that were read during that TR (see table 2).
We extracted the story features at multiple levels of representation. Specifically, we obtained simple perceptual features such as the average word-length in a TR, as well as semantic features of individual words and sentence level features such as syntactic dependency relationship. We also included discourse level features such as the presence of different story characters. The list of all the features we used is provided in tables 1. This table includes the features that were finally used in the model: a few of our features had too few occurrences and we ended up disregarding them. We also include in table 2 as illustration a subset of the feature values for the two segments of the story included in Fig. 2(B).

Visual features
• Average Word Length: We compute the average word length in every TR.
• Word Length Variance: We compute the variance of word length in every TR.

Semantic features
An approximation of the meaning of a word can be obtained by the pattern of its occurrence with other words over a large text corpus. For example, "apple" is likely to occur with other food items or the verb "eat", but not so likely to occur with building materials or power tools. These statistics are very large in dimension and therefore we need to resort to some form of dimensionality reduction.
We used NNSE (Non-Negative Sparse Embedding) [Murphy et al., 2012] a which produces low dimensional representations for word meanings that are interpretable and cognitively plausible. The intuition is that, when asked to name the semantic properties of an object, one would list the few salient positive properties (e.g. an apple is a round, usually red, edible object) instead of naming negative properties (e.g. an apple is not a tool), see [Murphy et al., 2012] for more detail. These are learned from massive web corpora from which dependency co-occurrences and document co-occurrence counts are computed. These statistics are then factorized using NNSE.
For every word in our story, we therefore obtain 1000 NNSE features of which we keep the top 100 (these 100 features are picked from the 1000 based on the set of words in the story by choosing the dimensions with the highest average magnitude for these words, whereas the original 1000 were picked by the NNSE model based on the set of all words in the corpora). We sum the features of the four words within each TR.

Syntactic features
Using an automated parser [Nivre et al., 2007] we determined the part of speech of every word in the story and obtained the dependency role of every word from the parse tree of the sentences.
We obtained a set of 28 unique parts of speech and 17 unique dependency relationships, for a total of 45 syntactic binary features that indicate if a given part of speech or a dependency relationship occurred within a TR. We also included an additional feature that records the position of a word in the sentence, i.e. its number starting from the beginning of the sentence. This value is averaged for the four words in a TR.

Discourse features
We made the following annotations manually by going through the story text: • Characters: We resolve all pronouns to the character to whom they refer, and make binary features to signal which of the 10 characters are mentioned.
• Motions: We identified a set of motions that occurred frequently in the chapter (e.g. fly, manipulate, collide physically, etc.). Because the actions happen in the course of a sentence, we created two story features for: a punctual feature and a "sticky" feature. The punctual feature represented when the verb of the motion was mentioned, and the sticky feature is on for the duration of the motion (i.e. the sentence). Because we disregarded some of the story features which had few occurrences, we ended up with some motion features that consist only of the sticky feature.
• Speech: We indicated the parts of the story that corresponded to direct speech between the characters. We have a punctual feature that indicates the verb that announces which character is speaking (e.g. "said Harry"), and a sticky feature that indicates ongoing direct speech.
• Emotions: We identified a set of emotions that were felt by the characters in the chapter (e.g. annoyance, nervousness, pride, etc.). We had punctual features for when the emotion was explicitly mentioned, and sticky features when it was being felt by the characters.
• Verbs: (non-motion) We identified a set of actions that occurred frequently in the chapter that were distinct from motion (e.g. hear, know, see, etc.). These typically spanned a shorter time than motions and we only used punctual features to represent them.   We aim to find the mapping between the different types of features we presented above and the neural activity y v of a voxel v. We want to learn the response of this voxel v to every feature j.
We first assume that each feature j has a signature activity in voxel i that is consistently repeated every time the brain encounters this feature (for the regions that do not encode this feature, we will ideally learn a signature activity equal to 0). Fig. 1(a) shows a hypothetical pattern of activation elicited by the semantic feature j in a given voxel. Due to the TR = 2 seconds we use in our experiment, and the typical latency of the hemodynamic response, we are only interested in the points of the response signature that are sampled 2, 4, 6 and 8 seconds after the onset of feature j (w vj 1 , w vj 2 , w vj 3 and w vj 4 ). It is important to note that we do not constrain the shape of the learned response signature. We also tried estimating the response with 5 time points (2 to 10 seconds after onset) and 6 time points (2 to 12 seconds). However this manipulation did not significantly change the performance and therefore we use 4 time points for computational and statistical reasons (see Appendix F).
The second assumption is that the signature activity is scaled by the value of feature j at the time the feature is presented. See Fig. 1(b).
Therefore, if we assume that the responses created by successive occurrences of a feature are additive then the activity at time t in voxel v is: Another way to think about this is that the activity created by the feature is the convolution of the response signature with the time course of the feature. Above we considered the brain activity to be created by one story feature. Now we include the activities created by all of the features we have defined above, again assuming they are additive. This gives the model: We therefore model the voxel's activity y v (t) as a linear combination of the values of all the features at times t − 4 to t − 1. We know the time courses of the feature values and the voxel's activity, and we need to predict the set of response signatures.
Our approach is similar to Hidden Process Models [Hutchinson et al., 2009] that also use a multiple regression setup. The neural activity there is also assumed to be generated by linearly additive processes and all instantiations of the same process share the same response, but unlike the case of our model, the delay in the onset of the response is variable.

Appendix D. Learning the Response Signatures
In equation 2, we did not consider different subjects, and only considered a hypothetical voxel i. However, in reality, we have S subjects, and V (s) T voxels for each subject. The regression in equation 2 can therefore be rewritten as: where: • s is the index of a given subject (1 ≤ s ≤ S) • n is the number of TRs (or time points) • y is the n × 1 vector of errors (n is the number of TRs) caused by noise in voxel v of subject s (σ 2 v is the noise variance at voxel v and I n is the n × n identity matrix).
To learn the responses w (s) v , we solve the following 2 regularized regression: independently, for each voxel v and each subject s. This equation has a closed form solution where I K is the K × K identity matrix, and we choose the λ (s) v parameter using generalized cross validation [Golub et al., 1979] to estimate the average leave one out cross validation error for each value of λ (s) v . Note that at each voxel we are estimating a value for the best regularization parameter independently of the other voxels.
One additional detail is that, for each experimental block, we throw out the first 10 TRs corresponding to the fixation period and the following 1 TR. This 1 TR correspond to the start of the text display and because of the time shift of the features, the feature matrix at that TR has no content (it corresponds to the story features from the 4 following TRs, i.e. the fixation period).

Appendix E. Learned Waveforms
After learning the set of parameters, we look at the four points we learned for a feature j at a voxel v and examine their relative shape. We find that the responses learned are very noisy. However when only looking at the average response for a given feature type at the regions that represent this feature type (we obtain these regions via the classification task explained in detail in the next section), we end up with 4 points that can usually be fitted on a concave waveform that resemble the characteristic shape of the hemodynamic response. We present the average waveforms we learned in Fig. 6. It should be noted that these plots are the averages by feature set, for one of the subjects, of parameters learned across the voxels whose accuracy is in the top 95% percentile, and therefore they are only provided as an illustration.

Cross-Validation Procedure
To learn the response signatures we time-shift the story feature matrix: we make matrix F in which every row t contains the values of all the features at times t − 4, t − 3, t − 2 and t − 1 .We also create an fMRI data matrix containing in each row t the concatenation of the entire brain images for all subjects, at TR t.
We introduce here the matrix W, which is the concatenation of all the vectors w where To test the validity of the learned response signatures, we constructed a binary classifier that decodes which passage of the story is being read from a given fMRI data frame. We start by partitioning the timeline into 10 cross-validation folds. Then for every fold i: 1. Decorrelate the test (fold i) and training data (other 9 folds) by discarding the data corresponding to the 5 TRs before and after fold i.
2. Use the training data to estimate the response signatures of all features in all voxels and all subjects (W), using the method in Appendix D. It is important to note that the responses are learned independently for each voxel and each subject. Also note that the penalty parameter for each voxel that is described in Appendix D is chosen using only the training data.
3. Divide the timeline of fold i into non-overlapping time windows, each of length B TRs. Then, for every pair of B TRs segments: (a) Take the two test story-frames (S 1 and S 2 ) and predict the corresponding brain activity using the learned responses W, as shown in Fig. 7. (b) Use the two predictions P 1 and P 2 to classify each of the two test data-frames T 1 and T 2 independently: i.e. assign to each data-frame the story-frame with the closest prediction, using a distance function explained in the following subsection.
ry Reading in the Brain using fMRI

Predict) the) brain) data) for) the) held) out) segments) and) use) the) predic&on) to) classify) the)held)out)fMRI)data.)
Compute)the)average)binary)accuracy) (a)) (b)) story) features) fMRI) data) held)out)) features) predicted)fMRI)) response) real)fMRI)) response) &me) Figure 7. Diagram of the classification task. The task is to assign to each held-out K TRs fMRI segment (T 1 and T 2 ) the K × 2 seconds portion of the story to which it corresponds (one of the the two dark blue segments). This is done by predicting the activity using the learned weights, then computing the distance between the two predicted responses (P 1 and P 2 ) and the real segment. The classification of T 1 and T 2 is done independently, i.e. for T 1 , the story passage S 1 or S 2 is chosen, and then, in a different test, for T 2 , the story passage S 1 or S 2 is chosen.
We average the results of all the cross-validation folds and obtain an overall classification accuracy.

Classification Procedure
Here we describe how the distances between a test segment T and the two predicted segments P 1 and P 2 that we compare it to are computed (see Fig. 7). We use two methods: • Whole-Brain classification: The simplest way to perform classification is to use all the voxels from all the subjects in order to determine the distance between the predicted segments and the true segment. Because we are working with single trial data, concatenating the voxels from different subjects in a row acts as a substitute for multiple repetitions. We compute the Euclidean distance between the two images: ||T − P 1 || 2 and ||T − P 2 || 2 .
Importantly, this test method combine data from multiple subjects without averaging data over subjects in either the learning step (as we saw above) or the classification step. The multi-TR segment that we are classify is actually a multi-TR concatenation of brain images from all subjects, instead of a multi-TR segment of one brain's images. Since every voxel in that data is trained independently, and contributes to the Euclidean distance independently, then this concatenation does not make any assumptions on the subject's alignment.
As discussed in Appendix G and the main text, when using the data that has been smoothed in preprocessing, the classification accuracy is lower than when using un-smoothed data. Therefore we boost the accuracy when using smoothed data by voxel selection. This is done in the following way: at every cross-validation fold, we use the training data in order to find the best subset of voxels to use. This is done via a nested cross-validation step on the training data what determines which voxels have the best accuracy and how many of the top voxels to use to obtain the best combined accuracy. These voxels are then used to classify the untouched test data: we compute the Euclidean distance using only the columns that correspond to these voxels.
• Concatenated Searchlight classification: Whole-Brain accuracies do not tell us about which parts of the brain are contributing to the classification accuracy. In order to assess this, we perform the classification "locally", looking in one region of the brain at a time. Regions are defined as k × k × k-voxel cubes centered around one MNI voxel location, k being an odd integer. This method is similar to the Searchlight approach commonly used in neuroimaging [Kriegeskorte et al., 2006], however we expand it to include data from multiple subjects: -We pick a cube size k: for example, k = 5 gives a 5 × 5 × 5 voxels cube (to look at one voxel at a time we take a 1 × 1 × 1 voxel cube) -For every voxel location (x i , y i , z i ), we select the set of voxels whose coordinates fall in the k × k × k voxels cube centered around that location. This can be done for each subject independently, in the case where we are interested to look for regions with high accuracy on a single subject basis. It can also be done by selecting the union of voxels from all subjects that fall in this cube. We call the set of voxels selected at this step V i . Because we are working with single trial data, concatenating the corresponding voxels from different subjects in a row acts as a substitute for multiple repetitions. Additionally, since the alignment of the subjects to the same anatomical space is not perfect, taking a k × k × k voxel cube with k > 1, allows us to circumvent small variations in the anatomical configuration of the subjects brains.
-For each of these sets V i of voxels, we compute the Euclidean distances: ||T(all rows, voxels in V i ) − P 1 (all rows, voxels in V i )|| 2 and ||T(all rows, voxels in V i ) − P 2 (all rows, voxels in V i )|| 2 Note: we are performing this computation at every voxel, so we are actually performing N v classifications, where N v is the total number of voxels.

Significance Testing
• Whole-Brain Classification Accuracy To show that Whole-Brain classification accuracy is significantly higher than chance accuracy, which is 50% in this balanced binary classification task, we compute an empirical null distribution. The null distribution that story features cannot predict neural activity is approximated empirically. A common approach to estimate the null distribution is by running a permutation test: the order of the features is permuted before classification and the procedure is repeated a large number of time. However, the different samples (different TRs) of our experiment are not identically and independently distributed (IID) given that the data is from a time series. The time series of data and of features varies smoothly and therefore the classifier might detect dependencies between them when there is none, because they happen to vary similarly in this finite sample. The commonly used permutation test will not contain such dependencies and therefore will not correct for them, therefore leading to an optimistically biased answer. To solve this problem we use a solution inspired by [Chwialkowski and Gretton, 2014]: we shift the feature time series by N TRs such that a < N < b and compute the classification accuracy. For a and b large enough (we use a = 500 and b = 750), there will be no real relationship between the time series of data and the time series of features, however the time smoothness will be conserved, leading to better estimates of the variance of chance classification accuracy, which guarantees less false positives.
• Identifying Brain Regions Correlated with Different Feature Types: To find out where in the brain each type of story feature is useful, we followed a similar training approach as previously, except that (1) only one type of feature (Word Length, Story Characters etc...) was used at a time and (2) we used a concatenated Searchlight procedure at test time with k = 5 and using data from all subjects. Precisely, for every voxel location i, we took the cube of 5 × 5 × 5 voxel coordinates centered around that location. The union of voxels from all subjects that have coordinates included in this cube were selected. Therefore, for every location, we performed the classification of 2 segments of size 20 × |V i |.
For every one of these combination of type of story feature/subset of data, we obtain a local classification accuracy. We measure significance by computing an empirical null distribution in the same way as for the whole-brain accuracy, then correcting for multiple comparisons using the Benjamini-Hochberg-Yekutieli False Discovery Rate (FDR) [Benjamini and Yekutieli, 2001]. This procedure controls the FDR at level q under arbitrary dependence and therefore we did not need to make independence assumptions about the accuracies of different voxels. The procedure is, for N comparisons: -Sort the N p-values.
-Find the largest j such that .
-Reject the null hypothesis for the j comparisons with the smallest p-values. Figure 8. Results obtained by our generative model for different syntax features, showing where sentence length, part of speech, and dependency roles are encoded by neural activity. Each voxel location represents the classification when using a cube of 5 × 5 × 5 voxel coordinates, centered at that location, such that the union of voxels from all subjects whose coordinates are in that cube are used. Voxel locations are colored according to the feature set that can be used to yield significantly higher than chance accuracy.
We have also ran the entire experiment with the same setup, using however the data without spatial smoothing. The results vary to a considerable degree in the boundaries of each region, while the main location of each feature representation stays the same. Figures 9 and 10 show the resulting maps. nonword-list a memory probe was presented (a word in the sentences and word-list conditions, and a nonword in the jabberwocky and nonword-list conditions), and participants had to decide whether the probe was present in the preceding stimulus. As discussed in Fedorenko et al. (2010), the two versions of the task (passive reading vs. reading with a memory probe at the end) produced similar activation patterns; we therefore collapsed across the two subsets of the subjects in our analyses in that paper and we do the same here. Each participant completed between 6 and 8 runs (i.e., between 24 and 32 blocks per condition; see Fedorenko et al., 2010, for details of the timing).

Author's personal copy
In Section 3, we report the results of: (a) region-of-interest-based (ROI-based) MVPA analyses on a set of key language-sensitive regions and (b) whole-brain searchlight-style analyses (Kriegeskorte et al., 2006).

ROI-based analyses
We chose to use as ROIs for our MVPA analyses the thirteen group-level functional parcels 6 (Fig. 5, bottom) that were derived from the probabilistic overlap map for the Sentences > Nonword-lists activations 7 (Fig. 5, top), as described in Fedorenko et al. (2010). These group-based ROIs represent the locations where individual activations tend to be found most consistently across subjects. So, for any given subject, a parcel will include some voxels that respond reliably more strongly to Sentences than Nonwords, and some voxels that do not show this property. We chose to use these group-level parcels instead of subject-specific functional ROIs in these analyses for two reasons. First, it has been previously demonstrated 6 These parcels were created in order to systematize and automate the procedure for defining subject-specific functional ROIs (fROIs): in particular, for any given region, an individual subject's fROI is defined by intersecting the relevant parcel with the subject's thresholded activation map. In other words, these functional parcels serve as spatial constraints on the selection of subject-specific voxels, akin to using borders of anatomical regions (see Julian, Fedorenko, & Kanwisher, submitted, for an extension of this method to ventral visual regions). 7 Although these group-level functional parcels were created from the 25 subjects whose data we examine here, non-independence issues (Vul & Kanwisher, 2009) do not arise in examining the discriminability between word lists and jabberwocky sentences because the data from those conditions were not used in creating the parcels. Some non-independence is present when we examine the discriminability among all four conditions (Section 3.1). This non-independence should be taken into consideration when interpreting the results from the ROI-based analyses. However, the fact that the results of the whole-brain searchlight analyses, which do not suffer from such non-independence problems, look similar to those of the ROI-based analyses largely alleviates the concerns. (Haxby et al., 2001;Kriegeskorte et al., 2006) that even voxels that do not show a particular functional signature relevant to the to-be-discriminated conditions can contribute to classification accuracy. For example, Haxby et al. (2001) showed that removing voxels from the ventral visual regions that respond most strongly to some visual category does not strongly affect the ability to discriminate that category from other categories. Consequently, voxels in the vicinity of language-sensitive regions in each individual subject may contain information about various aspects of linguistic knowledge even though they do not show the functional signature of language-sensitive voxels. And second, because we wanted to examine neural activity patterns across all four conditions, we could not use any of the conditions for defining subject-specific fROIs. (However, in addition to these whole-parcel-based analyses, we did conduct one analysis where we looked at the ability of subjectspecific functional ROIs (fROIs), defined by the Sentences > Nonword-lists contrast, to discriminate between word lists and jabberwocky sentences. The results of this analysis are reported in Appendix A.) For each condition we divided the data into odd-numbered and even-numbered runs (each subject performed between 6 and 8 runs total). Then, for each subject and for each ROI, and across the two independent halves of the data, we computed the within-vs. between-condition spatial correlations for each pair of conditions (as schematically shown in Fig. 4 above), considering all the voxels within the parcel. For example, to see how well the pattern of activity for the Sentences condition is discriminated from the pattern of activity for the Word-lists condition, we computed (i) a within-condition correlation value for the Sentences condition by comparing the pattern of activity for the Sentences condition in the odd vs. even runs (all the r values are Fisher-transformed); (ii) a within-condition correlation value for the Word-lists condition by comparing the pattern of activity for the Word-lists condition in the odd vs. even runs; and (iii) a between-condition correlation value by comparing the pattern of activation for the Sentences condition in the odd/even runs and for the Word-lists condition in the even/odd runs (these two values are averaged to create one between-condition value). Finally, for each ROI we performed an F-test on the within vs. between-condition correlation values across subjects to see whether the within-condition values are reliably higher than the between-condition values. If so, this would suggest that the distinction between the two conditions in question is represented in the relevant ROI.
We deviated from Haxby's analysis strategies in one way. In particular, Haxby applied centering to his data by subtracting the mean level of activation of a voxel from the activation level for each of the conditions. This is equivalent to considering the activation from each condition with respect to a baseline activation level computed as the average activation across all conditions, instead of using an independent fixation baseline as we used in our analyses. The centering procedure potentially increases sensitivity of the MVPAs by removing one source of variance from across the voxels and leaving only between-condition differences in play. However, centering also introduces between-condition dependencies in the estimation of the within-condition similarity measures, which complicates their interpretation. Our results do not only depend on processing methods, but they also require the significance thresholding of different classification tasks which might not be of equal difficulty. For instance, different features might lead to high or low classification because of the statistical properties of the features and not the Dependency" role" Part"of" Speech" Sentence" Length" Figure 10. Same as figure 8 with non-smoothed data.
way they are represented in the brain. We present below the comparison of the whole brain classification when different types of features are used. We compare these accuracies with the entropy of each feature set. We want to see if the difference in classification accuracy is due to differences in the entropy of each feature: it is harder to learn a model with features that change rarely in a story (low entropy), than it is to learn a model with features that occur very frequently. In our feature creation phase, we did explicitly exclude features with low entropy (for example, the location of scenes didn't vary much and we didn't include it). However, the features we did keep still vary in their frequency and we wanted to compare their entropies to their accuracies. For each feature set we compute the entropy of each feature, and then use the maximum entropy. The results are shown in the first row of tables 3 and 4. In the following rows, we show classification accuracy by feature set. For the smoothed data, the accuracy was initially low and was boosted by voxel selection as explained in Appendix F. There seems to be a modest relationship between the entropy of the features and how accurate classification is, in which feature sets with higher entropy lead to a higher accuracy. There might be other factors also affecting how easy the classification with different feature sets are. To avoid comparing the results of classification tasks that vary in difficulty, and as a way of leveling the playing field, we plot

Appendix H. Combining Subjects Spatially
Our concatenated Searchlight is not equivalent to spatial or cross-participant smoothing because, again, the voxels associated with each subject are treated independently. The only requirement is that the subjects are all normalized to the MNI space; we do not co-register the subjects and we learn the response of every voxel independently.
Assume we are interested in an area A that is distributed around a certain mean location (x, y, z) in all subjects. Then despite the subjects' anatomical variability and given an adequate model and an appropriate cube-size, the cube centered at (x, y, z) will contain in it the voxels from area A of all subjects. Running the classification at this cube should then hypothetically yield the best accuracy. This would be possible because, inside the cube, the voxels from all subjects are concatenated and they contribute independently to the Euclidean distance we compute in classification. The voxels' precise alignment is irrelevant at this step, it only matters that they are all taken into consideration. Therefore, this method identifies regions of a given size (in this case 15mm × 15mm × 15mm) in which the subjects are processing the same information. It avoids the problem usually encountered in averaging multiple subjects, which is that the only regions that are identified are the regions in which the subjects highly overlap. This problem is widely debated in the literature [Fedorenko et al., 2010].
Furthermore, despite the linearity of the model, this approach does not yield the same results as spatially smoothing the data in the cubes, because we have a multivariate input (the different story features) and while nearby voxels might be processing the same type of information (e.g. story characters), they are hypothetically coding different instances of this information (e.g. different story characters) with different patterns of activity for each instance.