Smoothing Spline ANOVA Decomposition of Arbitrary Splines: An Application to Eye Movements in Reading

The Smoothing Spline ANOVA (SS-ANOVA) requires a specialized construction of basis and penalty terms in order to incorporate prior knowledge about the data to be fitted. Typically, one resorts to the most general approach using tensor product splines. This implies severe constraints on the correlation structure, i.e. the assumption of isotropy of smoothness can not be incorporated in general. This may increase the variance of the spline fit, especially if only a relatively small set of observations are given. In this article, we propose an alternative method that allows to incorporate prior knowledge without the need to construct specialized bases and penalties, allowing the researcher to choose the spline basis and penalty according to the prior knowledge of the observations rather than choosing them according to the analysis to be done. The two approaches are compared with an artificial example and with analyses of fixation durations during reading.


Introduction
Two lines of statistical research, in combination, provide a very flexible framework for the analysis of data in psychology, linguistics, and many other fields [1][2][3]. First, smoothing splines offer a flexible framework for modeling of observations given a set of covariates. Second, mixed models are an appropriate tool for modeling clustered/grouped data. They allow for an explicit account of random effects, which model deviations of individual behavior from the overall mean. The close relationship between spline estimation and mixed models in a Bayesian context [4] led to their combination in the unified framework of generalized additive mixed models [5,6]. A generalized additive mixed model (GAMM) can be seen as an extension of generalized linear mixed models (GLMMs), i.e. [7], by allowing smooth functions as fixed and random effects or as an extension of the generalized additive models (GAMs), [8], by explicitly including random effects.
As generalized linear mixed models (GLMM) are extensions of the linear one (LMM) that allow for an analysis of non-Gaussian distributed responses, the same generalization is possible in the context of the (linear) additive mixed models (AMM) towards the generalized additive mixed models (GAMM). Throughout this article, we restrict ourselves to AMMs while all methodological results are also valid for GAMMs.
In the context of LMMs, an observed variable y is modeled as a linear function of one or more fixed effects and a set of variance components to incorporate individual deviations from the fixed effects. AMMs extend LMMs by the means of replacing the linear fixed effects by arbitrary functions under the assumption that these functions are smooth or at least continuous (these functions are then called splines, i.e. [8]). Note that LMMs are therefore a special case of AMMs where the inferred spline is a linear function of the covariates. This introduces a great flexibility for the modeling of experimental data. In correspondence to the ANOVA decomposition of fixed effects in the context of LMMs, it is possible to perform similar decompositions in the context of AMMs. For example, the model z i = f(x i , y i ) + i can be decomposed into z i = c + f x (x i ) + f y (y i ) + f xy (x i , y i ) + i , where c can be interpreted as the model offset (intercept), f x (x) and f y (y) as the main effects and f xy (x, y) as the interaction effect. This decomposition allows to determine if the modeled response variable y is sufficiently described by the simple sum of the main effects f x and f y or if, in addition, the interaction effect f xy is needed as well. In the presence of further covariates, these main and interaction effects are called partial effects. In contrast to LMMs, AMMs formed solely by the sum of main effects, c + f x (x) + f y (y) are able to produce a rich structured surface in the x, y plane as the main effects by themselves are already arbitrary smooth functions. Therefore it is not easy to tell by visual inspection, if a given surface f(x, y) is expressible in terms of a sum of main effect splines or if an interaction effect is required as well.
Furthermore, in contrast to the ANOVA decomposition of LMMs, the decomposition of some given spline f(x, y) into main and interaction effect splines is not unique, without the specification in which sense the spaces of the main and interaction effect functions are separated. A widely accepted approach for a unique decomposition, is the so called smoothing spline ANOVA (SS-ANOVA) introduced in [9]. This decomposition constrains the main and interaction effects to have a 0-mean and further that the interaction effect has 0-marginals. This decomposition has the advantage that the interpretation in terms of main and interaction effects is closely related to the ANOVA decomposition of LMMs.
The SS-ANOVA decomposition is usually performed by fitting an AMM, that expresses the main and interaction effects explicitly as separate model terms, where the spline bases of the interaction effect terms and their associated penalties are systematically derived as so called tensor product spline bases. There are R packages fitting (G)AMMs (i.e. mgcv, gamm4, gss [10][11][12][13]) that implement this decomposition, hence providing a convenient access to this type of decomposition. The broad availability of this method made it the standard method of spline decomposition. However, the restriction of the interaction effects on the basis construction by tensor product splines leads to a choice of basis according to the analysis of the statistical model rather than being guided by what may be known about the nature of the observed data. In our opinion, this is problematic as it may reduce the statistical power of the AMM, especially in cases where only a relatively small amount of data is available (compare section Comparison of methods with an artificial example below).
In this article we introduce an alternative approach to the SS-ANOVA decomposition of splines in the context of AMMs. This approach maintains the freedom to choose any basis for the description of the data while providing the same interpretation of the decomposition. This is achieved by decomposing a fitted AMM post-hoc, which has been constructed using arbitrary spline bases and penalties, chosen for an optimal description of the observed data.
In our opinion, the choice of the interaction effect basis is crucial as the conventional restriction to tensor-product spline bases may ignore prior knowledge about the observed data or about the underlying process that generated that data. Incorporating as much prior knowledge about the observations as possible into the AMM fit ensures an optimal description of the observed data by the resulting model. Not taking into account the nature of the data, may decrease the predictive power of the fitted model. The novel method presented in this article, allows for a choice of the interaction effect basis solely by prior knowledge. Obviously, if the optimal basis for the description of the observed data is the tensor product basis, i.e for analyses as those carried out in [14,15], the two methods are equivalent.
In the next section we briefly introduce the SS-ANOVA decomposition as described in [9] and the post-hoc decomposition of AMMs. In section Comparison of methods with an artificial example we showcase the effects of neglecting prior knowledge about the data by an artificial example; in section Application to fixation durations during reading the new method is applied to the analysis of fixation durations during reading.

SS-ANOVA Decomposition
As mentioned above, we want to explain N observations z i where i = 1..N, usually referred as the dependent variable in terms of two other measured quantities x i and y i , usually referred as the covariates, as where f is a smooth, at least continuous function. Typically we assume that f is an element of a reproducing kernel Hilbert space (RKHS) V with 1 2 V and that i * N(0, σ 2 ) represent the residuals, that is the part of the observations that cannot be explained by the function f. For the sake of definiteness we will work through the section with data in the unit square x i , y i 2 [0, 1]. All methods discussed here also generalize to splines of more than two variables and arbitrary intervals. If there is a non-degenerated quadratic functional J defined on V, with J(f) ! 08f 2 V, J(f) = 0 , f = 0 and λ > 0, the minimization problem For the sake of simplicity we only consider non-degenerated J. If J is degenerated such that 9f 2 V, f 6 ¼ 0: J(f) = 0, the RKHS V, has to be restricted on the orthogonal complement null space of J, i.e. [16].
Although the optimization is taken over an infinite dimensional space, the minimizer is located in a finite dimensional subspace. If R(Á, Á; x, y) is the reproducing kernel (RK) associated with V such that hR, fi V = f, then f 2 V can be written as and the quadratic functional J(f) can be expressed as where J ij = R(x i , y i ;x j , y j ). This makes the spline actually computable viâ N ðâ; P a Þis then the posterior distribution ofã given some observations z i and a prior distribution ofã as N 0; s 2 l J À1 ij .
An explicit example of a penalty term J(f) which is used frequently is This penalty implies the assumption of an isotropic smoothness. This means that the wigglyness of the function in x, y and all diagonals in the x, y-plane is penalized equally. Please note, that this particular penalty is degenerated, as all constant and linear functions are unpenalized.
In general, splines in RKHS are a very versatile tool, they allow for a description of data incorporating a-priori knowledge about it, like the assumption of smoothness above, by choosing an appropriate quadratic penalty J(f) on the spline. On the other hand one may be interested in a decomposition in terms of main and interaction effects, like Here c is a global offset (usually refereed as the model intercept), the functions f x and f y describe the part that can be explained by x and y individually, whereas f xy is called an interaction term that describes the part of z that needs x and y in a coupled way for the explanation.
The problem however is that (2) is highly non-unique as c can be absorbed into, e.g., f x , also f x and f y into f xy by redefining the latter ones. A possible way to define an unique decomposition was proposed by [9]. It requires that f x , f y and f xy have zero means and that further f xy has zero marginals where the Lebesgue measures dx and dy may be generalized to some probability measures that enable to incorporate the distribution of the observations x i , y i . This decomposes the space V into L 2 -orthogonal subspaces V 0 , V x , V y and V xy , such that and provides a unique definition of the functions c, f x , f y and f xy , which can be associated trivially with their corresponding member in the spaces V 0 , V x , V y and V xy respectively. Furthermore, these properties allow for a direct interpretation of the single terms as model intercept, main and interaction effects.
The orthogonal projectors onto these spaces can be defined using the following averaging operators ðA y f ÞðxÞ ¼ With these averaging operators, the model intercept, main and interaction effects are uniquely obtained as Therefore, there are two ways to obtain a decomposition like (2). The first approach starts from the one-way decomposition of marginal splines and construct the bases and penalties for V 0 , V x , V y and V xy . This approach is generally known as the SS-ANOVA decomposition as described by Gu [9]. An alternative approach, presented here, fits a bivariate spline f(x, y) to the observations and decomposes the resulting spline post-hoc using the averaging operators defined above. This can be performed numerically for any number of covariates and even analytically for some RKHSs (i.e. the bivariate thin plate spline, see supplement). This approach however, is more general since it contains the classic approach as a special case. Further it allows to choose the RKHS freely to describe the observations, in contrast to the classic approach which resorts to tensor product splines.
As mentioned before, the classic SS-ANOVA approach restricts the construction of the RKHS V of the spline f to be a tensor product of two RKHSsṼ x andṼ y for the marginals in x and y respectively as V =Ṽ x Ṽ y . Given the RKR x (Á; x) andR y (Á; y) for these spaces, R(Á, Á; x, y) =R x (Á; x)R y (Á; y) is the RK for the product space V. Further, if the marginal spacesṼ x and V y can be decomposed using the averaging operators defined above into i.e.
The decomposition ofṼ y can be obtained analogously.
With this decomposition of the marginal spacesṼ x andṼ y , the product space V decomposes naturally into spaces for the intercept, main and interaction terms as with the RK R 0 , R x , R y and R x ÁR y respectively. This allows to describe each term of the decomposition of f = c + f x + f y + f xy independently by its own RKHS. The joint penalty is then given by J(f) = J 0 (c) + J x (f x ) + J y (f y ) + J xy (f xy ). Usually one introduces a weighting for each penalty term, i.e.J ðf Þ ¼ y À1 xy J xy ðf xy Þ. This allows to treat some terms as unpenalized by setting the corresponding θ to 1, i.e. by setting θ 0 = 1, the intercept term c gets unpenalized. Please note, that in this case the joint penaltyJ gets degenerated, hence it only defines a semi-norm on its associated RKHSṼ . This construction also generalizes to more than two covariates. A more general description and examples are given in [9].
This systematic construction of the RK R x , R y and R xy from the RKR x andR y of the marginal spaces allows for an implementation of the SS-ANOVA decomposition into general purpose software packages, for example the R [13] packages gss [12] and mgcv [17]. Unfortunately this also requires that the observations are described by a tensor product spline, possibly neglecting a priori knowledge about the observations. For example, unless the marginal RK functions are Gaussians, it is not possible to integrate the prior assumption of a radial symmetry (isotropy) of smoothness. However, this assumption can be incorporated into the fit of the bivariate spline f(x, y), i.e. by a thin plate spline.
In the following we outline a method that allows for an SS-ANOVA decomposition of arbitrary, multivariate splines without any conditions to the selected RK. In general, this method can not be carried out analytically, except for some special cases like the bivariate thin plate spline. It relies on the fact, that if it is possible to describe the data well with a single multivariate spline, the SS-ANOVA decomposition can be carried out post-hoc using the averaging operators defined above, once the multivariate spline is determined. This allows for the choice of the RKHS according to the prior knowledge about data or the underlying process instead of resorting to a certain class of RKHS that is required by the decomposition to be performed.
Again, let the observations z i be well described by a single bivariate spline f(x, y) such that z i = f(x i , y i ) + i and the spline be an element of the RKHS defined by the RK R(x, y;x 0 , y 0 ), hence the spline can be parametrized as f(x, y) = ∑ i α i R(x, y;x i , y i ) for a given set of observations. The function f can then always be decomposed uniquely using the averaging operators above into a constant component c = A x A y f(x, y), components depending on a single variable only f y (x) = A y (1 − A x ) and f y (y) = A x (1 − A y )f(x, y) and a component capturing the part of f that can not be explained in terms of a sum of the offset and marginals, f xy (x, y) In the most general case these averaging operators are weighted integrals over the spline f and therefore these projections can be carried out at least numerically. Alternatively, instead of integrating directly over the spline, it is possible to integrate over the reproducing kernel, such that the main and interaction effects are expressed in terms of weighted sums of the averaged RKs, where the coefficients α i are the spline coefficients which can be estimated from the given data with eq. (1) and the covariance of i.e. the model intercept c is then given by Please note, that if the reproducing kernel R(Á, Á;Á, Á) is formed as a tensor product of univariate marginal splines, this approach is identical to the classic SS-ANOVA approach presented above.
For some special cases, the application of the averaging operators on the reproducing kernel can be carried out analytically. In this case a numerical integration is not necessary and the decomposition can be evaluated directly using the analytical expressions for the averaged reproducing kernels. In S1 Text, the integrals over the RK of a bivariate thin plate spline are given.

Comparison of methods with an artificial example
To demonstrate the effects of neglecting prior information about the data on the spline estimator, we performed SS-ANOVA decompositions by using tensor product splines and the posthoc decomposition method on a set of 100, relatively small samples from a known function f ðx; yÞ ¼ 2ðx À 1 2 Þ þ sinð2pyÞ þ sinð2pxÞ Á cosð2pyÞ (see Fig. 1 a). Each sample consists of 30 (small sample set) and 300 (big sample set) values of f(x, y), sampled uniformly and independently from the [0, 1] 2 plane with additional noise $ N ð0; 1 4 Þ to represent observational noise. For the tensor product spline approach, a tensor product of two cubic regression splines is chosen while for the post-hoc decomposition, a single bivariate thin plate spline is fitted to the data.
In Fig. 1, mean and standard deviation of the predictors for the decomposition (f x ,f y and f xy ) are shown. The post-hoc estimators of the main effects, show the much smaller variance compared to the estimators by the tensor product spline approach (for N = 30, compare Fig. 1  b). The trace of the estimated covariance matrix (see Table 1) allows for a quantitative comparison of the variability of the spline estimators).
The variability of the estimators (f x ,f y andf xy ) from the small data sets (N = 30), using the tensor product approach is much higher compared to the variability of the post-hoc estimators. This is explained by the additional a priori information used by the post-hoc approach, which assumes isotropic smoothness of the spline in contrast to the SS-ANOVA decomposition using tensor product spline. This prior information gets less important as more observations are added, which results in almost identical SS-ANOVA decompositions for a larger data set (N = 300, compare Fig. 1 c).
Please note that for the particular example above, the true function f(x, y) has only almost isotropic smoothness which implies a small additional bias on the estimate of main and interaction effects (see Table 1).
In order to verify our method, we conducted a second simulation, where the underlying function has isotropic smoothness by construction. Here we sampled from the function f ðx; yÞ ¼ 2ðx À 1 2 Þ þ 2ð 1 2 À yÞ þ exp À ððxÀ0:5ÞþðyÀ0:5ÞÞ 2 0:08 . Like in the first example, the variability and the bias of the spline estimates are obtained. While the biases of the tensor product and post-hoc decomposition approaches are comparable for this example, the variability of the post-hoc decomposition approach is generally smaller, especially in cases of small samplessizes (compare Table 2). sample size, here N = 300. As more statistical evidence is provided by the data, the a priori knowledge used for the post-hoc decomposition (isotropic smoothness) has a smaller influence on the outcome. Therefore the results are almost identical.
doi:10.1371/journal.pone.0119165.g001 Table 1. Comparison of the variability and mean squared bias (MSB) of the spline estimators from small and large data sets of example 1. The variability is given as the trace over the covariance matrix of the spline evaluated on a regular grid, while the MSB is the squared bias of these spline estimates averaged over that grid. Each spline was fitted to one of 100 independent subsets of the complete dataset.
tensor prod. post-hoc trð P^f In general, if no additional assumption about the underlying function can be made, the tensor product spline will be the most general approach to describe the data by the means of spline functions. In these cases the post-hoc decomposition of a tensor product spline will have no advantage over the classic SS-ANOVA decomposition and any additional (unjustified) assumption implied by the chosen spline penalty will result in a biased estimate (compared to the tensor product spline). If, however, the underlying function satisfies the a-priori assumptions, the post-hoc approach allows for an ANOVA decomposition of smoothing splines that incorporate these assumptions, reducing the variability of the spline estimates without increasing the bias compared to the tensor product approach.
We therefore suggest that in cases where no a-priori knowledge about the underlying functions is present, a two step method should be used. In the first step, a tensor product spline is fitted to the data in order to get some information about the generating function. If the result of the first step suggests, for example that the generating function can be described well by a thin-plate spline, the post-hoc decomposition should be used to refine the first estimates.

Application to fixation durations during reading
During reading the eyes move in alternations of pauses (i.e., fixations lasting between 150 and 300 ms) and quick movements (i.e., saccades of 10 to 30 ms) which carry the eyes on average five to ten letters forward. Visual information is processed only during fixations; we are practically blind during saccades. Fixation durations are sensitive to processing difficulty. For example, they are short for frequent words (such as prepositions and conjunctions) and long for rare words. A word's frequency is measured as the logarithm of its occurrence in 1 million printed words. Fixation durations are also sensitive to word length (i.e., fixations are longer for long words, see Fig. 2 b-d) and increase with the amplitude of the last saccade (see Fig. 2 a). Therefore, these variables were also included as covariates in the following AMMs, but the focus here was on frequency effects. We analyzed around 68000 fixations that were the first and only fixation on a word; the fixations were bordered by the eyes entering the word from the left and leaving the eyes to the right (i.e., they were first-pass single fixation durations). Further, in order to reduce model complexity, only those fixations were considered where the neighboring words (N − 1 and N + 1) were fixated too. These fixations form the majority (% 68000 out of 118000) of all first-pass single fixations.
Fixations were measured on 144 sentences, read by 275 German readers; for details see [18,19]. Readers differ reliably in their average fixation duration. Therefore, we fitted an additive mixed model, estimating also a variance component for random effects of readers in fixation durations. The following models where fitted to the data: t N ¼ c 0 þ s A ðA N Þ þ s l;NÀ1 ðl NÀ1 Þ þ s l;N ðl N Þ þ s l;Nþ1 ðl Nþ1 Þ þ s n;NÀ1 ðn NÀ1 Þ þ s n;N ðn N Þ þ s n;Nþ1 ðn Nþ1 Þ þ t n;NÀ1;N ðn NÀ1 ; n N Þ þ t n;N;Nþ1 ðn N ; n Nþ1 Þ þ r id þ r w þ : The two additive mixed models (4) and (5) describe the log fixation duration τ N on a word in terms of the same covariates: l N−1 , l N and l N+1 are the word lengths (measured in logarithmic units, Fig. 2 c) of the previous, the fixated and next word, respectively, and ν N−1 , ν N and ν N+1 are the word frequencies (also measured in logarithmic units, Fig. 3 a) of them; the term c 0 represents the model intercept, A N the amplitude of the incoming saccade (measured in letters, Fig. 2 a), r id the random effect intercept for each participant, r w the random effect intercept for each fixated word and the model residuals. The first model (4) was fitted to the data and the terms s ν, N−1, N (ν N−1 , ν N ) and s ν, N, N+1 (ν N , ν N+1 ) were then decomposed post-hoc into main and interaction frequency effects. These two splines were chosen to be thin plate splines, implying the a priori assumption of isotropic smoothness. The second model (5) was constructed according to [9] to perform the SS-ANOVA decomposition using tensor product splines without the isotropy assumption. Therefore the frequency main effects s ν, N−1 (ν N−1 ), s ν, N (ν N ) and s ν, N +1 (ν N+1 ) are expressed explicitly as terms of the AMM and the interaction effects t ν, N−1, N (ν N−1 , ν N ) and t ν, N, N+1 (ν N , ν N+1 ) are constructed as tensor product splines, which incorporates no additional assumption about the data.
The novel aspect of the present AMM is a spline-based re-evaluation of distributed processing during reading, that is of simultaneous processing of several words during fixations. Fixation durations depend not only on the frequency of the fixated word N, but also on the frequencies of the words to the left (N − 1) and to the right (N + 1) of the current fixation   [18][19][20]. Thus, during a fixation we may simultaneously observe effects of the frequencies of at least three words. Most striking, however, is the difference between these three duration-frequency relations. As shown in Fig. 3 a, they are (a) monotonic for word N − 1 (left), (b) clearly non-monotonic for word N (middle) and (c) also for word N + 1 (right). Thus, the difficulty of word N − 1 is strongly expressed in fixations on word N, but the frequency of the upcoming word N + 1 has only a weak effect on this fixation. The non-monotonic profile for the N-frequency effect is consistent with other evidence for distributed-processing constraints [19]. The reliability of specific shapes associated with frequency effects have been established with third-order polynomial trends for different groups of readers, for example young and old adults, and for reading sentences in the expectation of easy or difficult questions [21].
As mentioned above, we are in the comfortable situation of having a relatively large dataset (% 68000 fixations). Following from the results of the previous we may expect the results of the decompositions to be almost independent of the chosen method. As shown in the Fig. 3 a, this is indeed the case and the decompositions using the tensor product spline and the post-hoc approach reveal comparable results.
To compare these two methods and the effect of neglecting prior information about the data, we divided the complete dataset into smaller sets of 200 samples taken randomly from the complete set. The decomposition into the frequency main and interaction effects is then performed for each subset independently. The mean and standard deviation of the resulting main effects are shown in Fig. 3 b. Although the advantage of the post-hoc decomposition is barely visible in Fig. 3b, the variability of the interaction effect splines obtained by the means of a post-hoc decomposition is much smaller compared to the variability of those obtained by the tensor product spline approach (compare Table 3).
A novel question in this line of research is whether it is sufficient to model the three frequency effects relating to words N − 1, N, and N + 1 as three main effects or whether, in addition to these main effects, we also need two bivariate interaction terms capturing, for example, (a) the joint effect of frequencies of word N − 1 and word N (i.e., the left two words) and (b) the joint effect of frequencies of word N and word N + 1 (i.e., the right two words). Fig. 4, middle row, displays the two corresponding surfaces for the bivariate-TPS based post-hoc decomposition. Fig. 4, bottom row, shows the parts of the partial interaction effects which are point-wise significant, means all points in the frequency plane where the spline estimate, i.ê s n;NÀ1;N ðn NÀ1 ; n N Þ, is larger than twice its standard deviation. Obviously, there are significant interaction effects. Particularly in the cases of a low-frequent word N − 1 and a medium-frequent word N as well as in the case of high-frequent words N and N + 1.
using the tensor product approach (top row) and the post-hoc decomposition (bottom row) repeatably performed on a small subset (400 samples) of the complete data set. The confidence intervals show the standard deviations of the mean estimators over all repetitions.
doi:10.1371/journal.pone.0119165.g003  Although, these interaction effects are relatively well localized in the word-frequency planes and therefore explain only a small portion of the response surfaces (Fig. 4, top row), their contribution to the fixation duration must be considered as theoretically relevant.
In summary, AMMs are very useful for the description of non-monotonic main effects (and their interactions) on fixation durations in reading research.
Supporting Information S1 Text. Explicit post-hoc decomposition of bivariate TPS. (PDF)