Multivariate functional group sparse regression: Functional predictor selection

In this paper, we propose methods for functional predictor selection and the estimation of smooth functional coefficients simultaneously in a scalar-on-function regression problem under a high-dimensional multivariate functional data setting. In particular, we develop two methods for functional group-sparse regression under a generic Hilbert space of infinite dimension. We show the convergence of algorithms and the consistency of the estimation and the selection (oracle property) under infinite-dimensional Hilbert spaces. Simulation studies show the effectiveness of the methods in both the selection and the estimation of functional coefficients. The applications to functional magnetic resonance imaging (fMRI) reveal the regions of the human brain related to ADHD and IQ.


Introduction
In the past decades, functional data analysis (FDA) has received a great attention in which an entire function is an observation.Ramsay and Silverman (2005) introduced a general framework of FDA and many other researchers investigated the estimation and inference methods of functional data.See Yao et al. (2005a), Yao et al. (2005b), Horváth and Kokoszka (2012), and Wang et al. (2016).More recently, FDA has been extended to multivariate functional data that can deal with multiple functions as a single observation.See (Chiou et al., 2016;Happ and Greven, 2018).However, the sparseness of functional predictors in the multivariate model has not been studied well compared to the univariate case.Hence, we aim to develop theories and algorithms for the sparse functional regression methods with functional predictor selection when we have scalar data as response values and high-dimensional multivariate functional data as predictors.
Under the multivariate setting, numerous sparse models have been studied with the introduction of L 1 -penalty.Least absolute shrinkage and selection operator (LASSO) introduces a penalty term to the least square cost function which performs both variable selection and shrinkage Tibshirani (1996).The LASSO-type penalty, such as the Elastic Net Zou and Hastie (2005), the smoothly clipped absolute deviation (SCAD) Fan and Li (2001), their modifications (the adaptive LASSO Zou (2006) and the adaptive Elastic Net Zou and Zhang (2009)) are developed to overcome the lack of theoretical support and the practical limitations of the LASSO such as the saturation.These methods were developed to overcome the challenges and enjoy asymptotic properties when the sample size increases, such as the estimation consistency and the selection consistency, also known as the oracle property.
Recently, the sparse models have been extended to the functional data.Initially, a majority of the literature seeks the sparseness of the time domain.Examples include James et al. (2009) and related articles for univariate functional data and Blanquero et al. (2019) for multivariate functional data.On the other hand, Pannu and Billor (2017) proposed a model considering the sparseness in the functional predictors under the multivariate functional data setting.In particular, they introduced a model based on the least absolute deviation (LAD) and the group LASSO in the presence of outliers in functional predictors and responses.Its numerical examples and data application show the effectiveness in practice, but theoretical properties and detailed algorithm have not been explored.To this end, we develop methods for scalar-on-function regression model which allows sparseness of the functional predictors and the simultaneous estimation of the smooth functional coefficients.To implement it with the actual data, we derive two algorithms for each of the optimization problems.Finally, we show both the functional predictor selection consistency and the estimation consistency.
One motivating example for our methods is the application to the functional magnetic resonance imaging (fMRI).The dataset consists of the functional signals of the brain activities measured by blood-oxygen-level-dependent (BOLD), which detects hemodynamic changes based on the metabolic demands followed by neural activities.There are prespecified regions of the brain, and the BOLD signals associated with multiple voxels in each region are integrated into one signal for that region.Thus, the fMRI data are considered to be multivariate functional data in which each functional predictor represents the signals from a region of the brain.In Section 8, we regress the ADHD index to the regional BOLD activities of the fMRI of the human subjects.There are 116 regions of the brain in the data, and our methods reduce the regions to 41 regions with significantly lower errors than the linear functional regression.Figure 1 displays the regions of the brain's atlas that are identified by our method.It shows that the methods simplify the data analysis and provide clear representation while keeping the crucial information.The analysis shows that there is an urgent need for new methods in the fields of medical and life sciences as well as other related areas.The following quote from Bandettini (2020) further motivates to study the applications of the sparse multivariate functional regression in the field of fMRI.
"Think of the challenge of the fMRI with the analogous situation one would have if, when flying over a city at night, an attempt is made to determine the city activities in detail by simply observing where the lights are on.The information is extremely sparse, but with time, specific inferences can be drawn."-Peter A. Bandettini , fMRI, 2020 The rest of the paper is organized as follows.In Section 2, we illustrate the general framework of our methods along with the notations used in this paper.In Section 3, we describe the model and the optimization problem that we consider.Then, we develop an explicit solution to the optimization problem, and illustrate a detailed procedure using alternating direction method of multipliers (ADMM) in Section 4. We also derive another algorithm, called groupwise-majorization-descent (GMD), along with the strong rule for faster computation in Section 5.In Section 6, we develop asymptotic results, including the consistency of our methods and the oracle property.In Section 7, we show the effectiveness of the methods by conducting simulation studies.In Section 8, we apply the methods to a resting state fMRI dataset.Concluding discussions are made in section 9. Finally, the appendix includes all of the proofs and the list of the regions of the brain associated with the ADHD and the IQ scores.We created an R package MFSGrp for the computation, and it is available at https://github.com/Ali-Mahzarnia/MFSGrp.

Preliminary and notation
Let (Ω, F, P ) be a probability space.Let T j be a compact set in R d j for j = 1, . . ., p. Let H 1 , . . ., H p be separable Hilbert spaces of functions from T j to R with an inner product for any f = (f 1 , . . ., f p ) T ∈ H, and g = (g 1 , . . ., g p ) T ∈ H.Then, H is also a separable Hilbert space.Let X : Ω → H be a measurable function with respect to F X /B X where B X is the Borel σ-field generated by open sets in H.
Let X be a random element in H.If E X H < ∞; then, the linear functional f → E f, X H is bounded.By the Riesz's representation theorem, there is a unique element in H, say µ X , such that µ X , f H = E f, X H for any f ∈ H. Conway (1990).We call µ X the mean element of X or expectation of X.If we can further assume E X 2 H < ∞, the operator H → H, exists and is a Hilbert-Schmidt operator, where ⊗ indicates a tensor product computed in a way that for x, y, z ∈ H, (x ⊗ y)(z) = y, z H x. Hsing and Eubank (2015).
Let Y be a random element in H Y .Subsequently, we can define the covariance operator between X and Y by which maps from H to H Y .Γ XY can be similarly defined.For convenience, throughout this paper, we assume that E(X) = 0 and E(Y ) = 0 without loss of generality.Hence, the regression model is where β ∈ H is the unknown coefficient function, and is an error term which is a mean zero random variable and independent of X.Consider Y as a scalar random variable.We can rewrite β(•) = (β 1 (•), . . ., β p (•)) and

Model description
We are interested in the situations where the predictors are multivariate functions but only a few functional predictors affecting the response.i.e., a random variable Y and random functions X j ∈ H j have the following relation, where A ⊆ {1, . . ., p} is an unknown active set of indices involved in this regression model, and is a mean zero error term that is independent of X.
Assume that we have a random sample of size n from the model (2).To estimate β and the active set A , we propose the following objective function.
where E n is the expectation with the empirical distribution.We added the group-lasso type penalty so that each group includes one functional component in the infinite dimensional Hilbert space, H j , j = 1, . . ., p.Note that the norm in the penalty term is L 2 -norm which makes the objective function convex.In addition, we propose an alternative objective function to gain a more stable solution path.
The quadratic term allows us to have a stable solution path and encourages further grouping effects.It is similar to the Elastic Net proposed by Zou and Hastie (2005), but it is different in that the norm in the first penalty term uses L 2 -norm, and both the two penalties are applied group-wisely.The group-wise second penalty also gives us a huge computational advantage.Furthermore, we also consider the smoothing penalty of the functional coefficients β ∈ H by adding the term, λ 3n β 2 H to the objective functions, (3) and ( 4).It allows us to estimate smooth functional coefficients and to select the functional predictors simultaneously.In addition, it provides a better interpretation of the functional coefficients in this linear functional regression model.

Estimation: ADMM
In this section, we develop the algorithm for solving the optimization problems introduced in Section 3 via the alternating direction method of multipliers (ADMM), one that is popularly used in a general convex optimization problem.See Boyd et al. (2011).Consider the following optimization problem.arg min , and g(γ) = λ p j=1 γ j H j .Blocks γ j are associated with their counterparts' blocks β j .The augmented Lagrangian with its parameter ρ > 0 is where the Lagrangian multiplier is η ∈ H.The ADMM update rules are For computational convenience, it is a usual practice to consider the scaled dual parameter of the ADMM.Let u = 1 ρ η.It is straightforward to verify that the update rules (7) with scaled dual parameter are equivalent to (8)

Coordinate representation of functional data
Our method is based on the basis-expansion approach to the functional data.Suppose that we have n random copies from the model (2) denoted by (X 1 , Y 1 ), . . ., (X n , Y n ) and we observe X j i on {t j i1 , . . ., t j ia j i } for each i = 1, . . ., n and j = 1, . . ., p.
At the sample level, we assume that H j is spanned by a given set of basis functions, B j = {b j 1 , . . ., b j m j }.Thus, for any f ∈ H j , there exist a unique vector a ∈ R m j such that f (•) = m j k=1 a k b j k (•).We call the vector a, the coordinate of f and denote it [f ] B j .We also assume that H j is constructed with the L 2 -inner product with respect to the Lebesgue measure, , and let G be M × M block-diagonal matrix whose j-th block is G j where M = p j=1 m j .Consequently, for any f, g ∈ H, vectors obtained by stacking [f j ] B j and [g j ] B j respectively.We use the basis-expansion approach for each functional covariate X j i for i = 1, . . ., n and j = 1, . . ., p, which is also used in Song and Li (2021); Li and Song (2018).Without loss of generality, we assume m = m 1 = • • • = m p and M = pm.
Suppose that A is a linear operator from H 1 to H 2 in which the basis for H 1 is B = {b 1 , . . ., b m } and the basis for H 2 is C = {c 1 , . . ., c k }.Then, we define the coordinate representation of the operator A to be k × m matrix, say For notational convenience, if the basis system is obvious in the context, we remove the subscripts of the coordinate representation throughout this paper.The following lemma provides a further simplification for easy computations.
In addition, let Y be the n-dimensional vector, the elements of which are the observations Y 1 , . . ., Y n .Then

Orthogonalization
To achieve computational efficiency, we orthonormalize the basis system via Karhunen-Loève expansion of the covariance operator of each of the functional predictors.For each j = 1, . . ., p, define Γ jj to be the covariance operator of X j .Consequently, we have the following lemma.
Lemma 2. Let (λ j 1 , v j 1 ), . . ., (λ j m , v j m ) be the pairs of eigenvalues and vectors of Define a m × m matrix m 's are the eigenfunctions of a self-adjoint operator, they are orthonormal.Thus, for any x ∈ H j , Define C j = {φ j 1 , . . ., φ j m } to be the new basis system for H j .Then, we have . ., n, j = 1, . . ., p.We assume that the coordinate of H is based on the orthonormal basis system throughout this section.Thus,

Estimation
Using the representation, we can express the optimization (8) as follows. [ where and Under the finite-dimensional representation of functional element in H, one can see that the optimization in ( 9) is a convex optimization problem.
Theorem 1.The solution to the optimization problem (3) can be achieved by iterating over the following update rules.
where [γ j ], [U j ] are corresponding blocks to [β j ], and If we do not consider orthogonalization, Theorem 1 would contain element G j in the updates.In this case, the proof of numerical convergence of the update rules is slightly different from that of Boyd et al. (2011).However, due to the orthogonalization, the proof of the numerical convergence of the updates in the Theorem 1 to the solution of the optimization problem (3) is identical to that of the ADMM in Boyd et al. (2011).Hence, it is omitted.

Different penalty terms
In this section, we investigate the different penalty terms in two directions: one for the functional predictor selection, and the other one for the smooth coefficient functions β.

Multivariate Functional Group Elastic Net
LASSO does not provide a unique solution.In order to achieve uniqueness and overcome the saturation property, Elastic Net penalty has been introduced by combining the 1 -norm and 2 -norm by Zou and Hastie (2005) for the multivariate data.Functional data are intrinsically an infinite-dimensional objects.Thus, we propose a multivariate functional-version optimization problem for the Elastic net penalty by grouping each functional predictor as follows.
where α ∈ [0, 1] and λ > 0 are the tuning parameters.This optimization problem still follows the structure of the ADMM algorithm in (5) with g(γ) = λ(1 − α) p j=1 γ j H j + αλ p j=1 γ j 2 H j .It can be easily shown that the only difference from the original version is the γ-update in Theorem 1.Hence, we have the following update rules.
Theorem 2. The solution to the optimization problem (11) can be achieved by iterating over the following update rules.
Regularization parameters can be adjusted through a net search cross validation.

Smoothness of functional coefficients β
According to the simulation, we found that the previous algorithm provides wiggly estimation of functional coefficients β most of the time.It might be fine if we are only interested in the prediction; however, it is not the case, because we consider the linear functional regression.We propose an algorithm which controls the roughness of β simultaneously to avoid the over-fitting problems and to obtain smooth functional coefficients.In particular, we impose the penalty on the curvature of the coefficients by adding λ der 2 β 2 H to the objective function (8).We include this term in f (•) function in the ADMM structure.Finally, the first update rule (10) in Theorem 1 becomes where G is a block-diagonal matrix whose j-th block matrix is (( . ., m, j = 1 . . ., p.For each j, (G j ) can be derived from the second derivative Gram matrix for the original basis, say (B j ) , where ((B j ) where e i is i-th standard basis in R m .Then, where

Tuning
The initial values for γ and U are zero, and the initial β is the ridge regression estimation in the first update rule (10).We set the augmented parameter, or the step size, ρ to be 1 and stay the same through the algorithm.The different values of ρ only change the values of the optimal λ on the grid or optimal (1 − α)λ on the net.The larger the ρ, the smaller the optimized regularization parameter of the soft threshold operator.In some practices of augmented Lagrangian, it is possible to choose a small step size and increase it to 1 gradually in each iteration.It is also stated in Boyd et al. (2011) why ρ = 1 is a suitable choice in the ADMM algorithm.We use the k-fold cross validation for choosing the mixing parameter α, regularization parameter of the second derivative penalty λ der , and the main regularization parameter λ.In particular, for each α and each λ der on the net, we search for the optimal λ.In order to pick the initial λ, we first find the ridge estimation β with parameter ρ = 1.We then compute the norm of each of the groups of functional coefficients, β k .Note that in the second update of Theorem 1, the soft threshold operator would eliminate all blocks if λ is slightly higher than the maximum of these norms.On the other hand, this update would keep all the coefficients if λ is slightly lower than the smallest norm.Therefore, a reasonable procedure is to design a grid of λ's between a number slightly lower than the minimum norm of the blocks and a number slightly higher than the maximum norm of these block coefficients.

Estimation: GMD
In this section, we derive the groupwise-majorization-descent (GMD) algorithm for solving the objective functions in Section 3. Unlike the ADMM, this algorithm is geared toward the objective function with group-wise penalty terms.Motivated by Yang and Zou (2015), we derive the GMD algorithm under our setting.In addition, we do not force the basis functions to be orthogonal, which allows us to have more flexibility.Thus, throughout this section, we use the coordinate system based on the original basis B without orthogonalization.

Algorithm
The MFG-Elastic Net problem without the orthogonalization is arg min where the coordinates are associated with the original basis B. This optimization problem and the following derived algorithm include the steps that also solve for the MFG-Lasso (α = 0) and the ridge regression (α = 1).In the equation ( 14), we remove n for computational convenience.It will be adjusted when we seek the λ der and λ in the grid construction.We define the loss function as follows.
Consequently, the objective function ( 14) is where, In addition to Lemma 3, it is straightforward to see that if β = β * , we have the strict inequality, Thus, it leads to the strict descent property of the updating algorithm.Let β * be the current solution to the optimization problem and β be the next update.Assume that we update the β for each functional predictor j = 1, . . ., p.In other words, [β] − [β * ] has a form of (0, . . ., 0, [β j ] − [(β * ) j ], 0, . . ., 0) T , which leads to simplification of the objective function in the new optimization problem.Let U = −∇L(β * ) and U j be the sub-vector of U with the indices (m(j − 1) + 1, . . ., mj).Let H j be the j-th block diagonal matrix of H.Then, ( 16) is where γ j is a value slightly larger than the largest eigenvalue of H j , which further relaxes the upper bound.In practice, we take γ j = (1 + * )η j with * = 10 −6 where η j is the largest eigenvalue of H j .Finally, the update rule for β j is the solution to the following optimization problem.
arg min where g j is the j-th term of g(•).We have a closed-form solution to this problem using a similar trick of Lemma 6. [ where

Tuning parameter selection
While iterating over this GMD update rule, we can reduce the computational burden more efficiently during the tuning parameter selection with the strong rule technique.See Tibshirani et al. (2012).
Step 1. (Initialization) Given α ∈ (0, 1), the largest λ in the grid points is the smallest value of λ such that all its associated coefficients are zero.In particular, using the KKT condition (see Lemma 6), the largest λ in the grid points is Therefore, the initial β is zero.Then, the smallest λ of the grid points is set to be a certain small number to include all the functional predictors, usually a fraction of the largest λ value of the grid.The process of searching for the optimal λ starts with the largest value of the grid points and moves backward to the smallest value.
Step 2. (Iteration) At λ (k) , we add j-th functional predictor to the active set if it satisfies the strong rule condition, for j = 1, . . ., p. Subsequently, we update β with these reduced predictors by iterating the update rule (20) until numerical convergence.The stopping criteria for this iterative process can be chosen the absolute or relative.Next, in order to make sure that the strong rule does not leave out some of the worthy coefficients, we check the KKT condition on the rest of the blocks of the current solution, where β j update (λ (k+1) ) is the updated β j when the iterative GMD algorithm hits the stopping criteria on the result of the strong rule screening.If j-th functional coefficient violates the KKT condition, we add it to the active set and update β using (20).This process of checking the KKT condition and updating, continues until there is no functional coefficient that violates the KKT condition.We store the solution of the final updated value to β j (λ (k+1) ).We use β j (λ (k+1) ) to repeat (Step 2) for the next value of λ (warm start).
It is worth mentioning that the strong rule does not allow that the main regularization for λ to be computed in parallel because of the warm start, i.e. we search for λ sequentially.However, the main computational cost is paid in this regularization.The strong rule allows the algorithm to enjoy predictor screening, which leads to a cost-effective computation by storing and computing on smaller size vectors.On the other hand, the strong rule does not seem to be valid for the main regularization of the ADMM algorithm because there are two objective functions involved in this algorithm.Hence, it is possible to tune the regularization parameters in parallel via ADMM.

Asymptotic results
In this section, we derive the consistency of the multivariate functional group LASSO (MFG-LASSO) when functions are fully observed.In particular, the consistency breaks down to the selection consistency and the estimation consistency, which is known as the oracle property.
We first illustrate the consistency of the operators used in the estimation procedure.Since the implementation in Section 4.1 is based on the method of moments estimate, the following lemma is an immediate result from the functional-version of the central limit theorem in a separable Hilbert space.See Hsing and Eubank (2015). where Now, we limit our index to A , the true active set of the population functional coefficient β.For convenience, we use the notation for truncated-version by the superscript J such that β J = (β j : j ∈ A ) ∈ H J and use J and A interchangeably.
Lemma 5.In addition to the assumptions in Lemma 4, assume that for any j, there exists g j ∈ H j such that n as a minimizer of If λ n approaches zero slower than the rate at which √ n approaches infinity, β J n − β J H converges to zero in probability, slightly slower than The above lemma illustrates that if we know the true functional predictors, the solution to the optimization problem (3) achieves the consistency.Let M n (•) be the objective function in (21).Then, Note that ( 22) is asymptotically strictly convex as long as we can assume that Γ X J X J is a positive-definite operator.Similarly, the original objective function (3) has also a unique solution if we can assume that Γ XX exists and is positive definite.By using Lemma 5 as a bridge, we prove the consistency of our estimate in the following theorem.
Theorem 3. Assume that 1.The fourth moments of X and Y are bounded.
2. For any j, there exists g j ∈ H j such that X j X j (g j ). 3. In the population, we have such a condition that, where C X i X J and C X J X J are the correlation operators defined in Baker (1973).
Then, the multivariate functional group LASSO estimate satisfies the following.
The assumption 1 is commonly used in the condition for the functional central limit theorem.The assumption 2 states that the functional coefficients β lies in the support of the functional predictor X, which means that we restrict the potential range of β to be in the range of Σ XX .The assumption 3 is a modified version of the necessary condition for the LASSO to be consistent that is derived in Zou (2006).
To investigate the method thoroughly, we applied different numbers of observations (100, 200, 500) and different standard deviations for the residual term σ = 0.01, 0.1, 1.In each sample, we divide the observations into two sets for training and test sets (80% for the training set, and 20% for the test set).We measure the root mean squared error (RMSE) of the prediction for the response values of the test set.In addition, we measure the number of functional predictors that are chosen correctly.More specifically, we count the correctly identified functional predictors in the population active set, the size of which is 3, and in the population inactive set, the size of which is 16 while predicting the test response values.We use m = 21 B-spline basis functions to convert the observed values to functional objects and coordinate representations.We use 5-fold cross validation to tune the regularization parameters on a net.
In each scenario, we generate 100 samples and compute the percentages of correctly selected functional predictors that are tabulated in Table 1, and compute the mean and standard deviation of the test RMSE that are in Table 2. Furthermore, we compare the sparse methods along with the scalar on functional ordinary least square method (OLS), ridge regression, and the oracle procedure in which only the functional predictors in the population active set are used in the OLS.For the sparse models, we apply multivariate functional group LASSO (MFG-LASSO), and the MFG-Elastic Net (MFG-EN).The two algorithms, GMD with the strong rule and ADMM, provide similar results while the GMD algorithm is much faster on serial systems and ADMM is faster on parallel computational systems.Thus, we show the results using the GMD and strong rule algorithm in this paper.
From Table 1, we can see that the MFG-sparse methods effectively select the correct functional predictors.It also shows the consistency in an empirical way.In particular, they always select the active set correctly even with a large noise, but the selection performances of eliminating the inactive set predictors are poor with a small sample or large noise.The MFG-EN tends to choose more functional predictors than others.It is an expected result since the MFG-EN penalty includes the quadratic term which gives more stability but tends to choose more predictors.
Table 2 illustrates the estimation performance using the test RMSE.The overall behavior of the methods in terms of prediction errors is similar to that of the selection performance.As the sample size grows, the RMSEs are closer to that of the oracle estimator and their standard deviations decrease.Compared to the OLS, the sparse methods outperform when there are not enough observations or the functions are noisy.The OLS performs slightly better than the sparse methods when we have large enough n and small noises.However, the standard errors of the OLS RMSE are larger than that of the MFG-methods.The ridge method is worse than the OLS with the small noise, but it is better than the OLS with the large noise.Overall, the sparse methods, MFG-LASSO and MFG-EN, perform the best in general because their results are very close to the oracle estimations.Considering that the sparse methods use much less functional predictors, the simulation results illustrate a great effectiveness of our methods in reducing both the model complexity and the prediction error.
Figure 2 shows the estimated functional coefficients β1 (•), . . ., β6 (•) from the MFG-LASSO in a hundred simulation samples when n = 100, σ = 1, the worst performance case.The green curves are the true functions, and the rest of the curves are the estimations.Figure 3 shows the results when n = 500, σ = 0.01, the best performance case.

Application to fMRI
We apply our methods to a human brain fMRI data set collected by the New York University.This data set is part of the ADHD-200 resting-state fMRI and anatomical datasets.The parent project is 1000 Functional Connectomes Project.The BOLD-contrast activities of the brain are measured by the fMRI machine during a 430 seconds period of time.In order to extract the time courses, 172 equally-spaced signal values were recorded as the observed points within the 430 seconds period of time.Prior to the analysis, the automated anatomical labeling (AAL) Tzourio-Mazoyer and et al. (2002) was applied to the raw fMRI data by averaging the BOLD activities of the clusters of voxels in p = 116 regions of the brain, the regions of interest (ROI).This procedure is called masking, clustering the voxels by regions and averaging the time series signals within the region.The data consists of functional predictor selection Figure 3: This figure displays the estimated functional coefficients by the MFG-LASSO from a hundred simulated data sets when n = 500, σ = 0.01.The green curves are the true coefficient curves and the grey curves are the estimated coefficients.The estimated curves for the remaining of the coefficients from the seventh to the nineteenth are very similar to the fourth, fifth and sixth functions (inactive coefficients) displayed in this figure .functional predictor selection between five to seven brain resting state fMRI records taken from 290 human subjects.We randomly choose two brain images from each human subject, and clean the data by removing missing response values.We choose different response values in each regression analysis, such as the subjects' intelligence quotient (IQ) scores, verbal IQ, performance IQ, attention deficit hyperactivity disorder (ADHD) index, ADHD Inattentive, and ADHD Hyper/Impulsive.Then, we split the data by 80% for the training set and 20% for the test set.We use m = 31 B-spline basis functions in the function approximation procedure.
Table 3 describes the test RMSE and the sparsity of the regression models.The results show that the scalar on function OLS does not work in that the RMSE is higher than the standard deviation of the response values in the test set.The ridge regression has a significantly lower RMSE while it does not select functional covariates.The MFG-LASSO eliminates more than a half of the brain regions except for the performance IQ, while its RMSE is slightly higher than the MFG-EN in most cases.In terms of the RMSE, the MFG-EN performs the best while it selects more functional predictors than the MFG-LASSO.It is worth mentioning that when we change the proportion of the train and test data set to 90% and 10%, the ratio RM SE σY test decreases significantly for sparse regressions; however, in order to be consistent with the simulations we keep the 80% to 20% proportions for the train and test sets.
At the time of writing, there is no research study that uses the exact same data.However, there are articles that predict the IQ score based on human brain measurements.Hilger and et al. (2020) predicted IQ score based on structural magnetic resonance imaging (MRI).In order to predict the IQ score, they use two methods: Principal component analysis on gray matter volume of each voxel, and Atlas-based grey matter volume while adjusting for the brain size in both methods.The reported RMSE with 90% to 10% train to test proportions in this study is 13.07 at its best, while the standard deviation of the IQ scores in the whole sample including test set is σY = 12.94.Nevertheless, the MFG-LASSO provides an RMSE of 6.32 and the MFG-EN provides 5.91.In addition, to the higher accuracy, our methods have much less complexity of the model.Hilger and et al. (2020) selects more than 20, 000 principle features among all of the features associated with 556, 694 voxels in the data.Meanwhile, our methods use 53 functional predictors for MFG-LASSO and 106 functional predictors for MFG-EN.In each functional predictor, we use 172 time points in the raw data, and we use m = 31 basis functions in the function approximation procedure.Therefore, the proposed methods have obvious advantages in reducing the model complexity as well as achieving higher accuracy.Running one regression analysis with the proposed methods using the GMD/Strong Rule is on average around two to three minutes on a dual Core-i7 CPU with 16 GB memory, while the mentioned article claims an equivalent computation of 36, 000 hours using two CPU kernels and 5 GB RAM.In addition, there is another research study, Xiao et al. (2020).In this article, the RMSE does not get any better than around 14 while data is from a combination of resting state and task fMRI, and the sparse method uses voxels' functional connectivities (Pearson correlation between BOLD time series signals) as the input features.
In Figure 4 and Figure 5, we display the regions associated with the estimated active sets for IQ and ADHD by the MFG-LASSO respectively.The final active sets of the algorithms were extracted, and matched with the AAL's atlas where each of the regions has a label.The regions were manually entered into the WFU picked atlas Maldjian and et al. ( 2003  tool of the SPM-12 ran on MATLAB 2020b to produce mask.niifiles.The mask files were imported on MRIcron software to produce the multi-slice images. The active sets cover the regions associated with IQ in Yoon and et al. (2017) such as cerebello-parietal component and the frontal component.It is mentioned in the paper that the parietal and the frontal regions are strongly associated with intelligence by maintaining a connection with the cerebellum and the temporal regions.The shaded areas cover the ones mentioned in Goriounova and Mansvelder (2019) as well.We provide the name of the regions associated with these active sets in the appendix.
It is interesting that ADHD and IQ have a large proportion of common active sets.For instance, when MFG-LASSO is applied, they overlap in 35 ROIs where the size of active sets are 53 and 41 for IQ and ADHD respectively.On the other hand, the ROIs that are associated with ADHD but not with IQ are the middle and superior frontal, the Parahippocampal, the inferior parietal, and the superior temporal pole gyri.The ratio of the number of right hemisphere regions to the left ones associated with IQ is significantly greater than that of ADHD.

Conclusion
We propose new methods for scalar-on-function regression with the functional predictor selection along with the estimation of smooth coefficient functions when the predictors are multivariate functional data.We derive the algorithm for the implementation and develop the consistency of the methods by showing its oracle property.The simulation and real data application show the effectiveness of the methods with the superior performance of the proposed penalized methods over the functional regression model with the OLS.Furthermore, the proposed methods provide higher accuracy as well as the low complexity of the model in the fMRI study.It shows that there is an urgent need in the fields of medical sciences and other related areas.
The manuscript also has a potential impact on the field of statistical research.Considering that there is not enough investigation made to sparse modeling of multivariate functional data, the computation algorithm derived in this paper will pave the way to develop other novel sparse methods.In addition, the methods can be extended to the nonlinear regression model via the reproducing kernel Hilbert space (RKHS).Since the theoretical justification is constructed under the infinite-dimensional setting, the extension on the RKHS can easily adopt the results from this paper.Furthermore, the proposed methods are based on groups such that a single functional predictor forms a group.Hence, it can be easily extended to the sparse models where multiple functional predictors form a group.For example, instead of averaging out fMRI signals of voxels over the regions of the brain, we would keep the original data and apply the MFG methods with groups formed by each region's voxels activities.Then, we might figure out new foundation that has been removed in the masking procedure.
Consider the objective function for β in (9).After removing the constant terms with respect to β, with the help of Lemma 1, we have Differentiate with respect to β, and set the derivative equal to zero to satisfy the KKT conditions.The result is: Solve for β which completes the derivation.Note that the result is similar to the functional ridge regression.
2) γ-update.Similarly, if we remove the constant terms with respect to γ and expand the objective function for γ, we have Note that the objective function is now additive which allows us to optimize γ for each γ j , j = 1 . . ., p. Thus, the above optimization is equivalent to [(γ j ) new ] := arg min for j = 1, . . .p. Applying Lemma 6 completes the proof.
Lemma 7.For x, y ∈ R m where y is known and a, b are constants Proof of Lemma 7.
The proof is similar to that of lemma 6.The only difference is the derivative of the objective function.It is x − y + as + bx, where s is the subdifferential.The rest of the proof is straightforward.If x = 0 we see that x(1 + b + a x ) = y.Taking norm .from both sides, solving for x , and plugging it back, we would have x = ( 1 1+b )(1 − a y )y.Note that this is only possible when x > 0, which means y > a.If x = 0, it results in 0 ∈ −y + as, or y ∈ as.Since in this case s Proof of Theorem 2. The proof is a straightforward result from the combination of Theorem 1 and Lemma 7.
Lemma 8. Assume that Γ XX is a positive definite operator and when n approaches infinity, λ n approaches zero slower than the rate at which √ n approaches infinity.Then, ( ΓXX + λ To be specific, if we add and subtract λ n ( ΓXX + λ n I) −1 (Γ XX + λ n I) −1 in the left hand side of the above equation, we can easily derive the right hand side of the equation.In addition, we have ).The norm of product of the last two parentheses is bounded by 1.Hence, ( ΓXX + λ For the second convergence rate, note that Therefore, Proof of Lemma 5. The following proof is similar to the proof mentioned in Bach ( 2008) which considers a different penalty term that is square of the group LASSO penalty.Then, they proved the consistency by stating that the solution path of the group LASSO will be the same.Instead, we consider the a different optimization problem Mn (.) proposed below, which directly leads to the consistency of multivariate functional group LASSO.
Denote βJ n as the unique minimizer of the following objective function.
where β j is the j-th functional component of β J in the population model.βJ n has a closed form solution similar to the solution of a functional predictor ridge regression where D is a diagonal operator, diag((•)/ β j ).We can replace ΓX J Y by the following expression, after adding and subtracting ΓX J X J (β J ).
where ΓX is the empirical covariance operator between observed functional data X and the population error, = Y − X, β = Y − X J , β J .D is a self-adjoint operator, and β j H = 0 for all j ∈ J by the definition of the population active set J. This means there are positive constants D min = 1/max j∈J β j H and D max = 1/min j∈J β j H such that D max I D D min I.The closed form solution (25) can be broken down into multiple terms.One of the term is Applying the same technique in the proof of Lemma 8 and using the result of Lemma 4, we can see that ΓX Hence, we have The first two terms of the last equation in ( 27) is O p (n −1/2 λ −1 n ) by Lemma 8.By using (Γ X J X J + λ n D) −1 Γ X J X J = I − λ n (Γ X J X J + λ n D) −1 D, we can simplify the third and fourth terms of ( 27) as Consequently, we have Now, we show the norm of λ n (Γ The third line of the above equation is valid because Γ X J X J + λ n D min I H J ≥ λ n D min .Combining the results above, we have functional predictor selection Now, let's compare βJ n and β J n where β J n is the solution to the optimization problem of M n (α).Consider the following equation.
M n (α) − Mn (α) = λ n j∈J The partial Fréchet derivative of the equation (30) with respect to α i for an i ∈ J is Since β J are nonzero, ( 31) is continuously differentiable around β J , and D α i Mn ( βJ n )) = 0, we have , where the • is the operator norm.In addition, since β i = 0 for i ∈ J, it can be easily shown that for some constant C > 0. Thus, we have Now, since M n is strictly convex near the true β J , its second-order Fréchet derivative has a lower bound.Consequently, we have for some C > 0. Suppose that α J is near βJ n and let η n = α J − βJ n 2 H J which tends to zero.Subsequently, we can rewrite the lower bound such that If the last term is tending to zero slower than the second term, we can conclude that all minima of M n (•) is inside the ball {α J : α J − βJ n 2 H J < η} with probability tending to one.This is because M n (•), on the edge of the ball, takes values greater the ones inside the ball.i.e., the global minimum of M n (•) is at most η n away from βJ n .Thus, the necessary condition for the proof is η n λ 3/2 n = o(λ n η 2 n ) and n −1/2 η n = o(λ n η 2 n ).All together, we have the consistency results if η n converges to zero slower than λ 1/2 n + n −1/2 λ −1 n .
Proof of Theorem 3. We rewrite the multivariate functional group LASSO objective function (3) as, α j H j .
by using the fact that ΓX i Y − ΓX i X J (β J ) = ΓX i .At this point, the formulation has a similar form, derived in Theorem 11 of Bach (2008).Furthermore, Lemma 5 satisfies the condition that is necessary to derive the rest of the proof so that they can be derived in a similar way.
List of regions of interests: The following are the lists of the regions of interest of the human brain used in the application section 8.The atlas labels of the human brain and full names can be found at Atlas Label.
The list of the regions of interest associated with the active set of MFG-Lasso when the response value is Performance IQ:

Figure 1 :
Figure 1: The regions of interests, the BOLD activities of which correlate the most with the ADHD score variability in a sample of subjects and achieve the lowest error in estimation.The regions associated with ADHD are colored red, those associated with ADHD Hyper/Impulsive are blue, and the ones associated with ADHD Inattentive are colored green.

Figure 2 :
Figure 2: This figure displays the estimated functional coefficients by the MFG-LASSO from a hundred simulated data sets when n = 100, σ = 1.The green curves are the true coefficient curves and the grey curves are the estimated coefficients.The estimated curves for the remaining of the coefficients from the seventh to the nineteenth are very similar to the fourth, fifth and sixth functions (inactive coefficients) displayed in this figure.

Figure 4 :
Figure 4: The multi-slice display (Axial, Coronal, Sagittal) of the regions of interests, the BOLD activities of which achieves the lowest prediction error and correlate the most with the IQ score variability in the sample when the MFG-LASSO is used.The regions associated with the IQ score are colored red, those associated with the performance IQ are blue, and the ones associated with the verbal IQ are colored green.

Figure 5 :
Figure 5: The multi-slice display (Axial, Coronal, Sagittal) of the regions of interests, the BOLD activities of which achieves the lowest prediction error and correlate the most with the ADHD score variability in the sample when the MFG-LASSO is used.The regions associated with the ADHD score are colored red, those associated with the ADHD Hyper/Impulsive are blue, and the ones associated with the ADHD Inattentive score are colored green.

Table 1 :
Percentages of correct selection in the test set under various simulation scenarios.In each case, 100 random samples are used.In each sample, we count the correctly identified functional predictors for the active set of the size 3 and the inactive set of the size 16.Then, we compute the overall percentage out of 100 samples.

Table 2 :
Average test RMSE of different methods under different simulation scenarios.In each case, 100 random samples are used to compute the mean and standard deviation with parentheses.

Table 3 :
) The results of applying the proposed methods to the fMRI data when predicting the IQ and ADHD scores.