Identification of functionally connected multi-omic biomarkers for Alzheimer’s disease using modularity-constrained Lasso

Large-scale genome wide association studies (GWASs) have led to discovery of many genetic risk factors in Alzheimer’s disease (AD), such as APOE, TOMM40 and CLU. Despite the significant progress, it remains a major challenge to functionally validate these genetic findings and translate them into targetable mechanisms. Integration of multiple types of molecular data is increasingly used to address this problem. In this paper, we proposed a modularity-constrained Lasso model to jointly analyze the genotype, gene expression and protein expression data for discovery of functionally connected multi-omic biomarkers in AD. With a prior network capturing the functional relationship between SNPs, genes and proteins, the newly introduced penalty term maximizes the global modularity of the subnetwork involving selected markers and encourages the selection of multi-omic markers with dense functional connectivity, instead of individual markers. We applied this new model to the real data collected in the ROS/MAP cohort where the cognitive performance was used as disease quantitative trait. A functionally connected subnetwork involving 276 multi-omic biomarkers, including SNPs, genes and proteins, were identified to bear predictive power. Within this subnetwork, multiple trans-omic paths from SNPs to genes and then proteins were observed. This suggests that cognitive performance deterioration in AD patients can be potentially a result of genetic variations due to their cascade effect on the downstream transcriptome and proteome level.


Introduction
Alzheimer's disease (AD) is the most common form of brain dementia characterized by the gradual loss of memory and other cognitive function. With rapidly increasing aging population, AD is drawing more and more attention in the United States and around the world [1]. Unfortunately, the underlying mechanism of AD remains largely unknown and no clinically validated drug is available for disease treatment and prevention. Recent large-scale genome wide association studies (GWASs) have led to discovery of many genetic risk factors associated with AD, such as APOE, TOMM40 and CLU. However, they are mostly individual markers, possibly without functional interactions, which presents difficulties to validate these findings and to relate them to downstream biology [2,3]. Therefore, it remains a challenge to translate them into targetable mechanisms related to disease pathogenesis. Novel biomarkers are in need that can potentially serve as targets in the future therapeutic interventions.
Recently, there is a substantial increase of multi-omic data in AD. Example projects include the Alzheimer's Disease Neuroimaging Initiative (ADNI) [4] and the Religious Orders Study and Memory and Aging Project (ROSMAP) [5]. Instead of limiting their perspective to a single data type, these data collections create a molecular landscape spanning the genome, transcriptome, proteome and metabolome. Coupling with various biological networks (e.g., proteinprotein interaction (PPI) network), these data provides a valuable resource with rich content and opens numerous opportunities for more comprehensive analyses of AD. These multiomic data has been increasingly recognized to be a potential key enabler of novel biomarker discovery [6,7]. It not only allows us to examine the disease from different molecular scales, but also provides insights into their interactions which is critical for translation of genetic findings into targetable mechanism.
In this paper, we proposed a modularity-constrained Lasso model to jointly analyze the genotype, gene expression and protein expression data for discovery of functionally connected multi-omic biomarkers in AD. We constructed a network between SNPs, genes and proteins to capture their functional relationship and used it as a priori in the proposed model. Based on this, the newly introduced penalty term maximizes the global modularity of the subnetwork involving selected markers and encourages the selection of multi-omic markers with not only predictive power but also dense functional connectivity evidenced in the prior knowledge. We applied this new model to the real data collected in the ROS/MAP cohort and used the cognitive performance as disease quantitative trait. As a result, we identified a functionally connected subnetwork involving 276 multi-omic biomarkers, including SNPs, genes and proteins. Within this subnetwork, we observed a plenty number of paths cutting across multiple molecular types, from SNPs to genes and then proteins. This suggests that cognitive performance deterioration in AD patients can be potentially a result of genetic variations due to their cascade effect on the downstream transcriptome and proteome level. Such connected pattern can help improve not only the reliability of identified biomarkers, but also their replicability and interpretability.

Study sample
All the data analyzed were obtained from the Religious Orders Study (ROS) and Memory and Aging Project (MAP). It was launched by Rush University to build a cohort from religious communities to measure the progression of amnestic mild cognitive impairment (MCI, a prodromal stage of AD) and early probable AD. The combined ROS/MAP cohort includes around 600 participants under age 90, which constitute a very rich repository of multi-modal data including GWAS data, whole genome sequencing (WGS) data, cognitive, behavioral and clinical data. The more detailed description could be found in [5]. In this paper, GWAS genotype data and quality controlled RNA-Seq gene expression and protein expression data collected from prefrontal cortex tissue in the brain were downloaded. To perform the proposed joint analysis, only subjects with all three types of data were included. In total, we have 262 subjects with full set of data, including 115 cognitive normals (CN), 67 MCIs and 80 AD patients. The detailed demographic information can be found in Table 1.

GWAS genotype data preparation
ROS/MAP samples were genotyped on the Affymetrix GeneChip 6.0 platform [8]. We performed sample and SNP quality control procedures on GWAS data (SNP call rate<95%, Hardy-Weinberg equilibrium test p < 10 −6 in controls, and frequency filtering MAF<1%) were performed. After the standard quality control procedures for genetic markers and subjects, only non-Hispanic Caucasian participants were selected by clustering with CEU (Utah residents with Northern and Western European ancestry from the CEPH collection) + TSI (Toscani in Italia) populations using HapMap 3 genotype data and the multidimensional scaling (MDS) analysis [9]. Un-genotyped SNPs were imputed using MaCH and the 1000 Genomes Project as a reference panel [10]. Final SNP data used for the analysis is coded as the number of minor alleles.

RNA-Seq gene expression preparation
RNA-Seq gene expression data in the ROS/MAP cohort were collected from the prefrontal cortex tissue in the brain. The RNA-Seq data were recently reprocessed in parallel with other AMP-AD RNAseq datasets, and this second version of the data were downloaded for our subsequent analysis. The input data for the RNA-Seq reprocessing effort was aligned reads in bam files that were converted to fastq using the Picard SamToFastq function. Fastq files were realigned to the reference genome using STAR with twopassMode set as Basic. Gene counts were computed for each sample by STAR by setting quantMode as GeneCounts. These gene level counts further went through normalization and adjustment to remove the effects of relevant factors such as age, gender, education, batch, RNA integrity number (RIN) and postmortem interval (PMI). Detailed reprocessing and normalization steps can be found in the AMP-AD knowledge portal (https://www.synapse.org/#!Synapse:syn9702085/).

Protein expression data preparation
Selected reaction monitoring (SRM) proteomics was performed using frozen tissue from dorsolateral prefrontal cortex (DLPFC). The samples were prepared for LC-SRM analysis using standard protocol as described in [11,12]. All the data were manually inspected to ensure correct peak assignment and peak boundaries. The abundance of endogenous peptides was quantified as a ratio to spiked-in synthetic peptides containing stable heavy isotopes. The "light/ heavy" ratios were log2 transformed and shifted such that median log2-ratio is zero. Normalization adjusted for differences in protein amounts between the samples. During that normalization, the log2-ratios were shifted for each sample to make sure the median is set at zero. Detailed processing steps can be found in the AMP-AD knowledge portal (https://www. synapse.org/#!Synapse:syn8456629). Using the regression weights derived from the cognitive normal participants, peptide abundance data were further adjusted to remove the effects of age, gender, education, PMI and batch.

Selection of SNPs, genes and proteins
We focused our analysis on a set of SNPs, genes and proteins with known functional connections. Though we have genome-wide genotype and transcriptome-wide gene expression data available in the ROS/MAP cohort, only a limited number of proteins are measured, which forms a bottleneck for the joint data analysis. To address this problem, we took a bottom-up approach where proteins measured in the prefrontal cortex were used as seeds to select a subset of relevant SNPs and genes for subsequent analysis. As shown in Fig 1, in the proteomic level, abundance level of 186 peptides, corresponding to 126 unique genes, were measured in the ROS/MAP cohort. In the functional interaction network obtained from the REACTOME database, these 126 genes were found to interact with 954 genes and these interactions were all manually curated from known pathways [13]. Among these 1080 (=126+954) candidate genes, 743 of them without missing RNA-seq data were included to represent the transcriptomic level. Of note, we did not further filter these genes based on their differential expression in AD. In the genomic level, we identified SNPs located on the upstream of these 743 genes within the boundary of 5K. To ensure the functional connection of selected SNPs and the downstream genes, we included only SNPs significantly affecting the transcription factor binding activity, based on the SNP2TFBS database [14]. These relationships between SNPs, genes and proteins/peptides are used to build a trans-omic functional interaction network to guide the search of functionally connected biomarkers in the subsequent analysis.

Memory outcomes
Cognitive performance of participants in the ROS/MAP cohort was estimated through the mini mental state examination, a standardized screening measure for collecting 30 items in related with dementia [15,16]. This score ranges from 0 to 30, and is scaled to quantify the severity of dementia. In this study, we used this memory test score as the AD quantitative trait for biomarker discovery. Using the regression weights derived from the cognitive normal participants, the memory score is adjusted to remove the effect of age, sex and education.

Modularity-constrained Lasso
Throughout this section, we write matrices as boldface uppercase letters and vectors as boldface lowercase letters. Given a matrix M = (m ij ), its i-th row and j-th column are denoted as m i and m j respectively. Let X = [x 1 , x 2 , . . ., x n ] T be the multi-omic features as predictors and y = [y 1 , y 2 , . . ., y n ] T be the disease quantitative trait as outcome (i.e., cognitive performance).
Here, x j � R p is a concatenated vector of genotype, gene expression and protein expression data for j-th subject. The least absolute shrinkage and selection operator (Lasso) is a shrinkage and selection method for linear regression [17]. It minimizes the usual sum of squared errors with a bound on the sum of the absolute values of the coefficients, which is also known as L1 norm (Eq 1). min w ky À Xwk 2 s:t: With this constraint, Lasso aims to minimize the number of selected features, which significantly improved the interpretability of results compared to traditional linear regression, where almost all features are considered to be outcome-relevant with non-zero weight. However, when dealing with a group of highly correlated features, L1 norm penalty will result in a random selection. In this case, multiple runs of Lasso on the same set of data will possibly generate different set of selected features, which presents challenges for replicating and interpreting the results. To address this problem, several groups proposed to explicitly incorporate the correlation structure into the sparse prediction model and encourage the selection/exclusion of all highly correlated features together [18][19][20][21]. Among those is GraphNet, where a graph G � R p�p indicating the correlation structure between predictors is used as a priori to guide the feature selection (Eq 2) [21].
Here, L is the corresponding Laplacian matrix of graph G. However, GraphNet only accounts for the local topology information with a focus on pairwise similarity. For multi-omic biomarker discovery, using this penalty still can not guarantee the connectivity of selected features in the prior network.
In this paper, we propose a new modularity-constrained Lasso which leverages the global network property to encourage the selection of a sub-network rather than individual markers scattered in the prior network. Given the trans-omic network capturing the functional relationships between SNPs, genes and proteins, we formulate it as a graph and its corresponding adjacency matrix is denoted as G � R p�p . B is the modularity matrix [22] It evaluates whether the number of links between nodes i and j is significantly more than expected. h i and h j are the degrees of the i-th and j-th node in the network, and m is the total number of links in the network. Inspired by the module identification problem [23,24], we propose a new penalty term as P M (w, B) = < w T w, B> to impose a modular structure in the identified biomarkers. Here, <> is the Frobenius inner product defined by <A, B > = tr(A T B). Maximizing the Frobenius inner product between w T w and the modularity matrix B encourages the selection of features with dense functional connections in the prior multi-omic network. Taken together, our new modularity-constrained Lasso objective is formulated as in Eq 3. min w X q i¼1 ky À Xwk 2 À P M ðw; BÞ s:t: Here, λ and t are the parameters that control and balance the contribution from two regularization terms. Note that the objective function in Eq 3 is not convex because the modularity matrix B used in P M (w, B) = < w T w, B> is indefinite. To make B negative-definite, we introduced an auxiliary function where B is replaced by B − λ B I and λ B is the absolute maximum eigenvalue of B. Eq 3 can be easily solved by obtaining a closed form solution without L1 constraint, followed by soft-thresholding method [17].

Performance comparison with competing methods
In this section, we denote our modularity-constrained Lasso as M-Lasso and GraphNet-constrained Lasso as G-Lasso. We compared M-Lasso with three state-of-the-art sparse regression models: G-Lasso, elastic net and traditional Lasso. For M-Lasso and G-Lasso, nested 5-fold cross validation (CV) procedure was applied to tune the parameters. Elastic net and traditional Lasso were both implemented using glmnet Matlab package with 5-fold cross validation. To provide an unbiased estimate of the prediction performance of each method tested in the experiments, all methods were evaluated using the same partition of subjects during the cross validation procedure. The portion of AD, MCI and CN participants was kept the same for each fold. Root Mean Square Error (RMSE) and mean absolute error (MAE) between the predicted values and actual values of all the test subjects were used to compare the prediction performances across different methods. Prediction performance, measured by RMSE and MAE, of four different regression models on test data set is shown in Table 2. It is observed that, across all 5 folds, M-Lasso largely outperforms G-Lasso, elastic net and traditional Lasso. Traditional Lasso occasionally shows the best performance with smallest prediction error. However, its selected markers are much less connected than those of M-Lasso and G-Lasso due to lack of prior structure constraints and thus will present challenges for further interpretation.

PLOS ONE
Functionally connected multi-omic biomarkers for Alzheimer's disease For feature selection, M-Lasso identified around 600 features including SNPs, genes and proteins, to be predictive of cognitive performance, while G-Lasso only identified a handful of them (i.e., less than 20 for all 5 folds). When mapped to the prior functional connectivity network, markers identified by G-Lasso scatters across the network with much fewer connections, which suggests that the local topology information used in GraphNet penalty is not strong enough to form subnetwork structure among identified biomarkers. Although the total number of markers identified in G-lasso is much fewer, lack of connections makes them hard to interpret and presents challenges for further functional validation. For M-Lasso, multi-omic biomarkers identified are largely connected to each other in the prior network. Taken together, the sparsity constraints in M-Lasso is still encouraged, but not as strict as in traditional Lasso, elastic net and G-Lasso. Compared to the ridge penalty in elastic net and the GraphNet penalty in G-Lasso, the modularity penalty in M-Lasso further relaxed the sparsity constraint such that subnetworks, instead of individual features, can be identified as biomarkers.
Take the result from one fold as example, 650 features were selected in M-Lasso, including 255 SNPs, 339 genes and 56 proteins. In particular, there are 7 subnetworks with more than 5 nodes and the largest connected network component involves 276 multi-omic features with 366 edges (Fig 2). The rest of the multi-omic markers identified in M-Lasso mostly form small connected components, ranging in size from 2 to 4. These features are found predictive yet not well functionally connected, possibly due to the fact that they are false positives or their functional connections have not been previously studied yet. In the subsequent part, we focus on the multi-omic biomarkers in the largest connected component, which are not only predictive of cognitive performance but also functionally connected with evidence from prior knowledge.

Functionally connected multi-omic biomarkers
Shown in Fig 2 are the 7 subnetworks with more than 5 nodes, which were obtained after mapping 650 features back to the prior network. Size of each node is made proportional to their degree. APOE gene and its corresponding protein, a top risk factor of AD, is found in one of the subnetworks involving 1 SNP and 5 genes, including LRP2, PPARG, LPL, LPLRAP1 and GHIHBP1. Across all the subnetwoks, we observed that there are multiple trans-omic paths from SNPs to genes and then proteins. Note that these SNPs are located upstream of their connected genes and has significant effect on the transcription factor binding activity. Thus, these SNPs are very likely to have an influence on the expression of their connected genes. Also, the functional interaction between genes and proteins are curated from the REACTOME pathways with direction information. Therefore, genes have a regulatory role toward the expression of their connected proteins in the prior network. Taken together, these trans-omic paths suggest that cognitive performance can be potentially affected by the genetic variations (i.e., SNPs) due to their cascade effect on the expression of downstream genes, which further regulate the protein expression. In addition, we examined 63 SNPs involved in the largest connected component in the BRAINEAC database. This database provides the association between SNPs and gene expression tested on 134 neuropathologically confirmed control individuals of European descent. SNPs significantly associated with gene expression are named as expression quantitative trait loci (eQTLs). EQTL mapping is a widely used tool for identifying genetic variants that affect gene regulation [25]. Details of the eQTL analysis can be found in [26]. For 63 SNPs in the largest subnetwork, 58 of them were found to be eQTLs in the frontal cortex tissue (FDR corrected p < 0.05).
For the largest connected subnetwork, we further performed network analysis using Net-workAnalyzer in Cytoscape [27] and identified the biomarkers with top centrality values, such as degree, betweenness and closeness (Table 3). Top nodes by degree in this subnetwork included proteins PIK3R1, GRB2, FYN, CD44, RPS2, BCL2L1, BCL2L1 and PTPN11, and genes EP300 and SPCS3. Most hub nodes are also found to have the top centrality value in betweenness and closeness, such as PIK3R1, FYN, and EP300. Majority of these genes and proteins have been previously reported in association with AD. For example, PIK3R1 encodes the regulatory subunit of the phosphoinositide-3-kinase protein complex PI3Ks, which are known to play a key role in insulin signaling. Results from recent studies start to show evidence of intrinsic insulin resistance inside AD brains [28]. The hub gene EP300 encodes the enzyme histone

PLOS ONE
Functionally connected multi-omic biomarkers for Alzheimer's disease acetyltransferase P300 or E1A-associated protein P300. This enzyme functions as histone acetyltransferase that regulates transcription of genes via chromatin remodeling. Findings from multiple studies have suggested the potential of P300 to act as a biomarker for dementia assessment and monitoring AD [29,30]. In addition, GRB2 was found to interact with APP, a wellknown gene related to AD. GRB2 interacts with APP requiring phosphorylation of APP at Tyr-682 [31]. This could lead to the activation of the MAPK pathway, since GRB2 are known to link growth factor receptors to signaling pathways, such as MAPK and PI3K, and participate in oncogenic proliferation, neuronal development, cell differentiation, and apoptosis [32-37]. In addition to rank, a subsampling procedure can help determine a hard cut-off threshold for identification of a smaller set of significant features [38]. For example, we can repeatedly select the same number of random features, map them to the prior network and use the average (or 25% percentile) of the derived network centrality values as threshold to select significant features.
Finally, hub genes are known to be likely disease-associated genes. Features with higher degree in the identified subnetworks are expected to be more important with higher absolute regression weights. Therefore, we further tested the association between degree of features and their absolute regression weights derived from M-Lasso. Across 5 folds, the average Pearson's correlation value between node degree and absolute weight is only 0.18. However, when we excluded the features with low degree (degree <3), the average correlation increases significantly to 0.4. This indicates that the importance of identified multi-omic features is more proportional to their degrees only when they already have many interactions in the prior network. Upon further examination, features with high degree were found to have medium to high regression weights. Features with low degree can have very small or very high weights. This is possibly due to the fact that their low degree may be a result of no interaction or no known interaction. Therefore, their importance is less determined than that of hub ones.

Pathway enrichment analysis
For 166 genes, 47 proteins and 63 SNPs in the largest connected subnetwork, we performed pathway enrichment analysis based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [39]. The enrichment analysis was performed using ClueGO as left-sided tests based on the hypergeometric distribution [40]. In total, 77 pathways were found to be significantly enriched by our gene/protein set, with Bonferroni corrected term p-value smaller than 5% (corrected p � 0.05). Shown in Table 4 was the top 20 enriched KEGG pathways with smallest p values after correction. The top hit is PI3K-Akt signaling pathway, a major mediator of effects of insulin. Two recent studies have found a significant correlation between peripheral insulin resistance and brain Aβ levels as measured by Pittsburgh compound B-positron emission tomography (PiB-PET) [41,42]. The impaired insulin-PI3K-Akt signaling observed in the AD brain has led to clinical trials studying whether the enhancement of this pathway using intranasal insulin (IN) treatment is beneficial [43]. Other enriched pathways that are  [46], MAPK signaling pathway [47], Rap1 signaling pathway [48], etc. Many of the top enriched pathways are related to cancer, such as PI3K-Akt signaling pathway, prostate cancer and small lung cancer. In addition, we performed pathway enrichment analysis for the subnetwork involving APOE gene and protein. Interestingly, the cholesterol metabolism pathway was significantly enriched by these genes and proteins. This provides support to recent findings in the association between cholesterol metabolism and memory performance [49][50][51].

Conclusion
In this study, we proposed a new modularity-constrained Lasso model to jointly analyze the genotype, RNA-Seq gene expression and protein expression data. The newly introduced penalty term maximizes the global modularity of selected biomarkers in the prior network and encourages the selection of multi-omic biomarkers forming network modules. With this new penalty term, M-Lasso is advantageous in that features can be selected either because they are predictive or because they are closely connected with many predictive ones in the prior network. Thus, the sparsity constraint in M-Lasso is much more relaxed than in G-Lasso, elastic net and traditional Lasso. Compared to the GraphNet penalty that enforces local pairwise similarity, modularity-based penalty helps identify more biomarkers with significantly improved functional connectivity. In particular, we found that some biomarkers form trans-omic paths from SNP to gene and then protein, suggesting the potential cascade effect of genetic variations on the downstream transcriptome and proteome level. To the best of our knowledge, this is the first study that explored the potential of functional multi-omic subnetworks as biomarkers in AD. Despite the promising findings, the proposed M-Lasso still has multiple limitations. First, only one disease quantitative trait is used as outcome in the prediction model. Considering the potential bias introduced in data collection procedure, the biomarkers and their functional connectivity network identified here may not reflect the optimal pattern. Incorporating multiple correlated outcomes and performing a multitask prediction will possibly help improve the performance. Second, like many multi-view prediction models, the proposed M-Lasso is not very capable in handling the missing data problem. Each subject has to have all types of data to be included in the analysis. Therefore, subjects with missing data in one or more data types are inevitably excluded. This missing data problem can be partly addressed using imputation methods such as singular value decomposition [52] and matrix completion [53]. In case of subjects with large chunk of missing data, one possible solution is to examine two types of data at a time to maximize the number of available subjects. Future efforts are in need to further improve this model to enable the integrative analysis of multiview data from decoupled subjects.