Investigating the Control of Chlorophyll Degradation by Genomic Correlation Mining

Chlorophyll degradation is an intricate process that is critical in a variety of plant tissues at different times during the plant life cycle. Many of the photoactive chlorophyll degradation intermediates are exceptionally cytotoxic necessitating that the pathway be carefully coordinated and regulated. The primary regulatory step in the chlorophyll degradation pathway involves the enzyme pheophorbide a oxygenase (PAO), which oxidizes the chlorophyll intermediate pheophorbide a, that is eventually converted to non-fluorescent chlorophyll catabolites. There is evidence that PAO is differentially regulated across different environmental and developmental conditions with both transcriptional and post-transcriptional components, but the involved regulatory elements are uncertain or unknown. We hypothesized that transcription factors modulate PAO expression across different environmental conditions, such as cold and drought, as well as during developmental transitions to leaf senescence and maturation of green seeds. To test these hypotheses, several sets of Arabidopsis genomic and bioinformatic experiments were investigated and re-analyzed using computational approaches. PAO expression was compared across varied environmental conditions in the three separate datasets using regression modeling and correlation mining to identify gene elements co-expressed with PAO. Their functions were investigated as candidate upstream transcription factors or other regulatory elements that may regulate PAO expression. PAO transcript expression was found to be significantly up-regulated in warm conditions, during leaf senescence, and in drought conditions, and in all three conditions significantly positively correlated with expression of transcription factor Arabidopsis thaliana activating factor 1 (ATAF1), suggesting that ATAF1 is triggered in the plant response to these processes or abiotic stresses and in result up-regulates PAO expression. The proposed regulatory network includes the freezing, senescence, and drought stresses modulating factor ATAF1 and various other transcription factors and pathways, which in turn act to regulate chlorophyll degradation by up-regulating PAO expression.


Introduction
Chlorophyll is a central molecule in plants that is essential to photosynthesis in absorbing light and transferring excitation energy, a portion of which is ultimately captured in plant biomass. Chlorophyll synthesis and breakdown are two metabolically significant processes of higher plants that can have significant economic consequences for crop agriculture. While the intricate chlorophyll biosynthetic pathway is well understood, basic knowledge about the chlorophyll degradation machinery and its regulation is uncertain. The chlorophyll degradation pathway consists of several enzymes that convert chlorophyll to non-fluorescent chlorophyll catabolites (NCCs), leading to loss of green color and absorption of visible light [1,2].
The key controlling enzyme involved in the chlorophyll degradation pathway is pheophorbide a oxygenase (PAO), a Rieske-type iron-sulfur protein and monooxygenase that activates molecular oxygen with its mononuclear iron center, functioning as an important cofactor in the reaction [3]. PAO oxidizes the degradation pathway intermediate pheophorbide a to red chlorophyll catabolite (RCC) by opening and linearizing the porphyrin ring leading eventually to the production of NCCs. As a rate controlling step of the pathway, PAO is expected to be under tight regulation [4].
A role for transcriptional regulation of the chlorophyll degradation pathway is suggested by several findings [5]. For example, evidence for the involvement of coordinated transcriptional regulation of plant chlorophyll levels include the clustering and co-regulation of gene expression patterns of stay-green (SGR), chlorophyllase b, non-yellow coloring (NYC), PAO and other enzymes or related proteins involved in the chlorophyll degradation pathway [4,6,7]. Environmental factors also modulate the chlorophyll degradation pathway, such as exposure to freezing temperatures in seeds inhibiting chlorophyll breakdown, which was also shown to be connected to post-translational dephosphorylation and activation of PAO [8].
The chlorophyll degradation pathway is relevant to the agricultural sector by modulating postharvest fruit and vegetable color [9][10][11]. Disruption of the pathway results in the "green seed problem" of oilseed crops such as canola and soybean, leading to photoactive chlorophyll contaminated oils, which in turn cause rancidness and financial losses [12,13]. In addition, optimizing chlorophyll levels in crop canopies could improve canopy photosynthetic efficiency, leading to higher crop yields [14]. Finally, delaying the induction of pathway in leaves by installing a "stay-green" trait could improve crop productivity, by taking advantage of the lengthening growing season brought on by global warming [15][16][17]. This study identifies upstream regulatory elements of PAO, which can be used to modulate plant response to biotic and abiotic factors and thus address wider issues in agribusiness.

Microarray Datasets
Three experiments on the analysis of global gene expression under conditions that should influence PAO gene expression in Arabidopsis thaliana were chosen as the expression datasets to mine for genes that co-expressed with PAO. These transcriptomic experiments were conducted using the Affymetrix ATH1 microarray chip technology containing gene expression data for 22,810 predicted genes and downloaded from the NIH GEO database [18]. The GSE55907 dataset originated from freezing tolerance induction experiments (4°C for 24 hrs) or warm (22°C) conditions on wild-type Columbia (Col-0) and several transgenics that were over-expressing genes associated with cold responses: a C-repeat binding factor (CBF), heat shock transcription factor C1 (HSFC), and a DREB and EAR Motif protein (DEAR) [19]. The GSE5727 dataset originated from leaves sampled during and prior to senescence, and were performed on wild-type Col-0, a jasmonic acid (JA) coi1 mutant, an ethylene signaling pathway ein2 mutant, and a NahG transgenic that would dampen salicylic acid (SA) signaling pathways [20]. The GSE72050 dataset originated from a study on the effect of drought on global gene expression in leaf tissue from wild-type Col-0 and NAC26-transcription factor overexpressing lines [21]. Microarray data were formatted for statistical analysis using the Python scripting language such that all data were normalized and log2 transformed in a similar manner.

Statistical Analysis
Statistical and correlation analysis were performed on each microarray dataset using statistical programming language scripts of the statistical analysis system (SAS), written and tailored specifically for this project (S1 Appendix). PAO expression levels were compared within each microarray dataset among differing plant lines and environmental conditions using analysis of variance (ANOVA). Following this, PAO expression was modeled by other genetic elements using multiple linear regression (MLR) performed by automation in either forward method to add gene variables at a P = 0.05 cutoff, or in a stepwise method to both add and remove gene variables at a P = 0.05 cutoff. This was done until a significant model predicted PAO levels and R 2 = 1, meaning that the model explained the entire variation in the PAO expression. Based on functional analysis results, particular upstream regulatory candidates were analyzed for coexpression with PAO using correlation analysis and general linear model (GLM). Upstream regulatory candidates and PAO were compared side-by-side among plant lines and environmental conditions using frequency analysis.

Functional Classification
Functional determination of upstream regulatory candidates was accomplished using two unique, complementary methods. The first method involved the relational database structured query language (SQL) native functionality within the SAS statistical programming language. Pre-stored gene ontology data were crossed with the shortened co-expressed gene list using a custom-made SQL query to determine the functions of these genes and whether they belonged to any transcriptional factor families of interest. The second method relied on the Database for Annotation, Visualization and Integrated Discovery (DAVID) analysis tool [22], which utilized a unique algorithm to pool together information about large inputted gene lists based on different functional annotation tools including gene ontology (GO) terms, the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database, Protein Information Resource (PIR) keywords, and InterPro protein database terms. Fellow gene family members of the upstream regulatory candidates were also investigated by review of recent scientific literature.

Cross-comparison and regression modeling of PAO levels
First, the three microarray datasets were analyzed using ANOVA for comparison of PAO expression levels across all plant lines, environmental conditions, and biological processes, finding in each case that PAO expression levels significantly differed at P<0.05 among the various Arabidopsis lines and conditions (Fig 1, S1-S3 Figs) supporting that these datasets would be good sources for finding other genes that were co-regulated with PAO. Next, the microarray datasets were utilized to create multiple linear regression models by automation using both forward and stepwise selection methods at a P = 0.05 cutoff. In particular, there were 24 gene candidates found to predict PAO expression in the GSE55907 dataset as they formed a multiple linear regression model at P<0.05 and R 2 = 1, receiving the same results with forward and stepwise selection (S1 and S2 Tables). These candidates were then investigated for their functions.

Functional prediction of gene candidates
The 24 candidates of interest were typified by their function using the SQL query merging the candidate list and pre-loaded ontology data. One candidate, probe name_261564_at, was shown to be a NAC family transcription factor based on SQL query findings ( Table 1). The independently conducted DAVID analysis found that TAIR ID AT1G01720 was the only candidate gene with a high probability of being a transcription factor. Both the probe name_261564_at and the TAIR ID AT1G01720 were identified as Arabidopsis thaliana Activating Factor 1 (ATAF1). This gene was relevant due to being a transcription factor that was also a member of the NAC domain An ANOVA was performed to cross-compare PAO expression levels and found significant differences among plant lines in the GSE55907, GSE5727, and GSE72050 microarray datasets at the P<0.05 level in normal vs. pre-freezing treatment (A), pre-senescence vs. during senescence (B), and drought vs. well-water conditions (C), respectively. Box plots displayed lower to upper quartiles with central horizontal lines representing median, diamonds representing mean, end lines representing maximum and minimum values, and circles representing outliers. Further details on the degrees of freedom, mean squares, correlation coefficients, and ANOVA details are provided in S1-S3 Figs. family, as several other NAC members have also been implicated in the regulation of chlorophyll degradation. thus adding weight to selecting ATAF1 as the most likely candidate for involvement in the PAO regulation (S3 Table).

Correlation and frequency analysis of PAO and gene candidates
To investigate ATAF1 further, correlation analysis was performed on ATAF1 compared to PAO, with positive Pearson's correlation coefficients of greater than 0.5 and P<0.05 being determined between the two variables of ATAF1 and PAO transcript expression levels in all three microarray datasets ( Table 2). A GLM was performed using ATAF1 as the sole predictor Table 1. Functional descriptions of 24 candidates predicting PAO expression. The 24 candidates predicting PAO levels in the forward and stepwise method multiple linear regression models of the GSE55907 dataset were identified by their functions through performing SQL joins between tables containing gene names and functions. In particular, probe name_261564_at was found to be relevant due to its transcription factor capacity. variable and PAO as the response variable, finding that ATAF1 significantly predicted PAO levels at P<0.05 in all three datasets. These results were overlaid onto a fit plot that contained all observations within 95% prediction limits for all three datasets (Fig 2). Finally, frequency analysis repeated for ATAF1 side-by-side with PAO showed that in all three microarray datasets, the two genes correlated positively among plant lines, with ATAF1 levels being lower in plant lines where PAO expression was also low, while ATAF1 levels were higher in plant lines where PAO expression was also high (Table 3). PAO and ATAF1 levels were found to be higher under warmer conditions, during senescence stages, and under drought conditions, as well as in HSFC overexpressing lines, salicylic acid pathway nullifying transgenic line, jasmonic acid/ethylene pathway mutant lines, but were found to be lower in the DEAR overexpressing line. PAO levels were higher but ATAF1 levels slightly lower in NAC26-overexpressing line compared to wild type.

Discussion
The analysis of the GSE55907, GSE5727, and GSE72050 microarray datasets revealed that PAO expression differed among the different plant lines, which were either wild type, overexpressing, transgenic, or mutant lines, as well as among the differing experimental conditions such as pre-freezing treatment vs. normal conditions, before vs. during senescence, or drought vs. well-water conditions (Fig 1, S1-S3 Figs). These analyses indicated that both the genetic Table 3. Frequency analysis of PAO and ATAF1 by plant line. Both PAO (left-side) and ATAF1 (right-side) expression differed by plant line in a positively correlated manner in the GSE55907, GSE5727, and GSE72050 microarray datasets, where PAO and ATAF1 levels were either both high or both low in each plant line. A. PAO and ATAF1 levels were higher in warm conditions and certain transcription factor overexpressing lines such as HSFC overexpressing line, and were lower in the DEAR overexpressing line. B. PAO and ATAF1 levels were higher in wild-type plants during senescent conditions, in mutant jasmonic acid/ethylene pathway lines, and transgenic salicylic acid pathway line. C. PAO and ATAF1 levels were higher in drought conditions in both wild-type and NAC26-overexpressing lines and PAO levels somewhat higher and ATAF1 levels slightly lower in NAC26-overexpressing line compared to wild-type line. background as well as environmental conditions needed to be taken into account when analyzing PAO and its regulators.
To determine the potential indirect genetic factors regulating PAO expression in Arabidopsis in particular, the list of total genes in Arabidopsis was narrowed down by whether there was a significant correlation with P<0.05 and then narrowed down further by whether the genes would predict PAO, resulting in 24 gene candidates, one of which was the transcription factor ATAF1 ( Table 1, S1-S3 Tables). This is relevant because PAO is known to be transcriptionally regulated [4], and ATAF1, a NAC domain family transcription factor, has been found in the analyses presented to be involved with PAO expression across wild type and transgenic plant lines, in freezing, senescence, and drought experimental microarray datasets.
Further correlation analysis and regression modeling pointed to the same conclusion that ATAF1 significantly predicts PAO and that PAO is positively correlated with ATAF1 ( Table 2, Fig 2). Frequency analysis showed that PAO and ATAF1 expression levels were either both high or both low in the various plant lines, again reinforcing the positive correlation between ATAF1 and PAO (Table 3). In particular, PAO and ATAF1 were both overexpressed under warmer conditions, during senescence, and under drought conditions as well as in certain plant lines like HSFC overexpressing lines. PAO and ATAF1 were also overexpressed in jasmonic acid/ethylene pathway mutant lines and salicylic acid pathway transgenic line despite pathway disruption in these lines, suggesting that jasmonic acid/ethylene pathway and salicylic acid pathway, which are involved in senescence-induction, work independently of ATAF1 to induce chlorophyll degradation. PAO levels were also higher despite slightly lower ATAF1 levels in the NAC26-overexpressing line, suggesting that NAC26 up-regulates PAO in a manner similar to ATAF1. Meanwhile both PAO and ATAF1 were underexpressed in the DEAR overexpressing line. These findings are compiled into a working model showing hypothesized relationships among the different elements that interact to regulate PAO expression (Fig 3).
Our model is further supported by several findings in recent literature showing a strong connection between NAC family domain transcription factors and various abiotic factors and plant processes, including temperature and leaf senescence, serving as important regulators. For example, NAC domain family member NAM-B1 was found to accelerate senescence and increase zinc and iron content in wheat grain [23], while another member JUNGBRUNNEN1 was found to regulate longevity and tolerance to abiotic stress including salt, cold, and heat in Arabidopsis plants [24] and NAC family member AtNAP was found to trigger senescence when overexpressed [25]. These findings suggest that ATAF1 plays similar roles in pre-freezing treatment, senescence, and drought conditions, likely due to their similar protein composition containing the same NAC domain. Recent literature suggests a similar scheme for ethylene signaling and leaf senescence-induced chlorophyll degradation through ethylene insensitive 3 (EIN3), similar to ein2 gene investigated here, acting through another NAC domain gene ORE1, similar to ATAF1 gene, to modulate PAO expression [26]. ATAF1 overexpressing lines were also found to have stunted growth and delayed flowering [27], affirming connections with chlorophyll degradation.

Conclusion
PAO transcript expression was found to be significantly up-regulated in warm conditions, during leaf senescence, and in drought conditions, and in all three conditions significantly positively correlated with expression of ATAF1, a NAC transcription factor implicated in the literature as being related to all three of these types of conditions. This analysis posits a regulatory network in which ATAF1 is triggered in response to these abiotic stresses and acts to regulate chlorophyll degradation by up-regulating PAO expression. PAO is represented by probe name _246335_at, while Plant class variable represents the 4 plant lines of the study representing wild-type or NAC26 line in well-water or drought conditions. Note significant factor of Plant (p = 0.0092). Box plots displayed lower to upper quartiles with central horizontal line representing median, and diamond representing mean. (TIF) S1 Table. Stepwise multiple linear regression to predict PAO through 24 correlated genes. After performing stepwise selection with a multiple linear regression model containing all PAO-correlated genes, these 24 candidates were identified as together forming a significant model (p<0.05) predicting PAO expression and explaining all variation in PAO expression (R 2 = 1). (DOCX) S2 Table. Forward multiple linear regression to predict PAO through 24 correlated genes. After performing forward selection with a multiple linear regression model containing all PAO-correlated genes, these 24 candidates were identified as together forming a significant model (p<0.05) predicting PAO expression and explaining all variation in PAO expression (R 2 = 1). (DOCX) S3 Table. Further functional descriptions of 24 candidates predicting PAO expression. The 24 candidates predicting PAO in the multiple linear regression models were annotated using the DAVID tool. In particular, probe name_261564_at, here aliased by TAIR ID AT1G01720, was found to be relevant due to its transcription factor capacity. (DOCX) S1 Appendix. SAS code for statistical analysis of microarray datasets. The written SAS code is presented below for the statistical analysis of the GSE55907, GSE5727, and GSE72050 microarray datasets.