Robust learning algorithms for capturing oceanic dynamics and transport of Noctiluca blooms using linear dynamical models

The blooms of Noctiluca in the Gulf of Oman and the Arabian Sea have been intensifying in recent years, posing now a threat to regional fisheries and the long-term health of an ecosystem supporting a coastal population of nearly 120 million people. We present the results of a local-scale data analysis to investigate the onset and patterns of the Noctiluca blooms, which form annually during the winter monsoon in the Gulf of Oman and in the Arabian Sea. Our approach combines methods in physical and biological oceanography with machine learning techniques. In particular, we present a robust algorithm, the variable-length Linear Dynamic Systems (vLDS) model, that extracts the causal factors and latent dynamics at the local-scale along each individual drifter trajectory, and demonstrate its effectiveness by using it to generate predictive plots for all variables and test macroscopic scientific hypotheses. The vLDS model is a new algorithm specifically designed to analyze the irregular dataset from surface velocity drifters, in which the multivariate time series trajectories are having variable or unequal lengths. The test results provide local-scale statistical evidence to support and check the macroscopic physical and biological Oceanography hypotheses on the Noctiluca blooms; it also helps identify complementary local trajectory-scale dynamics that might not be visible or discoverable at the macroscopic scale. The vLDS model also exhibits a generalization capability (as a machine learning methodology) to investigate important causal factors and hidden dynamics associated with ocean biogeochemical processes and phenomena at the population-level and local trajectory-scale.


Background
the behavior and movement of the Noctiluca blooms along the trajectory of the particular drifter. This behavior and movement is statistically learned or transformed by the vLDS model. Since the drifters have different launch time and longevity, the time series representing individual drifter trajectories have unequal (variable) lengths. To our knowledge, no existing model can simultaneously process regularly-sampled multiple multivariate time series with variable lengths collected by velocity drifters. The goal in this paper is to describe a variablelength Linear Dynamic Systems (vLDS) model that is tailored to this particular data structure, to learn, summarize, and recover the latent dynamics for all drifter trajectories, and to generate predictive dynamics that match closely with the observed data along drifter trajectories. The previous analysis of the phytoplankton blooms in [25][26][41][42] was based on the macroscopic scale of space and time, namely, the data is aggregated or pooled across spatiotemporal dimensions. This research hypothesized (1) that nutrient-enrichment of the surface waters are increasing productivity. The potential sources of nutrients are multiple [25][26]. Moreover, these previous research hypothesized (2) that Noctiluca grew faster in light than in dark on the sea surface and in the sea water, thanks to its sun-loving endosymbiotic algae, which are thought to have survived 1.3 billion years on an oxygen-scarce Earth. However, in our study, the surface velocity drifter dataset [43] and satellite image dataset [44][45][46][47] are not aggregated or pooled across spatio-temporal dimensions. With the data structure of individual drifter trajectories kept intact, the vLDS model tests the previous hypotheses on the dynamics of the Noctiluca blooms from the spatio-temporal trajectory-scale. By comparing the vLDS model predictions directly with the observed ocean profiles along the drifter trajectories, we can easily visualize its predictive performance and interpret the underlying latent dynamics of the Noctiluca blooms at the trajectory scale, as discussed in the "Discussion & conclusion" Section.
The benefits of vLDS are threefold. First, it provides statistical evidence in a direct and zoomed-in manner to compare the macroscopic physical and biological oceanographic observations and the inherent physiological behavior of Noctiluca blooms, by recovering the latent dynamics that governs the probabilistic distribution of the Noctiluca concentration in space and time and by comparing the model predictions with the observed ocean profiles. Second and more importantly, it helps identify complementary local drifter-scale dynamics that might not be visible or discoverable at the macroscopic scale. The vLDS predictive plots in "Discussion & Conclusion" Section provide statistical evidence that the atmospheric deposition measured by the quantities T865 aerosol optical thickness at wavelength 865 nm does not have much impact on the underlying dynamics that are driving the Noctiluca growth, as measured by the chlorophyll a concentration (Chl a) at the time scale of two days. Third, it provides a generalizable machine learning methodology to probe important causal factors and hidden dynamics for the ocean biogeochemical processes at the local population-level along individual drifter trajectories. These scientific findings in the local trajectory-scale of the population data can lead to critical hypothesis and even conclusions at the macroscopic scale of the pooled data.

Data collection
The Arabian Sea (coordinate range~5 to 28˚N, 45 to 75˚E) is predominantly located in the tropics (Fig 1A), and it has one of the most energetic current systems driven by the seasonally reversing monsoons. Dataset on Chlorophyll a (Chl a), from the GlobColour Project [45][46][47], which provides merged products based on measurements from the ocean color satellites Sea-WiFS, MODIS-Aqua (NASA), VIIRS (NOAA) and MERIS and OLCI-A (ESA), was used for studying the distribution of Noctiluca blooms during winter. In practice, the Ocean color datasets have missing values at certain locations due to the limitations of the satellite coverage or the presence of clouds. The dark spots are regions with missing values (Fig 2). For the purpose of our study, we used merged products from both NASA and the GlobColour Project [44][45][46][47]. Ocean color satellites can provide remote sensing reflectance values for different wavebands. These wavebands are used in empirical and semi-analytical algorithms to convert remote sensing reflectance to chlorophyll a concentration. Pre-processed Chl a data products were used to explore a time series of snapshots of chlorophyll a concentration on a lattice of latitude and longitude coordinates. In the next sections, we provide detailed descriptions of the data aggregation and preprocessing steps.
The temporal evolution of the satellite images reflects both physical and biological dynamics. To impose the structures of physical drivers (advection) onto the data sample, we utilized the drifter array data from the NOAA's Global Drifter Program (GDP). These freely drifting buoys provide information about the upper ocean currents that are responsible for the advection of the planktonic particles [48]. The Lagrangian trajectory of each float is retrieved from the database as a time series of variables representing the location, velocity field, and sea surface temperature. We note that each float has a typical lifetime of a couple years and has different launch times. Therefore, the drifter dataset is highly heterogeneous in both time and space. Fig 3 displays the temperature measurements of all the drifters in the Arabian Sea. It is known that the Lagrangian drifter trajectories are highly chaotic [49][50][51], and the prediction of particle trajectories has been a challenging research task [52]. There have been recent research results on using Kalman filter and data assimilation with various physical models, namely, the Gauss-Markov Lagrangian particle model [8], the Eulerian velocity field [9,11], and the upper ocean horizontal momentum balance model from Ekman dynamics [10], to track the position and velocity of the floats. For comparison in our study, we are introducing and imposing statistical structure on the latent state variables to capture the joint dynamics among the Chl a, the spatio-temporal information of the floats, namely, the latitude, longitude, velocity, speed and distance to the coast, and the physico-chemical predictors, such as CDOM, KD490, T865, PAR, and SST4, where KD490 is the diffuse attenuation coefficient at 490 nm using the Lee algorithm indicating light under the sea surface, and PAR is photosynthetically available radiation indicating light on the sea surface. The predictive plots generated by the vLDS model in our study (as displayed in Section "Discussion and Conclusions") reveal the relationships among the physical and physico-chemical ocean profiles and the Noctiluca blooms inside a collection of chaotic drifter trajectories in the Arabian Sea region from 2002 to 2017.

Combining multiple dataset
We merged the satellite data with the buoy data to generate a Lagrangian dataset. It is a collection of multivariate time series for each drifter with a unique id to combine the information from the satellites, namely, Chl a, CDOM, KD490, T865, PAR, and SST4, and the data associated with the drifters including id, time, latitude, longitude, velocity components, speed, and distance to the coast. Fourteen features were selected in our experiments; the variance inflation factors (VIF) on multicollinearity of all the predictors are listed in Table 1. The drifter id and time are mainly used for ordering and grouping data in the vLDS model. The twelve (12) remaining factors represent the physical and physico-chemical variables that is related to the evolution of Noctiluca blooms [25][26]. Since the timescale of the phytoplankton reproduction is on the order of a few days [53], we carried out a resampling process to match the frequencies of both datasets to the same level. At the same frequency, we interpolated the variables from the satellite dataset onto the specific spatial and temporal points of the Lagrangian drifter dataset, in order to make each observation in the drifter dataset more informative. This process builds up a multivariate time series for each drifter id with all the physico-chemical and physical information embedded on the drifter trajectory. Furthermore, this interpolation process is repeated for all other features from the satellites.
We note that each float record has information on its coordinates flat; lon; timeg. For convenience, we denote flat; lon; timeg as fx; y; tg, which is almost always not precisely on the grid of the Ocean Color dataset. To resolve this issue, we now describe the multidimensional interpolation operator to map the chlorophyll a concentration onto the GDP float dataset. For any float data point with coordinates fx 0 ; y 0 ; t 0 g, we identify the coordinate cube or the grid cell in the Ocean Color dataset that contains this point. In particular, this cube has 8 vertices with coordinates generated by the outer product x nearest ; x furthest  Robust learning algorithms for capturing oceanic dynamics of Noctiluca blooms using linear dynamical models each coordinate, respectively. We further denote the interpolation weight w x for x 0 by Similarly, we define the weights w y and w t for y and t. Using the function f ðx; y; tÞ to represent the chlorophyll a concentration at the coordinate fx; y; tg, we write the interpolated chlorophyll a concentration at fx 0 ; y 0 ; t 0 g as In our implementation, we have applied the interpolation process to the chlorophyll a concentration, distance to the nearest coast, and all other predictors. As an illustration of the interpolation process, we display the distance to the nearest coast [40] with a resolution of 4km in Fig 4A and the interpolated distance to the nearest coast for all the floats in Fig 4B. Using the multidimensional interpolation procedure described above, we map the satellite observations for each of the variables {'chlor_a', 'dist', 'cdm', 'kd490', 't865', 'par', 'sst4'} onto the GDP floats. Along with the information on the floats, namely {'time', 'id', 'lat', 'lon', 've', 'vn', 'spd'}, the interpolated float dataset becomes high dimensional (Table 1). Here 'chlor_a' denotes the chlorophyll a concentration, 'dist' the distance from nearest coast, 'cdm' the  The 'cdm' measures CDOM the amount of dissolved organic materials in the sea water, which supports the growth of the Noctiluca [26]. The 'dist' measures the distance from the particle to the nearest coast, which is the source of the nutrient rich water. Moreover, in the winter season, the northwestern Arabian Sea off the coast of Oman experiences winter convective mixing from November to January, during which nutrient rich, low-oxygen, cold water is brought to the surface both by convective mixing and by cyclonic eddy activity and benefit the growth of the Noctiluca in a complex and nonlinear fashion as described in the "Discussion and Conclusions" Section. The factor 'par' measures PAR the amount of light that is available on the sea surface for the photosynthesis of the symbiotic green algae ( Fig 1B) living within Noctiluca. The 'kd490' provides an indication of the transparency of the water column and amount of light that penetrates into the sea water. The 't865' represents the amount of particles in the atmosphere over the water, an indicator for the atmospheric deposition. The rest of the factors {'time', 'id', 'lat', 'lon', 've', 'vn', 'spd'} represent the spatio-temporal information describing the physical transport and dispersal of the Noctiluca blooms.

Data preprocessing for vLDS
Due to the limitation of the satellite coverage mentioned in the "Data Collection" Section, there are missing values in each of the variables in the interpolated dataset. Our focus here is on the chlorophyll a concentration 'chlor_a', since it is the key variable for Noctiluca blooms.
The data structure of the post-processed float dataset is determined by the pre-processing steps for the variable 'chlor_a'. The overall objective of data preprocessing is to keep the microscopic trajectory-based data structure intact, to split the drifter trajectories that are spanning over multiple years, and to remove the drifter trajectories that are too short to be meaningful for the learning algorithm, with due consideration of the physical and physico-chemical meaning. We consider that each period from November 1 to March 31 (during winter monsoon) represents one growth cycle of the Noctiluca bloom for a particular float or drifter. For any float with a unique id that has chlorophyll a data over two or more cycles, we split the data and assign a new derived float id to the data within each cycle, by adding a small increment 0.05 to the original float id. The resulting dataset allows the vLDS model to learn the trajectory-scale dynamics of each cycle independently.
To enhance the quality of the raw input dataset, for each float id, we calculate the percentage of 'NaN' values in the dataset in the column 'chlor_a' and choose a threshold of 40%. There are two cases. First, if this percentage is smaller than the threshold, the data quality for this particular float is considered to be good, and we interpolate all the missing values for each of the vari- Fig 5 for an example, in which the time series is interpolated for 'chlor_a'. Also, after the interpolation process, there might still be gaps in the time series. For instance, the float might just not have any record, including 'NaN', in a certain short period. In this case, the float will be further split into continuous subseries. Therefore, every interpolated float time series will go through the second step for checking and splitting, which we now describe.
In the other case, if this percentage of 'NaN' values is larger than the threshold, the data quality for this particular float is considered unsuitable for interpolation. We split the discontinuous series into smaller continuous series for 'chlor_a'. We loop through the time series on 'chlor_a' and split it into smaller continuous series for 'chlor_a'. Moreover, we assign a new derived float id to each newly generated shorter series, by adding a small increment 0.03 to the original float id. Also, we drop any series for 'chlor_a' of length 1. See Fig 6 for an example, in which we split the time series for 'chlor_a' into 5 different shorter series, and we drop three series of length 1. For a threshold of 40%, the irregularity of the drifter dataset is evident from the distribution of the trajectory lengths characterized by [10,16,26,53] at the [20%, 40%, 60%, 80%] quantiles, respectively. For a threshold of 20%, the quantiles are [7,10,16,30]. For a threshold of 10%, the quantiles are [7,9,11,19]. It is evident that with a tighter threshold, such as 20% or 10%, many of the long time series will be split into too many shorter time series. Many of the resulting shortened series are too short, comparing to the time series in the heldout dataset. Therefore, we have fixed a threshold of 40% in the following experiment.

Linear dynamical systems (LDS)
After preprocessing, the merged GDP floats dataset consisted of 186 float records. For each float, the measurement is a multivariate time series, given by a vector y:¼ flat; lon; ve; vn; spd; dist; chlor a; par; cdm; t865; kd490; sst4g: . . . x n T n Þ to denote the latent variables, y n ¼ ðy n 1 ; y n 2 ; . . . y n T n Þ the observations from the float n, and T n the length of the time series of the float n. The plain version Linear Dynamical System (LDS) is an adaptive procedure that can learn from data to recover the latent relationship between y and x, using assumptions of linear relationships at time i between x i and x iÀ 1 ; y i and x i , with Gaussian Noise. For the moment, we omit the superscript n and focus on one float. More specifically, we assume that for a specific float, the time series of the latent variables and the observations hold the following relationships: where w i ; v i ; u are noise terms. The LDS model fits the model parameters Θ :¼ A; C; G; S; μ 0 ; V 0 f g by taking the expectation over the latent variables x i jθ old , and maximizing the log-likelihood of the complete data fx; yjθg, where θ old is the model parameter from the previous iteration and θ is the parameter that we are seeking at the current iteration. In the expectation step, with the parameter θ old , the mean and the variance of the posterior marginal latent variables x i jθ old ; y 1 ; y 2 ; . . . y i at time i (see float 1 in Fig 7A) and the mean and the variance of the posterior marginal latent variable x i jθ old ; y 1 ; y 2 ; . . . y T based on the information at all time (see float 1 in Fig 7B), are calculated using the forward and backward iterations. Here, T is the total length of a particular time series in a float. These marginal variables lead to the sufficient statistics of the complete data fx; yjθg, namely the E xjθ old ;y 1 ;y 2 ;...y T ½x i �, E xjθ old ;y 1 ;y 2 ;...y T ½x i x T iÀ 1 �, E xjθ old ;y 1 ;y 2 ;...y T ½x i x T i �. Using these sufficient statistics [54,55], we obtain the updated LDS model parameters Θ :¼ A; C; G; S; μ 0 ; V 0 f g. The graphical model [56] of the plain LDS is schematically plotted in Fig 7A and 7B as one branch, for instance, the branch of float 1. The workflow of the plain LDS model is described in Algorithm 1.

Variable-length linear dynamical systems (vLDS)
In this study, each float generates one or more statistically independent time series of the Chl a concentration, due to the interpolation or splitting process discussed in the "Data Preprocessing" Section. For the preprocessed dataset with 186 floats, we treat it as multiple multivariate time series, each with a unique id. Also, we note that the lengths of the time series in the dataset are mostly different, due to the irregularity of the longevity of the floats. The variable-length Linear Dynamical Systems model is specifically designed to address this situation, as it summarizes and recovers the latent dynamics from multiple multivariable time series with a different time span.
To fit the vLDS model, we start with some initial parameter Θ 0 , which is shared across all floats in the dataset. We keep the superscript n here. The Expectation step is carried out on each float id, using a two-loop forward and backward smoothing step to compute the conditional expectations of the sufficient statistics of the complete data fx; yjθg, namely the E xjθ old ;y 1 ;y 2 ;...y T n ½x i �, E xjθ old ;y 1 ;y 2 ;...y T n ½x i x T iÀ 1 �, E xjθ old ;y 1 ;y 2 ;...y T n ½x i x T i �. In the maximization step, we use the averaging formula derived in Eq (5) across all floats to update the model parameter Θ :¼ A; C; G; S; μ 0 ; V 0 f g. We emphasize that for a particular drifter n, the multivariate observation y n i ; i ¼ 1; 2; 3; . . . T n , contains all the physical and physico-chemical information along a drifter trajectory at time i, and it is the multivariate latent random variable x n i that we are solving for from the vLDS model to represent the hidden dynamics between different Robust learning algorithms for capturing oceanic dynamics of Noctiluca blooms using linear dynamical models components of y n i , namely, flat; lon; ve; vn; spd; dist; chlor a; par; cdm; t865; kd490; sst4g: It is possible that some of the components of y n i , for instance, the t865 in our study, as demonstrated in the "Discussion and Conclusions" Section, are not much involved in the latent dynamics. Therefore, the dimension of the latent space recovered by the latent variable x n i might be smaller than the dimension of the observations y n i . In this study, the latent dimension in x n i , as determined by the cross-validation procedure, turns out to be 11, and the dimension of the observations y n i is 12. To test the hypotheses in the "Background" Section, we first generate the predicted values of y n i e by using Eq (1) with the recovered latent variables x n i from the vLDS model. Since the vLDS model automatically maximizes the log-likelihood of the complete data, including the latent variable x and the observational variable y, it is beneficial to visualize the predictive plots of all predictors along the drifter trajectories, as shown in the "Discussion and Conclusions" Section, and study the performance metric R-squared, as described in the "Results" Section.

Probabilistic computation of vLDS
We assume that the observed multivariate variable y n i , including the chlorophyll a concentration, of each float is evolving independently with other floats, except that they are driven by the same underlying physico-chemical and physical forces. It is this assumption that allows the vLDS model to share the same model parameters Θ :¼ A; C; G; S; μ 0 ; V 0 f g in all branches in Fig 7A and 7B and makes the vLDS model a powerful algorithm to summarize and capture the population-level structures along drifter trajectories. Multiple variants of the LDS model have been applied successfully at the microscopic scale in several research areas such as computational neuroscience [19][20][57][58] and sound tracking [22], in which the authors considered various extensions of LDS to data sequences with fixed length. Our vLDS method is different, and particularly designed for trajectory-based data sequences with variable lengths. The irregularity of the drifter dataset is evident from the distribution of the trajectory lengths characterized by [10,16,26,53]  Although the distribution of the complete data fx; yjθg depends on the model parameter θ, we omit the dependence on θ for notational ease in the following derivation. For a particular float n, letting i be the time step and T n be the total length of the time series, the distribution of the complete data (namely the observations y and the latent variables x) can be written as Pðy n i jx n i Þ logPðfx n ; y n gÞ ¼ logPðx n 1 Þ þ By the assumption of independence, the joint probability of the observations and state variables across all floats expands into the product of the joint probability of the observations and state variables of all the time series generated by each float n ¼ 1; 2 . . . We note that, for each float, the preprocessed data of this float might generate multiple multivariate time series (see the "Data Preprocessing" Section for more details.) Under the traditional i.i.d. assumptions, the objective function of the vLDS model is simply the addition of the objective functions for each individual time series' log-likelihood with all the parameters Θ :¼ A; C; G; S; μ 0 ; V 0 f g that define the plain version LDS for each float (or each box-branch in Fig 7) being shared across all the floats. Using the Expectation-Maximizing algorithm, the Expectation step can be carried out using a two loop backward and forward iteration for each time series independently, due to the conditional independence of the state variables across different time series of different floats. However, in the maximization step we need to average the vLDS model parameter across all the time series from all the floats.
The derivation of the update formula for the vLDS model parameters Θ :¼ A; C; G; S; μ 0 ; V 0 f g follows directly by taking derivatives of the complete data log-likelihood with respect to each component in Θ and by using the standard results from the Maximum Likelihood Estimators of the mean and variance for the Gaussian Distribution. For a dataset of many floats, the complete data log-likelihood has an addition summation sign running through n ¼ 1; 2 . . . N in Eq (4). We use T n , instead of T, to denote the length of the time series of the float n: Given the fact that the derivative of a linear combination of functions is a linear combination of derivatives of each function, we can write the updating formula for the maximization step as: The workflow of the vLDS model is described in Algorithm 2, and the averaging of the vLDS model parameters Θ across all the floats in Eq (4) is carried out in step 12. During this procedure, the vLDS model adaptively learns the latent dynamics of the underlying process. We have fitted the chlorophyll a concentration for the particular float in Fig 5 with id 64113560. The result is displayed in the "Discussion and Conclusions" Section.
Each Expectation-Maximization cycle of the LDS model for Gaussian random variables is guaranteed to increase the value of the complete data log-likelihood. Therefore, a standard stopping criterion for the Expectation-Maximization algorithm is based on the complete data log-likelihood in Eqs (2) and (4) with a relative tolerance rtol ¼ 10 À 4 and maximum iteration 100. (The source code of the vLDS implementation is available at: https://bitbucket.org/ yy2250cu/vlds-oceancolormodeling/src/).
One of the key model parameters in the LDS modeling is the dimension k of the latent space, namely, the number of components in the latent variable x. It is the dimension of the subspace generated by the projection of the full feature space onto the latent subspace, whose projection back onto the full feature space in Eq (1) under the vLDS linear transformation matrix C maximizes the complete data log-likelihood. A larger k indicates that there are more independent factors in the latent space of x driving the underlying dynamical system of fx, y}. Moreover, varying the values of the dimensionality k induces a family of different vLDS models (1)-(3) indexed by k. To select the model with the most appropriate parameter k, we carry out a 10-fold cross-validation [59][60][61] on the parameter k and choose the optimal k that achieves the maximum complete data log-likehood on the test dataset. More specifically, we group the dataset by float ids. We hold a portion of the floats ids and consider them as the heldout testing dataset. We take the rest of the float ids as the cross-validation dataset. In the cross-validation step, we split the cross-validation set evenly into 10 folds. Each time we take one fold as the testing dataset, we take the rest as the training dataset. We fit the vLDS parameter Θ :¼ A; C; G; S; μ 0 ; V 0 f g on the training dataset and compute the complete data log-likelihood on the testing dataset using this newly fitted parameter Θ. The complete data log-likelihood is averaged for different testing fold for a fixed k. Then, we repeat the entire process for different values of k. See Table 2 for the complete data log-likelihood generated by different cross-validation trails. The averaged complete-data loglikelihood across different testing sets is maximized at k ¼ 11.
With the optimal value k ¼ 11 of the latent space dimension identified, we fit the vLDS model one more time with the full cross-validation dataset to generate the vLDS model parameter. In Fig 8, we display the log-likelihood convergence of the Expectation-Maximization algorithm for the complete cross-validation dataset and five individual floats in the cross-validation dataset. We note that from Eqs (2)-(4), the log-likelihood of the complete cross-validation dataset is the sum of the log-likelihood of each individual floats in the cross-validation dataset (step 8 in Algorithm 2.)

Results
With the optimally selected latent space dimension k ¼ 11, the vLDS algorithm obtains a set of model parameters Θ :¼ A; C; G; S; μ 0 ; V 0 f g when the stopping criterion inside the Expectation-Maximization algorithm is reached. The spatial distribution of the vLDS prediction error for the chlorophyll a concentration is shown in Fig 9. Fig 10 shows the prediction results for some drifter ids in the cross-validation dataset, using Eq (1) and the expected conditional mean of the latent variables at the last iteration of the Expectation steps 4, 5, and 6 in Algorithm 2. The dark lines are the observations, and the cyan lines are the predictions. Most of the hidden dynamics of the float profiles inside the cross-validation dataset are well captured by the vLDS model. The R 2 values of the drifters in Fig 10 are 0.95, 0.98, 0.98, 0.98, 0.99, respectively. We note the positive correlations among 'chlor_a', 'cdm', and 'kd490' in the recovered vLDS latent dynamics (cyan lines in Fig 10) at the local drifter-scale and population-level in the cross-validation dataset. The model captures this correlation with some overshooting or undershooting in certain regions. Also, 't865', the aerosol optical thickness over water, turns out to be independent of the chlorophyll a concentration and other ocean profiles. Moreover, the spatial information, namely, the longitude, latitude, velocity, speed of the float, and Table 2. 10-fold cross-validation on the parameter k. The optimal k that achieves the maximum complete data log-likelihood on the test set is k ¼ 11. The unit of the test-dataset log-likelihood in the table is 10 4 .  distance to the nearest coast, is all well recovered by the vLDS model (lat and spd are not shown in Figs 10 and 11 due to space limitations.) We next examine on the robustness of the vLDS model. The floats in the heldout testing dataset are not used in the cross-validation process or the model's parameter estimation process. Therefore, the heldout testing dataset is totally unknown to the vLDS learning algorithm. We use the cross-validated latent dimension k ¼ 11, and the model parameter Θ :¼ A; C; G; S; μ 0 ; V 0 f g generated by training the vLDS model on the cross-validation dataset. Applying one iteration of the forward-backward smoothing process, namely, i.e., one iteration of the Expectation steps 4, 5, and 6 in Algorithm 2, to each float in the heldout testing dataset, we obtain the predictions of their profiles (Fig 11). Most of the hidden dynamics along drifter trajectories for the floats in the heldout testing dataset, which is totally unknown to the learning algorithm, is well captured by the vLDS model. The  Robust learning algorithms for capturing oceanic dynamics of Noctiluca blooms using linear dynamical models  Robust learning algorithms for capturing oceanic dynamics of Noctiluca blooms using linear dynamical models aerosol optical thickness over water, seems to be independent of the Chl a concentration and other ocean profiles in the heldout dataset. The vLDS model simply estimates the mean value for the variable 't865' in both the cross-validation and heldout datasets. However, by comparing the predictions generated by the vLDS model for 't865' and for other variables, we conclude that the variable 't865' is not much involved in the latent dynamics of the Noctiluca's growth. Otherwise, the predicted values of 't865' should match its observations in Figs 10 and 11, and the R 2 of 't865' in Table 3 should not be too small. Evidently, the vLDS model indicates that there is no strong relationship between 't865' and the latent dynamics of the Noctiluca's growth at the population-level along the drifter trajectories, a feature lacking in models that do not take the local trajectory-based population-level structure into consideration. Moreover, the spatial information of the heldout floats, namely, longitude, latitude, velocity, speed of the float, and distance to the nearest coast, is all well recovered by the vLDS model. In addition to the above-mentioned correlations that are recovered correctly from the vLDS predictive data stream fỹ i g, it is evident that both spatio-temporal {time, lon, lat, dist}, physical {ve, vn, spd}, and physico-chemical factors {sst4, cdm, kd490, par} are involved in this 11-dimensional ðk ¼ 11Þ latent dynamics for the Noctiluca blooms. It is important to note that vLDS serves as a mechanism to exclude irrelevant variables, such as 't865', for the underlying microscopic latent dynamics at the local drifter-scale, which is the Noctiluca's growth in this study.
To quantify the performance of the vLDS model, we use the R-squared metric ðR 2 Þ. In Table 3, the total sum of squares (SSTotal), sum of squared errors of predictions (SSE), and R 2 , which is the portion of the variance captured by the predictive model, are computed for both the cross-validation and heldout testing datasets. The standard deviations (stdev) of SSE and R 2 are also listed in Table 3. Although the vLDS model has the log-likelihood of the complete data in Eqs (3) and (4) as its own performance metric, we use R 2 here for an intuitive interpretation. The quantitative results reflect the visualization in Figs 10 and 11. The feature 't865' has a very small value in its R 2 metric and does not exhibit any predictive power. The vLDS recovers the spatio-temporal information well, and explains most of the variance in the physicochemical factors {'cdm', 'kd490', 'par', 'sst4', 'chlor_a'}.

Discussion and conclusions
We have introduced a new model vLDS and showed that it offers a new local-scale trajectorybased data analysis tool to recover biogeochemical mechanisms underlying chaotic drifter trajectories that might be unobservable at the macroscopic scale or accessible only in controlled laboratory experiments. The vLDS model generates predictions that recover the causal relationship among the Noctiluca blooms, physical dispersal, and physico-chemical environments (Figs 10 and 11 and Table 3.) The model's generalization capability also summarizes, recovers, and predicts the latent dynamics from unknown heldout testing datasets, thus inspiring confidence in our local-scale findings along drifter trajectories and macroscopic findings of pooled data. The highly correlated relationships between the 'chlor_a' and 'cdm' (colored dissolved organic matter CDOM), and between the 'chlor_a' and 'kd490' (light under the sea surface) are close to linear. The tightly correlated relationships between the 'chlor_a' and 'par' (light on the sea surface PAR), and between the 'chlor_a' and 'sst4' (sea surface temperature SST4) are nonlinear. The vLDS model does not provide evidence of a strong relationship between 't865' and the latent dynamics of the Noctiluca's growth.
Furthermore, in the vLDS model, individual components are not assumed to be mutually independent in the multivariate random variable. After the prediction step, the linear correlations are only one aspect of insights that can be obtained from vLDS. In fact, correlations are linear relationships. The latent dynamics recovered by vLDS predictions, on the other hand, is not simply a linear correlation. Although the vLDS is named Linear, it is evident from the updating formula Eq (1) that recursively applying linear transformations on the latent variable x iÀ 1 makes both the latent data stream fx i g and the predictive data stream fỹ i g generated by the vLDS highly nonlinear. So the data stream fỹ i g recovered or generated by vLDS is not simply linear correlated. Instead, it is on a low-dimensional nonlinear manifold generated by vLDS. It is evident that both spatio-temporal {time, lon, lat, dist}, physical {ve, vn, spd}, and physico-chemical factors {sst4, cdm, kd490, par} are involved and correctly recovered in this 11-dimensional ðk ¼ 11Þ latent dynamics for the Noctiluca blooms. It is important to observe that vLDS also serves as a mechanism to exclude irrelevant variables, such as 't865', for the underlying local trajectoryscale latent dynamics in general, beyond the results in this study for Noctiluca's growth.
These results confirm the macroscopic hypotheses in the "Background" Section from the local trajectory-scale perspective, and confirms the impact of both the physical transport and physico-chemical factors of light and nutrients, proxies for the latter being CDOM, on the distribution of the Noctiluca blooms. Also, the test results imply that the nutrient and light (light on and under the sea surface) are important positive factors for the Noctiluca's growth. Regarding the atmospheric deposition 't865', the vLDS model does not provide evidence of a strong relationship between 't865' and the latent dynamics of the Noctiluca's growth (Figs 10 and 11 and Table 3). Due to the fact that the nutrient dynamics involving the atmospheric deposition may have lagging and cumulating effects, further research regarding the role of the atmospheric deposition in the Noctiluca's growth is needed. It has also been confirmed from the drifter dynamics recovered by the vLDS Model that Noctiluca grow faster in lighted than in dark areas on the sea surface and in the sea water.
We have demonstrated the effectiveness of the vLDS model as a local-scale trajectory-based statistical modeling tool for detecting important causal relationships in biogeochemical processes. Although the trajectories of the oceanographic probing devices are chaotic and the dataset is high dimensional, the vLDS model is very parsimonious on model parameters. The model only requires Θ :¼ A; C; G; S; μ 0 ; V 0 f g and the latent-space dimension k to be able to summarize all the drifters in the Arabian Sea region from 2002 to 2017. The predictive dynamics matches the local-scale observations along drifter trajectories well, and affords tremendous confidence in support of the macroscopic hypotheses.
Furthermore, the intertwined relationships recovered by the vLDS model between the physical and physico-chemical dynamics of the Noctiluca blooms and the intertwined relationships among the physico-chemical factors such as 'cdm' and 'kd490' have inspired us to use inference tools to quantify the isolated impact of the physico-chemical factors that are responsible for the Noctiluca blooms as 'chlor_a' in the Arabian Sea region. The vLDS model presented here is fully generalizable to other datasets for other applications, such as larval transport in marine ecology.

Code availability
The source code for the variable-length Linear Dynamical System (vLDS) method is available at: https://bitbucket.org/yy2250cu/vlds-oceancolormodeling/src/ Supporting information S1 Software. Source code of the vLDS implementation.