Boosting performance in Machine Learning of Turbulent and Geophysical Flows via scale separation

Recent advances in statistical learning have opened the possibility to forecast the behavior of chaotic systems using recurrent neural networks. In this letter we investigate the applicability of this framework to geophysical ﬂows, known to be intermittent and turbulent. We show that both turbulence and intermittency introduce severe limitations on the applicability of recurrent neural networks, both for short term forecasts as well as for the reconstruction of the underlying attractor. We suggest that possible strategies to overcome such limitations should be based on separating the smooth large-scale dynamics, from the intermittent/turbulent features. We test these ideas on global sea-level pressure data for the past 40 years, a proxy of the atmospheric circulation dynamics.

The advent of high-performance computing has paved the way for advanced analyses of high-dimensional datasets [1,2]. Those successes have naturally raised the question on whether it is possible to learn the dynamical behavior of a system without simulating the underlying evolution equations. Such an interest is motivated on one side by the fact that many complex systems still miss a universally accepted state equation -e.g. the brain dynamics [3], macro-economical and financial systems [4] -and, on the other, by the need of reducing the complexity of the dynamical evolution for the systems where the underlying equations are known -e.g. on geophysical and turbulent flows [5]. From a mathematical point of view, the Navier-Stokes equations, which are the cornerstone of turbulent flow dynamics, have yet to find a complete assessment for the existence/unicity of the solutions, which is a millennium Clay problem [6]. They are difficult to simulate for large systems such as the geophysical flows, so that approximations and parameterizations are needed for meteorological and climatological applications [7]. These difficulties are enhanced by those encountered in the modelling of phase transitions that lead to clouds and convection, which are major source of uncertainty in climate modelling [8]. Machine Learning techniques capable to learn geophysical flows dynamics could help improve those approximations and avoid running costly simulations representing all spatial/temporal scales. * Correspondence to davide.faranda@lsce.ipsl.fr Several efforts have recently been done to apply machine learning to the prediction of geophysical data [9], to learn parameterizations of subgrid processes in climate models [10][11][12], for nowcasting [13][14][15] and forecasting [16][17][18] of weather variables, to quantify the uncertainty of deterministic weather prediction [19]. However, these attempts mainly consist of complementing deterministic models, or short term prediction of time series data: while their utility may be substantial, they are still far from learning the dynamics of a complex system such as the Earth climate. A first great step in this direction was the use of Echo State Networks (ESN, [20]) to forecast the behavior of chaotic systems, such as the Lorenz 1963 [21] and the Kuramoto-Sivashinsky [22] dynamics. It was shown that ESN predictions of both systems attain performances comparable to those obtained with the real equations [23,24]. Good performance of regularized ESN in the prediction of multidimensional chaotic time series was obtained, both from simulated and real data [25]. This success motivated several followup studies with a focus on meteorological and climate data. These are based on the idea of feeding different statistical learning algorithms with data issued from dynamical systems of different complexity, in order to study short and long-term predictability. Recent examples include equation-informed moment-matching for the Lorenz96 model [26,27], multi-layer perceptrons to reanalysis data [28], or convolutional neural networks to simplified climate simulation models [29,30]. All those learning models were capable to recover some short-term predictability but failed in obtaining a long term behav-ior coherent with the input data.
In this letter we investigate the possible origins of these limitations by analyzing two of the most striking features of geophysical flows, namely turbulence and intermittency. Turbulence affects the dynamics by introducing energy at several spatial and temporal scales. This results in non-smooth trajectories within the flow, leading to local unpredictability and increasing the number of degrees of freedom needed to describe the dynamics [31]. Intermittency triggers large fluctuations of observables of the motion in time and space [32]. Both these effects have made challenging the description of geophysical flow both for short term forecast and to reconstruct the underlying climate attractor properties.
By applying ESN to turbulent and intermittent systems, we investigate the limitations introduced by these effect in applying machine learning techniques in geophysical flows. We begin with the simplest possible setup: we simulate the effects of turbulence by artificially introducing small scales dynamics in the Lorenz 1963 equations [21] via additive noise. We investigate the Pomeau-Manneville equations [33] stochastically perturbed with additive noise to have an example of intermittent behavior. We show how the performance of ESN in predicting the behavior of the system deteriorates rapidly with increasing noise and investigate how scale separation, introduced with a moving average filter, recovers short-term predictability. The idea of using moving average for scale separation is already established for meteorological variables [34]. We choose the ESN framework following the results of [23,24], and an established literature about its ability to forecast chaotic time series and its stability to noise. For example, [35,36] analyse and compare the predictive performance of simple and improved ESN on simulated and observed one-dimensional chaotic time series. We aim at understanding this sensitivity in a deeper way, while assessing the possibility to reduce its impact on prediction with simple noise reduction methods. This step has a crucial importance, as noise is inevitably present in each dataset issued from real-world systems. The remaining of this letter is organised as follows: first we give an overview of the ESN method and provide the description of the systems used. Then we show the results for the perturbed Lorenz equations and the Pomeau-Manneville intermittent maps. We discuss the improvement obtained on the short-term prediction and the long-term attractor reconstruction obtained with the moving average filter (see [37]-B). We conclude by testing these ideas on atmospheric circulation data.
Echo State Networks (ESN) are implemented as follows. The code is given in [37]. Let x(t) be the Kdimensional observable originating from the dynamical system and r(t) be the N dimensional reservoir state, then: where W is the adjacency matrix of the reservoir: its dimensions are N × N , and N is the number of neurons of the reservoir. The coefficients are randomly sampled from a uniform distribution in [−0.5; 0.5]. W in , with dimensions N × K, is the weight matrix of the connections between the input layer and the reservoir and the coefficients are randomly sampled, as for W . The output of the network at time step t+dt is W out r(t+dt) =x(t+dt). wherex(t + dt) is the ESN prediction, W out with dimensions K × N , is the weight matrix of the connections between the reservoir neurons and the output layer. We estimate W out via a ridge regression [38]. In the prediction phase we have a recurrent relationship: The systems we analyze are the Lorenz 1963 attractor [21] with the classical parameters, discretized with a Euler scheme and a dt = 0.001 and the Pomeau-Manneville intermittent map with β = 0.91 [33](see also [37]-C). The systems are perturbed via additive noise drawn from a Gaussian distribution with zero mean and standard deviation. The initial conditions are randomly selected within a long trajectory of 5·10 6 iterations. The training phase is fixed to 10 5 iterations, starting from the initial conditions, and the prediction phase to 10 4 iterations. These values have been derived after studying the performance of the ESN as a function of different learning windows (see [37]-C). For each dataset we run the ESN only once but for 100 different initial conditions.
In this paper, we use three different indicators of performance of the ESN: i) an indicator based on the χ 2 distance to assess whether the ESN forecast has reproduced a distribution of variable compatible with the invariant density of the system (see [37]-B). Here, by the term invariant density we refer to the physical measure of the discretized attractors, often referred to as the Sinai-Bowen-Ruelle measure [39]. The null hypothesis to be rejected is that the ESN generated data have the same distribution as those generated using the equations. We define φ as the rate of failure of this test averaged over 100 initial conditions. ii) a measure of the predictability horizon of the ESN forecast compared to the equations. For this purpose we use the root mean square error (RMSE) and define the saturation time τ s as the first time that RMSE exceeds a certain threshold s (defined as 1/10 of the average distance on the attractor. Results are independent on s provided that it is smaller than the average distance on the attractor and few orders of magnitude larger of the numerical precision.) iii) the initial error η = RM SE(t=1). Equipped with these indicators we analyze two sets of simulations performed with and without moving average. The moving average operation is the integral of x(t) between t and t − w, where w is the window size of the moving average.
For the Lorenz attractor, we perform an ensemble of 100 ESN forecast varying both the noise intensity and the network size N . In Fig. 1(a,c,e) ESN forecasts are run without moving average while in Fig. 1(b,d,f) we apply moving average with w = 10dt (dt = 0.001). Panels (a,b) show φ, the rate of failure of the χ 2 test averaged over the 100 initial conditions and computed considering only the x variable of the Lorenz 1963 system (results do not depend on the chosen variable). For increasing noise intensity and for small reservoirs sizes the performances without moving average rapidly get worse. The moving average operation with w = 10 greatly improves the three indicators except when the noise is too large (e.g. = 1). In this case, for any number of neurons, the moving average operation smooths too much the data. Furthermore the results exhibit only a weak dependence on the reservoir size. In [37] (section C and Figs S1-S4) we give additional details on those computations.
The analysis for the Pomeau-Manneville map (definition is provided in [37]-C) is shown in Fig. 2. We first observe that the forecast of the intermittent dynamics of the Pomeau-Manneville map is much more complicated than for the Lorenz system. This is a consequence of the intermittent behavior of this system. The ESN can never simulate an intermittent behavior, for all combinations of noise intensity and reservoir sizes. This is reflected in the behavior of the indicators. Independently from the network size considered, in the deterministic limit, the ESN fails to reproduce the invariant density in 80% of the cases (φ 0.8). For intermediate noise intensities φ > 0.9 ( Fig. 2-a). The saturation time τ s=1 for the short term forecast is small (Fig. 2c) as well as the initial error (Fig. 2e). The moving average procedure with w = 3 (see [37]-C for a justification of this value) partially improves the performances (Fig. 2b,d,f) and it enables ESN to simulate an intermittent behavior (Fig. 3). Panel (a) shows the intermittent behavior of the data generated with the ESN trained on moving averaged data of Pomeau-Manneville system (red) and compare to the target time series (blue). Panel (b) displays the spectra which show in both cases a power law decay, typical of turbulent phenomena. Although the intermittent behavior is captured, this realization of ESN shows a different distribution of invariant density (not shown), concentrated around x = 0.5 for the ESN prediction and around x = 0 for the original data.
We now test the effectiveness of the moving average procedure in learning the behavior of turbulent and intermittent systems on climate data issued by reanalysis projects. We use data from the National Centers for Environmental Prediction (NCEP) version 2 [Donner la reference, e.g. Kisler et al., BAMS]. We consider the global sea-level pressure (SLP) fields every dt = 6 h from 1979 to 31/08/2019 ([37]-C). Since these data are spatially extended, we need to investigate also the effect of spatial resolution on ESN performance. We find that higher predictability and better correspondence of long term behavior is achieved for a spatial coarse grain factor cg = 4 (see [37]-C and Figs. S6-S7). Focusing on cg = 4, we extract randomly from the data a day in the first 25 years, and then we train the ESN with 10 years of SLP data and start a long prediction of 5 years just after the training. We consider the same performance indicators introduced before. However, rather than the rate of failure of the χ 2 test, we show the χ 2 distance averaged over 100 long forecasts (Figure 4-a), for the global mean SLP. Results are shown for different w (colors) and network sizes N . Fig. 4-b,c) show respectively τ s in hours, as a measure of the predictability horizon of ESN forecasts, and log(η). The results show a great dependence on N and w. Indeed the the behavior of the three indicators is not monotonic with N . Depending on w, there is a network size which causes a loss of prediction power due to overfitting. The moving average procedure improves the performance but only for small network sizes. Visual inspection shows that the best combination is found to be for w=12 h and N = 200. We compare in details the results obtained for two 10 years prediction with w = 0h and w = 12 h at N = 200 and cg = 4 fixed. At the beginning of the forecast time (Video S1), the target field (Video S1-a) is close to both what is obtained with w = 0 h (Video S1-b) and w = 12 h (Video S1-c). However, when looking after 24000 h (Video S2), of course we do not expect to see agreement among the three datasets. Indeed we are well beyond the predictability horizon. However, we remark that the dynamics for the run with w = 0 h is steady: positions of cyclones and anticyclones barely evolve with time. Instead, the run w = 12 h shows a richer dynamical evolution with generation and annihilation of cyclones. This visual impression is confirmed by the distribution and spectral analysis presented in ([37] Fig. S8).
We have analysed the performance of ESN in reproducing the behavior of noisy and intermittent data, as those measured as observables of geophysical flows. The performance of ESN in predicting both short and long term behaviors rapidly drops when the systems are perturbed with noise. However, we found that a good predictability is partially recovered when scale separation is performed. Here we have used a simple moving-average filter and shown that a careful choice of the moving-average window can enhance predictability. The results also apply to geophysical datasets: here we analysed the atmospheric circulation, represented by an SLP field. Heuristically, the moving-average operation helps both in smoothing the signal and by providing the ESN with a wider temporal information. In some sense, this is reminiscent of the embedding procedure [40], where the signal behavior is reconstructed by providing not only information on the previous time step, but on previous times depending on the complexity. In future works, it will be interesting to use other learning architectures and other methods of separating large-from small-scale components [41][42][43]. Our results give a more formal framework for applications of machine learning techniques on geophysical data. For example, [12,44] have shown the capabilities of spatial separation of scales and the interest of learning subgrid dynamics. Indeed, several difficulties encountered in the application of machine learning on climate data could be overcome if the appropriate framework is used [19,27,29].