In-Situ Inference: Bringing Advanced Data Science Into Exascale Simulations [report]

Nathan Mark Urban, Earl Christopher Lawrence, Ayan Biswas
2020 unpublished
What new science would be possible if you had ALL the data?" As simulations generate ever-increasing amounts of data, there are correspondingly richer opportunities for analysis and scientific discovery-discoveries that will be missed if most of the data must be discarded before it is analyzed. Because future exascale architectures will be increasingly storage-limited, it will not be possible to save the vast majority of simulation data for later analysis, requiring analysis to occur "in-situ"
more » ... to occur "in-situ" within the simulation. However, existing in-situ data analysis frameworks provide little or no support for one of the most sophisticated forms of data science: probabilistic statistical modeling or uncertainty quantification (UQ), and the accompanying challenge of inference-fitting those statistical models to massive simulation output. Our goal is to develop the fundamental statistical algorithms and computer science needed to perform statistical inference in-situ (in HPC simulations) to the full stream of data those simulations generate. Consider the mission science challenge of quantifying the probability of events in predictive HPC simulations, and understanding the underlying factors influencing the likelihood of these events. Examples include the future risk of extreme weather events damaging population centers, or of extreme electron flux events in solar storms damaging satellites. To understand why this is a statistical inference challenge, turn to questions that we cannot yet adequately answer: How will the frequency of blizzards change as the climate warms ( Fig. 1) ? How much of this change is attributable to sea ice retreat, vs. surface warming, vs. enhanced moisture transport? How do the statistics of turbulent plasma flows change as a function of solar cycle or prior history of the magnetospheric state (Fig. 2) ? The tools needed to answer these questions are statistical models: probability density estimation, extreme value analysis, nonstationary spatial and time series modeling, regression and covariance analysis to quantify the sensitivity of effects to causes, etc. The inference algorithms used to fit these statistical models to data include Bayesian inference and Monte Carlo sampling, but they currently only work offline, on highly-reduced data. The above grand challenge questions, by contrast, require all the data to answer. We are asking statistical questions down to the individual grid cell and neartimestep level in exascale simulations, looking for subtle statistical differences in probability distributions at different locations and times, and the dependence of event frequencies on vaguelydefined phenomena that extend throughout a vast 3D domain (such as mesoscale weather formations or geomagnetic substorm injections). In such settings, and when we are looking to quantify potential dependencies of any data point with any other, we cannot simply identify all the relevant features of interest ahead of time. Without new algorithms and computer science to infer or fit sophisticated statistical models in-situ, to all of the simulation data as it is being generated, modern data science will be left behind in the exascale revolution. 36 J. Berckmans et al. (a) (d) (b) (e) (c) (f) Figure 1. (a) Blocking frequency (% per total days) HadGAM absolute values (b) blocking frequency difference between NUGAM and HadGAM (% per total days) (c) blocking frequency difference between NUGAM and NUGAM with HadGAM orography (% per total days) (d) blocking frequency difference between HadGAM and ERA-40 (e) blocking frequency difference between NUGAM and ERA-40 (f) blocking frequency difference between NUGAM with HadGAM orography and ERA-40. is not the main reason for the improvement in the model climatology. Following Scaife et al. (2010) , we now assess whether the increase in European blocking frequency is due to changes in the climate mean or to changes in the time-varying part. A probability density function (PDF) of the blocking index (BI) is able to show the maximum change of blocking frequency occurs between the two model resolutions. This figure shows that, in addition to a mean difference, the shape of the two distributions is also different, which indicates a difference in variability in the two model resolutions. Blocking occurs when BI is >0, so the blocking frequency is related to the area under this part of the Figure 1. Phenomena in simulation data, such as atmospheric blocking events leading to blizzards and cold snaps, 56 exhibit complex patterns of spatiotemporal variability. 7 Sophisticated statistical inference is required to identify relationships between such events and changing environmental conditions. But it is currently impossible either to store the data needed to fit such a statistical model offline, or to fit it online in the simulation.
doi:10.2172/1595630 fatcat:7a33rjr6gbe7vpx6fup2oign5i