datafold: data-driven models for point clouds and time series on manifolds
Journal of Open Source Software
Ever increasing data availability has changed the way how data is analyzed and interpreted in many scientific fields. While the underlying complex systems remain the same, data measurements increase in both quantity and dimension. The main drivers are larger computer simulation capabilities and increasingly versatile sensors. In contrast to an equation-driven workflow, a scientist can use data-driven models to analyze a wider range of systems, including those with unknown or intractable
... s. The models can be applied to a variety of data-driven scenarios, such as enriching the analysis of unknown systems or merely serve as an equation-free surrogate by providing fast, albeit approximate, responses to unseen data. However, expanding datasets create challenges throughout the analysis workflow from extracting and processing to interpreting the data. This includes the fact that new data does not always provide completely new and uncorrelated information to existing data. One way to extract the essential information is to understand and parametrize the intrinsic data geometry. An intrinsic geometry is what most data-driven models assume implicitly or explicitly in the available data, and successful machine learning algorithms adapt to this underlying structure for tasks like regression or classification (e.g., Bishop, 2006) . This geometry is often of much lower dimension than the ambient data space, and finding a suitable set of coordinates can reduce the complexity of the dataset. We refer to this geometric structure encoded in the data as a "manifold". In mathematical terms, a manifold is a topological space that is locally homeomorphic to Euclidean space. Typically, manifold learning attempts to construct a global parametrization (embedding) of this manifold, in a space of much lower dimension than the original ambient space. The well-known manifold hypothesis states that such manifolds underlie many observations and processes, including time-dependent systems. datafold is a Python package that provides data-driven models for point clouds to find an explicit mani-fold parametrization and to identify non-linear dynamical systems on these manifolds. The explicit data manifold treatment allows prior knowledge of a system and its problem-specific domain to be included. This can be the proximity between points in the dataset (Coifman & Lafon, 2006a) or functions defined on the phase space manifold of a dynamical system, such as (partially) known governing equation terms Williams, Kevrekidis, & Rowley, 2015) . datafold is open-source software with a design that reflects a workflow hierarchy: from lowlevel data structures and algorithms to high-level meta-models intended to solve complex machine learning tasks. The key benefit of datafold is that it accommodates and integrates models on the different workflow levels. Each model has been investigated and tested individually and found to be useful by the scientific community. In datafold these models can be used in a single processing pipeline. Our integrated workflow facilitates the application of Lehmberg et al., (2020). datafold: data-driven models for point clouds and time series on manifolds. Journal of Open Source Software, 5(51), 2283. https://doi.org/10.21105/joss.02283 1 data-driven analysis and thus has the potential to boost widespread utilization. The implemented models are integrated into a software architecture with a clear modularization and an API that is templated from the scikit-learn project, which can be used as part of its processing pipeline (Pedregosa et al., 2011) . The data structures are subclasses from common objects of the Python scientific computing stack, allowing models to generalize for both static point clouds and temporally ordered time series collection data. The software design and modularity in datafold reflects two requirements: high flexibility to test model configurations, and openness to new model implementations with clear and well-defined scope. We want to support active research in data-driven analysis with manifold context and thus target students, researchers and experienced practitioners from different fields of dataset analysis. Figure 1: (Left) Point cloud of embedded handwritten digits between 0 and 5 with the "Diffusion Map" model. Each point originally has 64 dimensions where each dimension represents a pixel of an 8 x 8 image. (Right) Conceptual illustration of a three-dimensional time series forming a phase space with geometrical structure. The time series start on the (x,y) plane and end on the z-axis.