A User-centric Approach to Data Cubes
A version of this article originally appeared in the July/August 2021 edition of GeoConnexion International Magazine.
Geospatial data cubes are used frequently these days for their enabling of performant, cloud-compatible geospatial data access and analysis. But differences in their design, interfaces, and handling of temporal characteristics are causing interoperability challenges for anyone interacting with more than one solution. Such challenges are unnecessarily wasting time and money, and - from a science perspective - affecting reproducibility.
To address these challenges, the Open Geospatial Consortium (OGC) and the Group on Earth Observation (GEO) invited global data cube experts to discuss the “state of the art” and find a way forward at the Towards Data Cube Interoperability workshop. The two-day workshop, conducted in late April 2021, started with a series of pre-recorded position statements by data cube providers and data cube users. These videos served as the entry point for intense discussions that not only produced a new definition of the term ‘data cube’, but also underscored the need for a ‘user centric’ API-based approach that exposes not only the data available to the user, but also the processing algorithms that can be run on it - and allow the user to add their own. The outcomes of the Workshop have been published on the OGC & GEO Towards Data Cube Interoperability Workshop webpage.
Data cubes are ideally suited to cloud-based workflows, but a lack of standards makes integration of different data cubes a challenge.
Data cubes from the users’ perspective
Existing definitions of data cubes often focus on the data structure aspect as used in computer science. In contrast to this, the Towards Data Cube Interoperability workshop emphasized the need to leave these definitions behind and focus on the user’s perspective. Users don’t care if the data is stored in a relational database, in a cloud-based object store, or a file server. What users are interested in is how they can access the data and the processing algorithms that they can apply to it. Any such standard for access should reflect this.
This led to an interesting rethinking of just what a data cube is and can be. Although it wasn’t agreed to on any formal consensus-basis, the workshop participants generally took a user-centric definition of a geo data cube to be:
“A geo data cube is a discretized model of the earth that offers the estimated values of certain variables for each cell. Ideally, a data cube is dense (i.e., does not include empty cells) with constant cell distance for its spatial and temporal dimensions. A data cube describes its basic structure, i.e., its spatial and temporal characteristics and its supported variables (aka properties), as metadata. It is further defined by a set of functions. These functions describe the available discovery, access, view, analysis, and processing methods by which the user can interact with the data cube.”
As we see, the data cube is described for the user, not the data. It does not matter if the data cube contains one, two, or three spatial dimensions, or if time is given its own dimension(s) or is just part of the metadata of an observation - or isn’t relevant to the data at all. Similarly, it doesn’t matter how the data is stored. What will unify these heterogeneous data cubes is their use of a standardised HTTP-based API as their method of access and interaction.
The main concern of the user is what functions the data cube instance offers to apply to the data. These functions are what primarily differentiate the user-centric data cube definition over other definitions. A user needs to understand what questions can be asked to access data that fulfills specific filter criteria, how to visualize specific (sub-) sets of data, or how to execute analytical functions and other processes on the data cube. If supported, the user also needs to understand how to add their own processes to the data cube so that they can be executed directly on the data cube without the need to transfer vast amounts of data out of the cloud.
This isn’t to say that all other characteristics - such as spatial and temporal details (e.g., being dense or sparse, overlapping or perfectly aligned, constant or inconstant distances), and property details (scales of measurements, incomplete data, interpolation methods, error values, etc.) - are of no concern to the user: they still need to be known. As such, they will be provided via the data cube API as metadata, so that the user can take them into account when assessing how best to process the data.
Integrating different data cubes isn’t an unsolvable puzzle.
Interoperability through a Data Cube API
Where does this leave OGC? We think an API-based, flexible approach to standards will provide end users, software developers, and data cube operators with the best experience.
For end users: a single, simple, standardised HTTP API to learn and/or code for, no matter where the data resides, will mean an increased selection of available software (including low- or no-code platforms) will support an increased choice of data cube providers and an increased number of processing algorithms. From a scientific perspective, this means that the atmospheric scientist doesn’t additionally have to also be a Python expert, potentially using a low- or no-code platform GUI to create an algorithm that processes the data for their heatwave study across Germany. Another atmospheric scientist could then take that same processing algorithm and apply it to the UK with minimal changes - even if the required data is held by a different standards-compliant data provider. This approach greatly increases the transparency and repeatability of scientific studies and other valuable analysis tasks.
For software developers: a single, simple, standardised HTTP API means that software developers don’t have to design their own vendor-specific methods for providing access to data cubes in their software. Instead, they interact with data cubes via HTTP calls, thus benefiting from simple standard Web communication, rather than interactions on the programmatic level. By coding to an agreed-upon standard, developers can work with any compliant data cube while minimizing cube-specific adaptations. This increases the usability of the software, while decreasing the development and maintenance costs.
For data cube operators: using a single, simple, standardised HTTP API reduces development and maintenance costs while broadening the customer base. Being standards-compliant allows providers to access customers that are using any compliant software package, rather than just those using a select list of software coded to work with your specific instance. This means that more people will be coding for your data cube, even if they don’t know your service exists.
Data cubes come in many different shapes and sizes - a standard API would simplify their use.
What’s next for OGC?
It’s early days yet, but you can expect to see a data cube-related API become part of our family of OGC API standards. Work towards such a data cube API builds upon the work of our Earth Observation Exploitation Platform (see An App Store For Big Data, in GeoConnexion International, July/August 2020), and is currently underway as part of OGC Testbed-17.
If you’re interested in learning about OGC’s approach to standardising access to data cubes, OGC Members can follow their early development as Observers in OGC’s Testbed-17. Alternatively, OGC Members can join the Earth Observation Exploitation Platform Domain Working Group. Detailed outcomes from the Workshop are available on the OGC & GEO Towards Data Cube Interoperability Workshop webpage.