Data integration is often where geospatial projects slow down. It is a critical step in building a Spatial Data Infrastructure (SDI), and it is often where a great deal of time is spent. This is largely due to the chronic lack of adherence to FAIR (Findable, Accessible, Interoperable, Reusable) principles in many data sources.
One of the most common challenges I have encountered is the lack of knowledge about the data to be integrated. This makes it very difficult to plan timelines for developing an SDI. By “knowledge,” I mean knowing exactly which datasets will be integrated, along with their technical metadata, such as format, size, and frequency of updates.
In an ideal scenario, this information would be provided in a metadata record, preferably in a standardized format. However, in many cases, data comes without metadata, which brings me to the second challenge: the creation of standards-based metadata. As with building a data inventory, this is often less a technical issue and more a human one, as it requires collaboration with data owners.
Another major challenge I have encountered is the format of the data itself, which in many cases is neither standardized nor structured. One of the most common examples I have seen is CSV (Comma-Separated Value) files, which are text files that may contain anything within comma-enclosed fields. For instance, a single location field may contain unstructured information, such as coordinates expressed in different formats within the same column. These integration challenges arise from the lack of a mechanism to enforce a schema. Despite these limitations, CSV and Excel files remain among the most commonly used formats for data exchange.
The image above depicts the ideal situation: a dataset is provided in a standards-based format, along with an accompanying metadata record. This allows the data to be more easily ingested into an SDI and published through various OGC API formats, such as tiles, features, or records.
In this post, I have highlighted some of the most common challenges. In the coming posts, I will explore each of these challenges in more detail and share some practical strategies that I have developed to address them.
My key takeaway is that while software tools can assist us with these tasks, education remains the most effective way to prevent these challenges in the first place.