Published on

By

In my previous blog post, I introduced some of the common challenges I encountered in geospatial integration projects. In this post, I will focus on the first of those challenges: knowing your data.

Before we can integrate, publish, or analyze datasets, we need to understand what we are working with. This involves answering questions such as:

  • Does the dataset contain Personally Identifiable Information (PII)?
  • What is its size on disk?
  • Is it spatially enabled?
  • What Coordinate Reference System (CRS) does it use?
  • What is the update frequency?
  • What are the spatial and temporal resolutions?
  • What file format is it stored in?
  • Who is responsible for the dataset?

Ideally, we should be able to get this information from an accompanying metadata record. Unfortunately, metadata is often treated as an afterthought. More often than not, I have been involved in projects where datasets arrived with little or no metadata at all.

I will leave the broader discussion about metadata creation for a future blog post and instead focus on how we kick-start a data integration project when metadata is missing.

If we already have access to the datasets prior to designing the Spatial Data Infrastructure, there are some questions that can be answered by examining the data itself. File size, file format, and in some structured formats, the CRS can be determined through direct inspection.

But things become more complicated when working with formats such as CSV or spreadsheets containing lists of coordinates. In these cases, determining the CRS may require a degree of trial and error. This can be relatively straightforward when dealing with commonly used systems such as WGS84 or WebMercator, but it can quickly become difficult, or even impossible, when less familiar coordinate systems are involved.

Other questions may be even harder to answer. For example, identifying the contact point responsible for a dataset can be impossible when no documentation exists. Ironically, this may well be the most important piece of information of all, because the right contact person can often provide answers to many of the other questions.

So, what should we do when the data itself cannot tell us everything we need to know?

In my experience, keeping a human in the loop is often our best bet. In most cases, we do have a contact point for the project, and this person can either act as a contact point for the datasets or connect us with someone who could fill that role.

To support this process, I typically create a “Data Survey” that captures the information needed to understand each dataset, covering areas such as technical metadata (e.g., size and format) as well as information that we would normally see in a metadata record, such as the contact point details.

This data survey can take the form of an online questionnaire. Over time, I have realized that I should not assume that participants will interpret the survey fields in the way I have designed them. So, unless semantic annotations are an option, I try to provide clear explanations for each question, or even better, walk participants through the survey during a call.

eMOTIONAL Cities project data survey results
What is the spatial resolution survey response
Figure 1: Results from the eMOTIONAL Cities project survey showing responses to the questions “Who is the target user of your data?” and “What is the spatial resolution?”

Designing a Spatial Data Infrastructure (SDI) requires knowledge about the data that we are going to integrate and publish. When metadata is incomplete or absent, engaging stakeholders becomes essential. The process may require guidance, clarification, and a certain amount of handholding, but the insights gained are often critical to the success of the project.

In the next blog post, I will explore one of the most challenging aspects of building an SDI: engaging stakeholders in the creation of standards-based metadata. Stay tuned for the next installment in this series.

Latest Blogs