Training Data Markup Language for AI SWG
Yue, Peng (Wuhan University) - Group Chair,
Hu, Lei (South Digital Technology Co., Ltd) - Co-Chair,
Ziébelin, Danielle (Laboratoire d'Informatique de Grenoble) - Co-Chair
The Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) SWG is chartered to develop the UML model and encodings for geospatial Machine Learning training data. In machine learning, training data is the dataset used for training and validation of machine learning models. The geospatial training data categories will include, but are not restricted to, remote sensing imagery, moving features (e.g., vehicle trajectories), and related spatial content. The SWG will define a UML model and encodings consistent with the OGC standards baseline to exchange and retrieve the training data in the Web environment. The SWG will start from the JSON implementation. Once the UML and JSON encoding are well accepted, the SWG will work on the XML encoding as the current OGC baseline. This standard will provide more detailed metadata for formalizing the information model of training data. This will include but not limited to the following aspects: how the training data is prepared (e.g. provenance or quality), how to specify different metadata used for different ML tasks (e.g. scene/object/pixel levels), how to differentiate the high-level training data information model and extended information models specific to various ML applications, how to introduce external classification schemes and flexible means for representing ground truth labeling. The SWG will investigate the feasibility and interoperability of OGC standards to exchange geospatial training data in machine learning applications and describe gaps and issues that can lead to a new geospatial standard.
Artificial Intelligence is expected to play a crucial role in many domains and will revolutionize existing technologies. During the last decade, Machine Learning techniques, especially Deep Learning, have improved significantly due to an abundance of data and advancements in high-performance computing. machine learning reorients and transforms geographic information systems (GIS) and Remote Sensing (RS). Machine learning based applications are now being deployed across diverse markets to provide new solutions and increase human efficiency. Increasingly, the science community is also using these techniques to better harness the ever-increasing volume of Earth Observation (EO) data for geospatial analysis in various domains — such as smart cities, environmental management, and disaster management.
In order to increase the adoption of Machine Learning techniques for geospatial analysis by researchers and practitioners, several challenges must be addressed. A key component of Machine Learning techniques and processes is training data – data with known provenance, consistent metadata, and quality measurements that can be used to consistently tune and train machine learning applications. The lack of consistent and known training data is increasingly becoming the main bottleneck to advance EO science applications. The lack of training data is also causing reproducibility issues and making it difficult to compare results across studies. In recent years, several concerted efforts have started to catalog and publish open-source benchmark training datasets to support EO model development and data science challenges. However, the existing training datasets are usually packaged into a public or personal repository, lacking discoverability and accessibility. Moreover, there is no unified method to describe the training data. For example, in the remote sensing (RS) machine learning scenarios: scene level, object level, and pixel level, the content and format of the training data are generally different. In the scene level, e.g., the wildfire scene classification, the training data content of which includes an image and its corresponding binary label; In the object level, e.g., the building detection, the training data content of which includes an image with several polygons indicating the position of buildings; In the pixel level, e.g., the landcover classification, the training data content of which includes the Earth Observation (EO) imagery and the landcover class of each pixel. The remote sensing training data for deep learning could come from different organizations with different resolution and ground truth forms but without adequate metadata, which makes it burdensome for the users to access and use. Therefore, it is expected that the geospatial community should define stricter specifications and policies to enhance discovery and sharing training data, in particular to develop a Training data Markup Language for AI/ML to document, store, and share the geospatial training data following the FAIR (findability, accessibility, interoperability, and reusability) data management principles.
Training data should have sufficient metadata in a machine-readable standard format, include general spatiotemporal information and training data-specific attributes to facilitate data discovery and query. Given the popularity of JSON/XML, the SWG will support and be consistent with the JSON/XML encoding that is ubiquitous on the Web. Where available, the proposed standard will use existing industry standards commonly used by developers.
Training data is the building block of Machine Learning models, which constitutes the majority of Machine Learning applications in Earth science. The training data is used to train AI/ML models, and validate model results. It is essential to formalize and document the training data by characterizing the training data content, metadata, data quality, and provenance, etc. This SWG will focus on the TrainingDML for AI standard submission, coordinating a public comment period, and processing any comments received during this period. The final deliverable of the SWG will be a version of the candidate standard for consideration by the OGC membership for approval as an OGC standard.
The SWG will take on the following work actions around training data:
- Discuss the cutting-edge issues of machine learning training data for the geospatial community;
- Design the UML model and encoding of TrainingDML for AI, maximize the interoperability and usability of geospatial training data for various Machine Learning tasks;
- Merge the related efforts from the O&M “Sample” (e.g. UML models) and STAC ML Label JSON extensions;
- Define the description of spatial and temporal representativeness (e.g., spatial coordinates, spatial resolution, temporal resolution, and distribution of phenomenon of interest);
- Define different AI/ML tasks in remote sensing (e.g. scene/object/pixel levels), type of applicable machine learning model/algorithm, preferred accuracy level, labeling procedure used to generate the training data, original data used to generate labels, external classification schemes for label semantics, e.g. ontologies or vocabularies;
- Define the description of the permanent identifier, version, license, training data size, dates of measurement or imagery used for annotation, uncertainty of the measurement, counter-examples, data privacy, etc.
- Define the description of quality evaluation (e.g., training data errors, training data representativity) and the provenance (all intermediate data and training process);
- Define the support of training data expression, both physical data or scripts should be documented for reproducibility;
- Best practices for documenting, storing, evaluating, publishing, and sharing the training data, based on TrainingDML as the standard format; and
- Best practices for deploying, training, and executing the machine learning model by using TrainingDML as the input/output format.
3.1 Statement of relationship of planned work to the current OGC standards baseline
There is no existing OGC standard that directly addresses the above requirements.
The TrainingDML for AI is intended to be aligned with some existing OGC Standards and leverage capabilities fulfilled in part (or in total) by other standards.
- OGC API Features: OGC API Features provides API building blocks to create, modify and query features on the Web. It specifies discovery and query operations.
- Simple Features: Simple Features provides a distinct set of geometric objects for describing geospatial feature data. Simple Features also defines the geometry model for most other OGC feature encodings.
- GML: GML is a comprehensive encoding of features, geometry, and topology in XML. GML is quite heavy for simple feature data exchange and specifies an XML encoding for many other geospatial resources that are out-of-scope of this SWG, such as coverage data or coordinate reference systems.
- Observations and measurements (O&M): O&M defines a conceptual schema for observations, for features involved in the observation process, and for features involved in sampling when making observations. The training data in machine learning, such as the wildfire scene labels, mostly focus on the high level semantics, which is used for information extraction or pattern recognition in data analytical stages. O&M commonly involves sampling of an ultimate feature-of-interest, it defines a common set of sample types according to their spatial, material (for ex-situ observations) or statistical nature. The concept of “statistical sample” in O&M can be merged into this standard.
- Spatial-Temporal Asset Catalog (STAC): STAC provides API for accessing EO data, and also develops a Label Extension that defines a metadata specification for EO training data linking labels to remotely sensed source imagery. This standard will merge the efforts of STAC to provide more detailed metadata for formalizing the information model of Training data. These will include: (1) how the training data is prepared (e.g., provenance or quality), such as the sampling procedure for preparing the data in O&M; (2) how to specify different metadata used for different ML tasks (e.g., scene/object/pixel levels); (3) how to differentiate the high-level training data information model and extend information models specific to various ML applications; (4) how to introduce external classification schemes in ground truth labeling. (5) developing UML models and JSON/XML encodings for the training data; (6) more flexible ways for representation of training data, like a pair of images (a source image and a ground truth label image), or pairs of images (many small image tiles and their labels), or a source image and “all different labels together as geometries (e.g., OSM geometries)”.
- WKT CRS: The Well Known Text representation of Coordinate Reference Systems offers a standardized way to describe CRSs for reference by any spatial data set fully.
Once the SWG is established, a candidate standard is intended to be developed within one year.
3.2 What is Out of Scope?
Initially, the SWG will only standardize UML and encoding of Remote Sensing machine learning Training data, other common geospatial data, such as a vehicle trajectory, may be considered later.
The SWG will not focus on produce benchmark training datasets, neither produce the machine learning models and algorithms.
As the standard will be modular and multi-part, using the concept of core and extensions will allow a customized approach by implementors and data providers. If a community needs to develop a profile, it should be specified and governed by that community.
3.3 Specific Contribution of Existing Work as a Starting Point
The SWG work is based on:
- OGC Testbed-16: Data Access and Processing API Engineering Report (OGC 20-025)
- OGC Testbed-16: Data Access and Processing Engineering Report (OGC 20-016)
- OGC Testbed-16: Machine Learning Engineering Report (OGC 20-015)
- OGC Testbed-16: Machine Learning Training Data Engineering Report (OGC 20-018)
- OGC Testbed-15: Machine Learning Engineering Report (OGC 19-027r2)
- OGC Testbed-14: Machine Learning Engineering Report (OGC 18-038r2)
- OGC Abstract Specification: Geographic information — Observations and measurements (OGC 20-082r2)
- OGC Earth Observation Applications Pilot (several detailed ERs with Summary ER)
- OGC Web APIs
- ML Label Extension for STAC
- ISO 19107:2019 Geographic information — Spatial schema
- ISO 19115-1:2014 Geographic information — Metadata — Part 1: Fundamentals
- ISO 19115-2:2019 Geographic information — Metadata — Part 2: Extensions for acquisition and processing
- ISO/TS 19158:2012 Geographic information — Quality assurance of data supply
- Data Catalog Vocabulary (DCAT)
4.1 Initial Deliverables
The following deliverables will be the initial results of the SWG:
· OGC TrainingDML for AI Standard (part1 - Remote Sensing);
· Associated implementation guidance for OGC TrainingDML for AI;
· Any training data code, evidence of implementation, annotated list of public comments, or compliance tests that might be developed in parallel to the Standard.
The targeted start date for this SWG is the first quarter of 2021, once the charter is approved. The SWG will aim to deliver an initial release of the candidate Standard for review by the end of 2021.
This SWG will develop the TrainingDML standard for generating rich metadata interpretable by humans and machines suitable for data exchange beyond the geospatial community.
Geospatial data providers, geoscientists, computer scientists, software engineers from academia, industry, and government will be interested in assisting with the development of this Standard and the output of the SWG.
Yue, Peng (Wuhan University)
Hu, Lei (South Digital Technology Co., Ltd.)
Ziébelin, Danielle (Laboratoire d'Informatique de Grenoble)
The work of the SWG is intended to be largely public: the SWG will solicit contributions and feedback from OGC members and non-OGC members to the extent that is supported by the OGC Technical Committee Policies and Procedures.
Other collaborators are expected to include the GeoAI DWG, MetaCat DWG, Feature and Geometry JSON SWG, O&M SWG, and STAC community. As it will be relevant with the W3C Spatial Data on the Web, there may be collaboration with the joint OGC/W3C Spatial Data on the Web Interest Group.
b. Similar or Applicable Standards Work (OGC and Elsewhere).
The following standards and projects may be relevant to the SWG's planned work, although none currently provide the functionality anticipated by this committee's deliverables:
· Observations and Measurements (O&M)
· Spatio-Temporal Asset Catalog (STAC)
· AI ready EO training datasets (AIREO)