Ontologies do more than just organize concepts. They can also be used to validate datasets. A research paper published on arXiv suggests using two types of ontologies to make sure training datasets for safety-critical applications like autonomous driving are complete and robust.
Domain Ontologies
A domain ontology defines all the important concepts in a specific area. For emergency vehicle detection, the ontology might include ambulances, fire trucks, and police cars. Each concept is clearly defined, and the relationships between them are formalized.
The ontology acts as a checklist. Does the dataset have examples of all the vehicle types? Are all the subtypes represented? If the ontology includes “ambulance with lights on” and “ambulance with lights off” as separate concepts, the dataset needs to have both.
This approach can be used in any domain. A medical imaging ontology could define anatomical structures and pathologies. An autonomous driving ontology could define road users and infrastructure. Each ontology provides a formal way to specify what the dataset needs to cover.
Image Quality Ontologies
Having all the right concepts is not enough. A dataset might have all the right objects but not capture the variations in quality that a model will see in the real world. An image quality ontology can help with this.
An image quality ontology defines different quality dimensions, like lighting conditions (day, night), weather (clear, rain, snow), and occlusions (partial, full). For each concept in the domain ontology, the dataset should have examples across all these quality dimensions.
A model trained only on daytime images will not work well at night. A model trained only in clear weather will fail in the fog. By formalizing these quality dimensions, we can systematically check for gaps in the dataset.
The two-ontology approach increases trust in ML models used in safety-critical domains. Because ML models embody what they are trained with, ensuring the completeness of training datasets increases trust in the training of ML models.