Data Foundation
l 5min

Using Schema Design for Multi-Domain AI Readiness

Using Schema Design for Multi-Domain AI Readiness

Table of Content

Powering the Future with AI

Join our newsletter for insights on cutting-edge technology built in the UAE
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Key Takeaways

Schema design is the starting point for any AI that needs to work in multiple domains. It’s about creating a clear structured way to label data.

Using broad vague categories (weak semantics) leads to confusion. A formal, structured system (strong semantics) based on ontologies allows for better reasoning and validation.

Domain ontologies help you check if your dataset is complete. Image quality ontologies help you check if your model will be robust in different real-world conditions.

A multi-layered annotation pipeline helps you apply your schema consistently, from preparing the data to monitoring the model’s performance.

Machine learning models don’t learn from raw data.

They learn from a blueprint we give the AI to make sense of that data. We call this a “schema.” Think of it as the set of instructions that tells the AI  what to label, how to label it, and how the labels relate to each other.

For an AI that only has to do one thing, designing a schema is simple. The task is specific, the labels are clear. But when an AI system needs to work across different domains, schema design becomes a much bigger challenge.

Think about a computer vision system that needs to recognize objects in both medical images and self-driving cars.

  •  The schema for medical imaging would include things like anatomical structures and pathologies. 
  • The schema for autonomous driving would include vehicles, pedestrians, and road signs. 

These two domains don’t have much in common. But the system that handles the annotation (the tools, the workflows, the quality control) has to support both. 

AND THIS is the challenge of multi-domain AI readiness.

Weak Semantics vs. Strong Semantics

Not all schemas are the same. Some use weak semantics, which are broad categories with implied meanings. Others use strong semantics, which are explicit hierarchies with formal relationships. The difference is important.

Research from SciBite shows this distinction through an experiment in extracting life sciences articles from Wikipedia. Wikipedia organizes articles using categories, a form of weak semantics similar to SKOS (Simple Knowledge Organization System). Categories point upwards and downwards to indicate more general or more narrow groupings. But the specific meaning is implicit.

 This works for people browsing the site, but it causes problems for machine learning. When researchers tried to pull all the life sciences articles by following Wikipedia’s category tree, they got some strange results. The “Hearing” category led to “Sound,” which led to “Music by country,” and finally to Peruvian folk music. The “Cocaine” category led to “Fictional cocaine users,” and then to Sherlock Holmes. Neither of these belongs in a life sciences dataset.

Arabic Voice AI Enterprise Use Cases

Weak semantics allow for these kinds of loose connections. But when a machine learning system treats these categories as if they were a formal hierarchy, it creates noise.

The problem is over-commitment to meaning. Weak semantics allow a person to be categorized as narrower than a village, and a village as narrower than a country. There is no inheritance here. What is true of the country does not have to be true of the person. But when machine learning systems treat these categories as if they were formal class hierarchies, they propagate noise.

Strong semantics solve this problem. In an ontology, the relationships between classes are explicit. The World Wide Web Consortium (W3C) has established the Web Ontology Language (OWL) as a standard for this. What is true for a given class is also true for all its subclasses. This allows for better reasoning and validation. If a dataset is supposed to cover all central nervous system diseases but is missing examples of Alzheimer’s, the ontology can flag that gap.

For multi-domain AI, strong semantics are essential. They provide a solid foundation for checking dataset completeness and aligning schemas across different domains.

Ontologies for Dataset Validation

Ontologies do more than just organize concepts. They can also be used to validate datasets. A research paper published on arXiv suggests using two types of ontologies to make sure training datasets for safety-critical applications like autonomous driving are complete and robust.

Domain Ontologies

A domain ontology defines all the important concepts in a specific area. For emergency vehicle detection, the ontology might include ambulances, fire trucks, and police cars. Each concept is clearly defined, and the relationships between them are formalized.

The ontology acts as a checklist. Does the dataset have examples of all the vehicle types? Are all the subtypes represented? If the ontology includes “ambulance with lights on” and “ambulance with lights off” as separate concepts, the dataset needs to have both.

This approach can be used in any domain. A medical imaging ontology could define anatomical structures and pathologies. An autonomous driving ontology could define road users and infrastructure. Each ontology provides a formal way to specify what the dataset needs to cover.

Image Quality Ontologies

Having all the right concepts is not enough. A dataset might have all the right objects but not capture the variations in quality that a model will see in the real world. An image quality ontology can help with this.

An image quality ontology defines different quality dimensions, like lighting conditions (day, night), weather (clear, rain, snow), and occlusions (partial, full). For each concept in the domain ontology, the dataset should have examples across all these quality dimensions.

A model trained only on daytime images will not work well at night. A model trained only in clear weather will fail in the fog. By formalizing these quality dimensions, we can systematically check for gaps in the dataset.

The two-ontology approach increases trust in ML models used in safety-critical domains. Because ML models embody what they are trained with, ensuring the completeness of training datasets increases trust in the training of ML models.

Multi-Layered Annotation Pipelines

Schema design has to be put into practice through an annotation pipeline. For complex AI tasks, a multi-layered approach is recommended, with five distinct layers.

  1. Pre-Annotation and Data Preparation: This layer handles cleaning the data, removing duplicates, and ensuring a balanced representation of different groups. It can also apply some initial, automated labels to speed up the process.
  • For multi-domain AI, this layer must handle heterogeneous data sources. Medical images arrive in DICOM format. Autonomous driving images arrive in JPEG. Text data arrives in multiple languages. The pre-annotation layer standardizes these inputs so that downstream layers can apply schemas consistently.
  1. Human Annotation: This is where the schema is applied. Hierarchical labels and nested attributes are used to capture the full meaning of the data. For example, a medical image might be labeled with “lung” (anatomical structure), “nodule” (pathology), and “malignant” (diagnosis).
  • Inter-annotator agreement serves as a pulse check. If annotators disagree frequently,the schema may be ambiguous. If agreement is high, the schema is clear. This feedback loop informs schema refinement.
  1. Quality Control and Validation: This layer includes multi-pass reviews and automated checks to ensure quality. One pass might check for logical consistency (e.g., no “day” label in a nighttime image). Another might flag unusual patterns in the annotations.
  • For multi-domain AI, this layer must validate schema application across domains. Are medical image annotations following the same quality standards as autonomous driving annotations? If not, the pipeline must adapt.
  1. Model-Assisted and Active Learning: Creating a feedback loop between people and algorithms. A model trained on earlier data can suggest labels, and humans can then validate or correct them. Active learning techniques can identify the areas where the model is most uncertain and prioritize those for human review.
  2. Governance and Monitoring: Version control, schema tracking, and audit logs ensure traceability. As schemas change, governance tracks when and why changes occurred. It also monitors for bias and data drift. The U.S. government’s AI readiness framework brings the importance of data management and governance in building reliable AI systems.

Building better AI systems takes the right approach

We help with custom solutions, data pipelines, and Arabic intelligence.
Learn more

Schema Evolution Strategies

Schemas requirements change, and new domains are added. The challenge is to manage this evolution without breaking the existing annotations.

  • Version Control: Every change to the schema should be versioned. This allows for backward compatibility. If a model was trained on an older version of the schema, it can still be evaluated on data annotated with a newer version.
  • Backward Compatibility: Adding a new label is fine, but deleting or renaming a label can cause problems. When incompatible changes are needed, you have to define a migration path for the existing annotations. How will existing annotations be updated? Will they be re-annotated? Will they be automatically migrated using a mapping? Will they be deprecated?
  • Cross-Domain Alignment: Multi-domain AI requires aligning schemas across different domains. This doesn’t mean using the exact same schema for everything. 

For example, Medical imaging and autonomous driving domains might use a three-level hierarchy: object type, object state, and context. Medical imaging might label "lung" (type), "nodule" (state), "malignant" (context). Autonomous driving might label "vehicle" (type), "moving" (state), "intersection" (context). The structure is the same, even though the labels differ.

This structural alignment allows for transfer learning, where a model trained on one domain can transfer its knowledge to another.

FAQ

What is the difference between a schema and an ontology?
How do I know if I need strong semantics for my business?
Where can I find pre-existing ontologies for my domain?
How do I get started with building a multi-layered annotation pipeline?

Powering the Future with AI

Join our newsletter for insights on cutting-edge technology built in the UAE
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.