CNTXT AI

‍

Today, we’re going to pull back the curtain on the one thing that determines whether your AI venture soars to the moon or crashes and burns: your data.

‍

About Sibghat:
Sibghat Ullah leads CNTXT AI’s data practice. He serves as Technical Product & Program Manager, driving the development of tools, teams, and datasets that power Arabic-first AI.

‍

Q1: Why is everyone so obsessed with data preparation? Isn't the AI model the most important part?

The model is important. But it’s not where most projects succeed or fail. In practice, the limiting factor is almost always the data. A model, no matter how advanced, can only reflect the patterns in the data it’s trained on. If the inputs are incomplete, inconsistent, or biased, the outputs will be unreliable. And that’s not a theoretical risk, it’s the main reason 70–80% of enterprise AI projects never reach their expected return.

From a business perspective, this is critical. AI decisions feed directly into operations. If the training data doesn’t match real-world conditions, the model will underperform. That leads to financial loss, compliance risks, and reputational damage. Companies often underestimate this. They assume accuracy is a function of the algorithm, when in reality it’s a function of data quality.

So when you ask if the model is the main thing, the answer is no. Models are becoming commoditized. You can access great architectures from open-source or cloud providers. What isn’t commoditized is your data. The companies that succeed treat their data as a core business asset. They build processes and teams around maintaining that asset, because they know it’s the foundation that every AI initiative rests on.

‍

Q2: Okay, I get it. Data is important. But what are the biggest mistakes you see businesses making when it comes to data preparation?

The first mistake is assuming that volume equals value. Classic. The belief that if you just throw enough data at the problem, the AI will magically figure it out. So, companies go on a data hoarding spree. They collect everything, from every possible source, and store it in massive data lakes. Unfortunately, what they end up with is a data swamp. Noisy, irrelevant, and full of errors.

The smart companies take a different path. They are curators. They’re ruthless about what they keep and what they discard. They understand that their data is a reflection of their business. So they treat it with the same care n as their products or their customers.

The second mistake is treating data preparation as a one-time exercise. But nothing stays still. Business conditions change. Customer behavior shifts. Products evolve. Regulations get updated. If the data pipeline isn’t continuously monitored and refreshed, the model becomes outdated. This is what causes model drift. The model drifts when predictions keep coming but accuracy steadily declines because the data no longer matches reality. Without constant quality checks and updates, even the best models degrade.

‍

Q3: Beyond those fundamentals, what else do companies often overlook in data prep?

“The Human vs. Machine”. This is a more nuanced mistake, but it’s just as important. It comes down to mismanaging the balance between automation and human oversight.

Automation is good to scale for tasks like deduplication, schema alignment, or detecting anomalies in large datasets. But full automation without human input creates blind spots. It misses contextual issues like subtle business rules or cultural nuances hidden in the data.

That’s where people come in. They apply their judgment, their intuition, and their deep understanding of the business to guide the process. The best practice is to combine automated checks with expert review for both efficiency and contextual accuracy.

And probably the last one for me is lack of ownership. Data prep usually sits between IT, analytics, and business teams, which means accountability gets blurred. No one takes full responsibility for defining standards, governance, or validating ongoing quality.

It's important to assign clear ownership whether through data product managers, domain leads, or dedicated stewardship teams so that the data feeding AI is continuously managed as a strategic asset.

‍

Q4: This is all great in theory, but what does this look like in the real world? What’s really going on in the companies that are getting this right?

They are those exactly who moved from seeing data as a byproduct of their business to seeing it as the product itself. As we've mentioned earlier, ownership. They assign ownership to people who are responsible for the entire lifecycle of a dataset. They own the quality, the usability, and the value of the data. They’re obsessed with their “customers” and constantly gathering feedback and improving.

Another key is building a Data QA culture. The best companies test their data at every stage of the pipeline. From the moment it’s collected, to how it’s cleaned, to how it’s used. They treat data with the same discipline as code. They have service level agreements (SLAs) for their data products, and they hold themselves accountable for meeting them.

This is a huge departure from the “garbage in, garbage out” mentality that’s so prevalent in many organizations. In a Data QA culture, data is guilty until proven innocent. It’s a culture of skepticism and of relentless attention to detail. And it’s the only way to build the kind of high-quality data foundation that’s required for successful AI.

‍

Q5: What are the specific steps I need to take to prepare my data for AI deployment? Give me the playbook.

I think there are plenty of playbooks out there already. You can Google ‘data preparation steps’ and find a checklist. But checklists only get you so far. What really matters, if you’re serious about getting a system into production, are the questions you ask yourself along the way.

Don’t just ask “where is my data?” Ask: who owns it, who uses it, and do they even trust it? Half the time, the problem isn’t access. It’s that people don’t believe the numbers. If you can’t answer those questions, you don’t have a foundation.

On quality, don’t settle for “is it clean?” You have to ask: what decisions could this data break if it’s wrong? That’s how you prioritize. No company has the resources to fix every field in every system. You focus where errors cost money, compliance, or customer trust.

When transforming data, does this transformation still reflect the business reality? I’ve seen teams normalize data in ways that strip out the exact nuance that mattered. The result is a model that’s technically impressive and commercially useless.

I think in general, the real tactic is to focus on the blind spots: ownership, trust, business context, accountability, and response. That’s the work no one puts in a slide, but it’s what keeps deployments alive in the real world.

‍

Q6: Pilot purgatory. What is that, and how do I avoid it?

Ah, pilot purgatory. This is the graveyard where AI dreams go to die. It's where companies get stuck running endless proof-of-concepts and pilots that never make it to production. And it's more common than you might think.

I'll give you a typical pattern. A company decides to "experiment" with AI. They pick a use case, usually something that seems technically interesting but isn't necessarily business-critical. They assign a small team. Give them a limited budget, and tell them to "see what they can do." The team builds a model. It shows promise in testing, and everyone gets excited. But then, when it comes time to deploy it in the real world, everything falls apart.

Why? Because they never thought about the infrastructure. They never considered the data pipeline. They never involved the end-users. They never thought about governance, compliance, or scalability. They built a beautiful prototype, but they didn't build a product.

The escape plan later is too late which is why it is important to start with a business problem, where you can create business value,not just what looks interesting. Think about scalability, reliability, and maintainability from the beginning. Include everyone in the design process (operations, IT, and end users). You have to have an infrastructure reality check from day one so your investment makes sense.

‍

Q7: Let's talk about the elephant in the room: cost. How much should I expect to invest in data preparation, and how do I justify it to my board?

The first thing to do is to stop pitching data prep as “cost.” Frame it as risk management and ROI protection. Boards understand risk. If you tell them: “We can spend $x million on this AI initiative, but without proper data prep we have a 70–80% chance of failure,” that’s a losing pitch. But if you show that an additional $500k in data preparation reduces failure risk dramatically and raises expected returns, the conversation changes. You’re not asking for more money. You’re protecting the money already committed.

You also need to highlight the hidden benefits. When you improve your data infrastructure for AI, you also improve reporting, analytics, compliance, and decision-making across the organization. Faster access to reliable data is good for finance, operations, and customer teams.

‍

Q8: We keep hearing about "data drift" and "model decay." What are these, and how do I protect against them?

Data drift and model decay are the reasons why a model that works perfectly in testing can fail spectacularly in production. And they're the reasons why AI systems need constant monitoring and maintenance.

Drift means the data your model is seeing today is not the same as the data it was trained on and there are different types (i.e., changes in input features, meaning of outputs, or how labels are applied) and then we have Decay which is the result of data drift and where the model’s accuracy erodes over time.

This happens in every business. As I've said in the beginning, customer behavior changes. Market conditions shift. New regulations alter how data is recorded. Competitors launch new products that change demand. Your model doesn’t adapt on its own, it keeps making predictions based on old patterns. And because the decline is gradual, most companies don’t notice until the damage is done. You get wrong forecasts, poor recommendations, misallocated resources.

You can prevent this using techniques like; Statistical Tests (and others) to detect changes in the distribution of your data. But performance monitoring is the most direct approach. And you have to set up a response strategy for it.

‍

Key Takeways:

‍

Data, not models, determines AI success. Models are commoditized; high-quality, well-governed data is the true differentiator.
Without continuous data prep, AI fails. Most projects miss ROI due to outdated, inconsistent, or poorly managed inputs.
Data prep is risk management. Investing upfront reduces failure risk, protects ROI, and strengthens enterprise-wide decision-making.

‍

Heading

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Data Preparation Before AI Deployment— FAQ with Sibghat