Briefing: The Future of AI - Data Curation Economy

This briefing summarizes key themes and important ideas from “Andrew Ahn, Beyond the Model: Why the Future of AI Is a Data Curation Economy,” an episode of the Praxi Pod featuring Andrew Ahn, CEO of Praxi. The discussion highlights a crucial shift in the AI landscape, moving beyond a sole focus on model size to an emphasis on high-quality, curated data, particularly within regulated industries.

1. The Shifting Focus: Beyond Model Size to Data Quality

The central theme of the discussion is the growing realization that the size and complexity of AI models are secondary to the quality and reliability of the data used to train them. Andrew Ahn argues that a significant investment by Meta in Scale AI underscores this shift. He states,

“I think really it underlies the importance of kind of a shift in focus… really focusing on which you know arguably we we believe in as well which is really the foundational components of it which is the source of the data kind of the uh the the foundation for all of AI going forward which is really high quality data is it trustable is it reliable.”

The analogy of a high-performance sports car requiring the right fuel illustrates this point:

“I mean getting away from the model fixation I mean really it doesn’t matter if your model is good if you know you’re just scraping data from the internet without really taking a hard look at the quality of the data that you’re feeding the models.”

This addresses the “garbage in, garbage out” problem, which is even more pronounced in AI.

2. Limitations of RLHF and the Need for Expert Curation

The source identifies a fundamental flaw in traditional reinforcement learning with human feedback (RLHF) when applied to regulated industries. While supervised learning and RLHF assume correctly labelled and trustworthy data, the human element in this process is inefficient and resource-intensive.

“human beings are tasked with the idea of making that that qualitative decision uh when training the models But really you taken this automatic process or automated process and only automated part of it right so it still requires an inordinate amount of time and resources and specialized skills from humans people to to make that outcome you know successful.”

Praxi’s hypothesis, as explained by Andrew Ahn, is to use “expert-trained curation.” This involves subject matter experts (SMEs) providing “much higher fidelity and resolution to that learning,” leading to significantly better outputs from AI models, including generative AI and RAG AI.

This approach necessitates a reduced scope, focusing on specific verticals or use cases (e.g., insurance and sensitive data for Praxi) rather than attempting to create a generalized tool.

3. Compliance as a Competitive Advantage

A crucial and counter-intuitive idea presented is that compliance, often viewed as a constraint, can actually be a significant competitive advantage in AI development, particularly for regulated industries. Instead of fighting against compliance, Praxi’s philosophy is to “work with” it, believing in its motivations of “trustworthy transparent responsible you know outputs.”

Integrating compliance from the outset, rather than as an afterthought, accelerates the development process. Ahn explains,

“if you start off with properly classified or curated data the building blocks that you’re using to build your project and to build your models from in your analytics you know that they’re trustworthy… And because of that it actually makes the process much faster.”

Adding compliance at the end leads to “rework cycles” and can invalidate previous efforts, making the process “much more efficient and cost effective as well.” This is articulated as “compliance design first thinking.”

4. Specialisation vs. Generalisation in AI Tools

The discussion highlights the limitations of generalised AI tools, exemplified by Scale AI, when operating in regulated and specialised environments. Andrew Ahn argues that

“If you don’t specialise in a certain vertical or set of use cases, you know you’re really creating an artificial ceiling in terms of its effectiveness and productivity.”

Praxi’s approach of pre-training models for specific domains (e.g., property casualty and life and health insurance) offers a significant advantage. This pre-training incorporates specific terminology and compliance ramifications that a general-purpose tool would lack, allowing clients to

“hit the ground running right out of the gate rather than having to train.”

Generalized tools, without this specialization, incur “a lot of anchors,” including extensive QA and compliance overheads.

“hit the ground running right out of the gate rather than having to train.”

Generalized tools, without this specialization, incur “a lot of anchors,” including extensive QA and compliance overheads.

This domain-specific focus also builds a stronger foundation of trust. When a model already understands the nuances and regulatory landscape of their industry, clients can have greater confidence in the outputs. It minimizes the risk of generating non-compliant or contextually inappropriate content, which is a major concern when using generic AI. By embedding this expertise directly into the tool, Praxi not only accelerates deployment but also enhances the reliability and safety of its AI solutions from day one.

5. Future-Proofing AI Strategy: The Data Curation Imperative

For organizations outside the “technology-centered” giants (like Google or Facebook), effective AI adoption and data-driven decision-making hinges on high-quality data curation. Andrew Ahn advises,

“In order to make really good data-driven decisions you really need to get to that first part that that curation and labeling part first. And to do that best is to have pre-trained specific models.”

This is presented as “the only way you can compete and have those high quality decisions that is comparable to maybe a technology focused or led you know company.”

Praxi offers three key areas of support:

Data Curation: Leveraging patented processes to create complex, synthetic labels and specialized libraries that general-purpose tools lack.
System Integration: Enhancing the entire data stack by synchronising labels and metadata across existing tools, breaking down data silos, and embedding compliance “throughout the whole step.” This can significantly reduce the time data workers spend “dumpster diving for good data to make sure that it’s trustworthy. Maybe not eliminate that but we can cut that down by half.”
Automatic Actions: Enabling immediate responses to identified data patterns by triggering alerts, emails, or system integrations, thereby cutting down response times.

The ultimate aim is to move from a purely technical discussion about data to a business discussion about policy and action.

As Ahn notes,

“the technologist can’t make the decision around what to do with the data. It’s usually got to be a business person or a business owner.”

The increasing ubiquity of data means that differentiation will come from an organization’s ability to “react to new data sources” and integrate them into efficient workflows to achieve high productivity and fast ROI.

The core message concludes: “beyond the model the future of AI is the you know the future of the AI economy is data curation.”