Data discipline becomes media’s real competitive edge in the AI era

By Dawn McMullan

Assisted by ChatGPT

Dallas, Texas, United States

Connect         

While generative AI dominated much of the conversation during Media Tech & AI Week, the sessions that resonated most strongly with executives came back to something older and quieter: data discipline. 

Speaker after speaker — from start-ups in Palo Alto to established publishers like Hearst — described a landscape where the next breakthroughs depend less on model access and more on mastering the inputs that power them.

The week’s three-day Silicon Valley study tour revealed how data is becoming both the language and the currency of media transformation. The subsequent two-day conference showed how those lessons translate into newsroom, product, and commercial strategy.

Together, they traced a through-line from infrastructure to innovation: that owning, structuring, and governing first-party data is what enables real AI progress.

Building resilience with first-party data

At Cloudflare, participants saw the front line of the data economy. The company’s engineers detailed how most publisher sites are still scraped daily by unknown bots — often without consent or compensation.

To counter this, Cloudflare has built tools that let publishers identify and block unauthorised crawlers in real time, as well as a “pay-per-crawl” model that introduces the idea of value exchange between data owners and AI developers.

Study tour attendees ask questions at Cloudflare.
Study tour attendees ask questions at Cloudflare.

The visit crystallised a new truth: first-party data strategy now extends beyond audience information to include content access itself. Protecting that access is becoming as critical as collecting user data. Several executives noted that unless publishers can see who is using their data and under what conditions, no downstream AI initiative can be trusted.

That theme continued across the tour. At Vermillio, attendees learned how its TraceID system can detect exactly where and how a publisher’s text, audio, or images appear in AI models. News publishers can now audit the invisible layer of model training, quantify unlicensed use, and prove compliance with licensing deals.

For news executives, these tools shift the conversation from fear of data leakage to data assertiveness: understanding one’s own assets well enough to bargain, license, or restrict them intelligently.

Consent as a new product dimension

Consent management also emerged as a commercial differentiator. At Microsoft, MSN’s product leaders described how privacy-preserving personalisation has become a design feature rather than a compliance afterthought. Their ranking systems use aggregated first-party signals to guide discovery and relevance, but avoid identifying individual users directly. The goal, they said, is to build trust through transparency, giving users and partners confidence that data improves the experience without overstepping.

Similarly, at CTGT — a start-up founded by a Stanford-trained AI researcher — the conversation turned to traceability. Founder Cyril Gorlla argued that auditability should be a baseline expectation for any AI used in newsrooms. His team has developed a system that measures how much copyrighted material is embedded in a model and where the information originates. 

For media companies, this kind of testing not only reduces legal risk but also supports more ethical use of retrieval-augmented generation (RAG) tools.

Executives noted that in an environment of eroding third-party data, consent isn’t only legal hygiene — it’s a chance to differentiate. Publishers that can show readers how their data fuels relevance, not surveillance, will be better positioned to maintain trust and first-party engagement.

Retrieval-augmented generation and the knowledge layer

The study tour’s first stop at Otter.ai demonstrated how RAG (retrieval-augmented generation) principles are already shaping workplace productivity. The company’s leadership described an evolving system that turns recorded conversations into a searchable knowledge base, drawing connections across meetings, projects, and departments.

Elliot Rogers, enterprise account executive at Otter, discussed RAG (retrieval-augmented generation)  with study tour attendees.
Elliot Rogers, enterprise account executive at Otter, discussed RAG (retrieval-augmented generation) with study tour attendees.

For media organisations, this concept mirrors what many hope to build for their own archives: dynamic, queryable datasets that combine structured and unstructured information. Instead of static dashboards, the goal is living data — systems that not only store but also explain, contextualise, and retrieve insights on demand.

By the end of the week, several conference sessions tied this directly to newsroom practice. Retrieval-augmented generation is increasingly viewed as the bridge between proprietary archives and public models. It allows publishers to keep sensitive or licensed content behind the firewall while still enabling AI systems to answer questions or generate context. 

This balance of openness and protection reflects the new ethos of data governance in media: transparency without exposure.

Data governance as competitive advantage

At Hearst, the conversation became concrete. Executives outlined how they built a central editorial innovation team to help dozens of local titles adopt AI safely and consistently. Each tool rollout begins with a governance checklist: identifying what data is used, where it comes from, and how accuracy is verified. This framework prevents fragmentation while allowing experimentation.

Participants heard that governance is not bureaucracy — it’s infrastructure for trust. Hearst’s leaders argued that by codifying standards for accuracy, consent, and transparency early, publishers can move faster later. Several attendees drew parallels to financial auditing: once systems are clear, innovation accelerates because everyone knows the boundaries.

That approach was echoed at the conference itself, where multiple speakers stressed that data governance is now as strategic as product design. Media leaders who treat data as a managed asset — complete with ownership, lineage, and security — will outpace those who treat it as a byproduct of operations.

Personalisation redefined by trust and context

Across the week, “personalisation” took on a new meaning. Microsoft’s MSN and LinkedIn teams described how they now focus on contextual personalisation rather than behavioural profiling — shaping experiences around content relevance and professional identity instead of raw click patterns.

Nikhil Kolar, vice president/product and engineering at Microsoft, presenting at the Media Tech & AI Confernce.
Nikhil Kolar, vice president/product and engineering at Microsoft, presenting at the Media Tech & AI Confernce.

At the conference, several case studies illustrated how this thinking is spreading. Publishers are designing data systems that understand relationships between topics, entities, and authors rather than simply tracking individuals. One executive called it “a shift from predicting people to understanding content.”

This reframing dovetails with audience expectations. In a post-cookie world, users increasingly expect value in exchange for data. The week’s discussions suggested that the most sustainable path forward is reciprocal data value: the more transparent a publisher is about data use, the more willingly audiences share it.

From collection to connection

A recurring phrase throughout the sessions was that “data is only as valuable as the connections you make with it.” Whether those connections link newsroom insight to audience engagement or licensing metadata to AI model governance, the next stage of media data strategy is relational rather than extractive.

Speakers from multiple companies noted that the industry has matured from mere data collection to a phase of data orchestration — where data flows between departments, products, and even external partners under clear consent rules. The companies most advanced in this shift are creating internal data platforms that act as both repositories and engines for innovation.

At the conference’s closing discussion, panellists agreed that success with AI won’t depend on how sophisticated a model a publisher uses, but how well that publisher manages, annotates, and applies its own data. As one participant summarised, data maturity is now the multiplier of every other innovation.

Innovation through governance

Perhaps the most counterintuitive insight of the week was that governance itself can be a source of creativity. By formalising privacy boundaries, transparency standards, and RAG access protocols, organisations open space for safe experimentation.

The pattern observed from San Francisco to Palo Alto was consistent: companies that view compliance as a creative constraint are finding new ways to personalise content, automate workflows, and build data-driven products without eroding trust.

In the end, the “data advantage” discussed throughout Media Tech & AI Week is not about hoarding information but about using it responsibly, contextually, and collaboratively. The future of AI in media belongs to those who make data both intelligent and ethical — a foundation for decision-making that earns the confidence of audiences, partners, and regulators alike.

About Dawn McMullan

By continuing to browse or by clicking “ACCEPT,” you agree to the storing of cookies on your device to enhance your site experience. To learn more about how we use cookies, please see our privacy policy.
x

I ACCEPT