3 priorities for media data teams in 2023
Smart Data Initiative Newsletter Blog | 12 January 2023
Hi everyone.
We meet again, on the other side of the holiday, and I hope you had a great start to your 2023 — whether you celebrated something or whether you smartly decided to not take time off and enjoyed some relaxing times at your empty office (like a data scientist I chatted with on December 27th. What a king).
In any event, we’re back — and I’m deep into planning our events for 2023. Do you have a great case study you’d like to speak about? Reach out because I’m filling our slate for our virtual events and our World Congress in New York in May — so, consider a call for volunteers at ariane.bernard@inma.org (but be quick on this because World Congress in particular is almost finalised).
On this, off we go with this week’s newsletter.
All my best, Ariane
Year ahead topics
This is the third year of the Smart Data Initiative — and my second year as its lead. Just like last year, I am going to share the three tentpole topics that will shape our editorial programming for the year.
Last year, I built the programming after a series of conversations I had with publishers in the first few weeks of taking on the initiative. And I think I was rather lucky in that these publishers provided useful guidance and feedback.
This year, I was able to add other inputs into the mix:
I asked our advisory board for their input.
I went back to the transcript of the dozens of interviews I carried out, plus some of the member exchanges that I’ve had by e-mail, and, using AI (#onbrand), identified some of the reliable “hot” topics that came up. Shoutout to Otter.ai for making this possible in the first place (not sponsored) and for my colleague Jodie Hopperton who had recommended Otter a year ago. NLP has really advanced, and Otter is a great application of it.
I allowed myself to fall into one hype trap, BUT I’ll be defending it and, hopefully, you’ll allow it too when I explain.
So, without further ado, our winners, in no particular order …
1. Testing: How, what, and the culture of it
Back in October, I asked the advisory board of our initiative (excellent folks from The Economist, NZZ, The New York Times, and Torstar) what they saw as a both an area of excitement but also torment.
“Testing” was their answer, and the reasons where many:
• Organisations of all sizes do tests. This is one of the earliest applications of data and just scales up and up.
• As the organisation scales, testing doesn’t become easier. Yes, there are more resources but everything is also more complex: how we prioritise what tests, the tests we do, how they get rolled out and analysed. So, unlike topics we could be looking at that mostly apply to “medium publishers” vs. “large publishers,” this one connects with everyone.
• There are technical angles, but there are also statistical angles. Again, whether small or large, a publisher experiences both. Building your own tech for testing is a large publisher problem, but it’s hard to build tech for testing (tech that works, that is). On the other hand, whether you are working on large or small samples, the analysis of your tests are always full of gotchas (albeit different gotchas).
One of the publishers of the advisory committee commented that their company was considering hiring pure statisticians to work on test design — not data science folks, statisticians. On the other hand, for smaller organisations that use commercial tools for testing, the question of how samples are being built is often shrouded in mystery.
Said the lead data scientist of a large German publisher to me just last month: “I have a PhD in statistics and I don’t want to report confidence intervals vs. p-values for anything because that’s a whole interesting debate. But my point is when when you say something like that, you know things will get misinterpreted anyway. And that’s exactly what’s happening with all these automated tools. You can get some insane, just insane results. And when you talk to their tech, they’ll tell you something nuts like, ‘Oh this has a reliability of 99.9%’ … You don’t know. What does that mean? I have no idea what the tool is doing and it’s a black box.”
• What tests are valuable is also a complex question and an area of endless debates. We know about simple UI tests that seem to show simple uplift (or not). But tests that require long observation of well-kept cohorts are tricky for us in publishing, where so many users use multiple screens, often anonymously.
Do we focus our tests in areas where we can get clean results even if our outcomes are less interesting? And what can we learn from each other about areas that are more interesting to tests, even if results aren’t as clean?
2. The growth of the data team
This second topic is the winner in terms of its prevalence, according to my interviews and e-mails last year. This touches on both hiring and organisation, but also on questions of how the data team colocates its engineering enablement: Is data engineering in data? Does it stay in engineering? Are product analysts in product? In data?
All of these could be their own self-standing topics I suppose, but this is all to say: lots of ground to cover.
Last year, I took a look at some of the zero-to-one journeys of data teams, which also touched on everyone’s favourite topic (because I guess there are a lot of nerds in these parts): Org charts (prior newsletters on the topic can be found here, here, and here.) Now, I don’t know that I will revisit org charts specifically, but there are at least two really interesting conversations I had recently that offered some high potential for debate:
Where product analysts belonged and what background they should bring to the role. They often lean more to product management in media companies, but elsewhere in tech, they are often deep analytics folks.
A product manager friend also remarked on something that had never occurred to me, which is the career path of product analysts. Often, these are folks who are looking to move into a more traditional product management role, so their incentive is to get deeper in with the PM crowd. That is also different from what you find in tech, where a product analyst’s career will not necessarily affiliate with product but with analytics/data.
I suspect there are many other variations on this question — roles where data is more embedded into a specific craft and what this means in terms of the data practice built into these roles and careers (in newsroom, you’d see this with analytics folks vs. audience engagement folks).
On the hiring/people side, I want to focus on offshore vs. onshore. This topic that came to me last year from folks in medium-sized organisations who were both intrigued by the good reviews from larger publishers’ establishing data operations in certain offshore locations with known pools of good talent in data (Eastern Europe being popular for this) but concerned about downsides of such decisions.
Managing data remotely is tricky because subject matter expertise is useful for our industry. Says a large North American publisher: “Sure you get a lot of data science for your buck, but you have to invest in a lot of onshore talent to manage it.”
We know that testing and optimising a checkout cart is a fairly prevalent problem, but our specific problems around advertising or paywalls, for example, are uniquely large. So the balance of arguments for offshore’s economic efficiency isn’t without its downsides. (If you work at a publisher with an offshore data team, I can promise anonymity — but please come speak to me. Your industry friends want to know.)
3. AI and the products built from data and machine learning
The first two priorities are a bit more like homework, and this one is the recess. And look, yes, I’ve fallen into the hype pot of breathless headlines about all the new fancy AI tools.
But if I can defend this, it is precisely because of the breathless headlines that it is worth taking on: “Will GPT3 save journalism? (Back at ONA 2018, the peerless Robert Hernandez made us play “[Blank] will save journalism” — and, of course, AI was already heavily featured for comedic appeal.)
More soberly, yes the number of software libraries that can be used to build machine-learned products continues to grow. And we already see them with tools like auto-moderating comments, smarter paywalls, product and content recommendations, automated content generation.
These libraries and frameworks get better, and more applications are built around them so a greater number of organisations can leverage them. But INMA is a business organisation, not a research university, and I want to make sure we don’t lose track of what we need to look at:
What are production-grade applications for these technologies? While I enjoy reading about the latest attempt at creating Thanksgiving recipes using AI (Sorry, fact check, I did not enjoy this — the stuffing recipe in this video should go to jail), this isn’t where our media industry is going to solidify its business, reach new audiences, scale itself up, etc.
Production-grade means two things: Taking on core challenges of our industry, but also doing this in a manner that truly adds value. So while fun experiments make for good headlines, we need to identify the things we can build that are there to stay.
Later this year, GPT4 is going to be released and this may be a new turn for generative AI. But as the robots become more human-like, there are ethical and democratic challenges that could only get bigger. Already, some recommendations or outright legislation are pushing for AI transparency, the possibility to audit algorithms. And these trends at the intersection of AI technology and legislation have to be on our radar, as much as the new capabilities that new libraries will tout.
In particular because our business is rooted in the distribution of verified (and therefore verifiable) information, our industry should take a particular interest in how AI-powered technologies that produce content are able to augment what we do, but in what way they may also create confusion among our readers and users.
I hope this little peek into our priorities will be useful and be the start of many conversations. Our ability to fullfill the mission of INMA as an organisation is rooted in what we share with one another, so please, do drop me a note and let’s chat.
Happy new year to all.
Further afield on the wide, wide Web
This week’s FAOWWW will go high. I promise this section will return with oddball news, but this being the new year and being full of great resolutions to cultivate our brains, well, we get the good green vegetables this week.
- So first, this super interesting chat thread started by Professor Andrew Gelman, a professor of statistics and political science at Columbia University. You don’t have to be a stats PhD (which I most certainly am not) to enjoy this because while the topic is statistics, the conversation isn’t an exploration of statistics but rather a high-level back and forth on the merit of certain approaches in machine learning: Do simpler machine learning models exist and how can we find them (Columbia).
- Second, and harder: Statistical challenges in Online Controlled Experiments: A Review of A/B Testing Methodology by Ronny Kohavi, Nicholas Larsen, Jon Stallrich, Srijan Sengupta, Alex (Shaojie) Deng, Ronny Kohavi, and Nathaniel Stevens. The audience is a hard-core data person. I am not a PhD in the matter but could basically follow along (emphasis on “basically,” and certainly I did so slowly). The takeaway is that there are a number of interesting case studies in this one, so pass this along to your data science team. We gotta get going on our tentpole topic on testing, so, let’s kick this off seriously.
About this newsletter
Today’s newsletter is written by Ariane Bernard, a Paris- and New York-based consultant who focuses on publishing utilities and data products, and is the CEO of a young incubated company, Helio.cloud.
This newsletter is part of the INMA Smart Data Initiative. You can e-mail me at Ariane.Bernard@inma.org with thoughts, suggestions, and questions. Also, sign up to our Slack channel.