Back in October, I asked the advisory board of our initiative (excellent folks from The Economist, NZZ, The New York Times, and Torstar) what they saw as a both an area of excitement and also torment.
“Testing” was their answer, and the reasons where many. So as one of the three tentpoles of this year’s INMA Smart Data Initiative, I’ve included testing. Here are four reasons why.
1. Organisations of all sizes do tests
This is one of the earliest applications of data and just scales up and up.
2. As the organisation scales, testing doesn’t become easier
Yes, there are more resources but everything is also more complex: how we prioritise what tests, the tests we do, how they get rolled out and analysed. So, unlike topics we could be looking at that mostly apply to “medium publishers” vs. “large publishers,” this one connects with everyone.
3. There are technical angles, but there are also statistical angles
Again, whether small or large, a publisher experiences both. Building your own tech for testing is a large publisher problem, but it’s hard to build tech for testing (tech that works, that is). On the other hand, whether you are working on large or small samples, the analysis of your tests are always full of gotchas (albeit different gotchas).
One of the publishers of the advisory committee commented that their company was considering hiring pure statisticians to work on test design — not data science folks, statisticians. On the other hand, for smaller organisations that use commercial tools for testing, the question of how samples are being built is often shrouded in mystery.
Said the lead data scientist of a large German publisher to me just last month: “I have a PhD in statistics and I don’t want to report confidence intervals vs. p-values for anything because that’s a whole interesting debate. But my point is when when you say something like that, you know things will get misinterpreted anyway. And that’s exactly what’s happening with all these automated tools. You can get some insane, just insane results. And when you talk to their tech, they’ll tell you something nuts like, ‘Oh this has a reliability of 99.9%’ … You don’t know. What does that mean? I have no idea what the tool is doing and it’s a black box.”
4. What tests are valuable is also a complex question and an area of endless debates
We know about simple UI tests that seem to show simple uplift (or not). But tests that require long observation of well-kept cohorts are tricky for us in publishing, where so many users use multiple screens, often anonymously.
Do we focus our tests in areas where we can get clean results even if our outcomes are less interesting? And what can we learn from each other about areas that are more interesting to tests, even if results aren’t as clean?
If you’d like to subscribe to my bi-weekly newsletter, INMA members can do so here.