List management: Test and control groups for picking a winning campaign

You would think this topic would be pretty simple. At 50,000 feet it is: Split the e-mail list into two halves and call it a day.

In the weeds of implementation, it is a whole different world. How big should the test group be: 50%, 20%, or something else? Should you test in an A/B mode or multi-variate? How much (or little) should you change between the A and B designs?

Should the design look somewhat similar, or is a radical variation ok? What criteria do you use to decide how to split the list into groups? How do you vary the pieces: Is one a true control, or do you just constantly try variations?

In e-mail campaigns, can you use opens/click thrus to dynamically pick the winner mid-campaign? How do you measure direct mail? Are there other campaigns happening at the same time to the same groups (telemarketing, for example)?

So much for simple.

Then when the campaign completes, if the test group sells 50 units, and the control group 43 units, the winner is easy to pick, right? Or is it?

To split or not to split

The short answer: Split every campaign related to sales and retention efforts. Every campaign is a chance to test and improve. People are fickle and will respond differently to even the most subtle difference.

For simplicity’s sake, let’s work with a straight forward A/B test, both netting to an identical subscription cost:

  • Version A: “Save $14.99.”

  • Version B: “Save 50%.”

Simple? Nope.

The A and B versions will tell you what resonates better with your campaign audience at the time. But to really test things, a control version of your message is also needed.

A control version is a piece that has been in the market for a while without change — a piece that has been around long enough that you know that whenever it is sent, it yields statistically stable results.

For those of you without a control piece, just pick a piece you’ve used in the past and lock it down! Use it without any change at all until you have a challenger that beats it on a regular basis. Also, make sure it is a piece that has an offer that you are willing to continuously honour, as you need to keep sending this for many months while you evaluate it for consistency.

Back to keeping this simple, I’m going to pretend that the Version A piece is my control version. I’m pretending that I’ve used it in several past campaigns, and it works well (averaging a .9 response rate).

The test I’m setting up is to see if the words “save 50%” outperform actually showing the dollar amount of savings. To make the test valid, I am very careful that the only difference between the two pieces is just the words “save $14.99” and “save 50%.” Absolutely nothing else changes.

This strict restriction on change is critical to understanding impact. If the words change and the graphics are updated to show a cute puppy with a rolled up newspaper in its mouth – and you get a 1.5% response – you can’t determine why there was a difference. Was it the puppy or discount, or both? You can’t tell, so you have to test again.

A downside to this test and control process is that is does take time and strict discipline to enact well. Time and discipline are difficult when order volume goes up or down quickly. Euphoria if you get a great response – panic if you get none.

Having a solid anchor point – a known response range in a control piece – is the best way to work through these emotional swings.

For example, the baseline response over a series of campaigns for the control piece will be within a fairly narrow (seasonally adjusted) range. Then, if you run a test piece with a puppy and your control piece responds as usual while the puppy knocks the ball out of the park – you can declare a winner! Puppies rule!

Likewise, if you test 50% off and get nothing and the control piece has a regular response, then you can say that the 50% verbiage didn’t work, at least as it is packaged with everything else on the piece for that particular time of the year.

Elements of the piece

It’s worth having a side conversation on the elements of a piece. Most of these are common, whether e-mail or direct mail. The list is long but includes:

  • From/to salutation.

  • Dimensions.

  • Carrier envelope design.

  • Window design.

  • Stamp type, position.

  • Return envelope with phone number (1-800 or not) and URL (tiny or full).

  • Paragraph letter format or benefits bullet points.

  • Colours, fonts, pictures, and artwork.

  • FOD offered, rate offered, and term length.

  • Renewal rates, vacation policy, and the fine print of the contract.

  • And so on.

These elements and many more … I’ve seen upwards of 31 elements defined on a piece. All are in one way or another what the recipient has flash through his mind in the initial two seconds when he sees what is put in front of him. A good part of piece evaluation is the controlled management and understanding of all of the elements so you can clearly define what you are testing.

The data aspect

The data stewards play a vital role in test and control implementation. The surface level view is fairly simple: Pick the list criteria, then divide the list. The details require a bit more work.

As with any campaign design, the basic selections are worked out (in my fictional example here):

  • Presence of an e-mail address.

  • Former subscribers for more than 140 days and less than 180 days or true nevers.

  • In key metro ZIP codes.

  • Personicx (or PRIZM) mature life stage but not in segment/cluster 61.

  • Did not stop for non-pay.

  • Not in a prior campaign within the past xx days. 

  • Did not stop for the standard list of suppression reasons.

  • Not on do-not-e-mail, etc.

Then I would run a trial campaign in my campaign management tool to get counts of what the candidate pool size is. I would also work with the marketing team to get its expectation on response rate, whether it is an A/B or multi-variate test, cost per piece, and postage (or e-mail) cost, and run the numbers into both profitability calculators and statistical significance testers.

You may find that there isn’t a large enough pool to run a statistically valid test, especially in a small to mid-sized market.

Let’s say the pool selected is 10,000 e-mails.

A simple A/B would divide the list into 5,000 and 5,000. But, since the campaign is doing a test, you really don’t want half of your campaign going to a test group.

What if it absolutely bombs? A rule of thumb is that you want to keep your test to a smaller portion – say under 30% of the campaign, usually even smaller than that and sometimes even at 20% of the list.

So your list is then split into 8,000 and 2,000 piece groupings. Size becomes a problem. The math isn’t on your side.


Statistics was probably not your friend in college. And right now, it is working against you in your campaign. Why? Based on the anticipated number of orders, there aren’t enough to mathematically prove you have a winner.

Let’s say you project 75 orders from the A group and 18 from the B group. There just aren’t enough responses to prove one piece over the other is a statistical winner. Things you didn’t like – confidence level, statistical significance, Z-value, standard deviation, and lift – are all playing against you in a small mailing/e-mailing.

Without getting too deep, a general rule of thumb is that you need at least 100 responses from each side of the test to be on the path to proving a winner. Therefore, be very careful about assuming you have a winner when the pieces return 43 on one and 36 with the other. Without all of the rest of the statistics calculated, your face value judgement could get you into long-term trouble, especially if you did a small test before going big.

Basic math

Another best practice is to calculate profitability from a campaign. Without layering in the complexities of a full-on lifetime value model computation, the simple test is to look at cost-per-piece (production and postage/ESP charges) and compare it against past retention and profit per week. It will be up to you to determine whether to proceed based on estimated response and when profit is reached.

The chart shows what a typical P/L would compute out to. All things being equal in this mythical campaign, profit isn’t reached until the 41st week. At which point about 75 of the initial 158 orders are still active.

Again, it is going to be up to you whether this campaign is worth executing. For me, I’d like to know this before running the campaign as there are occasions where the campaign design and anticipated response will not turn out on the plus side of the ledger.

Finish line

Doing the A/B test is complex but needs to be done — and done with discipline and consistency. It is also critical to keep gut and intuition in check in the processes of measurement.

Sure, use your gut and intuition when deciding to try a puppy and a rolled-up newspaper versus a bigger discount on seven-day delivery. Or, how about testing a sad puppy looking at the iPad on the ground with the site showing? Which keywords are worth buying and which are not?

Either way, knowing what is the winner – and being able to prove it – is critical when everyone is fighting the fight for marketing dollars. This is a key to successfully becoming a database centric marketer.

A quick recap

  • Always set up a test.

  • Establish a control piece (just pick one!).

  • Be patient. Setting up the test parameters is complex as it is too easy to start changing many things instead of just one. If you want to change more than one, you can if you have enough pieces to send and respond to measure.

  • Documenting the test expectation is critical.

  • Measure the response.

  • Predict response, profitability points, and build history for comparisons.

  • Start!

About Greg Bright

By continuing to browse or by clicking “ACCEPT,” you agree to the storing of cookies on your device to enhance your site experience. To learn more about how we use cookies, please see our privacy policy.