Database design: What to get, what to keep, what to shed
Big Data For News Publishers | 02 February 2016
Last month I talked about the half-life of data – how time not only degrades but can destroy value if not governed carefully. For example, a stop for non-pay followed by a restart then a stop for a move-out-of-area; they all loose relevance when you get a next new next start at the same address.
Assuming the new start is a different name and/or phone number, the value of the previous transaction history for much (if not all) of your standard analysis is destroyed. If not filtered carefully, it can lead you to erroneous conclusions.
Why? A different person is now at the address. You need to track his/her activity separately.
This month, I’ll weave through some of the considerations you need to make when designing your marketing database, or how to apply filters in your queries should you have to do your marketing from a data warehouse or Big Data (Hadoop) database.
What to get
The first thing to ask when designing a marketing database is what to get. The standard answer? Everything.
Not so fast. Everything is fine for a data warehouse, but when switching to a database designed specifically for marketing purposes, having everything is a pathway to trouble (see my last month’s blog posting for additional horror stories).
Marketing databases need to be selective in what is in them. Most are, but some border on the warehouse approach of having everything, so use caution here as well.
If you are marketing from a data warehouse, having everything presents problems that you have to be very mindful of when doing your campaigns and their select/omit queries, or you end up messaging to the wrong person with the wrong message.
To drive home the point one more time, if you are doing a re-acquisition campaign and the most recent stop at an address is a move-out-of-market, unless you filter around that stop reason, the message sent on a “we want you back” is wrong:
- In an e-mail you are asking someone to come back who potentially lives 2,132 miles away.
- On a direct mail piece you are asking someone back who you never had.
- In a telemarketing campaign, if you are lucky enough to be calling a land line, it has possibly been reassigned to a different person. If it is a cellular number (watch all of the laws carefully on dialing a cellular line), you could be talking to the former subscriber now living 2,132 miles away or someone just assigned the number.
So you have to build into your marketing database build process filter logic or into your queries a way to handle transaction codes that have the effect of negating history. The logic is an interesting exercise in time-series handling in Hadoop and a good coding test in SQL Server!
Regardless the technology, you are building what amounts to, for lack of a better term, a data checkpoint reset button. It takes quite a bit of work to design business rules around every stop/start code, but it needs to be done! And, unfortunately, it isn’t a task you do once and move on.
Every new stop reason defined needs a query usage governance step to define how the code is used for marketing and analysis purposes. The same process for every start code, rate code, and service message is needed as well. Don’t be surprised if the rules are different between a marketing and analysis use!
Another fallacy in extending the “get everything” answer too far for your marketing database is you can end up with a literal copy of your original system. Then some genius will say something like, “Why not just let us make a duplicate of the original system and call it a decision-support system instead of making us copy everything one table at a time?”
Well, a straight copy will get you a database designed for processing transactions one at a time (… WHERE TELEPHONE = 7039933939) rather than a database designed (optimised) for massive queries (SELECT * from HISTORY with STOP_DATE between … well you get the idea) — queries returning lots of rows instead of one row.
When you start running campaigns, you also need a way to track what you did to measure success. You will want to summarise and group data differently than in the original system. These two tasks require additional tables in the database, and as soon as they are created, you no longer have an exact copy of the original database.
So now you have to build backup and disaster recovery plans for the “well, it started as an exact copy” database. Solving the back-up and disaster recovery problems still leaves you with a database that isn’t optimised for campaigns and queries.
Eventually, you will want to build out the specialised data structure for marketing or data warehousing – or even Hadoop’ing the whole thing in your quest to capture everything.
So what do you end up with? If you’re not careful, you get a giant database that is difficult to use, is slow, can’t get you what you want, and fails.
You shut it all down and just run your campaigns from your live production system – until customer service and IT complain about the system running slow and they take away your access. Or you find you can’t track anything and start building lots of side tables.
Within a few months, you’ll end up starting all over and build (or buy) a specialty database, be it a data warehouse, data mart, Big Data solution, or a marketing/analytical database.
I suppose most of the media companies have been through this cycle at least once and many may be circling around for another round of database design or vendor selection. If you are, it is important to pause and think the end-purpose of the database(s) through very carefully before proceeding.
Online transaction design is different than data warehouse, and that is different from Big Data and from marketing and analysis design. There are a few vendors that have figured all of this out for you (disclosure: I work for one of them).
If you decide to build versus buy, carefully think through what to get from your database, then design to avoid the two major flaws:
- Having (using) outdated data.
- Having a design that supports the wrong function.
If you focus on the first, you can avoid the second. Having a query design or database design that helps with the data relevance “reset button” is a key factor in the success or failure of analytical and marketing database design.
The companies that offer very specialised databases have built the processes and software tools they use around the special needs of marketing and analytics, and have figured out how to co-exist in both worlds.
What to keep
About now, you might be equally paranoid about keeping too much data as you are about not getting enough. Rightly so.
How do you decide what to keep? Is it okay to keep that move-out-of-market cancellation transaction? For how long?
How about an e-mail address you picked up from the promotion you did at the county fair when giving away umbrellas? What about the transactions associated with the person who moved out of market?
Well, like everything, what is the purpose of the database you are building? In the case of a Big Data database and data warehouse, you keep pretty much everything. But as you move into a specialty analytical or marketing database (or if your Big Data database fits this category), you need to become selective or very careful in your query design and/or temp tables as filter tools.
In my example of the move transaction, I look at layers of data to decide if it is worth keeping and how long it is worth keeping.
- Has there been a new subscription started at that address since the move transaction?
- Have you seen a name change at that address in your demographic data providers file?
- If you ran the name/address combination through NCOA (or a similiar service if you outside the United States), did you get a new address?
- How long ago was the stop?
Each of these questions is a branch in the decision tree for not only keeping the history, but what to do with the messaging to this address (or the telephone number(s) and e-mail address(es) associated with it).
Suppose you do find the name has changed in the demographic providers file you receive and you’ve confirmed it changed with an NCOA process of your own. What do you do with the history? Is a wet paper complaint from the prior customer going to help you?
This is a trick question: Yes, a wet paper will help in retention prediction (it is associated with the physical address, not the person).
How about a vacation stop? Maybe. If you are trying to predict vacation patterns over a large enough customer base, it might.
How about payment history? Maybe not so much.
And so on it goes. Every transaction needs a governance rule – one for how to use the data for analysis and another for how to use the data for campaign design.
What to shed
By now you should be guessing that the answer to the shedding question is complicated. Yep, it is.
Even in the era of US$0.01 per GB of disk, it is still something you need to build policies around. In your marketing database design, you might build around an address level record for the current customer or the most recent former customer if no current customer.
In an analysis database you might want the current and three years of former records, while in a data warehouse or Hadoop system you will want to go back to the Nixon administration.
Build your rules from the top-level transaction: the customer.
Then build back rules for the data captured. Think through how the rule is impacted if you change the top level to an address or an e-mail address.
In my example of the marketing database, if I think of the address as the top level, and there is a current customer associated with an address, do I need payment history from the prior customer? Not for marketing purposes, but yes for analytical work and an analytical database (rate predictions and churn/risk assignment).
So, grab a healthy handful of your favourite comfort food, sit down with the various departments in your organisation, and get the discussion going for each database and its purpose. If you have a database that is serving multiple purposes, I suggest you also bringing along your favourite comfort beverage. You’ll need it.
As Admiral Ackbar said in Return of the Jedi: “It’s a trap.”
Enjoy. Be careful out there.