5 rules that make a “simple” Big Data request complicated

By Greg Bright

Albuquerque Journal

Albuquerque, New Mexico, USA


I just finished a large data analysis project. In the process, I was reminded of several data “rules of thumb” I thought I’d pass along to help marketers understand the daily data quality reality that data scientists and analysts have to wrestle with.

A simple request — how many advertisers ran colour ads and how much revenue did they generate — is as simple to extract from the data as you would think.

Just below the surface is the real question: Can you create a report that shows, by publication (daily, Sunday, special section) and by category (local, national, classified), the total inches and dollars (space charge and color only) charged (gross and net)? Show me the free colour inches as well ads that ran this year, compared to the past three years to show a trend … and so on.

Pulling accurate, usable data is not always as simple as it sounds.
Pulling accurate, usable data is not always as simple as it sounds.

Well, you see, the request isn’t really a report that shows 845 ads and US$349,223.23 worth of colour run. There is quite a bit of detail — all expected as the analyst or scientist interprets this simple request.

Then the fun begins in getting the data.

Rule 1: There is no such thing as clean data.

No matter how good the billing software, humans are using it and will put the oddest information into seemingly well-defined places.

Take the standard industrial classification (SIC) code field. It is supposed to hold the SIC code. After many years, strange things end up in this field — not counting the lazy data entry effort where all SIC codes end up as 9999.

Another is in how a customer type is defined. Is the customer a local or a national? What happens when a local branch of a national company (typically represented by an agency) runs an ad without the agency’s involvement? Do you end up with two Macy’s accounts?

I once saw a university in a market I worked with that had more than 200 different accounts. With all the sarcasm I can muster, it sure made analysing the university’s total spend easy!

Rule 2: Anything less than a SELECT* query will leave out an important piece of information.

Rule 2 corollary: Your PC (or Mac) isn’t big enough to accept all the data pulled in a SELECT* request.

I cannot tell you how many times this has happened to me without getting a little red in the face, but I will tell you to keep this narrative moving along.

I spend a good deal of time thinking through everything I need to do my analysis projects. What do I need, what might I need, and a few kitchen sink fields, just in case. Then I write out the extract logic — 400,000 rows of data. A Big Data file.

I no sooner start working the data, and then it happens. I missed a vital data element.

For example, to do a particular slice of advertising rate analysis, I have to identify ads that ran in my classified section because there are 10 columns per page in classified and six in retail, so my inches per page are not standardised.

Back to the data to grab one more column. Then do I re-process it all or do I build an append process? Grrr …

Rule 3: Don’t trust the expert. Verify everything yourself.

When the expert tells you there are three, and only three, different ways to categorise the data (say, by customer type), don’t necessarily believe him. Do a count distinct query to confirm. Even if the distinct values return the three, and only the three, it is time to ask around.

Find a user of the data, someone using the system every day. I guarantee there is a “well almost” or “yes but” that goes with the categorisation showing there are really six different customer types.

The fun part is when you take your results to the expert and he tells you the analysis is flawed because there are can’t be six different buckets. Thanks for the support. Now you have to spend a day explaining how the impossible is possible and has been that way for years.

A recent example is when I tried to reconcile two different reports generated elsewhere in the organisation. There were slightly different purposes for each report, and they had been in use for years. They wouldn’t cross-balance due to how each bucketed the dollars/inches.

The totals were fine, but the details were off, and the difference made a significant difference in the story the data was telling. It was not fun for me to uncover it, nor for the people who had to redo the report with the error and explain to the ninth floor how something like that could happen.

At least they had a long winter break in New York to figure it out.

Rule 4: Everything takes twice as long as estimated.

Rule 4 corollary: Even if estimates are adjusted to account for Rule 4, the rule still applies.

This is possibly the most frustrating data rule out there for the folks waiting on analysis. It ranks right up there with Rule 5, coming up next.

Management is busy so requests for analysis are always time-sensitive. Management wants information now: So, Greg, when can you have this to me? A quick read of the request and sweat on the boss’ brow, and I give him a delivery time (multiplying by 1.5; my actual quick math). The boss says sooner if possible!

Well (No. 1), the monthly billing is going to take place tomorrow, so why not wait a day to get the latest figures? (You know that is coming!) Well (No. 2), I can’t believe it, but the billing got messed up and has to be rerun. Say goodbye to another day.

Finally, you start the work. Rules 1-3 apply. Finally, you get it all done. Polished up. PowerPoint summary built. Handouts ready and … Rule 5 (below) hits.

Rule 5: This is exactly what I asked for but not what I need.

Yep. A redo. You get clarification of the request and finally get working on what is needed by the boss.

Probably what was delivered was truly, exactly what was asked. It was only upon looking at the analytics a next question was formed, and that is what is “needed.”

Data folks, rest assured this is a basic rule of analysis. For every question answered, a new question is formed. It is our job to keep answering questions. We learn over time what the next question is and deliver as many layers of questions and answers as we can.

So, analysts, your road to success isn’t measured in how well you work your way through the first four rules, but how well you can anticipate how rule five impacts your work.

The best of the best doesn’t play 10 questions. The best play “n” questions. To be excellent, play “n+1” when you deliver your work.

About Greg Bright

By continuing to browse or by clicking “ACCEPT,” you agree to the storing of cookies on your device to enhance your site experience. To learn more about how we use cookies, please see our privacy policy.