This is a continuation of my previous post, which began exploring the difference between Big Data and lots of data.
Another way to approach the Big Data technology decision is to start with looking at your data and categorising it into what is and what is not Big Data.
It is not Big Data, in my opinion, if you are looking at customer account history, digital subscription access log summary data (device, OS version, user information), or payment history.
Nor is it predictive analytics; predictive analytics is a process applied to data (big or not), not the data in and of itself.
Likewise, if you are sitting on less than a dozen terabytes of data and it is growing at less than 20% a year and you are just running reports to understand what happened with your product, you don’t have Big Data (nor do you have a Big Data technology need).
You have lots of data, but not the kind of data or application need from the analytical discoveries that make the Big Data technologies jump practical.
It may look big, but volume alone isn’t enough. It would be hard to justify adding an entirely new layer of technology to analyse a small, slow-growing data set to generate reports or insights. Traditional analytical databases, reporting, and statistical tools are well suited for this type of work. Big Data drives action – usually an immediate action.
So, what is Big Data? It is the information you (can) collect from Internet searches, pageview/customer journey on your Web sites/mobile apps, and the like.
It is information collected from sensors, software/server log files, search results, site journey log files, the Twitter firehose, and information from RFID readers or GPS/geo-fencing data (ignoring, again, the data collected for IT security and site health reasons, as that is not a marketing decision but a different reason to move toward Big Data technology).
It is important to keep these rules and categories of the data in mind when approached by vendors that specialise in selling you the stuff!
After a while, the sales pitch sounds good if you fall into their sales spin cycle. Then you end up spending a lot of money for a new technical infrastructure stack, make some expensive hires, and find that you already had or could have done the same with the tools (databases, analytical tools, and statistical tools) you already have.
Worse yet, you don’t have a Web site that can ingest a data-driven decision.
All that said, I repeat, the media industry does have a call for Big Data. It is an absolute necessity for the biggest of players (again, leaving out the across-the-board security reasons for doing Big Data).
I feel quite strongly that the decision of when to “go big” must be well thought through for all sizes of operations and never lose track of the end deliverable(s) throughout the decision process.
Why? Big Data is the hot thing for both right and wrong reasons. (It is critical to remember that all Big Data solutions are supplemental technology to your existing technology and shouldn’t be seen as a replacement for existing databases and tools.)
In the end, Big Data is a set of database technology, specialty tools, and infrastructure to drive decisions in real time. They can also be used to bring together a “stuck” warehouse project – but do so by shifting who does the work to unstick the project.
Moving to a Big Data tool set requires a new IT infrastructure, skills, and analytical packages. If you will, it is another tool in the tool kit, one that will require all new technology, skills, and highly complex sets of connections between your current applications (Web site and apps) to deploy.
As the industry of content producers, packagers, and distributors, it is easy to recognise the need for content delivery to become personalised. Doing so is complex. The majority of the industry uses third-party software to do ad placement, but for content recommendation, the path thus far seems to be one of invent-it-on-our-own.
Do you have the tools/resources/speed and necessary content to personalise content on your own? Are you ready to make recommendations over a generalised next story approach?
Collecting the data is one thing, but the work to analyse a single customer’s journey — as the journey moves from story to story, to picture then to audio file — is quite another. Not to mention movement from device to device during the day. That is a big, and necessary, leap.
Big Data tools can do the analysis with the speed and deliver the recommendation to the user – that is, deliver it once you have the right connectors (APIs) in place on the Web site/app to accept the recommendations.
Ultimately, are you ready in your technology departments/providers to make this happen?
The need for speed in the delivery of the analytics is a critical element in deciding if Big Data technology is needed. Both traditional and Big Data tools are designed for rapid data collection. The differentiator is how the two deal with the complexity of data collected.
Big Data technologies are designed to deal with the volume, velocity, variability, and veracity of data collected. To a lesser degree, so are traditional database tools – and the builders of the traditional tools are building quickly to close any of the gaps in the “Vs.”
Keeping the definition of what constitutes Big Data in mind, the question then becomes: What is done with the information after it is collected? Reporting? Or real-time reaction?
If you are within my definition of just having “lots of data,” you can do the same predictive analytics in SQL Server, Oracle, or MySQL as you can in Hadoop, though you might require some improved fast-access analytical tools, a columnar high-speed analytical database, an analytics package (like MaaX), and a better statistics package (think SAS, SPSS, and R).
The reality is that Big Data and Hadoop or other Big Data database technologies have applications in the media space, as do their traditional infrastructure counterparts. Scale and speed of need is the dividing point.
Traditional solutions are a straightforward decision process. The world of Hadoop and NoSQL need to be approached with a lot of questioning before you jump in. Don’t fall for the current hype. Move only if you are absolutely sure of the return.
If you’re an IT type, I get it. It is hard to tell the CEO/CFO to hold off on Big Data. If you’re a “C” type (non-CIO that is), be patient, this isn’t an easy discussion and decision for the organisation. Sure, it is “cool” to be in Big Data, but the technology decisions are not like picking a word processing or spreadsheet package – it is extremely complex.
Even if you have the “Vs” that define Big Data, do you have the skills to deploy? Do you understand the costs? Can you sustain the infrastructure? Are you sure you have enough volume and velocity in the Vs, or can you do it in SQL?
What else are you going to do with the data and analysis produced? Can you even run a campaign from Hadoop? Do you understand how time series are handled in Hadoop? Can you push the data back to the Web site/app fast enough to be relevant?
If you are ready to jump, both Hadoop and NoSQL require a whole new core infrastructure and skill set. Do you understand the many tools it takes to do the analysis, to move the results into production systems? Which cloud will you use? Will you deploy via cloud or appliance?
Do you really understand what a data scientist is, what one costs, what they produce? Do you have the resources to write the Hive code or to process the MapReduce?
Can you move the results to production fast enough to matter? Does a recommendation need to refresh right away? How much delay is acceptable? Do you understand how to analyse around what some refer to as “the curse of Big Data?” Do you understand the costs to push up and pull down the data?
This doesn’t even cover the decision steps on whether to use Hadoop, NoSQL, Oracle Big Data, MongoDB, Microsoft HDInsight, and so on. That journey is the subject of its own blog post ... coming soon.