It’s hard to believe, but it has been 11 years since the birth of Hadoop

January 2006’s delivery of HDFS and MapReduce marked the birth of an entire industry and new type of computing. The shared file storage and shared processing approach to analysing huge amounts of data (a Google research question outlined in 2003) made large-scale analysis possible.

Big Data was no longer a brute force and raw horsepower exercise (or the very clever load-balancing software from companies like BlueVenn) but a work-shared, multi-computer approach.

Hadoop has changed the face of Big Data.
Hadoop has changed the face of Big Data.

Today, just 11 short years later, we have whole suites of tools and variations of the Hadoop approach working on solving problems that seemed unattainable a decade ago. Real-time content recommendation is possible. Network intrusion detection, statistical modeling, and manufacturing quality assurance tests never before possible are a normal part of business operations today.

Big Data has altered the role of the analyst, generating a new job called the “data scientist,” a person with the technical skills to write Python and the statistical skills to understand and build a proper model.

That is a special combination of skills, understanding the problems introduced when modeling over massive amounts of data (a sample size approaching population problem), and enough computer programmer skills to cut code in Python and/or “R.”

Big Data is the driving force behind a new branch of statistics, one that builds algorithms/models that behave when ingesting staggering amounts of information. That is, for example, the mathematics to build ultra-large-scale cluster models and collaboration filter tools that can detect the true outlier against a huge sample.

Business analysis no longer has to rely upon a sample. It is possible (though unwise unless trained in the mathematics) to use “all” of the data.

Computer hardware, operating systems, and database engines are all responding. Solid state storage and mainstreaming of IBM’s iSeries “in memory” all-the-time data have made it possible to process at super-computer speeds — and moved super computers to hyper-computing speeds.

Has Hadoop killed the data warehouse? Why ETL into a warehouse structure when the Hadoop family tools can do the transformations on the fly?

For us old-timers in the industry, Hadoop was a game changer. Hadoop and Big Data have had same transformative level of change as the introduction of mainstream personal computers or spreadsheet software. It truly is one of those quantum leaps where everything you knew was instantly old-school.

In the content creation industry — aka media — the power of Hadoop and the other Big Data analytical tools are, after years of figuring out how to use, now filtering into the mainstream. The leading-edge media companies have made the leap from descriptive and predictive to prescriptive analytics.

Leaders see beyond using data to describe or even forecast a result and are now seeing how data can guide choice. The widening acceptance of the recommendation engines — commonly used to “suggest” stories you might be interested in — is moving beyond e-commerce and into the mainstream.

As with cloud computing and other game-changing technologies, acceptance is slow until a tipping point. We are at that tipping point with Big Data and are just now grasping what Big Data can do and how to use it. It will still take some time before universal adoption is the accepted normal.

I, for one, used to think it was creepy when Amazon recommended music. Now I expect it. I like that those recommendations have led me to musicians that I never would have listened to before.

I also had to spend some quality think-time and tinkering at home before I worked out how to apply the Big Data processes and newly possible analysis into my day job. I owe DSSTNE and the Big Data Manifesto big thank yous (look them up; it is well worth spending time in the discussion threads).

Hadoop, Cassandra, Hbase, CouchDB, NuoDB, MongoDB, and so on … what will their second decade bring?

We still need better tools for the two large classes of Big Data technology (operational and analytical). The ability to deal with scale is needed (outliners versus value-additive data). Cluster modeling tools that can build rapidly are critical.

The rise of the data scientist (part statistician, part programmer, part dream-solver) is directly connected to the birth of Big Data and the lack of software tools to mine through the volume, variety, and velocity of the data and simplify the insight for an easy-to-understand solution for the customer experience software to ingest and use.

A dozen or so tools are fighting for dominance in this space. In the next 10 years, the tool vendor battles will narrow to a few winners. Then the data scientist will become a less Python-intensive job and the skill needs will shift to working on the mathematical dark-arts.

The bottom line? In the next 10 years, Big Data will evolve to fulfill the analytical promise and drive into the mainstream solutions with real business impact.

In 10 years, I see recommendation engines driving real-time online and offline decisions with the same natural ease as we use campaign A/B testing today. Personalisation engines are going to drive communication. Feedback loops that respond as business conversations take place will be common.

Today, many companies are working to solve this need. Some of their solutions are amazing.

Here are a few other Hadoop lead predictions:

  • Leveraging the cloud. The cloud removes owning processor power and storage from the equation. The next decade will see businesses imbed cloud into their real-time (OLTP) data center infrastructure. The old model of on-premise computers will be replaced by a pure cloud infrastructure. Of course, 10 years of upgrades to networking, security, and reliability are needed.
  • Leveraging solid state. Hadoop was deployed before solid-state storage moved past the megabyte level. The whole design of Hadoop to split work across multiple servers and multiple disks is going through a re-think now that we have gigabyte SSDs. Designs built around computing, storage, and performance limitations will need a re-think as the next generations of nano-level hardware remove barriers. (Mr. Moore and his law stated in 1965 continues to hold true — okay, mostly true. The doubling time may get a little stretchy, and we may move from transistors doubling to storage capacity doubling, but Mr. Moore’s law still works well as a good rule of thumb for all things computing.)
  • Mathematics. A whole branch of advanced mathematical theories will deploy as modeling, clustering, similarity, and comparative algorithms are all revisited and designed to produce results when computing against large, fast-moving data sets. Models that can isolate dependence when N is in the billions (N = population?) will exist.
  • And, finally, Hadoop itself. Where will it be in 10 years? I venture to say that whether it is Hadoop or one of the other Big Data engines, Big Data will become as mainstream as SQL Server and Linux in today’s world. Companies will have to have a tool with the power to handle the Big Data “V’s” with tools as mainstream as today’s Microsoft Excel. Ten years is five cycles of Moore’s law. Petabytes will be the norm. Exabytes will be where the Fortune 100s and not just Google’s and Amazon’s, private playground is.

    In cars, there was a saying that there is no substitute for cubic inches. True, at least until the era of the turbo and electric-hybrid came along. In computing, the parallel saying was that there was no substitute for CPU speed (petaflops), but that was until it began sharing the workload in Hadoop and the Big Data family came along.

Change is here. Jump in and enjoy the ride.

So, happy 11th birthday, Hadoop. You are a tweener now — still a bit misunderstood, but fighting to make your mark. Cheers.