Oh, this post will be fun.
My hunch is that, from the time I write this to the time you read this, one of the vendors will be no more and three new ones will have sprung up. This is the unfortunate reality of the ecosystem of Big Data.
Hadoop, NoSQL, and all of the others living in the cloud storage/tools world are in an explosive stage of development, with many new companies jumping into the fray with a widget that makes something possible (time series) or the large firms making something easier to do (drag and drop H-SQL).
All of this inventive effort makes selecting the right set of tools to build your technology stack very difficult. The classic cliché of 20:20 hindsight will prove yet again that your great tech decision was wrong. But, you can’t stand on the sidelines; you have to pick something.
So, if you are sure you and your company are ready for the Big Data stack (versus lots of data), let’s jump down the rabbit hole.
The simplified stack:
- Needs assessment.
- Cloud storage.
- Database technology.
- Extraction/loading tools.
- Governance and data forensics tools.
- Query tool and language.
- Scripting language.
- Analytical (statistical) tools.
- API and other connectors to enable decisions.
Each layer in the stack is a unique tool/vendor decision point. Come to think of it, the list itself is a decision point: to do the list or hire a firm to manage it for you.
There are vendors out there that provide everything on the list plus the consultants to plug it all in for you. Think shrink-wrapped service provider. It’s a fairly simple approach, with its own risks and rewards.
My preference is to study each of the layers and build a vendor/service/importance matrix. This will let you see at a glance the whole stack, who fills a stack position and who doesn’t, an evaluation of each at each X/Y coordinate of the matrix, and the relative importance of the matrix position. You then have a quick tool to begin narrowing vendors and make decisions based on what is important to your organisation.
Let’s look at the cloud storage level. I just Googled “Big Data cloud storage companies.” It’s no surprise there are more than 32,000,000 results. So, filtering past the companies that pay to get to the top of the search list, you see the big players: Amazon, Microsoft, IBM, EMC (Dell), Google, and Oracle. And then there is the mass of others – some are re-branding the big guys’ stuff under their own brand with some (maybe) added services.
So how do you choose?
Develop selection criteria. Use the matrix concept from above, not necessarily as fully as you would a formal RFP but close, to narrow the field. It is important to do this because most often in the Big Data space, the software tools and vendor names will be completely new to you.
Even then, a vendor name isn’t everything. But in this fast changing space, a big name might be completely wrong for your company – so a safe choice might be the best choice or the worst choice.
Note I said “might.” You need to pick on the basis of needs and fulfillment of those needs. Have your RFP go-to questions answered to support your choice: service level (up time), support, storage location, price, company financials, security capabilities, public/private/hybrid cloud, and an exit strategy (move to another vendor) are at the top of my criteria list for storage vendor.
Storage location? Yep, it is important in the cloud space.
For example, if you have personal identifiable information (PII) or company financial/performance information in the data you load to the cloud and if the storage is outside of the United States, what legal implications does that have? Subpoena and discovery/disclosure/retention laws are different. PII definitions and restrictions are different. The right to “forget me” requirements, and so on.
So, while the desire to be on the very edge of technology is intriguing, and the thought of open-sourcing everything is too, remember you are putting everything into a cloud of disk storage owned by someone else. You’ve got to get this one perfect the first time!
The next decision, and it is directly linked to the storage choice, is the database technology. Relational or non-relational. No-SQL or NewSQL. Hadoop, Netezza, Vertica, HANA, Informix, HBase, Cassandra, Membrain, Riak, Couchbase, MongoDB, Google Cloud SQL, Microsoft Azure, or Postgress. In memory. Storage based. Over lots of small servers and disk. Or big, big gear. Do you have the skills to operate these databases? Which one works with which storage providers?
Then you repeat this discovery (flare) analysis and decision (focus) process for every layer in the needed technology stack. It will take you time, and there isn’t a silver bullet solution out there. The mix of the data you are collecting and using in the decisions of your Web site/mobile app are factors that will move you across the spectrum of technologies until you find the right one for your particular needs.
The skill sets to navigate this space are new, and, unfortunately, just making a few key hires, could inadvertently direct the decisions on the tools to those familiar to the hires and not in sync with the needs of your company.
Ultimately, the Big Data answer is complex. Do you need it, what are you going to do with it, and so on. Having made the decision to jump, the tool stack and vendor decisions must be made with your goals for using the information supporting the decisions.
Remember: It isn’t the tool, it is what you do with it. It is the ability to make better decision and take meaningful actions at the right time. There is not perfect vendor for all.
Sure, I have a list of vendors, hardware, and software that I could layout for you right now. But that list is based on my needs, for my clients. Your needs are different. Use the list of the technology stack pieces above to build the specific grid of vendors, software, hardware, and people that match your needs.
Enjoy and let me know how it works out for you.