The data scientist.

Has there ever been so much interest and demand in a job title?

Confusion, too. Who are these data scientists? What do they do? What qualifies them, and how do you find these people?

“A data scientist is a statistician who lives in San Francisco.” – Josh Wills

In the 1980s and 90s, it was data mining, database marketing, and other business intelligence skill sets that were all the buzz. These jobs required people who could manipulate and transform data through the use of popular data/database code (SQL) and statistical software (SAS, SPSS, etc.).

The data was typically structured and relational in nature. IT “owned” the databases and managed the administration required for the software being used to access and speak to the data.

Times were simpler back then. A database dictionary, a proper understanding of statistical methods, and some degree of certification (or at least advanced training) in statistical software was essentially the extent of technical expertise required for the job.

Another important factor was an ability to digest and translate business problems into analytical requirements – a set of specifications required to getting the job done.

Add it up and you had the qualifications of a very capable data miner. They quickly ascended to the top of the hot jobs list in most organisations – a very similar, if not more pronounced, trajectory for both database and software side vendors in this space who all had something to offer to make data analytics easy and painless.

So what changed?

The dramatic increase in both transactional “events” and behavioural “signals” being instrumented resulted in bigger volumes of data being created at increasingly faster rates.

Traditional database hardware and software simply was not designed for this sort of volume, variety, and velocity – and neither were the data miners.

Everything and everyone evolved simultaneously ... and quickly. It’s the essence of Big Data.

“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.” – Josh Wills

This is my favourite definition for the data scientist. They design and co-own their data environment. They code in multiple languages far more integrated into the technology that acquires and houses data, and when it comes to the combination of software, code, and data-based architecture, one size most definitely does not fit all.

In fact, it’s quite the opposite.

To me, that is what separates the data scientist from the traditional data miner – the scale and variety of decision-making required to get the job done. And “the job” in this sense now includes data-based products that are closer to the consumer, beyond analytics and predictive modelling, and into the user experience and product platform itself. 

If data miners were race car drivers, traditional IT architects and database administrators would be their lead engineers. Today, data scientists are both the driver and the engineer; they now co-own the design and build of the car’s mechanics and internal systems, built for purpose based on their intimate understanding of what makes the man-machine combination most effective.

If I had to summarise the essential qualities or characteristics of a data scientist I would say there are three key attributes:

  1. An ability to learn and write code. Again, I’m not talking about using software. I’m talking about writing raw code from scratch.

    Python, R, Java, and Pig/Hive/Hadoop-based are the current popular ones but these things change quickly. As important as it is for data analytics professionals to be fluent in the latest programming languages, it is equally important that they have an ability (and interest!) in learning new languages.

    Most advanced data projects today require multiple layers of code to process and compute large data sets, and it is very common to find different languages leveraged across each layer.

    This non-linear approach to computing data is a significant departure from how traditional data miners worked just a short while ago. There is a lot more coding (and learning) involved today, and software that used to be a persistent standard is now a choice (a variable in and of itself).

  2. A detailed understanding of open source platforms/architecture. The days of the simple relational database are over. Qualified data scientists understand the differences in open source platform options, the pros and cons of each across various problem sets, and how effectively these platforms work together within a larger “Big Data” architecture.

    The growing popularity and success seen in cloud/service-based infrastructure/architecture deployments has been a huge factor.

    Data scientists I know and work with have some very specific and often passionate opinions about what works best for a given application use case: what platforms work best for streaming and collecting data in real time, what platforms work best to process and compute data at scale, and what platforms work best for ad hoc data mining and analytics.

    I haven’t met a qualified data scientist who didn’t have very strong opinions about all of these. So while an IT operations team is still essential, much of the platform design decision making is a joint effort.

  3. A strong sense of (and for) business intelligence (analytics and processes). This includes statistics to some degree (there is great debate at the moment around how important this will soon be given how much “math” has become turn-key within these new analytics platforms).

    Mainly I mean an advanced understanding of business problem solving – being able to understand a business challenge, uncover opportunities, and then connect the dots between data, analytics, and a technology stack that will deliver an effective and measurable solution (so we know if it worked!).

    Advanced experimentation, personalisation, and predictive modeling/targeting are all good examples of what data science can provide. But each of these requires people who understand how to deploy these capabilities within the business. This requires them not just to work with but think like sales people, marketers, or UX experts.

    Essentially this is the thinking and planning required before the coding and technology platform are chosen and deployed. It is the most important factor that distinguishes the good from the great in this line of work

I would be remiss if I didn’t mention other virtues such as patience, passion, and curiosity – all critically important as well. But the truth is I know great data science people with varying degrees of each of these characteristics.

I prefer not to make blanket statements about “the best” data scientists displaying a certain behaviours or needing to “be” one way or another.

In fact, the single best data scientist I’ve ever met is a relatively introverted fellow, very pleasant to work with but certainly not what my sales or marketing colleagues would think of as “passionate” relative to their teams. It took some time before I got to know him well enough to see just how passionate and proud he was about what he thinks and what he’s accomplished.

So, yes, a charming guy but also a serious guy doing a serious job.

It will be fascinating to see old stereotypes fade (and new ones take shape) over the next few years. The proliferation of connected devices will mean more data, and innovations in technology data science such as deep learning and AI will push the boundaries between man-made and machine learning.

The one thing we can predict with certainty is that the science (and scientists) of data will never be the same.