When Hearst scooped Rick McFarland out of Amazon, the company had a tall order for him: transport data from all its 250 Web sites without disturbing Web site operations, create new metrics for a period of one hour to one week — and make it happen in under five minutes.
The Buzzing@Hearst project increased traffic to Hearst’s 250+ Web sites by 25%. This was no easy feat, and McFarland, vice president of data science at Hearst, managed to make this happen with US$200,000 and a team of two, he explained to the audience at the Big Data Media Conference, a joint venture of World Newsmedia Network (WNMN) and INMA.
McFarland says Hearst isn’t in the publishing business. Not even in the media business. “It is really in the data distribution business,” he said, as the company is creating and disseminating 100 to 200 GB of data a day. “Data is the gasoline.”
The world has lost the customer — that is why people are working around the clock to chase the customer, McFarland said. When he started in 1990, customer data was mostly gathered by surveys whose results came in a bound book or CD-ROM. And people would build their business on this.
Today, it is called clickstream data that comes directly from Web site.
Clickstream data is a term to describe big data with velocity. It usually comes in daily from sites like Omniture or Google Analytics. Today, every device from a phone to a watch is transmitting trillions of lifestream data. They are life-trackers. The information from this device is distributed to many different sources.
In a futuristic example that is actually happening today, a person in Australia can think, “Raise my right hand,” and a person in London would perform the action, McFarland said. Thought patterns can be collected through waves; McFarland believes this is the future, helping companies get as close as possible to the customer and their decision-making.
His objective at Hearst was to capture as much of the clickstream as he could: Action (scrolling, mouse movement), events (listening to audio, watching video), geospatial (GPS, movement, proximity), and sensor data (pulse, gait, body temperature).
So McFarland created a data pipeline, allowing the real-time funnelling of the most popular content across all Web site. It also let people to click on it and re-circulate and share this content. He had successfully leveraged the full weight of Hearst’s entire infrastructure.
Here is a practical guide to how he created a data highway that transported data between APIs in much less than five minutes:
Clean the data. No one can much with raw dirty data.
Data science: Create metrics from the cleaned data, such as regression models to forecast performance.
Expose the data.
The whole process took a 105 seconds. “For a data geek, this is awesome,” McFarland said.
Having the right people are key to making the seemingly possible happen, he said. Many people getting out of school are good at one narrow thing — but there are people willing to experiment with multiple narrow things and that is the kind of person a company needs, according to McFarland.