Hadoop 2, YARN set to shake up data management and analytics
Consultant Wayne Eckerson says Hadoop 2, with its key YARN component, qualifies as a flexible big data operating system. And it could quickly take the open source framework into the IT mainstream, he predicts.
Released in October 2013, Hadoop 2 turns the open source distributed processing platform into a multipurpose operating system for big data applications. Rather than supporting just one type of data processing, Hadoop 2 supports any data processing application written to the YARN interface. As such, it can support not only batch processing but also real-time queries, enterprise search apps, stream processing, in-memory computing, and whatever else anyone dreams up and writes to YARN.
The upshot is revolutionary: Rather than move data to a variety of specialized applications and systems for processing, companies can store the data in Hadoop 2 systems and process it there as well.
That message was trumpeted recently at an analyst day hosted by Cloudera, which was the first vendor to commercialize a Hadoop distribution and related support services. In his opening remarks, Cloudera CEO Tom Reilly said that Hadoop 2 will change how companies architect analytics systems: "Rather than move data to compute resources, companies will move compute resources to data, saving enormous amounts of time and money."
· Data lake harbors new types of apps
The new version has given rise to the notion of a Hadoop-based data lake. Cloudera is also one of the first companies to commercialize a data lake offering, which it calls an enterprise data hub (EDH). With an annual subscription, Cloudera Enterprise Data Hub Edition customers can access core Hadoop plus six premium components but the company thinks a raft of third-party applications are also on the way.
There are some perils lurking in the Hadoop 2 data lake -- see my blog post about them. But according to Reilly, the lake spawns a new breed of "converged applications" that can deliver enormous business value. For instance, a company can use Spark Streaming to stream data from a sensor network into a Spark in-memory database, where it is analyzed and turned into a model that gets embedded in a high-volume Web application running in HBase. All the while, the data never leaves the Hadoop cluster, which greatly simplifies data processing and reduces costs.
Although many skeptics claim that Hadoop isn't ready to support enterprise-caliber production applications, Cloudera says demand for its EDH is high. In fact, the company reportedly sold eight subscriptions within six weeks at the end of this year's first quarter after making the Enterprise Data Hub Edition commercially available.