Spark is latest hyped-up big data framework

Jul 23, 2014 | |

Apache Spark landed on the map of many data professionals on May 30 when the Apache Software Foundation announced the 1.0 release of the open-source platform. Spark has since continued to grab headlines, but is it ready for enterprise prime time?

Listening to the speakers at last week's Spark Summit, the answer seems to be yes, though reality may be more complicated. Spark is often described as a runtime environment, sitting on top of data stores like Hadoop, NoSQL databases, Amazon Web Services (AWS) and relational databases, and acting as an application programming interface (API) that allows programmers to manipulate data through common applications. Spark comes with a few applications, including an SQL query engine, a library of machine learning algorithms, a graph processing engine and a streaming data processing engine.

There's an opportunity for Spark to become the "Lingua Franca" for big data, said Eric Baldeschwieler, a technology advisor and former co-founder and chief technology officer at Hortonworks. Hortonworks is one of several technology vendors that have incorporated Spark into their distributions of Hadoop, including Cloudera, IBM, MapR and Pivotal.

Therein lays a major part of Spark's promise. Proponents say it complements Hadoop while also taking the functionality of the much-hyped file system beyond what it can do on its own. Spark advocates say no other platform provides such comprehensive integration of these disparate technologies and functions.

M.C. Srivas, CTO and co-founder of Hadoop distribution vendor MapR, is particularly excited about Spark paired with Hadoop. He says it offers an alternative to the clunky and much-maligned MapReduce language and, since Spark can process data in-memory, it enables real-time data processing on Hadoop.

There's lots of opportunities to make it better, but I think Apache Spark is the most exciting thing happening in big data today. Eric Baldeschwieler technology advisor and former cofounder and CTO, Hortonworks.

Most of the chatter around Spark has been about its ability to integrate disparate data source and provide a single, simple interface. But it's beginning to offer more to data scientists who are less interested in the heavy lifting of data management.

Patrick Wendell, a software engineer at Databricks, the vendor that is leading Spark development, said the 1.0 release included 15 pre-defined machine-learning algorithms in its Machine Learning Library (MLlib). That is expected to double with the 1.1 release.

"The future of Spark is the libraries," Wendell said. "That's what the community has invested in and where the innovation is coming from. We're betting the future of Spark on these libraries."

Does all this mean enterprises should start planning their own Spark implementations? It may be too early for that. The idea of a single API for interacting with and managing streaming and batch data, as well as running both advanced analytics and simpler reporting functions against that data is appealing. Users today are frustrated with the broad array of tools necessary for managing, analyzing and reporting data. But Spark still has holes.

"There's lots of opportunities to make it better, but I think Apache Spark is the most exciting thing happening in big data today," Baldeschwieler said.

Source: searchbusinessanalytics.techtarget.com