IBM bets big on Spark, calling it the Linux of Big Data analytics
IBM is putting a major stake in the ground in support of Apache Spark, the high-speed analytics and machine-learning engine that is the hottest thing in Big Data right now. IBM said it will embed Spark into all of of its analytics and ecommerce platforms, commit more than 3,500 researchers and developers to work on Spark-related projects and open-source its SystemML machine learning technology for plug a key hole in the Spark technology stack. It will also offer courses to train more than one million data scientists and engineers to use Spark.
Regarded by some people as both a complement and competitor to Hadoop, Spark is actually one of many components of the large Hadoop ecosystem. It's an in-memory analytics processing engine that works across many back-end file systems, including Hadoop's native HDFS. Spark has rapidly gained popularity among businesses that are struggling to analyze data in multiple formats scattered across incompatible databases and file systems.
Because it runs in memory, Spark performs up to 100 times faster than Hadoop's native MapReduce processing engine on native HDFS files. It also works just as fluidly on data stored in Amazon Web Services' S3, HBase, Apache Cassandra, MySQL and several other popular file systems, meaning that applications don't have to be rewritten for each engine. Spark is considered especially strong at working with unstructured data like Twitter streams.
In throwing its substantial weight behind Spark, IBM is casting a vote for simplicity, said George Gilbert, Wikibon's Big Data analyst. One of the chief complaints about Hadoop is its complexity, a function of the large ecosystem that surrounds it, Gilbert said. Hadoop-related projects such as Hive, Pig, Spark and Impala all work on their own update schedules, which means users need to do the integration work.
