Databricks thinks the open-source Spark engine is the next big thing for big data processing — so it has teamed up with analytics firm Alteryx to supercharge the software.


The two data startups intend to drive Spark into the hands of more data analysts through a formal partnership, Databricks and Alteryx have revealed to VentureBeat. They will become the primary committers to Apache Spark, the open-source, in-memory engine often seen as the leading candidate to replace MapReduce, the companies said.

MapReduce, originally conceived at Google, is the initial programming model for the Hadoopecosystem of open-source tools for analyzing lots of different kinds of data. But while MapReduce boasts strong scalability, fault tolerance, and throughput, it generally runs jobs on a batch basis. That is quite limiting in terms of latency and accessibility, argued Alteryx chief operating officer George Mathew in a conversation with VentureBeat.

You need a custom MapReduce programmer every time you want to get something out of Hadoop, but that’s not the case for Spark, said Mathew. Alteryx is working toward a standardized Spark interface for asking questions directly against data sets, which broadens Spark’s accessibility from hundreds of thousands of data scientists to millions of data analysts — folks who know who to write SQL queries and model data effectively, but aren’t experts in writing MapReduce programming jobs in Java.

The Spark framework is well equipped to handle those queries, as it exploits the memory spread across all of the servers in a cluster. That means it can run analytics models at blazing-fast speeds compared to MapReduce: Programs can go as much as 100 times faster in memory or 10 times faster on disk. Those performance enhancements — and the subsequent customer demand – has prompted Hadoop distribution vendors like Cloudera and MapR to support Spark.

Databricks, founded by the creators of Spark, today announced $33 million in new funding, bringing its total venture financing to $47 million. It also revealed a new service for running Spark jobs and visualizing data on a Databricks-owned cloud. That’s another move by Databricks to make Spark as accessible as possible, a goal the Alteryx partnership will help push forward.

“We want to create a whole new generation of data blenders and analytics modelers that were never able to touch this stuff before,” Mathew said. “We’re just really excited to be working on this together.”
In building the big data future, architectural issues add up

Building the back-end systems that support business intelligence and analytics applications used to be relatively simple -- or at least straightforward. You'd set up a data warehouse and consolidate transaction data in it, then maybe spin out some data marts with subsets of the information for individual departments or groups of users. But as we move toward the big data future, things aren't so simple anymore. New technologies, such as Hadoop, stream processing systems and NoSQL databases, have entered the picture. Older ones -- columnar databases, in-memory processing tools -- have also become more prevalent in recent years, spurred partly by big data uses.

And there's no easy recipe for mixing all those technologies together with mainstream relational databases to create a big data architecture. William McKnight, president of McKnight Consulting Group in Plano, Texas, uses the term no-reference architecture to describe the current state of affairs. "Every company is different," he said in a video interview with SearchDataManagement in February 2014. "Gone are the days when a vendor or a consultant could walk into a shop with a laminated sheet of paper and say, 'This is what everybody needs to do.'"

Perhaps that's one reason why Gartner Inc. analyst Svetlana Sicular found twice as many data architect job listings as data scientist ones in a search forHadoop-related positions in the New York area on the jobs site Dice.com, as detailed in an April 2014 blog post. Sicular added that inquiries from her clients had recently shifted to questions about "no-nonsense big data architecture, management and real-time use cases."

SearchDataManagement and its companion site, SearchBusinessAnalytics, have published a variety of content offering insight and advice to help organizations figure out the way forward on architecting a big data infrastructure. In his video interview, McKnight expands on the lack of uniformity in big data ecosystems. In another video Q&A, John Myers, an analyst at Enterprise Management Associates, discusses the mix of data management technologies being tapped to support big data applications. A case study looks at the deployment of a cloud-based big data platform at supermarket co-op Allegiance Retail Services, while another story delves into the clear-eyed thinking that's needed in evaluating and selecting big data technologies.

Writing as part of our BI Experts Panel, consultant Rick van der Lans examines the competitive importance of big data -- and of corporate execs who understand the technologies that can be used to exploit it. Also as part of the panel, consultants Claudia Imhoff and Colin White detail a proposed method for extending traditional data warehouse architectures to handle today's expanded data needs. But another panelist, Wayne Eckerson, says it's time to stop dissing the data warehouse -- according to Eckerson, it still has a key role to play in IT architectures, even in the bright, shiny big data future.

Source: venturebeat.com