With a new streaming system, performance enhancements, and API refinements, Apache Spark 2.0 offers a big umbrella to data users
By Serdar Yegulalp
Apache Spark, the in-memory processing system that's fast become a centerpiece of modern big data frameworks, has officially released its long-awaited version 2.0.
Aside from some major usability and performance improvements, Spark
2.0's mission is to become a total solution for streaming and real-time
data. This comes as a number of other projects -- including others from
the Apache Foundation -- provide their own ways to boost real-time and
in-memory processing.
[ Learn how to unlock the power of the internet of things analytics with big data tools in InfoWorld's downloadable Deep Dive. | Explore the current trends and solutions in BI with InfoWorld's Extreme Analytics blog. ]
Easier on top, faster underneath
Most of Spark 2.0's big changes have been known well in advance, which has made them even more hotly anticipated.
One of the largest and most technologically ambitious additions is Project Tungsten, a reworking of Spark's treatment for memory and code generation.
Pieces of Project Tungsten have showed up in earlier releases, but 2.0
adds more, such as applying Tungsten's memory management to both caching
and runtime execution.
For users, these changes, plus a great many other under-the-hood
improvements, provide across-the-board performance gains. Spark's
developers claim a two-to-tenfold increase in speed for common
DataFrames and SQL operations, thanks to a new code generation system. Window functions, used for tasks like moving averages in data, have been reimplemented natively for further speed-ups.
Spark 2.0 also brings a major shift in programming APIs. DataFrames and
Datasets, previously two different ways of accessing structured data,
are now the same under the hood; DataFrames are now "just a type alias
for Dataset of Row," per Spark's release notes. R language users can
also now write a small range of user-defined functions and leverage
better support for existing Spark features.
These changes make Spark more powerful without unnecessary complexity,
since Spark's straightforward APIs are one of its biggest attractions.
Spark has streaming -- and company
Spark has been refining its metaphors for streamed and real-time data as
well, and Structured Streaming makes its proper debut in 2.0. It
repurposes Spark's existing DataFrame/Dataset API to connect with
streaming data sources like Kafka 0.10, so such data can be processed
live.
Streaming has long been considered one of Spark's weaker features because it's harder to debug and keep running than it is to get set up. But it's emerged as a contender to another major streaming-data solution, Apache Storm, in big part because Spark's much easier to use overall.
With version 2.0, Spark is making a bid to be an all-in-one processing
framework accessed by a few overarching APIs. But in the run-up to Spark
2.0, other projects have emerged with their own conceits for how to
approach streaming and batch processing -- Twitter's Heron, Apache Apex, and Apache Flink, to name a few.
All these projects have their advantages. Heron reuses Apache Storm's
metaphors for streaming to make it easier for Storm users to get on
board. Apex is even easier than Spark to work with, especially when it
comes to fault tolerance or event ordering. And Flink uses a native
streaming model rather than a retrofitted version of Spark's existing
data model.
Still, Spark has managed to establish itself solidly over the past
couple of years as an ingredient in third-party software products (SnappyData, Splice Machine) and cloud-native data systems (IBM and more). Spark 2.0 is set on making that legacy harder to displace.
This story, "Spark 2.0 takes an all-in-one approach to big data" was originally published by
InfoWorld.
Sign up here with your email