At its second edition in Europe and with a number of more than 1000 participants, Spark summit Europe took place in Brussels on October 25th-27th. With a total of 80 sessions, the summit covered a range of topics in six tracks: development, data science, enterprise, research, spark ecosystem and use cases. Unravel actively participated at the summit with a talk on Java Agents.
Unravel Presented a talk on Extending Spark with Java Agents
Click to view video.
Spark engine uses the Spark listener interfaces to export some of the internal data structures and runtime metrics to applications. While applications can extend these interfaces to collect metrics data at runtime, one limitation of Spark listeners is that the interfaces allow only certain metrics to be collected.
Jaroslav Bachorik, backend engineer at Unravel with a strong experience in JVM performance, presented Java Agents, a powerful technology that can be used in Spark to access and manipulate any internal data structures at runtime.
He presented BTrace, an open source dynamic instrumentation framework that uses Java Agents, has an easy to use DSL for class transformations, and was optimized for performance.
Java Agents are a powerful tool that can be also used to extend the application beyond the original intent (e.g., hot-patch problems or add feature enhancements without the need to modify the original source code), and may be removed when not needed.
In the example below we show how to use BTrace to intercept the SparkContext object at runtime, immediately after its initialization. After the object was intercepted, it can be used for fetching internal fields or even for modifying the state of the object (e.g., conf.set() call). In this particular example, BTrace was used to adjust the spark.executor.cores configuration property at runtime, after more information was collected dynamically from the cluster (through the AdviceSystem interface).
As part of the same talk, Adrian Popescu, backend engineer at Unravel focusing in optimizing Big Data applications, showcased the challenges DevOps face when deciding which RDDs to cache in Spark. While caching can boost performance significantly, caching too many RDDs may also hurt performance if the cached blocks are never used or if they are evicted just before usage (as also summarized in this post). He presented a solution that exploits Java Agents to collect all the data required in a caching algorithm which can be further used to simplify the optimization process. The illustration below shows the output of the caching algorithm for a Spark application. The algorithm identifies that the RDD produced at line 77 of the application has the most performance benefit, by reducing the runtime of the application with up to 22 minutes if caching is enabled.
Spark 2.0 and Performance Related Talks in the Development Track
Matei Zaharia’s keynote “Simplifying Big Data Applications with Apache Spark 2.0“ summarized the new features set of Spark 2.0 that allow users to write applications in a high level API which can be still highly optimized by the runtime engine. One of the key features of Spark is that it simplifies both development and optimization by allowing users to write an entire workflow of different processing steps in one single engine. Dataframes & Datasets allow users to write applications using a high level API that can be later converted into highly optimal code. A lot of attention has been also put on continuous applications, I.e., applications that interact with both batch and streaming data. Structured streaming integrates batch processing with stream processing.
In “Spark’s Performance: The Past, Present, and Future”, Sameer Agarwal presented the results of Project Tungsten that allowed Spark engine to run one order of magnitude faster by transforming SQL queries into highly optimized bytecode. The generated code collapses the input query into one single function, thus it eliminates expensive virtual function calls and it can better exploit CPU registers for storing intermediate data.
In “A Deep Dive into the Catalyst Optimizer”, Herman van Hovell presented the latest features available in the Catalyst optimizer. He illustrated the steps followed when optimizing a query with Catalyst. Starting from an initial logical plan, he showcased a number of rule base optimization techniques that can be used to transform the logical plan into an optimized physical plan. Finally, he illustrated how the optimized physical plan is translated into RDD-level bytecode through code-generation to ensure maximum performance in the JVM. Logical and physical plans are represented as Trees. Catalyst uses a Rule Executor which applies a number of batched rules where each batch has a large number or rules. The optimizer has also support for custom optimization strategies when more knowledge about the input transformations is available.
Some of the other performance related talks include: “Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire!”, which shows the performance boost when using hardware aware compilation techniques for Spark SQL applications, “SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs Across Your Cluster“, a custom monitoring tool that exploits Spark metrics API and Spark listeners to identify and tune inefficient jobs within the application, and “Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs“, which presents performance monitoring and visualization tools that can ease the task of identifying bottlenecks of applications.
The slides and videos of all the talks are available on the Spark summit page here. Other blog notes about the summit from Databricks can be found: here (for day 1), and here (for day 2).