I ran a tech talk at the always inspiring databricks.com/na-sais event in June in San Francisco. Its amazing to see the SPARK community continue to thrive and to be so relevant. In my talk below I dig into optimization techniques that exist today, while introducing our work on a self-driving optimization tool. In the video you’ll get to see a demo of the automatic optimization tool for improving both performance and reliability of big data applications. As always, please feel free to connect with me directly firstname.lastname@example.org to discuss any of the content in the slides in more depth.
Spark’s Catalyst Optimizer uses cost-based optimization (CBO) to pick the best execution plan for a SparkSQL query. The CBO can choose which join strategy to use (e.g., a broadcast join versus repartitioned join), which table to use as the build side for the hash-join, which join order to use in a multi-way join query, which filter to push down, and others. To get its decisions right, the CBO makes a number of assumptions including availability of up-to-date statistics about the data, accurate estimation of result sizes, and availability of accurate models to estimate query costs.
These assumptions may not hold in real-life settings such as multi-tenant clusters and agile cloud environments; unfortunately, causing the CBO to pick suboptimal execution plans. Dissatisfied users then have to step in and tune the queries manually. In this talk, we will describe how we built Elfino, a Self-Driving Query Optimizer. Elfino tracks each and every query over time—before, during, and after execution—and uses machine learning algorithms to learn from mistakes made by the CBO in estimating properties of the input datasets, intermediate query result sizes, speed of the underlying hardware, and query costs. Elfino can further use an AI algorithm (modeled on multi-armed bandits with expert advice) to “explore and experiment” in a safe and nonintrusive manner using otherwise idle cluster resources. This algorithm enables Elfino to learn about execution plans that the CBO will not consider otherwise.
Our talk will cover how these algorithms help guide the CBO towards better plans in real-life settings while reducing its reliance on assumptions and manual steps like query tuning, setting configuration parameters by trial-and-error, and detecting when statistics are stale. We will also share our experiences with evaluating Elfino in multiple environments and highlight interesting avenues for future work.
Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.
Adrian Popescu is a data engineer at Unravel Data Systems working on performance profiling and optimization of Spark applications. He has more than eight years of experience building and profiling data management applications. He holds a PhD in computer science from EPFL, a master of applied science from the University of Toronto, and a bachelor of science from University Politehnica, Bucharest. His PhD thesis focused on modeling the runtime performance of a class of analytical workloads that include iterative tasks executing on in-memory graph processing engines (Apache Giraph BSP), and SQL-on-Hadoop queries executing at scale on Hive.