Best Practices Video 5 Min Read

How to Run Spark Reliably and Optimize Spark Performance

Best Practices Video 5 Min Read
  • Twitter
  • Facebook

Demo by Shivnath Babu, CTO:

Many big data applications are built in Spark now. From data transformation and SQL applications to real-time streaming applications and data pipelines powered by AI and machine learning, Spark has made it easier than ever to create big data applications. However, moving these applications into production and running them in production (continuously and reliably) is challenging. Unravel powerfully helps you manage Spark performance…and meet your SLAs.

You might also be interested in this demo: CEO Kunal Agarwal shows how our big data APM monitoring software helps you manage applications in production.

Transcript of video (subtitles provided for silent viewing)

And in today’s demo, I will focus on multiple ways in which you can derive value from Unravel, starting with how Unravel provides a single pane of glass for the entire Spark application and operations management. From that, I will go into showing you how automatic analysis on the telemetry data that Unravel does, brings out useful recommendations so that you can optimize the performance of your applications, tune them to meet your SLAs or automatically do a root cause analysis of application failures. From that I will go on to show you Spark pipelines running end-to-end, how you can monitor these pipelines, how you can rapidly detect and diagnose when pipelines have unpredictable performance, their performance may change over time. And from that, I will actually take you to understand how the auto actions feature of Unravel can proactively alert you when problems happen so you don’t have to do reactive firefighting. And on top of that, we can automatically have these problems fixed by Unravel itself so that your cluster resource optimization and SLA needs can be met.

So let me switch over to the demo of Unravel. So what you’re seeing here, you see Unravel UI which actually talks to a distributed Unravel platform in the backend. Unravel has many different entry points designed for different use cases, an operations entry point for operations themes, an applications entry point which is where we’ll spend most of our time today. But all the applications running on your cluster will be loading and you can interact with them. And then there are other entry points which we will not be spending time today but in upcoming webinars where we’ll look at data, we will look at APIs that Unravel provides, and so on. So the applications entry point gives a very rich search interface and a selection interface, so every application running on your cluster, they could be YARN applications like MapReduce and Spark, and Tez, or SQL applications like Hive and Spark SQL, or pipelines like Cascading. And we’ll then go on later on in the demo to see how pipelines can be seen end-to-end, and these applications could be in any state from running to succeeded or failed. You can search and click applications running in different queues.

So, what we’ll start off today, since the focus is on Spark, let’s zoom in into the successful Spark applications over the last 30 days. So, I’ve zoomed into all of these applications and you can see the different applications and the key metadata and KPIs about those applications right here. Let’s sort by duration, for example, to find out which are the applications that could be running for a long time or consuming a lot of resources on my cluster. As you can see here, whenever Unravel analyzes an application, it provides tuning suggestions, it automatically analyzes these applications and that is what you see as this match right here. Let’s jump into one of these applications.

So right here, we have a Spark application page in Unravel. So you can see how the page is laid out, starting with the key information about KPIs and metadata, who ran this application, which queue was it submitted to, which cluster did it run on. So Unravel is multi-cluster there and on premises this might be your development cluster, a QA cluster, and prod cluster, and on the cloud, it could be any number of transient clusters that are spun up the amount of time spent by this application. And as you will see, Unravel provides a single pane of glass with all information needed to understand the performance of these applications right here, starting with the source of the application. This specific application that we are looking at is a Scala Spark program, right here, and you can see on the left side that this application ran with six stages. There’s key metadata about all of those stages, and they provide rich visuals like a Gantt chart view, for example, right here, where for all those six stages, the time going through the 27 minutes of the lifecycle of this application, where is the most of the time going. And you can quickly see that these first and second stages, stages 0 and 1, account for almost 99% of the time spent by this application. And of that time, seems like the stage 1 is actually the bottleneck stage.

So let us try to see what is stage 1 doing. So to understand what a particular stage might be doing in your program and in your execution, you have created execution graph use like this, that all the different data sets being processed by your Spark application, because Spark has a view of all your data sets and the way in which Spark programs run is you have different transformations that can be applied to these data sets. We can see stage 1 includes this data set, a map operation being applied to it. So I can click on any of those data sets and immediately it highlights which line of code, where is that access coming from so I can quickly understand what is stage 1 doing in my program because we have seen that stage 1 is actually becoming a bottleneck stage in the running of this entire Spark application.

Now, from the aspect of running time, we have seen where most of the time in this particular application is going. Now, what about resources? So resources used in a big data cluster, we can think in terms of the demand for resources. So this particular Spark application had 12 tasks that it had to run. And then, let’s look at the supply side. How is the cluster providing resources to run this application? So there we have these rich views that we can start by first understanding, how many of these tasks are running concurrently in the cluster at any point of time. We can see, how many containers, so, this is a YARN-based cluster but containers are a very gentle feature in which big data applications run today. So you can see that three containers are allocated to run this application at any point of time. We can see how many Yi codes, how much memory is allocated to run this application. All of these are YARN metrics and there are similar metrics when Spark is running on other types of containerized environments.

Along with this, we also want to know exactly how much true resources is used by this application. So this application is running, as we saw, three containers, and Spark runs master containers called The Driver and worker containers called Executors. So there are these three containers running, and Unravel brings together a rich set of metrics, starting with CPU metrics to understand how much was the CPU usage on the host where these containers are running, and how much of that usage can be accounted for by the containers corresponding this program, which helps you quickly understand if your applications are running on heavily containered clusters and they’re not getting enough resources, or they are actually stealing a lot of resources and causing problems on the cluster. So along with information about CPU, we have information at the level of memory, and memory is allocated at multiple levels, at the operating system level, at the JVM level, at the Spark level. So here this is a quick view of the operating system level memory usage of all of the containers running on this program.

And along with understanding that, you would then quickly want to understand how much memory is this application even asking for? So you can see how Unravel has brought together the entire configuration information. And with Spark and any system, you know that there are a lot of such configuration parameters. We’ll quickly see that the executed memory, the memory that is asked for each of the containers, is 3 gigs for the driver, it is 1 gigabyte. So all that information is right here. Along with that we have an interesting algorithm to bring in key logs, all the log forgery, the unstructured logs that correspond to this particular application. We will go into logs later on in the demo because logs are a very useful source to root-cause problems when your application is failing. But along with showing this high level information of where time is going and where resources are being used, we’ll always drill down to any stage. For example, I’ve drilled down into the bottleneck stage 1 right here, and I can see what the resource demand and supply for this particular stage is. I can even go into further detail right here to look at the tasks that are being run on behalf of the stage or the parallel tasks, which machines are they running on, which executors are they running on, key metrics about each of those stages, and that helps understand if the stage is suffering from a data skew problem.

All this information is right there at your fingertips. Now, if you are an analyst who wants to understand why this application took a lot of time, along with all of these data at your fingertips, what do you also want to be told is of all the configuration and application level changes I could make, how can I get this application to run in the best possible manner? And best might be the most resource-efficient and the fastest. So Unravel automatically analyzes all of this information and gives you very crisp recommendations, across size of containers, across parameters like the YARN overhead, across partitioning, across the type of joins and whatnot. It gives you very concrete suggestions to get the best performance out of this application. And it doesn’t stop there. It also provides rich explanations in plain English so that the person who’s trying to understand the performance can also understand why certain recommendations were made. Where are bottlenecks in the application itself? And here, as you can see in Spark, one of the secret sauces to get a really good performance is to employ the right caching. And this is today an art which Unravel has now reduced to a science where it can tell you exactly which RDDs, which datasets, which data frames to cache, and where in your program to cache it.

Let’s see, how this application will actually run if you were to apply these recommendations. So let’s take this particular application lane. Let’s go back and search in Unravel because I took the same application, the one that we are looking at, and applied these recommendations. And you can see how this particular application almost got an out of magnitude, 10X performance improvement. So these sorts of performance improvements are often possible because users who are creating and running these applications are often not aware of all the details of how Spark runs, how to set parameters, what else is happening in the cluster. So you can get very massive improvements and often on a cloud setting, these performance improvements can change to drastic reduction in cost, because I’m running the same cluster and finishing the applications much faster.

So having seen such an example of performance tuning, let’s go and take a look at what can Unravel do for applications that are failing. So I’ve zoomed in, into Spark applications that are failing, and you can see how we quickly show the status. Let’s now zoom in into one of these applications. So this is a very similar application like what we saw earlier, but this time the application was not able to finish any of these tasks successfully. And it ran four executors that ran on the cluster. Let’s see what Unravel has to tell you in this particular case for a failed application.

So now, you can see how Unravel, along with efficiency kind of analysis, also does root cause analysis of applications, and it can detect in this particular case, it’s telling you right up front that this application failed with an out-of-memory problem for Executors and the application have fail, and it’s guiding you towards which metrics to look at, and which parameters you need to change and how to get this application running again. So let’s take a quick look at the parameters that are actually being mentioned. So we can take a look at the maximum heap, the Java heap that has been allocated, so that’s around 1.9 to 2 gigabytes for this application. And if you looked that the used heap, you will see how the executors that are running tasks are having a gradual increase in their JVM heap size and that’s a reason why they are going out of memory.

In this specific case, Unravel also, along with telling you the root cause of the problem, also gives you a rich, what we call an error view, where it is extracting the key error messages that are happening in the log and giving you a useful view so you don’t have to dig deep and look at all the really complicated logs. And if you want to look at that, that information is also there because Unravel is providing you a single pane of glass. We will actually be giving a talk at the Strata Hadoop World week on all the rich capabilities for doing root cause analysis in Unravel. Now, in this particular case, let’s take a look at this particular application and see what happens if Unravel’s insights to do root cause analysis are applied. So we can search for this specific application. And if you now look at all the applications, you will see that the application that we were looking at which failed, with Unravel’s insights now runs successfully. And imagine a scenario where you have applications that are taking a long time to run and failing. It isn’t a very pleasant experience. And with Unravel, it can quickly detect and guide you to the right fix.

So, so far, we have looked at individual applications. We saw how these particular applications can be optimized for performance as well as if they are failing. Now a lot of the time production applications are pipelines. They’re pipelines consisting of multiple components. So Unravel has this feature where it can automatically detect pipelines running on a cluster, and as I said sometimes they’re scheduled by Oozie, Airflow, Control M, Tidal, Autosys. Here is an application which consists of three components. There is a MapReduce component for data ingestion, there’s a MapReduce component for cleaning up the data and doing some transformations on it. And then it runs a Spark application to produce a model. So, this is the Prod ML Model pipeline. And when you have these pipelines that are running, what you see on the right side, every dot here represents a run of this application. That is, it’s running in every day or every hour. And Unravel is now providing you the end-to-end view of what is happening for your application which is this pipeline. You can always zoom in into any component to understand its performance in much more detail.

What I’ll show today is, when you have these repeatedly running applications, a very common problem that happens is, say my application was running fine, as you can see here, it was running in 4 minutes, but suddenly today it had a major change of performance. So this troubleshooting, changes of performance, unpredictable performance takes a lot of time when you are operating Spark and other systems in production. In Unravel, trying to understand what happened and troubleshooting it, is very easy. I can just click and now what I have done is I am looking at a good run, which finished in 4 minutes, and a bad run which took a lot of time, side-by-side, and just eyeballing the components I can see that the component which finished in 1 minute, in my bad run is taking a long time, and I can zoom in and understand why the component is taking a lot of time, so that we can basically have the single pane of glass here.

But in this case, let’s take a look at what Unravel has to tell us. So I’m clicking on Unravel’s insights and now see how Unravel has actually done an SLA management analysis where it’s automatically building a baseline for these repeated runs. And there a lot of reasons why performance could actually change. It could be the I/O that it’s doing or the data cell set it is getting is much larger today. Or there could be resource contention, or it could be that the main route is much slower today. Because all these possible causes, Unravel is automatically analyzing and saying that the cause of performance change is the increased wait time to launch the application master. And there are lot of such resource contention problems that can happen. So very quickly Unravel can tell you, what is the cause of the problem?

Now at this point, you might wonder why is this application actually suffering from contention? Who is causing that contention? So Unravel has the insights and visuals to help you out there as well. So this particular application ran on August 28 between 21:41 and 21:56. Let’s take a quick look at Unravel’s operations views, where we can go into understand from the entire cluster or multi-cluster perspective.

We can zoom in into the time range, in this case it’s August 28th. Let’s go into that particular time range and understand what was going on in the cluster. Let’s zoom in into a fine granularity of one minute to understand what is going on in the cluster during that time. So what you’re seeing here is a cluster-wide view where the black line represents the total resources in the cluster from CPU memory, I/O, and whatnot, and the green represents the resource usage. If you go into that timeframe around 21:50, let’s see what is going on in the cluster during that time.

So Unravel can correlate the high-level clustering metrics along with which applications are contributing to it. So this application 584 is the application that we were looking at which is part of our SLA-bound workflow that’s having, suffering from a performance change, and you can quickly see how that application is getting very limited amount of resources in the cluster, and all the resources are being stolen by an ad-hoc Spark shell that was started. So, this shell is occupying all the resources and causing the contention. So, not only does Unravel help you understand that the root cause is a contention, it can help you pinpoint who is causing that contention.

Now, the the thing that users actually want is, I feel like it’s great to understand who is causing the contention, but can I protect the SLA of my workflow? Can I actually ensure that these problems don’t happen? Notice how the operations portion of this case was able to actually get the SLA of this application back into the state where it is getting met. So, how did that happen? So this run you can see the corresponding applications here, 586, 588, this application was running between 21:50 and 22:02. Let’s see what Unravel did during that time. So if you go into that time range, you’ll understand that I have my SLA-bound application running and somebody did start a Spark shell, that’s probably the same user, and that’s indeed consuming a lot of resources. But in this case you can see how Unravel applied an Auto Action, a policy, that understood that this particular application is a rogue or a runaway application, and it was able to terminate this application and protect the SLA of the workflow.

So let’s see how Unravel does this automation. Unravel has this feature called Auto Actions which is created by administrators of the cluster. So you can go to this Manage View, Auto Actions and add new Auto Actions in the cluster. So there’s a rich set of Auto Actions that can be created. So, what are Auto Actions? Auto Actions are policies that can be created and enforced by Unravel to protect SLAs, to enforce best practices, to do cost optimization. In this specific case, the Auto Action that protected the SLA is a resource contention and queue Auto Action. And how does that look like?

Any such policies, as you can imagine, are a collection of rules. So we can create a rule of the form any rogue or runaway application is occupying more than 1 terabyte of memory and causing a state in the cluster where multiple other applications are getting stuck in a pending state where they are not getting resources or having to wait for a long time for resources. And you can further scope these policies by saying they should only apply to specific use or they should only be applied during specific timeframes. For example, SLA-bound workflows in your cluster might run between 12 a.m., and maybe 12 a.m. and let’s say, between 5 or 6 a.m. in the cluster. So only during this time should be these policies be applied. And then a violation of the policy is detected, you can send an alert and we can alert in many different tools like, of course by email, as well as Slack, Asana and others. You can take actions where the applications that are causing problems in the cluster could be moved to a different queue or killed, like we saw in our example. But we have rich extension mechanisms where you can implement your own policies on what should be done.

So this Auto Actions feature, there are a rich set of Auto Actions that can be created in Unravel, and with any Auto Action we maintain an audit trail so we can see what the Auto Action did. And on top of all of this, Unravel has a feature called Smart Auto Actions where Unravel will automatically baseline your cluster and come up with the right policies for you. And you can always customize them by changing the thresholds or you know, creating your own extensions to these policies.