Putting AI to Work on Apache Spark

Bay Area Spark Meetup – January 2019 Putting AI to Work on Apache Spark

Presenter: Dr. Shivnath Babu, CTO & Co-Founder at Unravel

Apache Spark simplifies AI, but why not use AI to simplify Spark performance and operations management? An AI-driven approach can drastically reduce the time Spark application developers and operations teams spend troubleshooting problems.

This talk discusses algorithms that run real-time streaming pipelines as well as build ML models in batch to enable Spark users to automatically solve problems like: (i) fixing a failed Spark application, (ii) auto tuning SLA-bound Spark streaming pipelines, (iii) identifying the best broadcast joins and caching for SparkSQL queries and tables, (iv) picking cost-effective machine types and container sizes to run Spark workloads on the AWS, Azure, and Google cloud; and more.

Putting AI to Work on Apache Spark

Transcript of the video recording from the Bay Area Spark Meetup: January 17, 2019

Shivnath: Hello, everyone. So I’m Shivnath Babu and thanks Jules for the awesome introduction. Again, it’s pretty challenging to figure out which way to face. But one thing I realized in being at the Microsoft venue is I do have a Microsoft fellowship as well. Sorry, Microsoft, but I should have acknowledged that as well. And I really appreciate and that came as my first fellowship when I was a grad student at Stanford. So what I really wanted to quickly run by you is the same Shivnath you saw on the previous photo. Now that same Shivnath actually kind of seems like he’s in a fight. Who am I fighting?

I’m sure many of you here have run distributed applications production. Please raise your hands if actually you have distributed applications in production. [00:01:00] A very large crowd. And it is pretty hard to run these applications reliably and efficiently in production. So I’m sure a lot of you, once you have tried your first Spark app or maybe on a daily basis when you run your Spark apps, it fails. It ran out of memory. What do I do? I am a data scientist. I want to make sense of the data. I actually want to build my applications. Why am I fighting this battle of like, “It went out of memory?” Can’t something actually just tell me how to make my application more reliable? I would fix that problem.

You may be a data scientist, you might be an engineer, a data engineer who’s actually moving an application into production or you have that need to move an application to production and it has to meet certain performance requirements. The app was too slow. Maybe you ported it from a system like Teradata or whatnot where it was running fast, but I’m not able to get the performance. Once again, it’s a major battle to figure it out how to make it faster. All of these knobs, all of these things have to change, what instance types to allocate [00:02:00] on the cloud, many, many decisions. And once you have put it in production, maybe there’s an SLA. And suddenly it stops meeting the SLA. How do you troubleshoot it? How do you figure it out? How do you fix it?

Once again, you just want to know the answer to what can I do to guarantee the SLA? I don’t want to fight this battle of trying to tune the app and whatnot. And last but not the least, if you’re an operations person, how many of you here are an operations person, you would call yourself as somebody who actually manages the apps and in operation? Anybody here? Seems like a small crowd, but you will definitely agree that it is a challenge to ensure these applications run reliably. And now with the cloud, it’s very easy to burn money. A bad app might run slower on-prem, but it burns like 2X the amount of money on the cloud. And just I want to know how to make the app efficient and finish on time.

So it’s not a surprise that these are challenging because a lot of the time if you go to enterprise and ask them to draw [00:03:00] their data processing architecture, this modern data processing architecture, it looks something like this. Spark is a very key component but you might have some Kafka to ingest the data. You might have some Redshift or a Snowflake or like m-Power system to kind of generate reports. It will be a collection of systems, not to mention some NoSQL in there. And if one distributed system is hard to manage, just imagine when you have this collection of systems that are being tied together by applications. Things can go wrong everywhere: at the app level, at the platform level, the scheduler level, noisy neighbors, you just name it, data layouts, lots of hard decisions to make, a lot of places where things can go wrong. And what do you do today?

I’m sure a lot of you who have faced this problem, you’re probably even here, when you have this problem, what you might do is you might go to a Spark UI or a Resource Manager UI or you might look at logs. You’ll try to figure out what is happening. It’s just not one log. It is a driver log. It is an executer log, so many things to look at. And once again, why should [00:04:00] things have to be so hard? If this is what happens when you have one system, one user, one app, imagine as things scale, as you go into real production, you bring in multiple systems which are being combined by the application, you bring in multiple users, you bring in multiple applications, just things become more and more challenging.

So what we have been working on is something which I’m pretty passionate about. I’ve been working on this for almost a decade now is can we have a world where if I’m a user, if I’m a developer, if I’m an operations person, I have an app, which might be a simple Spark query, and I have a goal like you saw. Maybe the goal is to make it fast. Maybe the goal is to meet an SLA. Maybe the goal is to cut down cost while meeting some time in which the app completes. Can I just declare that, “Here is my app, here is my goal?” And can a system basically like tell me, “Hey, okay, for your app and for your goal, I have a configuration that can be run it 30% faster [00:05:00] than what you have right now,” just immediately based on something.

And maybe then I go check email. And within some time it tells me, “By the way, I’ve been working on your behalf. I found another configuration that can run the app 60% faster.” I go for lunch. I come back. And system tells me, “Even better. I have actually found a configuration that is 90% faster.” Well, great. I can take that. I want my app to be faster. I’m happy. I can go build my next app. Is this like virtual reality? Is this like something that is far out? What I’d like to actually show you today is just something which if you actually focus on it and work on it with the right tools and the right focus, you can actually solve it. And what I’d like to show you is auto-tuning that we have enabled for. Our last class of applications we’ll talk about Spark and we call the sessions internally with the idea of like a tuning session, a session that helps a user get [00:06:00] her app to meet a goal.

So first, I’m just going to show you a video to kind of show you what this means. So then we can start drilling down into techniques and whatnot. So what you’ll here is… I’m a developer. I have a whole suite of apps and there’s one app. I know you may not be able to see all the numbers here but I’ll try to explain. That app, once it comes up on the screen, is an app that runs in around like six minutes or so. You can see that it is Scala.

You see the right side, it is a Scala Spark app and all the things that you might be familiar with like the jobs and stages and resource usage and all of that. All that information is there. Now on the covers, there is also that…you see that orange bar like something, some recommendation engine is running under the covers, analyzing all of this data on my behalf and saying that, “Hey, there are some recommendations I have to make this application better. I have some set of a bunch of recommendations.

Now, I’m a developer. I don’t necessarily [00:07:00] care about all of these knobs and things like that. So what I really want to tune is, “Hey, great, recommendations auto-tune my application for me.” So the user clicks on auto-tune and then the system is asking some questions like, “Hey, would you like to name your session as something so we can refer to it again.” So I name my session. I pick my application type which is Spark. And then a very key thing that I’m going to emphasize is applications are often tuned for a goal and here I’d say my goal is efficiency. I would just want to make it subject to the amount of resources available as fast as it can be.

And then I might want to mention how the app can be run. It’s Spark submit like the jars and like the class name and arguments and whatnot and I’m telling the system like, “Don’t do more than three runs to kind of figure out a good configuration.” And then click on Add. So now our tuning session has been created. So the system is basically doing some thinking and running some algorithms [00:08:00] on my behalf, trying to analyze all the data it has. This is a page that represents the system telling me like, “On the right side, what you see is I created a session for you. I’m looking at some historic data. I’m kind of trying to figure out what is a good configuration for this app.” Remember, it ran in six minutes.

So the system does all of that under the covers, and you’ll see that very soon it’s able to come back and tell me, “Look. Hey, I found a configuration that can run this application in a minute.” So what is running in six minutes now like…some magic happened under the covers and there is a configuration that can run it in almost a minute. So I’m happy but I would like the system to keep trying. Maybe there is something even better. So it keeps trying on my behalf. And you’ll see here that sometimes the system might run an application on the cluster, just to test out a configuration, whatnot, and it does that. But basically, even after the run, it finds a configuration that still runs in the same amount of time. [00:09:00] The length of those green bars represents the time the application runs.

So it tried, but it was not able to find something that was faster. But let’s actually check it out. What the system did on my behalf is all what you see on the right side. I’m looking back. Imagine there’s a screen behind me, but I guess it’s all around me. But if you look at that configuration, all these…Spark has tons and tons of knobs. So the system kind of looked through all of those knobs that use some algorithm under the covers and found a good configuration for me. I think that ran the app in one-sixth of the time. So this is an example of the kind of things we want to make possible. So, again, taking a step back and summarizing what we see.

There are two dimensions of this auto-tuning where there is an application part and a goal. And it’s very important to understand that, well, the application might be from a SQL query. It might be a Scala program. It might be an actual data pipeline [00:10:00] that consists of multiple stages maybe multiple individual Spark apps like Airflow Workflow, or Informatica Workflow. Or it might have been an entire workload, all the jobs, all the Spark applications that run on my Spark cluster that has been up on the cloud. At the same time, the other dimension is the goal because it’s not the case that always users are interested in making things faster and faster.

On the cloud, cost is an important perspective. You want the right price-performance tradeoff. Maybe you can run the app in 10 minutes and it costs you like $1,000 but maybe you can run it in 12 minutes and it might cost you $500. So like there are all of these different tradeoffs you want to make but at the same time, what we really want to build is a solution that can actually take my application and my goal and really get me there because I don’t want to spend time fighting the system or trying to tune it. I would rather build my application.

So a more quick video just to illustrate another major challenge [00:11:00] that likely you might face. So this is an application where…let me play this video if it comes up. So making things faster is one thing. But I’m sure a lot of you spend a lot of your time dealing with failed applications. And here’s a Spark app in that orange-colored view. So it failed. So when it fails, your goal is, first and foremost thing is just, “I want to make it run.” You’re not thinking about making it faster. You’re just thinking about, “How do I make it run?” So here you can see the same UI that I presented where I can pick my application. In this case, it is an ETL pipeline. An ETL pipeline that fails at 2:00 a.m. is the worst thing, that you have to wake up and fix.

But my goal here is reliability because things like ETL pipelines might actually run with X amount of data today. Tomorrow, it might process to 2X amount of data. So making things reliable so that it can handle all of these different changes in data sizes or contention in the cluster becomes very important. And in this case once the session [00:12:00] comes up, you’ll see that my app has failed. Now the auto-tuning solution, the algorithms are running under the covers and its first goal is to find a configuration that runs this app successfully. So this app failed in around a minute and then the system is trying to find a configuration and it comes back with a verified configuration, verified as in it was able to run the app successfully. You’ll see once it comes up, that took around like 10 minutes.

So the algorithm in this case is actually first trying to find something that runs and then it continues to look. This time, what it’ll do is, “Hey, I have a configuration that runs but now let me find a configuration that actually is efficient as well.” So if you look at this…let me kind of quickly go through…I’m sure it’s a little bit hard to see but all of these configurations are the same configurations I’m dealing with before. We are trying to make things faster. [00:13:00] But here we are trying to make it reliable. And that often actually ends up being a different algorithm. Because if you want things to be reliable, you’re okay with some inefficiency as long as the memory sizes and everything can actually handle the app in the different sizes and whatnot.

One more quick demo and then I’d love to tell you about the algorithms that are actually happening under the covers. This is an app which I picked up. It is a pretty complex app. It is SQL, but lot of Scala and different statements in this. A pretty complex app. So for this app, what I found was pretty interesting is that if you just look at this for a minute, let me pause right here. Actually, okey dokey. So what is interesting about this app is, well, it ran in around four minutes.

But the algorithm from an efficiency perspective was really able to look through all the configurations and find a configuration. If you’re focusing on where [00:14:00] that arrow is, the original run was taking a certain amount of memory, in this case, it is 32 gigs of like memory seconds because memory usage over time. But the configuration, the new configuration was able to cut that down in half.

Now, think about the cloud. In this case, the application was able to run in half the amount of time but it used half the amount of memory. That immediately means that if you have a pretty large configuration, a VM instance, but the last configuration was a smaller one, you can run the app faster, even on the smaller configuration and that means reducing a lot of your cost. So a lot of things are possible and how does this work? The entire session’s concept is really based on AI. I know that you must hear a lot of places where AI…like I’m applying AI and ML. But this is indeed a problem where without some sort of automated decision making and automated learning, it is impossible to solve the problem.

So overall, [00:15:00] how does the sessions architecture look like? I’m a user. I have my app and a goal and I have maybe a cluster or it might be a cloud environment. But clusters can be spun up on demand. The crux of all of this is what we call a probe algorithm. And you’ll see this term, probe means something. Like probing it’s all about this algorithm that’s always looking at the data it has, and trying to figure out a configuration which can get the user as closer to the user’s goal as possible. But that algorithm is just one part of a much larger puzzle where sometimes, as we saw, the algorithm might have to run an application on the cluster.

And we have something called the orchestrator that is in charge of taking this configuration and trying to run it on an on-prem multi-tenant cluster. Imagine running an app on a multi-tenant cluster and grabbing information about it. So that is a major challenge. But as this orchestrator runs these apps, collects information, we end up with a lot of historic information about apps and previous runs [00:16:00] of the apps as well as new information that comes in via probes. Sometimes people call these experiments. I’m sure you’re all familiar with A/B testing and whatnot. And there’s some similarities to that entire concept right here.

Another very important concept that we realized as we were working on this algorithm is you have to keep the user in the loop because a lot of time, users will start off with a goal, which might be, “I want this app to run in a minute.” And then they realize for that the cost of running it on the cloud is very large. And then they might say, “Oh, no, one minute, I don’t need a minute. My SLA is 10 minutes.” So as the algorithm collects more and more information, it’s very common for the users to change their goalpost, basically. And the algorithms have to be resilient to that. So let’s drill down a little bit more. Now that we have seen overall what these algorithms are supposed to do, let’s actually take a look at what are they really doing under the covers?

So let’s take Spark as an example. Spark has all of these different configurations from container sizing and codes and memory and broadcast joints and skew and this [00:17:00] and that and whatnot. All of that, what it means is there is some sort of a response space. So I’ve shown here a two-dimensional space. So the X and the Y-axis, just imagine them as two parameters. It could be your driver codes and executer codes, for example. And performance is some surface, which is unknown. And any point, any setting that you actually pick is a point in this N-dimensional space, and it might have some performance that it gives. And the challenge, of course, is that the surface is unknown.

If the surface were really known, you can have a good search algorithm that’ll find those blue regions. And in my pictures, I’m going to assume that the performance is some sort of running time, so lower is better. And you kind of want to find a setting in that blue region, not in the red ones, to get good performance. So that’s the challenge. So we can actually apply many different types of algorithms. So, as I mentioned, there is a lot of interest. And this problem [00:18:00] is a nice match for AI and it ends up being a very nice match for reinforcement learning. The same algorithms that are being used to create your self-driving cars, similar algorithms actually work here.

But what is also important to know is this is not a new problem by any means. In the entire manufacturing, automobile, chemical industry, there is an area called Response Surface Methodology, RSM, where they are dealing with this problem. So it’s not a new problem where AI algorithms have been the first to go after it. It’s also something where algorithms, data mining algorithms have been around for a while. As I mentioned, we have worked on it for a long time because it’s by no means an easy problem. So I’m just going to mention some of the papers that we have written on it. I’ll just give a high-level summary of what we have done here. You’re welcome to either ask questions after the talk or read them.

So the problem, again, is that there’s this unknown response surface and we’re trying to find a good configuration [00:19:00] that meets the user’s goal. I’m going to talk about one algorithm here, which is one of the simpler ones, but you’ll see how to go about solving this problem. So you can model this response surface as, like using a regression model and then the residuals can be modeled using different things. And what I’m going to show you is modeling the response surface as a regression model, and then the residuals as a Gaussian process. And you’ll see why picking a Gaussian process is important. But really, what the Gaussian process captures is at any point of time, we have some knowledge of the surface. The uncertainty in our knowledge of the surface is what the Gaussian process actually captures at any point of time.

So what this probe algorithm literally has to do now is with the uncertainty and with the knowledge it has, it has to identify that point, the configuration to try or get more [00:20:00] information about that, in a sense, will give it the maximum improvement, an improvement towards getting towards the goal. And if your goal is trying to make things faster…so basically reduce a running time, then this improvement function basically is at any point you have a performance speed and there is density function. And with respect to your current best performing point so far, the gap between that and the performance here is the improvement and you can do the simple math around it. Basically, to combine the density function and the improvement and come up with an equation for the expected…because everything is uncertain, the expected improvement of doing a probe at a particular configuration.

And, again, here I don’t have the time to give you all the details here. I’ll show you some examples of how things work. But picking a Gaussian process…Gaussian process is one. You can pick many other [00:21:00] things as well. But a Gaussian process actually enables you to get a closed form for this expected improvement function. The details are in the paper. But let’s actually take a step back and see if you have all of this, how can this probe algorithm actually work? So initially, once the user gives you the app and the configuration, maybe there are some runs of the app like I showed you. So you have some examples of “Here’s the configuration and here’s the performance.” You can think of that as bootstrapping a set of data points.

And what the probe algorithm will keep on doing based on the information it has so far, it’ll try to pick a configuration where the improvement is actually maximized. And it keeps on probing. It gets more data. It keeps on running this until some stopping condition is met. The stopping condition might be as I showed in my examples, “Do no more than three runs,” because that’s all my cost budget for tuning this app. Or it could be like, “Do these runs until some… [00:22:00] basically the diminishing returns starts to kick in. I’m not seeing any better improvement with the resource I have.” So there can be different configurations, different stopping conditions. But that’s roughly how things work.

Let’s actually look at an example. So here’s our response surface that I was showing you. Imagine that is projected onto a single dimension. It’s very hard to create all of these pictures in multiple dimensions. So let’s basically say there’s only one configuration, maybe the container, the executed memory size. There’s one configuration I’m trying to tune and this red surface I’m showing is the true response surface, which is unknown. But at a point of time, I’ve done five probes so far. So there are five configurations for which I have real data of how the real runtime looks like. Now, based on this I can do my regression on any sort of models. The green line represents my current best approximation of the response surface based on the probes I have done so far. And at the same time, I can also come up with the uncertainty. [00:23:00]

If you see those bars represent the uncertainty. So between the points four and six, the answer is pretty high because my probes are far apart. So I haven’t explored that region. But around between six and eight, the uncertainty is less. So the bars are much thinner or less taller. So with that I can combine this, use my model and plug in that expected improvement. And you’ll see like that bottom blue line represents for any point that for my next point where I’m going to do a probe what the expected improvement looks like. There are two places, one between 4 and 6 and one between 10 and 12 where that improvement function is large and I’ll pick where it is maximized. So I’ll do my next probe between 10 and 12.

Let’s share this for a little bit. Why were there two peaks? And we quickly realize that this is nothing [00:24:00] but this fundamental tension between exploration and exploitation. Between four and six, I hadn’t done probes. My probes were far apart. So there, the system is sort of like…the probe is telling me, “You need to do a probe there to reduce the uncertainty because maybe the best setting is over there.” Or between 10 and 12, you see the best performing point, again, lower is better in terms of the Y-axis. That’s where my best point so far is. So the exploitation surface pushing me towards, “Do a probe there because maybe the best performance is over there.” So there’s tension between exploration, exploitation, which underlies so many.

You must have heard of the multi-armed bandit problems and things like in the AI. That is exactly what is happening here. So one extreme…it’s the data-starved regions where I don’t have performance points that I need to do exploration. And the ones where I have, I need to exploit. And that, in a sense, is the essence of the probe algorithm. You can basically slice and dice it [00:25:00] in multiple ways but there’s fundamental tension. Now, of course, as I mentioned, the algorithm is one part of a much larger puzzle. But this is the key thing that is driving where to get my next run.

Well, I’ll give an example. Is that it? Yeah, we have one algorithm. Is that going to basically solve this entire problem? If you take a step back and look at the overall problem I presented. So there are many, many additional elements here. One is what is the goal? So is the probe algorithm that is best for making an app faster? Is that the same one that is best for making this app more reliable? Or is it the same algorithm that’s best for meeting an SLA? So based on the goal, you start to think different algorithms, that might be different algorithms you have to develop and then the application itself.

It might be a SQL query. It might be iTrader machine learning app. So application patterns, once you realize a pattern, maybe that can play into the algorithm. Or, [00:26:00] this is a world where daily, companies are running hundreds of thousands of Spark apps. So there’s tons and tons of historic data. Not to mention, a lot of the time you’re tuning pipelines which are repetitive. So you might have historic data. Can I use historic data to make this probe algorithm better? What about ad hoc apps? Can I use apps that are sort of like…although people talk about ad hoc apps a lot of the time, if you look at apps other people might be running, they might be similar. Can I use information about that?

And another key process, how soon do you need the answer? Do you need it right now? Can you wait for a day? Can you give me a budget of $1,000 to tune the app? So a lot of interesting questions come up and I’d love to take you through all the details but I’m just going to summarize now, like, what are some of the key learnings that we got out of all of this? And one of the key learnings that just came out is no one-size-fits-all. Don’t even try to build one model that caters to everything. Try to pick the overall problem, slice and dice it. So an ensemble approach, I’m sure some of you [00:27:00] must have heard of it, where you create different models that are good at different things. And one model might be a set of export rules a lot of the time.

You’ll probably realize, say, turning on dynamic allocation is a good idea. There are things like that. Knowledge can be codified into a model or you could have a machine learning model. What I actually talked about is one example of a model that can actually, based on information that comes in, making no prior knowledge about what is happening, such as the next probe. So create these different models and at runtime, based on the information that’s coming in and the goal and how you’re trending towards the goal, make dynamic selections. So this is like many years’ worth of learning collapsed into one slide. I know there’s not much detail here, but I’m really happy to talk about some of the details as well as a lot of these things are in the papers I cited.

And there are similar things, learning, others have come up with as well. So that’s about the probe algorithm itself. Now, there [00:28:00] are other parts which have like equally big roles to play. Because a probe algorithm, everybody have seen who works on this problem really get stuck on that probe algorithm. But it’s just one part. The how you get data, how you do an experiment, how you run an experiment on a multi-tenant cluster with other apps running, how do you run an export on the cloud where everything costs money? So those things are what the orchestrator has to deal with.

And one of the things that we have gotten some traction here is leverage these proxy things which can get you some data about performance, which may not be very accurate, but you can factor that and the best example is a cost-base optimizer. Cost-base optimizers basically don’t really try to predict performance. But what they often try to do is they try to kind of say like, “This plan and that plan, this plan is better than others.” So they’re able to come up with some relative estimates of cost and that can be leveraged to cut out the need to do an actual run on a cluster [00:29:00] to collect information. Well, in Spark the cost base optimizes are still catching up. So we have not been able to leverage that as much in the Spark board just yet. But in the SQL world where there are better cost-based optimizes, the orchestrator can often not basically do an actual run. It can still gain information and cut out…really get the user to the goal quicker.

Another huge lesson that came up: any time you do a run, you should collect as much monitoring information about the system, about the application as possible. Because what ends up happening is the bottleneck, we were dealing with an application literally yesterday where the real bottleneck was in the metadata operations or rename operations that you have to do to create the final output. So when you have these problems [00:30:00] where changing some configuration may not easily fix a problem. You really have to redesign something. You might have to change the data layout and things like that unless you have that information. But you just will not get the user to the goal.

So don’t get hung up on, “I need to get the user to the goal through automatic configurations and runs all the time.” If the bottleneck is in something which can be fixed by a configuration, bring that insight out to the user so they can make use of the information and proceed. So you really have that…that brings me to the next observation. You really, really, really have to keep the user in the loop.

Because, as I mentioned, one key reason is that as these applications are being run, or the user gives the goal a lot of the time, they become pretty strict with the goal and as more information comes in and they understand performance, they can fine-tune the goal. That’s one key reason. A lot of the other reasons why you need to keep user in the loop is not every problem can be fixed by changing the configuration. [00:31:00] Sometimes you have to change the app. Sometimes you have to allocate more resources. Sometimes you actually have to change the data layout. So all of these things have to be incorporated into this entire mechanism so that we can actually build a usable system.

So don’t get lost in creating the best AI algorithm and bring all of that. Really think about whose problem are you solving and create something where the user’s problem can actually be solved. So that brings me to two of the follow-up work that we are actually doing here. One is, and we’ll be talking about both of these in the Spark Summit that Jules mentioned is happening in April. So we’d love to tell you more about these areas of work that we’re doing. One is, we’re creating a chat bot on top of all of this complicated machinery for predictions and whatnot because in a lot of time it’s all about demystifying performance.

Tell the user in plain English what is happening, where the bottleneck is, [00:32:00] rather than getting lost in all of this when you’re trying to fix problems automatically. So as we took this out to more and more users, it became clear that you can do some things with the UI, you can do some things with intelligent algorithms, but it’s all about ultimately simplify the information, demystify, present it to them in a way they can consume. There’s nothing easier than having them interactively explore. And there’s nothing that seems more easier than having them type in some messages on Slack or some questions on Slack and present that information in a consumable fashion.

So that’s one of the talks that I will be giving. Really excited about that and how that is turning out. Another one is I talked about individual applications, a Spark query, a Python, PySpark program, and whatnot. A lot of time, what operations teams are interested in are not just individual apps, they’re interested in optimizing the entire cluster, they’re interested in cutting down costs. So focusing on individual apps is great but can you focus and can you apply these things at the entire workload level, at the entire cluster level? [00:33:00] And sometimes across trash and clusters that come up and go as an ETL pipeline runs.

So that is the second dimension, holistic cluster optimization where we’re trying to apply this. As you can imagine, it’s much more challenging than dealing with an individual app. There are some things that have become easier in that world too. So these are two things that we’re doing. And that is my last slide, everyone. So in summary, exciting opportunities to apply AI and machine learning algorithms to solve real problems that a lot of the app devs here, operations teams here, are actually facing. And we would love your collaboration. We would your love your feedback.

We will allow you to try some of the things that we are actually developing. Please attend our Spark Summit talks. And last but not the least, we really are looking for great talent, because, these sort of problems can only be solved by what I like to call a performance data scientist, who understands the systems, who understand performance, who has faced these problems and who has a knack of working with data and working with algorithms. So that’s it. Thanks everyone. I would love to take your questions.

Moderator: Thank you. [00:34:00] Thank you for that wonderful insightful talk. We have got a few minutes. Any questions?

Shivnath: You can start here. Yes. Please go ahead.

Moderator: Where?

Man 1: Right here.

Moderator: All right.

Man 1: On your talk you’ve done a great job automating the configuration of Spark for my application. But that’s a bit of an untrusted user model where the system is taken care of the optimizations. What if one of my stages has inherent skew? What if there’s a join mid-stage that’s causing the performance problem? Do you have the capability to point that out to me as a trusted data scientist that could understand performance? I didn’t see it here. Do you have that kind of capability?

Shivnath: Terrific question. And that’s exactly what I was trying to mention when I was telling you about some of the follow-up we’re doing. So the question I was asking, sometimes, again, and this is not sometimes, a lot of the time, the problem is something you can fix easily [00:35:00] by changing the configuration. You have to change your data layout. You have to provision like the new type of nodes. You have to solve the Hive metastore becoming a bottleneck. So in these cases, what we do is we think of the entire problem as can we bring the data, all the data about the performance in a way that is easy for the user to consume? Sometimes just through plain English, just telling them, “Look, the Hive metastore is becoming the bottleneck.”

So there’s nothing you can do in your configuration unless…well, sometimes you can by changing a number of partitions but a lot of the time you can’t. So we do do that. We think of the entire performance management as like multiple stages, stages where if there was no automated system, a user or some performance architect has to solve it, and what do they do? They first collect all the data, bring it to a single place and that itself has a lot of value. And then press and innovate, do some analysis, find bottlenecks, find the data skew that he mentioned, which can all be [00:36:00] simple algorithms. No AI and machine learning data and they’re mostly physical algorithms.

But then can we do likely two more steps, one is give them recommendations, where you’ll just start predicting something, you take that leap of faith and then auto-tuning. So it’s really like a pyramid where some problems can be auto-tuned. A larger space of problems, you can give recommendations. And even larger space, you can give insights like the data skew. And for even a larger space, just bring all the data and make it easy to consume actually solves it. So that’s how we think. That was a great question. Thank you.

Moderator: Any other question? That side.

Man 2: A very interesting talk, thank you. So I noticed that you assume…you use kind of a Gaussian method to estimate the uncertainty. And you also kind of assume a continuous surface area to the general performance shape. In your experience, is that a safe assumption to make, especially when you deal with some discrete values like numbers of executors and things? [00:37:00]

Shivnath: Excellent question. So in all those response surfaces, I was actually drawing. I made it seem like it’s a nice continuous surface. And what the question was about is sometimes you have these Boolean parameters or discrete options where there could be big discontinuities in there. So long story short, with Gaussian Processes, we have had good performance like reasonably good conversions and whatnot.

The best thing is it gives a nice closed form that makes the overall algorithm much more easy to track and more importantly understand, whereas doing things…it is actually not doing things well. There are ways to deal with the discontinuities. Sometimes you deal with them in the regression model or that part and then the Gaussian Process can deal with the residuals. So it’s an interesting conversation. I’m happy to point you to some research we have done, but it’s something where we are continuously exploring and trying to make things better.

Moderator: Thank you. We got a question here.

Man 3: Yeah. Is there any overlap with Apache Calcite? And [00:38:00] are there any recommendations to scale out horizontally?

Shivnath: So the answers to both of them are yes and yes. So the Calcite is what I was kind of referring to here with the catalyst and the optimizers. So you can use Calcite as a rule engine to guide things. But where we have had more traction is sometimes instead of doing an actual experiment you can use the performance model that is embedded in some of these optimizers as a proxy. So that is the first part and, of course, the answer to the second question is the auto-scaling. A lot of time, finding the right number of machines, the right instance type to meet a performance goal is challenging and you can solve that with these methods. And we have done that too.

Moderator: Any other question?

Man 4: Thank you. Assuming that auto-tuning is a constant exercise, it has to be done all over again, but the process dataset is [00:39:00] changing. So the surface is actually changed. I changed the data layout in my app. So how could I give all the data points in your surface less weight to start re-exploration of that space if underlying data or parameter have changed?

Shivnath: Great. That’s a terrific question. So what the question is asking is this response surface static? Will it change? And what is the application we are talking about? Is the application itself changing? So a lot of these changes, yes, have to be dealt with. So this is why we actually even call this a session. So you’re sort of like putting an app in some sort of what we call the session.

So the session itself can monitor, “Hey, like what I was recommending has now become far from optimal,” and then try to detect that because something is telling the session, “This is my app that is running again and again.” So you can apply those learning algorithms and sometimes you just have to…some change [00:40:00] that was done to the application changes that performance entirely, join the algorithm and change it entirely. You have to redo it from scratch. So all of those are real challenges that you have to solve there. Luckily, a lot of the time, especially this was needed for applications running in production, where the tuning is done once where things are moved into production and the only two things that change are contentions of the cluster and size of data. And that we have some ways of handling.

Moderator: Okay, that’s all the time we actually have for the questions. Shivnath is gonna hang around. So if you have any questions after the second talk, feel free to hang around. Thanks a lot. Give a big hand to Shivnath.

Shivnath: Thanks.