Join us on December 11 for the Webinar: Always Meet Your SLAs on Amazon EMR. Register NowX
Best Practices Blog Video 12 Min Read

An AI Powered Chatbot to Simplify Apache Spark Performance Management

Best Practices Blog Video 12 Min Read
By: Shivnath Babu

On April 25, 2019 Unravel CTO, Co-founder, and Duke University Adjunct Professor Shivnath Babu presented a session on An AI Powered Chatbot to Simplify Apache Spark Performance Management at  Spark AI Summit in San Francisco, CA. You can view a video of the session or read the transcript.


Transcript

Shivnath Babu:

Thanks, everyone. Welcome to the session. We would like to tell you about how you can simplify Spark Performance Management using chatbots. A little bit about myself, I’m Shivnath Babu. I’m cofounder and CTO at Unravel, where we work on the problem I’m actually going to tell you more about. But very specifically, when you’re running big-data applications, maybe they are AI/ML applications, maybe they are BI applications, maybe they are IoT applications. But you’re running them in production at scale.

A lot of challenges can arise, like your application might be slow. It might be different in performance today. We have created software that can help you simplify and deal with all of these problems. And we’ll see some of that in the talk today. I’m also adjunct professor of computer science at Duke University, where I’ve been working on this problem, especially performance management and automated performance management for different kinds of systems  for a long time.

So we’re gonna start with the real question of, what is a chatbot. In this day and world where you know Amazon Alexa, you know Siri. I’m sure this is now something that chatbot might not be a new term for you, but a chatbot is really a program that can have a conversation with humans or even with other chatbots through text or audio. Chatbots, many of you maybe have heard of chatbots that are interesting, fun, entertaining, but chatbots are making a real difference in society.

Here are some examples of chatbots, and there is this kind of ranking of popularity of chatbots. I’ve pulled a few here that are chatbots, like Swelly which are for social voting. There are chatbots like Einstein, which when you’re doing educational course it helps you stay the course and really keeps you motivated,  and things like that, to chatbots, for example, if you’re into Legos, it often becomes pretty hard to buy the right gift. And they’re chatbots that can help you find the right gift, or chatbots that are actually helping with reproductive health. And also chatbots which will really fool you into thinking that you’re really chatting with a human, like Mitsuku.

There are lots of interesting chatbots. Here are a few that are in the commercial space that are actually making a big difference. One is TOBi. TOBi is a chatbot created by Vodafone. They created this chatbot to solve the following problem. They noticed that a lot of customers were coming to their website and trying to look for phones, or very specifically, simm-only plans, but not converting. So they created a chatbot that can understand that particular need that customers have and help them through it.

And the chatbot became really, really popular, so it could drive more than 200 person more conversions in  half the time. So TOBi is actually a chatbot that’s made a big difference. Another one is Zara, which is from Zurich Insurance. So Zara actually handles 20% of all the customer service request and interactions that actually happen with Zurich Insurance. So these are two examples of chatbots, one in the e-commerce space, one is in the insurance space. And things like customer service, or in converting and actually getting more revenue they’re making a big difference.

And this chatbot, one of my favorite is this one called Woebot. It is actually not in the commercial space. It’s a chatbot that you can chat with. It tries to identify your mood, it tries to uplift it. It tries to keep you interested and entertained. So what was really interesting about this Woebot is that when it was actually launched it was created by researchers from Stanford. It was launched. In the very first week it was launched in  some of the days and that week itself this Woebot was just actually a therapist chatbot, it talked to more humans than a human therapist actually talks in an entire lifetime.

So this is an example of how chatbots might be cool technology but it’s great technology that you can scale. There is no limit to how much you can actually scale it, so terrific, interesting technology. So chatbots are great. This is a talk about Spark Performance. What is the connection between chatbots and Spark Performance? I’m sure a lot of you here are probably the happy Spark users. If you’ve been creating Spark applications you must have been this very, very happy user and you want to be this very happy Spark user. Who’s a happy Spark user? Somebody whose applications are running really fast, like you’re very happy because you can build very interesting soft-skill applications, ML, AI, IoT applications, and everything, all the primitives is in one  platform.

Or, well, you can have streaming kind of functionality. You can have Graph processing, you can do IoT, and you can do the SQL, and BI, everything in one platform. Not to mention there’s a rich ecosystem around Spark, so great, the happy, happy Spark user. But I’m sure if you have been writing Spark applications, a lot of the time you must, not having like in the previous slide, you must be a frustrated Spark user. And who’s a frustrated Spark user? Hey, I created an app that’s supposed to be running very fast but it’s slow. It’s running much slower than what I used to get in my MPP database. Or that application I created is just failing. I have no clue why it’s failing. I have no idea how to make it work.

Or it might be you’re running on the cloud, and this is like on Amazon, and Azure, and Google. There’s a lot of these instance types I have no idea which instance type to actually use for my workload. Or the one that we keep hearing a lot, “My cloud costs are going  through the roof, and I have no idea. I’m getting charged so much and my application is costing this much.”

So here is an example if you have, again, been in the Spark world. I’m sure when you create your Spark application you ran it, the really nice and shiny Spark application, it failed. And what did you get? You got five levels of nested stack traces. Very, very hard to figure out what is wrong, how to actually fix the problem. Now wouldn’t it be great if as this frustrated Spark user you could just go and ask a Spark chatbot, “Hey, my app failed. I have no idea why.” And what would the chatbot say? “I know that sucks, it’s really bad. But hold on, let me try to help you out. Let me take a look.” Sometimes it says, “I see the problem. Don’t look at those nested deep stack traces but I found out that your execute was running out of memory.”

And by the way,  gonna be even better. It’ll say, “Set your executors memory size to 12 gigs and it’ll run. Not only that, I’ve actually verified it for you, and here is a run that you can see. This is a run.” What will the user say? Now from that frustrated user, it will leave them happy. “Well, thanks. You’re awesome.” This is an entire new world of how it’s roughly dealing with stack traces, and UI, and all of that stuff, you’re dealing with an expert. And AI-driven expert in the box…oh, sorry…in the bot. So that’s what I’m actually will be telling you more about.

We have built such a chatbot, and just like all the other chatbots, including the Woebot and the Spark therapist chatbot, we’ll actually show you how you can have a good time, you’ll become productive in creating applications, ensuring that your apps are reliable, in the same time it saves you time and money. That comes with being more productive. So let’s take an example. I’ll take a few examples  and lead you through how such a chatbot can make a difference in your applications. A typical data engineer creating Spark pipelines. My app is too slow, and what she want really is to make it faster. And the sooner she can make it happen the better.

And what does she do today? Today the world looks like, go to the Spark UI, or YARN resource manager, or Spark standalone page, look at metrics, look at logs, look at stages, and jobs, and problematic jobs, and data skew, and this, and that, and whatnot, and it’s just a mess. Let’s see if there’s a better way.

Okie-doke, so some of you might have seen this interface before. It is Slack. It’s one of the more popular kind of chat tools out there. And I’m the Spark user, let’s say my app failed. I go and ask the chatbot, let’s see if it’s up and running. So I said, “Hi, chatbot.”  And the chatbot is named Unravel here. So this is the…okay, give me a second here, if I just end that show. This is better. So this is the Slack interface, nice, great chat interface. I’m sure a lot of you must have seen it and might be using it in your company.

I go in, “Hey, chatbot.” It’s named Unravel. “Hi.” So it’s saying that…see if we can increase the size a little bit. It’s the Unravel chatbot, and this is the chatbot that we have created for applications and operations management, for data applications, “How can I help you?” “Well, Unravel,  how can I make my…my CEO wants a report. My CEO report app faster.” So the chatbot has said, “I can help you with that.” Then it’s asking me, “Is that the application that you have been running in application 163 and 176?” Yes, that is what I was doing. “Okay, Unravel, yes.”

So it’s asking me another question, what is the goal? Again, tuning an application, you can tune it for different things, like you can make it faster, you can make it resource sufficient. You can try to cut down on the cost that the app is  on the cloud. It’s asking me, “What is the goal? Is application speed-up your goal, that is you would like to make it faster?” You bet, yes. Sorry. Okay, application speed-up is indeed  my goal. So well, Spark applications can be made faster by tuning things like the container sizes, the partitions, join algorithm, many different things.

So basically it’s asking me that, yeah, “It’s able to tune these things. Do you want me to verify?” Yeah, I don’t really want to go in and try to change all of these things, just make my app faster for me. So now it’s asking me, “How are you running the application? And enter the spark-submit command used to run this application.” So basically I run my application using spark-submit, using a command like this, and I’ve given it the proper command.

So basically, so far what have I done? I’ve told the app, told the chatbot what my application is. I’ve been trying to tune it myself. So there are some applications I ran. I told it, “Yeah, that CEO report is indeed those few applications.” And then I told it my goal as well. So then it’s telling me that it’s created a tuning session called Application AutoTune. So it’s kind of doing some work under the covers for helping me tune my application. Let’s see what it is able to come up with. So it’s doing something which I will drill down to in just a bit.

So application, making it faster, trying to auto tune it. Well, it came up with something. What it’s telling me here is it found a configuration that can run the application faster, my particular application. And it’s telling me that if I tune the sizes of these containers, for a lot of users maybe these things may not even make sense. For BI users they don’t really want to know Spark, but if you’re familiar with these things, these related to sizing, the containers, memory, the CPU, like basically the buffer you give so that you can run in your PySpark applications, the number of instances with that particular location.

Now it’s given me a link, so let’s go to that link. So I’m basically on that particular link. So now what I can see here is  this is the particular application that I’m trying to tune. And I have these runs of that application, the 163 and 176, which I told the bot. And if I just drill down into that, what you actually see is I’ve been trying to tune this application. So if you start drilling down into these runs, Spark has done some such configuration parameters. And I’ve been playing with it, trying to tune it. I tuned on a particular location and I tuned the buffer settings, and all these join settings, and whatnot.

Still, I didn’t get to the good performance, since this app still takes more than two minutes to run, and I’m not happy with that. So the chatbot has told me that there’s a better configuration that could make my application run faster. And what it is doing is it’s actually trying to verify that and take that application, because I gave it my spark-submit command. It’s verifying all of that good stuff. So I got some notifications. Well, the verification is complete. I can see the full results of that particular same URL. Let me refresh and see.

Great. So now it’s actually able to run the application. It’s slightly faster. It’s able to, from the worst run, which is taking 2 minutes and 21 seconds, it’s able to cut it down to around 1 minute, 50 seconds. Something like 10%, 15% speed up. So notice I didn’t really have to do anything. Just tell the chatbot what it is supposed to be doing for me and everything actually got done. So as simple as that. Well, great, Chatbot, thanks.

So one full interaction with the chatbot where I actually got a configuration that is verified and my app is now running faster. And this process can actually continue. So we take a step back and see what really went on under covers.  What actually happened? So this is what happened from a very, very high level. I told the chatbot what my application and my goal were, and it was in Slack. And there was some natural language processing that happened to convert my request from text into something structured. I used under cover something called DialogFlow, which we’ll drill down to it later, and then all the magic happened in the bot’s backend.

And what is that magic? That magic was really about taking the app and the goal, and then running some sort of recommendation algorithm under the covers to find a better configuration, and also validating that by taking it, running it through an orchestrator that can actually run this application on my multi-intent cluster. In this case it was actually a development cluster on the cloud, and based on that runs you get more data, and this entire look. And it can actually help make my application as fast as can be, all without any interaction from me apart from having given the application and the goal.

So drill down a little bit into what really happened. How did this magic happen? So Spark has done stuff like configuration parameters, like we looked at. And these are the things I was trying to tune myself before the chatbot came and helped me out. And what this translates into is a nasty looking response surface, a performance response surface. As you change these parameters, performance changes, and if you take performance as the running time, you want the running time to be as fast as the smallest running time. So in the surface, you want to kind of find the configuration that takes in that blue region. And all those black dots you see on the surface are where some manual or some kind of interactions led me to understand what the performance of that points are.

Now the challenge, of course, is this entire surface, for any Spark applications unknown. And it’s all about getting to the best configuration as quickly as possible, where best might be as a function of you wanna make it faster or more resource efficient. It can actually vary. So here I’ll tell you a little bit more about the underlying AI driving the algorithm that happens. So you can look on the surface and sort of model it as based on your best understanding and the data available so far can have a regression model that will model the surface and then all the residuals, like the parts that cannot be modeled by the regression model can actually be captured using a Gaussian process.

And what the Gaussian process intuitively does is it captures the uncertainty in the surface based on all the information you have available so far. It’s trying to tune my application. I had some runs that I’ve already done, and maybe some other devs, or this might be an application that runs repeatedly. So there might be some historic information available. Along with that, I might be doing additional runs. So with some interesting mathematical agility that happens under the covers, you can model the problem, the essence of the problem as always you’re trying to look for that configuration that will give you the maximum improvement over the best possible available performance so far.

And I don’t expect you to really understand this equation but the essence here is X star might be the best configuration you have available so far and you want to go to another configuration that is much better. So the improvement is the difference between the performance that the known best performance and the expected performance of a new point. And the density function I’ve used here is basically based on the uncertainty you have. And the best thing about the modeling I told you about, the regression surface and the Gaussian performance modeling is it helps you come up with a close form for this expected improvement.

So putting that all together, what it boils down to in plain English is at any point you have a bunch of performance observations you have so far based on history or the probes you’ve done. And always if there are more resource available to verify and validate these runs you’re looking for the configuration that will give you the maximum improvement, and then keep on doing it until completion. In my case I just told the bot, “Run once,” and it gave me a configuration that, based on that, it was the best.

So again, a quick illustration of what is really happening. I’ve simplified it to a unidimensional surface here. The red line represents the surface so far. At any point we might have some observations about performance. And at the same time, you also have an uncertainty based on all the regions of that performance surface you have not explored. So with this Gaussian performance modeling you can actually find that blue line there. If you look at the blue line it has two peaks. One peak is close to where the best performance so far exist between 10 and 12. And essentially the other point is where the surface,  you don’t have any observation about the surface.

So this, in a nutshell, is the quintessential problem that often appears in AI. You want to exploit around regions where you know good performance exist, you’ve seen and observed, and you want to explore regions where you have no data. So that becomes, under the covers, what we package into this probe algorithm and the recommendations we give, which is, what really went on under the covers in the bot’s backend. So in five minutes I tried to summarize multiple years of work to extract out and simplify from all the trial and error performance tuning that users have to do something where you just ask a question that algorithms sort of runs as a service in the backend and give you that better configuration, and work with you to give you your end goal.

Happy to talk more about it, but I would like to show you one more interesting use case.  Application and making performance faster is all great, but at the same time, another huge problem that happens is applications fail. What do I do? I need to get it running. And in Spark today, unfortunately a lot of the time you end up with nasty looking stack traces like this. And does the world have to be this way? Let’s say we can actually make this better. Back to my chatbot.

Let’s check this out. Okie-doke, so I tuned my application. Now I have an application like I ran PageRank and ran it on a data set and it failed. So I just want to…we can ask it if my PageRank have failed. So the chatbot is looking through the application and saying, “Okay, the application PageRank I can see that, with this ID, 14, that app actually failed.”

So one of the things that most users will want to do  is…let me get that. Okay, it gave me the app ID, great. Let me get the app ID and then I can say, “Unravel, fetch errors for this app ID.” So it’s actually gone. What it did is the application failed and all these error messages that will appear in the log. Spark has a driver log and basically executors all run their logs. In the last cluster those logs could be really spread out. I don’t want to go clicking here and there and trying to get these error messages. So great, it actually fetched all the errors. Let’s take a look.

So it’s a text file with all the errors. I want to see it in full, yes. So basically I’m sure you’ve seen errors like this. So I’m the user, my app failed. And when I start looking at the errors, what am I seeing here? Failed while starting block fetches, Java.IOException on some particular machine and  some big stack traces. That doesn’t make too much sense for me. Let me look at the next one, another one, like java.IOException retries, and whatnot. Let’s say I’m a BI user. It’s not only these sort of stack traces, one after the other, after the other, after the other, and nasty looking things.

So I’m totally clueless and blocked. What I really would like to do is go back and then ask “Hey, Unravel, I know this application ID. I just want to ask this question, why did app fail?” So it basically comes back and tells me, “It failed because insufficient memory allocated to the Spark executors.” And it’s asking me, “Do you want to tune the application?” And now in this case the goal is reliability. It failed, I want to make the app more reliable. So if I go back to those errors and if I were to really search for memory, yeah, hidden somewhere in that big nasty stack trace is this, yes, something went out of memory.

But what happens in the existing resistance, a failure happens, it cascades into other failures, timeouts happen. So it becomes hard to figure out what caused that entire failure, that becomes the real challenge. So let’s take a look at whether these sort of problems can be addressed, getting back to my slides.

Now you saw a much better way. I’m able to fetch the errors, I’m to ask the question of where the failure happened. So some pretty interesting thing happened in this particular case, app had a failure. There is all of these errors and all the container logs, and we can extract out some structured information, as much as possible, from these logs, and convert that into some sort of a feature vector. And if there is a model, some sort of predictive model that can take that input and basically predict the root cause of that failure. That’s really what happened, and I got the failure happened because of executors running out of memory.

But to really create something like this, if you had logs, similar logs for millions of app failures then you can extract out that feature vectors from those logs. And if magically you also had root cause labels with all of those failures then we can apply any supervised machine-learning algorithm to build this model. So something like this happened under covers. So let’s see how it’s even possible. You’ll start looking at this, you’ll start, where are these root cause labels even could come from? Because whenever a failure happens, somebody manually diagnoses that failure. We had a Spark expert who’s manually diagnosing it, then they can come up with a root cause label.

There’s a much better idea where we can actually automate all of this process by automatically injecting root causes. So what we did in Unravel, we created an environment where we can actually bring up, like in a cluster, a Spark cluster, run different kinds of applications  on the cluster, like ML applications, SQL applications, streaming applications, and identify different kinds of problems that can happen, and write failure injectors that can inject those problems continuously.

So something like this an application is running, let’s say Spark SQL application. We have a failure, like this could be an out-of-memory problem that is injected. So once we do that we can of course collect the logs and convert that into whatever structure representation we want. But we know the root cause label because we injected it. So this strategy, in 18% of problems in most environment are created by 20% of unique root causes, and out of memory is a very, very common one. So inject out of memory, running out of disk space, a slow load causing timeouts, very, very slow Hive metastore. All of this can be injected.

So in this way you can generate a lot of training data, and this becomes in the folder for any learning technique you want to apply. So we have created a  taxonomy of different kinds of failures, like what I showed in the previous slide. And failures fall into configuration error failures, data error failures, resource error failures. Sometimes things due to the deployment itself and misconfigurations. A pretty nice taxonomy and that gives the labels.

Now there are a lot of ways in which you can convert the nasty-looking logs into more structure representations. One is extract out just the key messages, like the errors and stack traces, and a lot of interesting tokenization can be done here. You can tokenize from that extract out the out-of-memory expressions and whatnot, so pretty interesting things can be done here.

The ways in which we have actually done this, we have applied two kind of techniques. From the logs you can convert into a document where you can apply TF-IDF, which is term frequency, inverse document frequency, one of the more traditional ways of converting a document into a structured feature vector representation. Or we can use one of the more  modern ones, like Doc2Vec, which is pretty popular. It basically has a more multi-dimensionality of the word and can capture relationships between words that occurred together. And it’s actually built using a neural network as well.

So now stitching the entire story together, I told you how we can generate logs from millions of failures where the root cause labels are known. And you can convert that into feature vectors and really input it to any supervised learning technique. We have tried out a few, like the more shallow ones like logistic regression and random forests, as well as the more deep learning ones, the neural networks. Just one quick result.

What it’s showing is from one of our cases where we had around 14 unique types of failures injected. And we generated a lot of such examples we’ll split into a training and test set. Test set was 25%. The accuracy numbers you can see here are  pretty high, higher than 95%. So these techniques are very, very good at identifying problems that happen repeatedly. Somebody’s app failed for reason X, Y, Z, and the same failure might happen for my app today. This technique can very quickly identify that root cause. And we’ve also experimented to pinpoint when the totally unseen root cause actually happens.

So quickly, in our experiments, TF-IDF seem to be working slightly better than Doc2Vec, and logistic regression is better than random forest. We also have some results with deep learning, but the real idea here is you don’t need fancy technique. It’s all about creating the right data to train your models. So I talked a lot about the backend layer of the bot, once again from high level and all the examples I was showing you. There’s a messaging platform. I use Slack. There are many others as well, and from there there’s an NLP layer,  which I didn’t really talk about. I will spend some time talking about it. And then the backend where once the actions are known, I gave you an example of how you can tune applications from the chatbot, how you can find the cause of failures from the chatbot.

Now let’s spend some time actually looking at that NLP layer, the natural language processing layer which can convert those text into a format that you can start applying algorithms on. High level, what this looks like is the following. When I went in and typed a query, “How can I make the CEO report application query faster? The first thing that has to happen is to convert that into something where you can extract out the intent I’m referring to there. So intents could be, depending on, in my case, I’m building a performance chatbot to help with performance and operations management.

So things could be fetch all the errors, an intent could be I want to fetch all the errors, or I want to tune an application, or I want to actually set an alert.  In this case, the intent should automatically be identified as, based on what I said, if I’m saying how I can make the CEO report query faster, it is a tune an application intent. And after every intent there’s usually parameters that you want to extract. So in this particular case, the natural language processing should take the intent, which is tune the app, and start looking for parameters like, what is the app name? How do I identify the app? In this case it’s a CEO report identifies app. That’s the entity that has to be extracted.

And what is the goal? The goal is to make the query faster. So automatically things can be extracted, and if you saw what I was doing, the chatbot is now actually going back and then asking, “Hey, did you mean the application CEO report which actually was run with these IDs? Or did you actually mean the application speed-up or the reliability goal?” So the chatbot can further confirm, and once this happens it’s all about invoking the algorithms in the backend, like the auto tuning algorithm I actually showed  you.

So this is all about how to fit all the pieces together now in terms of the chatbot. Once you put these pieces together there’s so many interesting use cases that can be addressed. An operations person would want to know, who are the top resource-wasting users in the cluster? You start to think about that. You can convert that into an algorithm. Of course you’ll extract out the top resource wasting, what that means, and users. Then the algorithm on the backend can take the data from last week, last month, look at all the results usage of different users. And then using a variant of the algorithm I described can actually identify which users are wasting resources.

A tuned version of their application doesn’t need anywhere close to those resources to make it run in the same amount of time. So you can start applying use cases like that. Or right now, which app is causing contention in the cluster. So those are more operational use cases now which can be nicely  solved using a chatbot like this. Or if you’re an app developer, my app is stuck. Is it stuck? Where is it stuck? So this can be converted into taking quick stack trace to kind of see where the app might be, in which stage, in which particular line of code. Maybe it’s running a UDF. See, all of this can be nicely converted into things where I don’t have to go to UIs and figure all this out.

Or even things like, kill this report if it consumes more than $25 on Amazon. App, and then the alerting, and action can also be taken automatically, like some of the examples I showed. The possibilities are endless, and it’s a really nice work, where, now to summarize everything.

What I actually showed you is all those challenges that as an operations person, as an application developer, you have that Spark, that often gives you the feeling that Spark is just so hard to use. It looks really great on paper but once you start using it becomes pretty hard. To really solve those challenges you would often wish that maybe if you had somebody like Matei sitting right next to me and helping me out with all of these kind of use cases, wouldn’t that be terrific so that we can do that? So we can actually use this entire bot idea to package Spark knowledge and the ways in which you can make your applications faster, or more resource sufficient, or whatever the goal might be using AI, and in a way that you consume very easily, and just like the ways in which you are used to consuming many other apps.

It’s a chat. It makes you more productive. It saves you time and money, not to mention the overall experience of using Spark becomes so much better and applications become reliable. So that’s all from my side. We’d love your feedback. We actually have…this is a lot of the work that we are talking about is still in progress but we have a lot of that done. I would love if you can try out and give us your feedback at unraveldata.com/free-trial. Not to mention, if you’re really passionate about these problems,  if you’re passionate about Spark, you’re passionate about distributed systems, you want to apply AI, you want to combine that to solve real problems that can have an impact in the industry and in society, come talk to us. We’re definitely hiring. Thank you.

If you have any questions, please come up to the mic.

Attendee 1:

Could you talk a little bit more about some of the open-source NLP tools that you used and maybe even lessons learned? Did you find some tools didn’t work well? Did you use any transfer learning with some open-source packages? I’m just curious to know about just some of the NLP things you did.

Shivnath:

Great question. The question was, in building chatbots, NLP is a major component, the natural language processing. So the question was about any experiences, what tools do we use?  What experience do we have? What works, what doesn’t work? So I’ve listed the main tool that we have been using, DialogFlow. It is software that actually has been developed over a course of time, and now is pretty advanced. DialogFlow helps break down the overall problem into these three paths, extract the intent, extracting the entities, and then connecting with an action.

DialogFlow is not open source. It’s very easy to use, but on the other side there’s this thing called Rasa. Rasa is fully open source. It’s actually a great tool if most of your work happen with Python, so it’s a great integration it does entirely in Python. That’s on the open-source side of things. It’s a little bit more cumbersome but pretty much the functionality is there. So on the NLP layer, those are the two things that we have a lot of experience with. And the pros and cons are pretty evident as you start using it.

Now what we have seen though is once you start really understanding your chatbot and what the intents are, you will sometimes realize that I don’t need a very sophisticated NLP tool. Maybe I can get away with even a simple rule-based engine and a parser. So basically the key thing with all of this is start to understand and begin with what the specs, what are the key intents your chatbot should solve? And use that to then map it down to, do I need more emphasis on the NLP side or do I need more emphasis on the algorithm side? And once you figure out what you need on the NLP side there are a great set of tools. And we’ll be publishing a blog on this soon and I’m more than happy to describe what we’ve done there.

Attendee 2:

Really impressive methodology, I have to say, very good. My question is, so you had this surface that I guess amounts to kind of like a cost surface in this dimensional space. The integral you use to generate that surface, what prevents it from just saying, “Use as many resources as you can afford?” Because I guess I’m having trouble figuring out, what would the competing force be just to prevent you from doing that?

Shivnath:

That is a great question. So the question is, I showed a response surface, and sometimes these response surfaces are the number of dimensions is very large. Spark has at least 25 parameters that can have significant impact on performance. So when we tried to do the learning and tried to bring them information the question was, “Use parallelism, run 100 such explorations in parallel.” So this is something which we have worked on. I’m happy to point you to how you can use sequential approach so you can actually plan a set of experiments in parallel, try to fasten the learning. And on the cloud this becomes very interesting because you can say, “I’m willing to spend up to $50 to tune this app.”

You can bring in a lot of resources in a short span of time. You can bring in resources over time as you kind of learn, and the intel learning algorithms become very interesting. So in a nutshell, parallelism is all a function of how many experiments you can run in parallel. Sometimes on large, multi-intent clusters the ops person might say, “Your running and things for tuning should only be done in this queue where the resource are limited. So that limits how many you can run in parallel. So interesting trade-offs leading to interesting algorithms. I will talk more offline.

Attendee 2:

I actually meant more so…I’m sorry, the recommendation itself. What keeps the drop off from saying, “Use 100 gigabytes of RAM?”

Shivnath:

Okay, great. So then the question was, if you really want to get a particular, like you want this application to run with an SLA, some sort of goal, SLA goal of I want efficient in two minutes. Then the algorithm has to start to look at, what I showed was the specific goal application speed-up was subject to the amount of resources the applicant is getting. What configuration will make it the fastest? The SLA goal would be I would actually say, “I want this application to run in two minutes. Minimize the amount of resources that enable me to get there.” Goal is different, and that’s really the reason why it’s like tuning. The goal has to be something the user has to input. And then you can build the right algorithm and use the right one. So we have algorithms for the SLA case as well.

Attendee 2:

Okay, cool. Thank you.

Attendee 3:

Hi. For the ML, what kind of model algorithms were you using for the ML? That’s one side. And how much data were you using for your training?

Shivnath:

So the question is on, for these different algorithms, how much data are we using? So I gave two examples where in one case, the failure case, a lot of the data is pre-generated, and it is huge. Every application can generate if you’re running on 100-node cluster. The logs itself is huge, so the data size can be very large. For the other problem, the app-tuning  problem, often what you often realize is for the app, the tuning, the data for tuning doesn’t even exist because it’s a new app. So you collect it on-the-fly, so it becomes more of a small data problem. So across the space of different use cases, it could be big data, it could be small data problems, they differ.

Attendee 3:

Oh, okay. Thank you.

Attendee 4:

So for the speed-up goal, besides the probes are you using something else also? You mentioned some recommendation algorithm.

Shivnath:

So the question is, for the speed-up goal, is it only the probes, and are we using something else, too? The reality is a lot of the apps have either maybe the same app or similar apps have been run before. So if you’re continuously collecting and observing information about every application, a lot of them you realize you don’t even have to do more probes. Historical information is good enough. So it’s really the combination of history, and probes are just to supplement if you didn’t have information and history.

Attendee 4:

How do you see that the apps are similar? Do you look into the application?

Shivnath:

Well, if it’s SQL it becomes very easy. SQL is very structured. When it goes beyond SQL then just things like extracting hashes and whatnot may not work, but we can actually work with the user to help identify similar apps in the past. So that way the user gets some examples. Then we can use that to search more and bring it. Okay, so we’re asking that question, “All these IDs, do you think they are similar apps?” So a lot of ways in which this problem can enable us. Great question, thank you.

Attendee 5:

One more question, it’s from me. So are people using it today? I know that you have a free trial. How many people are using it and what is the response so far?

Shivnath:

Absolutely. So all these algorithms that I’m talking about in the backend are already things which are in the product which a lot of people are using, like large enterprises from starting huge to small clusters, and clusters on-prem, or in the cloud. The chatbot element is something new we are introducing, because as we took it to a lot of users, especially the application end users, they might be a BA analyst. I don’t want to look at the UI and figure it out, I just want to ask a question and get an answer. So this is more about increasing or improving the usability of the product where somebody can consume it in the way they were allowed to rather than using another interface.


This video first appeared here on the Spark AI Summit site.