Best Practices Video 5 Min Read

Simplifying DataOps: Unravel demo video with Henry Eckerson

Best Practices Video 5 Min Read
  • Twitter
  • Facebook

Demo by Kunal Agarwal, CEO:

Unravel is all about making sure that your Big Data applications are fast and reliable, and your entire Big Data infrastructure and setup is cost-efficient and highly utilized as well. In production, we have about 100 million applications that we’ve currently analyzed across all of these different customers, from large banks, financial institutions, to healthcare companies, and high-tech companies. We support about 10,000 plus Big Data machines that we have under Unravel support today. And Unravel is backed by a number of top VC firms here in Silicon Valley: Microsoft Ventures, Menlo Park, as well as GGV Capital.

Transcript of video (subtitles provided for silent viewing)

Henry: Hey, everyone. Today, we have Kunal Agarwal here from Unravel Data, and he’s going to give us a demo of their product. For those that don’t know, Unravel Data is a DataOps vendor that provides an application performance management solution for Big Data. The solution tracks, correlates, and interprets performance data across the full stack in order to optimize, troubleshoot, and analyze from a single pane. Okay, guys, take it away.

Kunal: Thank you so much, Henry. So, I’m the co-founder and CEO of Unravel, and we’re simplifying DataOps in a number of different ways. But at the crux of it, Unravel is all about making sure that your Big Data applications are fast and reliable, and your entire Big Data infrastructure and setup is cost-efficient and highly utilized as well. So, just a quick background about Unravel. We are a team that’s founded off a number of Big Data practitioners and leaders in the industry from companies that are commercializing beginner technologies such as Hadoop, Spark, Kafka, etc., from Cloudera, Hortonworks…

A lot of our research work was actually done at Duke University when we first started out. A lot of companies are using Unravel in production; we have about one hundred million applications that we’ve currently analyzed across all of these different customers, from large banks, financial institutions, to healthcare companies, and high-tech companies. We support about ten thousand plus Big Data machines that we have under the Unravel support today. And Unravel is backed by a number of top-tier VC firms here in Silicon Valley, Microsoft Ventures, Menlo Partners, as well as GGV Capital.

So, in a nutshell, what we do in Unravel is we provide our customers a single pane of glass to rather, you know, overused term, but the Big Data, the need for it could not be more true because you have multiple systems that create a Big Data stack. Unravel monitors and manages all of these different systems. We don’t just give our customers graphs, and metrics, or performance data, but you’ll see in one of our examples over here on the application side that we actually help auto-tune and auto-remediate a lot of these commonly occurring problems that happen on this Big Data stack.

And on the operations side, Unravel also has very fine-grained visibility into resource, data, and user utilization of the cluster, which can be used for a variety of different purposes, from chargeback reports and usage analytics, to even a cost savings report or a migration plan if you’re trying to switch between environments. And last but not the least, we use Big Data technologies ourselves to be able to make the Big Data team or the DataOps team more proactive. And I’ll show you in one of these examples today as well how we could resolve problems even before they hairball into a bigger issue, and how Unravel uses machine learning and AI within of the software to be able to drive some of these features.

So, before we jump into a demo, I just want to familiarize ourselves with when I say the Big Data stack, it’s really this typical merger of different systems that companies put together in order to drive value from their Big Data. You may have certain tools for data collection and data ingestion such as Hadoop and Kafka. You may be storing this data on HDFS or S3. We’re running a variety of different types of systems on top of it such as Spark, Hadoop, HBase, Impala, etc., all for the purpose of driving value and which we call apps. And these apps could be anything from a complicated report, to a machine-learning algorithm, to a Connected Cars or any other IoT application, just simple BI analytics in the form of SQL queries, and so on and so forth.

So, a typical, you know, a couple of scenarios that people run into when they’re running these Big Data systems is, you know, you have these data engineers and data scientists creating applications in a dev environment, and then running them in a production environment on top of these Big Data ecosystems. And they say, “Hey, you know, my app is failing,” or, “My app is super slow today, I wanna go and fix that.” And then you have the operations people who are sitting down at the center of it all, who are getting all these inbound requests from the end consumers of Big Data. Then, they too may be looking at some of these problems around resource contention, capacity planning, etc., or even everyday troubleshooting issues such as, “Why has a mission-critical machine learning app missed its SLA, missed its Service Level Agreements?” That, you know, we may have sat with an outside customer and said, “We will give you these results in X-minutes,” but today it’s not following that pattern.

Now, let’s look at one of these problems. So Tom’s problem was probably with a Spark application that is running slow. And how would Tom go about resolving that problem today without a tool like Unravel? Now, there’s multiple reasons why an application could be slow or why an application could be failing. And these reasons could be multiple faults from resource problems to how he wrote the code, to the configuration settings that this application took to actually process on the cluster, to the data layout and the RDDs and the caching, etc., you know, that Tom provided for that particular application. So, today, Tom would have to jump around in a system monitoring tool like the one you see here in front of you, which is giving information about cluster, CPU utilization, disk utilization, and on the left side, you see if the services are working or not working.

So, in this case, Tom’s Spark app, there’s no information about that app, but at least you can get some information about, “Hey, yes, Spark as a service is up and running over here, I’ve got a green light next to it.” The cluster seems to be okay, it’s not at a hundred percent in utilization, but then there’s no information particularly about his app over here. And then, he has to jump into another screen, you know, popular one like a Spark UI, for example. But if you look at the screen, it says, “Hey, are the tasks completed or not completed,” but why is it slow? That information is, again, lacking over here, not to mention, this screen is showing Tom information at a component level instead of tying up these several components and showing him the view that he needs as his app level.

Tom may have hundreds of different components that comprise his entire app, and he would have to jump into hundreds such screens to go and figure out what really happened, and even these screens won’t provide that complete information. Tom may then have to go into logs, JobTracker, TaskTracker, to give you some more information about your application. And these logs come in, again, thousands of screens. And Tom would, again, have to connect the dots, stitch up the story to go and figure out, “Is it a code-level issue? Is it a resource-level issue. Is it a container level-issue? Is it something to do with data layout, etc.?” And then, Tom may also have other tools in his arsenal, you know, general-purpose monitoring tools, which, again, provides you more, and more, and more metrics.

We want to show you now how Unravel helps people like Tom resolve issues like this. So, you would come into the Unravel interface, which is now monitoring and managing all the applications that are running on your cluster. It could be a Hive application, Impala, Spark, Oozie-based applications, which may be doing AI, machine learning, IoT, it doesn’t matter. And, in this case, Tom will be able to find his application very easily, jump into it. And what he’s presented with is a view, first and foremost, that’s the best view to understand what’s happening with his application, with all those different angles in mind: “How is the code doing? How are the resources doing? How is the data layout and partitioning doing? Were my configuration settings okay?”

So, if you bring your attention to this page, you will see that Unravel has the app itself, and then all the components of the app underneath it, to be able to further drill down and understand what each of these different components did. We also have very excellent views for Tom to be able to correlate his code to how his code actually ran on the cluster.

A lot of times, programmers may not have the information around, “Yeah, I wrote a piece of Scala code, that’s what I understand, but how does this code actually executes into a distributed computing application, may be hard to understand?” So, just by a click of a button, Unravel actually correlates which execution pattern here correlates to which line of code, and that may help him further drill down, and troubleshoot, and understand what his issues actually are.

But Unravel takes it a step further. Unravel has all these resource graphs and understanding of how containers were used and data was used, etc.

But what we figured is we wanted to connect the dots for Tom, so that Tom doesn’t waste a lot of time trying to drill down and firefight these issues. Instead, he can come over here and say, “Hey, this application is taking 27 minutes to run. Unravel has got three recommendations in which I can improve that.” And upon clicking here, Unravel actually gives Tom the changes that he needs to make, in order to make his application run faster and/or in order to make this application run more efficiently. Efficiently, meaning not using too many resources on the cluster.

So, Unravel has done all the checks and analysis that a Big Data expert would do, and it’s telling you, “Hey, there are some configuration settings that are the problem in this particular case.” And these configuration settings, as Tom knows, are probably five or six hundred of these settings. And we shortlisted them and said, “Hey, go and change these settings from this value to this particular value when you run this app, and this app will run much better.” But not only that, we actually explained to Tom, in plain English, why did we even come up with those values, so Tom can become a better programmer, he can become a better data engineer, and not repeat the similar problems that he’s had in the past. And also get a good view of understanding, you know, “What are some common problems and inefficiencies with Big Data applications?”

So, let me walk you through some of these. For example, we’re saying, “Hey, your container resources are underutilized. You allocated a certain size and certain number of containers, you’re not using that. So, if you want to speed up your app, you should use that properly.” And the way to do that is to tune a couple of settings around parallelism, around how much memory you’re allocating per container, etc., and Unravel is actually walking you through those kinds of things. In addition, you know, maybe there are some code changes that you need to make in order to run this application better. For example, it’s saying, “Hey, in this case, Unravel has figured out there are some RDDs that are worth caching. Don’t go and cache everything because that’ll fill up your memory again, but if you go and cache this particular RDD, and the way to cache this is go and add this particular statement in front of that line of code, and that alone will save you a ton of time.”

So, Tom can apply these settings or he can ask Unravel to apply these settings automatically, and that same application that was running in 27 minutes has now shrunk its execution time by seven X to about three minutes. This is just to show you the vast inefficiencies that these Big Data applications can have. We saw how easy it is to get all this information in one place or to have a correlated, imputed view of this information, and then actually take an action on that to get out of your problems.

Now, let’s look at the other problem, which was Jackie’s issue. Jackie’s the operations person again, and she got a complaint from a data scientist, in this case, saying, “Hey, my machine learning model, which runs every day, today it didn’t finish on time. What’s going on?” So, we’ve got very specific views for different types of applications. What we showed you in Tom’s example was one Spark app. In the case where Jackie has to investigate an issue, this is an app that’s actually made up of some MapReduce stages and some Spark stages put together. So, the Spark UI, in this case, won’t even help out at all because there are some MapReduce components, maybe there are some HBase components, maybe there’s some Kafka components.

And Unravel can tie all of these different components up, and call it this one app so that Jackie or anybody else can jump inside that and truly figure out what’s happening with the application and not with the components of the application. On the right side, Jackie can start her investigation by figuring out how much duration does this app usually take? So, it says takes an average of about four minutes, today it’s taking fourteen minutes, but it’s not processing any more data than it usually does, what’s going on? On the right side here, you see an instance compare view. So what Unravel does is it ties up every run of this workflow, and what you’re seeing here is a duration view saying these are all the previous runs, and they ran in about three minutes twenty seconds, three minutes ten seconds, but today’s run, boom, it just went up to forty minutes fifty-one seconds. What happened?

In this case, what Jackie can do is, of course, understand the insights Unravel is bringing to the forefront, and saying, “Hey, out of the hundreds of possible things that can go wrong with this application, today’s reason of why it did not perform as well as it usually does is because the wait time to launch application master is very, very bad compared to our baseline model that Unravel has been able to generate.”

And then, it says, “Hey, you can click here to compare against a better run of the same application.” So, now, Jackie can see side by side a good run which ran in three minutes, and a bad run that ran in fourteen minutes and fifty-one seconds, and see what changed. And if you bring your attention, you’ll see that stages that used to take one minute twenty-three seconds to run are now taking eleven minutes to run, and this stage that is now taking two minutes forty seconds to run, used to take about thirty-two seconds to run. So there’s something going on and why these stages are taking so much time, and Unravel is able to decipher that for you and tell you, “The reason for that is there’s a lot of wait time for these particular resources.”

So, what we can do next is: Jackie can jump into the Unravel operations dashboard and understand what are the other applications running on the cluster at that time, who else was stealing all these resources, are there any alerts that were fired up when this application was running? And she discovers that, “Hey, there was a rogue application running on the 28th of August at that same time that this app was running, maybe that caused it.” And clicking on that gives you the full context, in the system view, around, “Hey, what are the applications that were running at that particular time, and what fired off this alert of a rogue app?” And on further investigation, it becomes obvious that there was this user that ran a Spark shell that started hogging seventy thousand megabytes of memory, slowing down the job that Jackie’s investigating.

So, we found out over here with one click of a button that there was actual contention happening on the cluster, and that’s why that SLA-bound application did not finish on time. So, what Jackie can now do is actually set up Unravel’s auto-actions, which is a policy mechanism which automatically takes an action on your behalf. So, in this case, she can say, “Hey, if there is a rogue user that is detected at a time when my SLA-bound jobs are running between, you know, 12:00 p.m. and 2:00 p.m. every day on my production cluster only, and if that is detected, I want you to kill that rogue application, or I want you to move that bad application to a quarantine queue, or I want you to send me an email, or the application owner an email so that we can be notified and we can resolve these problems in a much faster way rather than facing this problem and then resolving these issues.”

So, these are a few ways in which Unravel is simplifying DataOps by giving you the full view and performance intelligence on top of it. The current approach without Unravel, to summarize, is go to a bunch of different places to go and get all of these little pieces of data, and then try to correlate all of these different pieces in your head. It’s literally like finding a needle in the haystack, and that takes a lot of time. And not only that, that approach may not even provide you accurate results.

Instead, what Unravel does is it provides a full stack performance intelligence platform. Full stack, meaning it gathers data from applications all the way down to infrastructure and everything in between. But also full stack horizontally, meaning Unravel is designed for Hadoop, Spark, Kafka, NoSQL, MPD systems, really all the systems that make up a Big Data stack. And once the information comes inside Unravel, we don’t just aggregate all this information and show you five hundred drafts, we actually correlate all of this information into a more meaningful imputed form in which you can understand truly what’s happening. And we don’t stop just there, we actually apply machine learning and artificial intelligence ourselves to be able to uncover these issues, to be able to give you a resolution for some of these things.

And in a nutshell, we are allowing users to optimize, which means making their applications faster, making your resources more efficient, we help you troubleshoot, which you just saw today, and last but not least, actually analyze. So, if you have to go back and say, “Who used the cluster last month?” Unravel will able to give you a very detailed chargeback report, for example. That’s our short demo that I have for you today. To learn more, feel free to go to unraveldata.wpengine.com, where we also have our free thirty-day trial. Thank you very much.