Best Practices Video 5 Min Read

Monitor and Optimize Big Data Pipelines and Workflows End-to-End

Best Practices Video 5 Min Read
  • Twitter
  • Facebook
  • LinkedIn

In this demo, Unravel co-founder and CEO Kunal Agarwal shows how our big data APM monitoring software help you manage applications in production. Unravel’s AI and ML-based knowledge base incorporates millions of customer use cases, to support automated recommendations, and even allows you to establish rules to prioritize on mission-critical processes to help meet SLAs. With an intuitive cross-platform dashboard, our APM software lets you drill down to a stuck app or resource hog and determine exactly what (or who) is causing the problem. This level of end-to-end application optimization manages apps running on Azure, Spark, Hadoop, Kafka, Impala…a comprehensive, transparent view, not point tools and endless poring over logs.

Transcript of video (subtitles provided for silent viewing)

Audrey: Welcome everyone, and thanks for joining today’s webinar, “Monitor and Optimize Big Data Pipelines and Workflows End-to-end.” Today’s Webinar will take about thirty minutes. After the presentation and demo, we’ll close it out with about ten minutes of Q&A. Just a couple of housekeeping items before we begin. If you have any questions, please submit them in the GoToWebinar questions panel at any time and we’ll be sure to get to them during the Q&A portion. Today’s webinar is being recorded. We’ll make sure to send you a link to the recording after the webinar. With that, I’d like to introduce today’s speaker, Kunal Agarwal, CEO and co-founder of Unravel. Kunal?

Kunal: Hi guys, I’m Kunal Agarwal, co-founder and CEO of Unravel Data. Today we’re going to be focusing on how you monitor, optimize, and really operationalize data pipelines and workflows. There’s a very, very important type of application that you’re running on your big data stack. And today’s webinar will cover four of these different types of problems that occur every day when you’re running these workflows at scale. And how do you go about resolving those common issues?

Before we jump into talking about problems, let’s quickly understand what we mean by a data pipeline or a workflow. A data pipeline and a workflow, first of all, are interchangeable terms. And what it really means is an application or big data application that you may be putting together, which comprised of several stages to achieve a goal which could be creating a recommendation engine, creating a report, creating a dashboard, etc. And you’re usually stitching together multiple stages in the form of multiple processes that you’re doing and these data pipelines or workflows may then be scheduled on top of Oozie or Airflow or BMC, etc., that may be running repeatedly maybe once every day, or several times during the day.

This is one such example. In this case, for example, if I’m trying to create a recommendation engine I may have all of my raw data in separate files, separate tables, separate systems, that I may put together using Kafka and then I may load this data into HDFS or S3 after which I’m doing some cleaning and transformation jobs on top of it meaning trying to merge these datasets together, trying to create a single table out of this or just trying to clean it up and make sure that all the data is consistent across these different tables. And then I need to run a couple of Spark applications on top of this cleaned-up dataset to then create my model or analyze this data to create some reports and views, etc. and that’s the end result that I’m trying to achieve from this particular type of a data pipeline.

So there are many examples of data pipelines and these workflows or data pipelines may both be in real-time or they may run as a batch application depending on your use case. And there’s a variety of tools that you may use for running these data workflows and pipelines end-to-end.

A lot of industries are already running a lot of these data pipelines and workflows in production. In the telecom industry, we see a lot of companies using these pipelines for churn prevention, making sure that customers are always happy with their service. So it goes all the way from making sure the customer service is great to, actually, the product itself is great.

Now in the automotive industry, a lot of companies are using this for predictive maintenance using IoT applications. So they’ve got similar datasets flowing in through Kafka in a real-time fashion and maybe running some Spark streaming applications on the end of it to be able to predict any faults or problems that may happen with the entire manufacturing line itself.

The financial departments may be using this for fraud detection. A lot of the companies that we work with today who are banks and major financial services use this extensively to be able to detect a fraudulent transaction, say, for example, in your credit card in a quick and fast manner. So on, we’ve got e-commerce companies and recommendation engines, and healthcare companies for doing their clinical decision making.

So as you can see, these data pipelines are mission-critical applications, so getting them to run well, and getting them to run reliably is very, very important. However, that’s not usually the case when we try to create these pipelines and workflows on these big data systems. And these are some common problems that occur every time you’re running these kinds of pipelines. You may have runtime problems, meaning an application just won’t work. You may have configuration settings problems, where you’re trying to understand, “Hey, what should I set these different configurations setting levels to, to make these applications run the most efficient way possible?”

You may have problems with scale, meaning this application ran with ten gigs of data just fine, but what about when it has to run with one hundred gigs or one terabyte of data, will it then still finish in the same amount of time or not? And last but not the least, you have multi-tenancy issues. I’m very sure that everybody on this call today does not run their applications in an isolated environment. Everybody probably runs it with hundreds of possible users, tens of thousands of potential apps, all running together on the same cluster. So a lot of times you may have other applications affecting the performance of your application.

So today, let’s look into all of these different types of problems in a little more depth and understand how you can solve them using the out-of-the-box tools that you may have, and then how a company like Unravel can help you simplify problem detection, root cause analysis, as well as resolution for a lot of these different types of issues.

So let’s dive right in… into run-time problems. Now, run-time problems are multiple types. You may have applications that are just stuck. How many times have you all seen a workflow stuck in 95% and it just won’t finish? You may have applications that are waiting for tasks, or dependencies, or are slow. In today’s case, we’ll focus on a very common problem that keeps occurring which is an application got killed or it failed. So I’ve got a couple of demo scenarios here set up for you. In this case, we ran an Oozie application that actually failed. In this case, we’re focusing on this application here which got killed. So I clicked over here and I’m trying to understand the root cause for why this particular workflow did not run.

And as you know, there can be tons of reasons for it. An application could fail because something bad went wrong with the code itself. It could fail because it didn’t get enough resources or the services that this particular workflow was running on failed, etc. Now, trying to detect that problem using the Oozie UI is not very useful because Oozie restricts it, first of all, only to the actions that I am actually running. It does not show me anything about resource utilization or seeing if CPU is available and if Oozie itself is working. For that I may have to open up my system monitoring tools, in this case, we’re using Cloudera, so using Cloudera Manager or Ganglia. So let’s go ahead and figure out, you know, what information Oozie actually provides you.

So, in this case, I’ve opened up this particular Oozie ID which is number 53 and it shows me over here all of the different actions that perform that if I bring my attention to the statuses it shows me, “Hey, all of these are okay, but this table scan here had an error and it failed.” So I click on this to get the reason for, “Hey, why did it fail? What went wrong over there?” And the only thing I see here in the error messages at the main class is an exit code. Why? Why does that main class not execute? Why did this execution actually just stop? It doesn’t give me that information.

Now, I can also go and check out logs. By the way, this log screen here maybe actually closed down for a lot of you users because in a lot of companies this is restricted to only the admins having this view, but regardless, even if I look at this, it actually shows me that the MapReduce job that ran as part of this Oozie workflow actually succeeded and that confuses me again. There’s a field here, it says succeeded here, let’s go check out the log and figure out what happened and then I can go and click on the log screen here to go and check out: Is there anything that those logs are giving me to go and understand why this failed?

And again, over here there’s incomplete information because it’s just showing me that there was an exit code but it’s not showing me why that particular problem even happened. Now, I should go and see, you know, all my system monitoring as well, to go figure out maybe it wasn’t an application issue, it was something to do with infrastructure itself over here. Let me go and see if I had CPU available. Absolutely, I was only using about five to seven percent of CPU even on my peak days I was using about, you know, thirty percent or something, so I definitely have CPU cycles available. Then why did this application got killed or why did it fail? Well, maybe it has something to do with services.

This app was running some Hive jobs on Oozie but both of them seem to be in good health and now I’m left scratching my head over here and this is usually that that lock which now I’m confused about, “What do I even do next and how do I get this application to run properly?” So as you can see the concerns of a variety of factors that can affect the performance of an app, jumping around from screen to screen, trying to triangulate this information is pretty hard to do. So we’ve simplified that with Unravel. We’ve got the same app here, number 53, that actually executed in Oozie, we’ve brought all of that data into Unravel and given it a nice, clean UI, but also more importantly being able to analyze all of this information automatically to be able to give you the root cause of this problem and to help you understand how you get out of this particular problem.

In this case, first of all, you’ve got a very clean UI to help you understand where your time went, which stages were processing in parallel, which was processing in serial, where time was actually going. And in this case, I see one of the components here has actually failed, which is a table scan component that we saw earlier. But now clicking on this particular fail job or component, I can go inside. So I clicked on that particular stage there and I opened up a drill-down of this particular stage. So I was on this Oozie piece here and one of the components that was failed. I clicked on that. Now I’m in the drill-down view of that particular component.  And over here when I click on the Unravel analysis button, it actually shows me why this application failed. So in this case, it’s saying that a job failed with this “null pointer exception” error, which means I had some data sets in there that had values of nulls underneath them and that’s why this application could not execute on time and it can get even more fine-grain details about that particular component to understand more details about that error. In fact, even get detailed log messages were that error was applicable. But you see how we didn’t have to go jump around different screens to go and figure out what the cause of the error was because Unravel would triangulate all the information for you, and then, in one click, show you what problem there was and how you can actually go and resolve it.

To summarize, without Unravel, you would have to go to multiple different screens, you have read through all of these logs, and really have a manual way to triangulate all of this information. And with Unravel, with one click, you can get to the root cause of these problems and move on with your data very quickly.

Another very common problem that we see often is on configuration settings. Now, configuration settings as you all know, come in the hundreds. And configuration settings dictate everything about the performance of this application. They dictate how much resources an app catch, what’s the size of each task, whether that’s a good degree of parallelism, what will be the batch size, what is the buffer memory size for Spark or Kafka, etc. And also these configuration settings dictate how these different components will talk to each other because your workflow may have Kafka, it may have Spark, it may have some Hive jobs as part of it, and you want to make sure that everything works properly, like a well-oiled machine.

So getting these configuration settings right is absolutely critical. Now, without a tool like Unravel that helps you understand and then recommend these settings, you will have to use a trial-and-error manual method to go and get these best configuration settings. So the way you would go about solving this problem is you would first understand your application behavior by using a set of default settings that your cluster currently has. You would then open up that application either in Oozie or JobTracker or Airflow and understand, “Hey, how is this application behaving? Which execution stages are bottlenecked right now? Are there any slow tasks or stragglers in this particular execution?”

And then you would pick maybe top-ten configuration settings for each of these because, again, all these settings are in the hundreds. It’s not possible to manually tune all of them. So maybe pick up ten of them, and then you wanna try adjusting these configuration settings a little bit at a time and rerun that application to see if you were able to affect the performance in a positive or negative way when those configuration settings were deployed. Not to mention, this get even more complicated because you are running in a production environment where things are changing constantly and applications are taking resources and giving resources away. So a lot of these factors will slow you further to truly get through those proper configuration settings for that application.

Now I want to show you how configuration settings problems can be resolved using Unravel. So in this case, we’ve got an application here that is also a workflow, but in this case it’s coming from Airflow, another popular workflow engine which a lot of our companies are using. Now we’ve got all the information about that particular workflow right here. For example, this workflow is called CEO report, is created up off these four Spark applications underneath it. On the right side, I have different runs of this particular workflow to be able to quickly compare and contrast how one run behaves in relation to the other one. So this is a bad run that took thirteen minutes, forty seconds to run, and I can see the next run also took about the same amount of time. And then I was able to analyze this and bring the workflow execution time down to four minutes. So let’s go and understand how we actually did that.

So with Unravel, you get these insights which help you understand, “Hey, where should we focus on? Can I reduce the amount of analysis I have to do as the end user, to figure out where do I even begin to resolve these problems and speed up this particular workflow?” So in this case, it’s saying, “Hey, we detected work…we detected bottlenecks. This is that particular stage that you should be looking at.” So we’ve opened that up and we get a more detailed view about that particular stage itself. In this case, just to make it clear, we’ve got these different Spark applications that are running underneath it and it’s picked up one of these stages and said, ”Hey, this is where you should go and spend time to try reducing the amount of resources or execution time that this particular workflow is taking.”

So when I click on that, I get fine-grain details about that Spark app. Again, all the information in one place, I get information about what program ran, how the Spark app actually uses resources, what is the execution pattern of this particular Spark app looking like? I can drill down into each of these Spark stages to further figure out what was going on with there. But Unravel has three recommendations that it has done automatically by analyzing all of these applications. And it said, ”Hey, these are the three configuration settings that if you change from this current value to this new value that Unravel has figured out, your application would run much faster.” And it also explains to you in plain English why it’s recommending those values so that you can learn and you can become better end users yourself. In this case, it’s saying, for example, container resources are underutilized, which means that this application did ask for some resources from the system, but it’s not even using them fully. So either give those resources up or use those resources properly so that this application can run much better, etc. etc.

So it will go through all of these different checks, balances and inefficiency hunts for you and recommend the right values. And then we can see here a good run that same app that was running earlier, taking about thirty minutes, thirty-five seconds, with these improved configuration settings, it’s actually running in about four minutes. We didn’t add more resources to this particular application. In fact, we use the resources that we currently have in a more efficient manner and made a few other changes in the configuration side so that all of these Spark apps have a better degree of parallelism and they execute better on this particular cluster. And you see how we were able to reduce time by a big factor over here by 3X and make this application run much faster.

Configuration settings are very, very crucial to get right, and Unravel will also help you automate and get you to the best configuration settings automatically. And you may question what does “best” mean? Broadly, it means two things, can I run this application the fastest possible, and best configuration settings may also mean, can I reduce the resource footprint of this application? Because again, you’re running in a multi-tenant environment, you wanna make sure that every application running on the cluster is a good multi-tenant citizen and it’s not hogging or blocking resources away from other applications that may need to execute on time.

Another problem that we have is around scale. Scale meaning, will this application perform at a bigger scale that may be on a production cluster that may be in cases where my data size is increased by 100X? And “how do I ensure reliability?” is usually the question. Which is, take for example, you’re running an app right now that we’re developing in a development environment or a dev environment. Now that dev environment may have limited data sets. It may also have limited multi-tenancy, meaning I along with a few other developers may be the only people accessing it, and I may be running this application with ten gigs of data to test it out. And now I’m trying to put this in production. Now in dev, this application performed in about two, two and a half minutes and I want to get that same performance in production. How do I go about solving that kind of a problem?

So to do that, you must fully understand all of the factors that are affecting the performance of the app. You will need to dig in and understand, for example, where are the expensive joints? Where are the bottlenecks? How is this application processing hotkeys? How’s this application getting resources? Is it getting enough containers? Is it getting enough CPU? And what will happen? And then post the what-if questions, which is what will happen with 5X more data, half the CPU, more contention happening on the cluster, etc. And as you can see, part of the problems and understand the behavior of the application in screens like Oozie or Hue or Airflow is pretty limited.

So in this case, I’ve got an application that I ran in my dev environment that finished in about two, two and a half minutes over here. But if I try to understand over here, which operations or actions took the most amount of time, why did they take the most amount of time, what kind of configuration settings were this actually running with, is that even appropriate? How did it get…how did it use the resources that it actually got? I don’t have a full understanding of those components to be able to then project how this will run in production. So with Unravel, you also have these views where this application which ran in dev in about two minutes and seven seconds, you can understand, how did this application actually run? What were the tasks that were running in parallel? What was serial? So now you can figure it out where were the bottlenecks or what was the critical task execution.

I can also see in a DAG view to understand not only what the DAG looks like, but the highlighted parts here show me what is the most important fact of execution, so in this case, we had a fork and then we had two Hive applications here running in parallel, but then which Hive application, which side is the more expensive side and the more critical side for this application to finish. So if I was looking for performance improvement, I would start looking on this side and not focus on this other side. So you see how Unravel simplifies, how you can drill down inside and figure out where should we even spend our energies into to try to figure out where bottlenecks and problems, etc. can happen. And then if I do run this at scale, maybe at 3X, 4X and 10X more data and I still want it in a reasonably good time of about two minutes, or two and a half minutes, or so.

Now on the right side as we originally saw Unravel also does instance compare meaning the same workflow that’s running. It will track it and show you what the duration of that workflow was in between different runs. So over here we have the dev run, and in production, we actually ran it using Unravel. As you can see over here, under two and a half minutes so it met our SLA, but if you see the data I/O in production was about 2.6 gigs and in dev about 300MB or 400MB. So about 10X more data set, but the execution time was still reasonable because Unravel was able to figure out how this application uses resources and make sure that if you utilize them in a very efficient fashion so that it can even scale up for this 10X amount of data. Not to mention, Unravel can also connect your dev cluster and your prod cluster, and bring all of their data in together in one place. So then you can start to see how changes and behavior changes in all of these different runs very easily and be able to compare and contrast them in a very simple fashion.

To summarize here, without Unravel, you would have to do a very manual way of figuring out “what are the reasons that an application may not scale?” Try to control those different parameters and then try to scale that application up. With Unravel, you have a drilled-down view to understand application behavior, but also you’ve got these easy compare screens to understand differences between runs to then go and figure out how will this application actually run in another environment and project that because all the factors that affect performance are all in one place. It will help you understand those reasonings in a very easy, accurate matter.

Last problem that we go to focus on today is the multi-tenancy issues. Now multi-tenancy is a very big problem because of a lot of other users maybe submitting applications on the same cluster at the same time and stealing resources away from you. And it’s very important to make sure that we control these kinds of multi-tenancy problems so that your applications always meet their SLAs that you promised your business users. Now, without Unravel, you would have to go and do a manual way of figuring these multi-tenancy aspects out. So for example, you have to go and see what other apps are running, what resources other applications are using. That information itself is not very easily available. You would have to go to, say, a Ganglia, or an Ambari, or Cloudwatch to go and figure out what the total CPU is, but then you may not be able to allocate and understand how these hundred applications running right now are using that particular resource like memory or CPU, and how much is your application getting out of that?

And then you want to compare to see what’s available and then allocate resources to your particular application. And then, if your application for some reason has some sort of effect around multi-tenancy issues, like it’s waiting too long or it didn’t start or it just fails because of preemption, etc. Then you would be able to gather that, “Hey, maybe it’s a multi-tenancy problem and we should slow down or stop other applications from running so that my application can finish on time.” So there are a lot of things over here that can go wrong and that’s why having the view as well as the control for managing this problem is very, very crucial.

We’ll show you how a multi-tenancy issue will be solved with a product like Unravel. So again, we’re back into one of the workflows, in this case, this workflow is called “Prod ML” model. This came from Airflow as well. And in this case, we’ve got two MapReduce jobs as part of this workflow, that may be doing some transformations, and then I’ve got a Spark job at the end of it, which may be doing some analytics or building out a report, etc.

Also, if I bring my attention to the right side on the instance compare, I can see previous runs of this workflow. This took about three minutes, and twenty seconds, or so. But today’s run is fifteen minutes long. What happened? And why did this particular run take so much time? So you remember how we always click on a good run and we can compare good runs and bad runs side by side. So I can see, “Hey, this good run that finished in three minutes and the bad run that finished in fifteen minutes and start to see what happened? Why did it take so much time when it’s processing the same amount of data?” So it’s not that my bad run has 10X or 20X more data to run. It’s got the same components as well. I didn’t change the structure of this application, otherwise, that would have been caught over here as well. But if I bring my attention, I can see that these MapReduce stages that usually take about a minute, twenty seconds are taking about eleven or twelve minutes to finish now. So something’s going on why these particular stages of components are taking so much time.

So now I can drill inside those particular components and understand what was going on, but of course Unravel will point out the problems to you by analyzing all of this information so you don’t have to manually dig into it. And in this case, it’s saying, ”Out of the hundreds of possible reasons why this particular workflow could be slow, today it’s slow because the wait time to launch application master is much worse than usual.” So there’s a lot of different times and wait times, etc., that can be analyzed. Unravel has gone in and said, ”Hey, the wait time to launch application master is slow,” which then we can infer is a problem where another application is running right now and that’s slowing down this particular app.

So Unravel would also be able to alert you on this before you miss your SLA by telling you that this resource convention in your queue because Unravel has a score-related information about resource. And this case is showing you that, “Hey, my CPU and my memory were being used in this way at a particular point of time.” And I had all these apps running over here, but then this particular application started running and started taking 70,000 MB of my memory, further slowing down or even stopping all the other applications completely.

And what I can also do is set up an automatic action in Unravel to go and kill such types of bad applications in the cluster automatically, so that I can protect my SLA-bound job and make sure that this particular app finishes on time, every time. So you see how multi-tenancy issues can be resolved using Unravel by providing priority levels to all of these different applications that are SLA-bound and making sure that there are some policies and practices in place which will prohibit these kind of rogue or anomalous behavior even occurring on the cluster itself making sure that all these applications finish on time, every dime.

So to summarize, managing workflows is very frustrating because of these kind of issues that we’ve seen. First of all, there’s no end-to-end view, meaning that I’m looking at a workflow, I want to make sure that I understand what the entire workflow is doing. I may not care about individual components till something goes wrong, so I want to see how all of these different components play with each other and how they perform holistically. I also want to be able to see how different runs of that workflow did, so that I can compare and contrast and figure out why one run versus the other one was bad or good and get down to those root causes very, very easily.

Another reason is there’s just too much to look at. In our first example, when we saw the failed application and the killed application, we have to jump from application view to resource view to configuration settings and everything else in between, and even then, there wasn’t complete information to triangulate why that application actually failed. So there’s just too much to look at and it’s a very manual way to resolve these problems. And when we did this resolution, say for example, sixteen configuration settings, we want to get those answers quickly. We want instant gratification that just tells me, something tells me why these configurations are bad and which were actually running into it.

A lot of times we see and even in our different customers that use the product, they would spend a lot of time trying to get these things right and they spend about seventy percent of their day fighting these issues. Now, we don’t want those things to happen. And the way Unravel simplifies workflow management is, as you saw, we can give you a workflow, that end-to-end view, so you can truly understand what’s happening. We will give you all the correlated information in one place. Application performance is factored in a lot of different reasons and Unravel would connect all those dots for you and help you understand why something happened and how you can go and resolve it as well. And that’s through our automated root cause analysis and resolution.

So to summarize, when you’re looking at fragmented metrics, Unravel will give you unified KPIs. But you’re manually trying to sift through these logs, which may be locked up and they may be in the hundreds of thousands of screens. Unravel will do that automated parsing and help you get to that cause very, very quickly. And Unravel’s got correlated information and, using the machine learning and AI, we actually provide you insights about what’s going on and recommendations for how you get out of these particular issues and that’s how we are taking a step forward to actually fix all these mission-critical workflows that affect every company today that’s running big data. That’s all we have today, happy to take any questions that you may have.

Audrey: The first question is, “Does this work with other workflow engines like BMC?”

Kunal: Thank you, Audrey. So Unravel is designed to work with a variety of different workflow engines. I know today’s example was restricted to Oozie and a couple of examples on Airflow. Let me show you a quick slide here that will help you answer the question. So Unravel is deeply integrated and if you bring your attention to depth tools and workflow schedulers, these are all the different workflow tools that we integrate natively with, meaning we will connect to your Oozie server, or your Airflow, or Control-M, or Cron, etc., to be able to extract all those workflow definitions from there and be able to create those nice views and analysis that you saw in the product in the demo today.

Audrey: Great. Next question, “How are you bringing data about workflows?”

Kunal: Great question. So just a follow-up of that, really. As you saw, our workflow pages have a tremendous amount of information about execution, resource usage, container usage, configuration settings, and a bunch of other things, all the different components and compared views and things like that. So Unravel is a full-stack performance intelligence tool, meaning we gather data from all the layers of the ecosystem from applications all the way down to infrastructure and everything in between and bring that into the Unravel service. So we’ve gotten multiple plug points, API-driven plug points primarily, to be able to get job info, workflow info, your CPU memory information, all of your services level information, etc. But the magic really is in not just aggregating this information but correlating and analyzing this information to be able to show you those insights and to be able to show you those recommendations, so you can deploy Unravel either on-premise or on the cloud depending on where you’re running your cluster.

And the Unravel service will be maintained within your environment so that you can send all that data and then you have a nice clean web UI to actually be able to view and play around those datasets.

Audrey: Okay. Next question, will this work with Amazon?

Kunal: Absolutely. So, with Amazon, if you’re running EMR or any of these other different Amazon services, even on Azure, so we’ve got very neat integrations for the cloud. Unravel works on Azure, Amazon, Databricks, Qubole, all these different cloud vendors in the same way as we work for Cloudera, Hortonworks, MapR. So you’ve got the similar plug points to be able to extract all of this telemetry data, logs, profiles about application, resource, fine line server graphs, etc. that we can bring in. And once Unravel has all of these datasets it can then start to analyze to be able to help you optimize your applications, to be able to troubleshoot any problems that are going on, and last but not the least, be able to analyze all of this data to be able to figure out who’s using the cluster, how I should set my priority levels, how I should set up auto actions.

And on the cloud especially when you’ve got these autoscaling clusters, it becomes even more important to figure out, “Hey, if I have a slow app, will adding more nodes speed up my application?” And Unravel can do that kind of analysis and help you understand not only how do you get the best performance but on these cloud clusters, do you really need to auto scale? Do you really need to spend that much money to run these applications within your timeframe? And how do you get a more cost view of your application as well and try to bring those costs down?

Audrey: Great. It looks like we have time for maybe one more question, since we’re past the half hour, and you may have already touched on this but it says, “Does it look into YARN as well as multi-tenant queues?” That’s the first part of the question. The second part is, “Does it support NiFi as well for processors and containers?”

Kunal: Very good question. So, absolutely, one of the data sources that we look at is Fair Share Scheduler, Capacity Scheduler, even your YARN pool’s queue levels, and why, because that last example that we saw on multi-tenancy is very important to understand what your application is doing as well as what else is happening in your environment. So meaning, whenever we’re trying to set these levels, you know, in this case it’s that there was an application that wasn’t supposed to be running at this particular time and wasn’t supposed to be taking so many resources and we were able to detect that because we knew what else runs in that cluster and that queue at that particular time.

Another example may be in some cases that you’re supposed to give more resources to your application for it to run properly. But what resources can you definitely guarantee to it while making sure other applications are not affected by it? So Unravel takes into account not only what’s happening with that particular app but what’s happening holistically across your entire cluster so that it can recommend those values within those fringes that are possible and not interfering with everything else running on your cluster.

Now, about NiFi or any other, you know, real-time workflow or any real-time data pipelining that you may want to do, Unravel does have support for that. We’re building out NiFi support particularly. It’s in beta right now, it hasn’t been GA’d just yet. But when you’re looking at real-time applications they have their own particular challenges. We actually did a webinar a week or two ago on real-time workflows on these data pipelines and how IoT applications and Spark Streaming applications can be controlled and managed as well. So I would definitely encourage you to check out those webinar pieces too, to get more view into how Unravel not only solves problems with these batch workflows but how it solves problems with Kafka, Spark Streaming, NiFi kind of workflows as well.

Audrey: Great. Thank you, Kunal. I think that’s all we have time for today. But if you have any additional questions about today’s content, I’m sure Kunal would be happy to answer them. Just let us know in the post-webinar survey or you can go to unraveldata.wpengine.com and contact us about more information or to get a one-on-one demo or start a free trial. Also, let us know in the post-webinar survey if there are specific use cases or topics you’d like to see us cover in future webinars. We’d greatly appreciate it. Thanks again, Kunal, and thanks, everyone, for joining today’s webinar. We’ll see you next time.

Kunal: Thanks so much for joining.