How to Build Fast and Reliable Big Data Apps on Microsoft Azure HDInsight

Webinar: How to build fast and reliable big data apps on Azure HDInsight clusters, on premises or in the cloud, using Azure, Spark, Kafka, Hadoop, etc. with transparent performance management provided by Unravel. Features Pranav Rastogi, Azure Big Data program manager, and Shivnath Babu, Cofounder and CTO of Unravel Data.

Transcript of video (subtitles provided for silent viewing)

– [Audrey] Hi, everyone and welcome to today’s webinar, How to Build Fast and Reliable Big Data Apps on Microsoft Azure HDInsight and Azure Analytical Services. Today’s webinar will take about 45 minutes. After the presentation, we’ll close it out with about ten minutes of Q&A. Just a few housekeeping items before we begin.

If you have any questions, please submit them in the GoToWebinar panel at any time and we’ll get to them during the Q&A portion. Today’s webinar is being recorded and we’ll make sure to send you an email to the recording after the webinar. With that, I’d like to introduce today’s speakers, Pranav Rastogi, Program Manager for Microsoft Azure Big Data, and Shivnath Babu, CTO at Unravel. And now, I’ll pass it over to Pranav.

– [Pranav] Thank you so much and thank you, everyone, for attending this webinar. So, I am part of the Microsoft Azure Big Data Group. Mostly, working on our Big Data ecosystem and how can we make customers more successful.

And one of the key challenges that we’ve seen with the Big Data stack as it has been used on-premises for analytics is: getting started with Big Data is hard. It is challenging because you have to plan out what your hardware configuration is going to look like, you have to install the required, let’s call it, software competence, you have to install Hadoop, Spark, and all the other open source ecosystems, you know, projects that you can use for analytics.

Then, from an enterprise perspective, once you’ve installed everything, you need to secure it, which means that you need to ensure that you have network isolation, you need to ensure that you have authentication so that only the authorized users can access the system and, sort of, you know, perform any sort of operations.

And while you’re in production, you spend a lot of time optimizing all of the various engines that make up a solution. So, you’ll, sort of, optimize, for example, like the executors in Spark, or you’ll optimize the resource settings, like, the YARN settings.

And then, you’ll spend a lot of time debugging. And then, eventually, you’ll run into a wall where you will have to either scale up or down the entire system to meet your SLA requirements for the solution. And then, once you hit that wall, you’re back to the Step One where, you know, you now need to provision either more hardware, which you have to go to your IT team and make budget asks.

And then you go through the entire process again, where you end up spending a lot of time in configuring, optimizing, and debugging your Big Data solution. And so, what we’re trying to do over here is we’re trying to, sort of, simplify the problem, and part of it is solved with Azure HDInsight, which makes it very easy to get started with the Big Data open source analytical ecosystem, where you can just browse to a UI-based portal, which is the Azure portal, you can just provide some very basic cluster details in terms of what kind of cluster do you want?

Do you want a Hadoop cluster or do you want Spark, Kafka, R? And then, you can click “create,” which is going to create a cluster for you and then you can go, sort of, start doing and focusing on your analytics as well. And Unravel complements the scenario where once you have started with this project and you want to optimize your solution, then Unravel can help you with the application side of the house in terms of protecting bottlenecks, from an application perspective.

And the key reason why customers are picking up Azure HDInsight is because of the falling value proposition. It is a fully managed, full spectrum, open source analytics for the enterprises. What this means, from a customer perspective, is customers don’t have to worry about setting up clusters for all of these open source workloads and more.

HDInsight is a fully-managed service where Microsoft will set up a cluster with one or more of these components and provide a financially backed SLA around the availability of the cluster. That allows you, as the user, to focus more on your application rather than worrying about the low-level infrastructure components on, like, “How do I install Hive? How do I configure Hive? How do I make sure that it’s highly available?”

So, those are the platform benefits that you get by using HDInsight. It’s also globally available. So, it’s available in 26 plus regions in Azure and it’s also available in some of the governed clouds in the U.S., Germany, and China, which allows users to meet the regulatory and compliance requirements of the local regions to ensure that the analytical solutions that they’re building can be used in highly ventilated industries, for example, like, financial sector being a key one of them.

It’s also secure and compliant. That means you can deploy, on an HDInsight cluster, inside a virtual network, which gives you a network level isolation, that means that you can control what traffic goes in and out of the network. You can encrypt your data, you can authenticate users with your Active Directory.

So, you can connect with all LDAP and you can connect with OAuth and all these protocols so that only the authenticated users have access to the cluster. It’s also, you can have authorization policies based on Active Directory or Ranger which allows you to define fine-grained access on what users can access what data, or what table, or what column, and you can even mask some of the sensitive data that is there in the data so that it’s more visible to users.

That allows you to protect your data app in a fairly fine-grained access control manner. It’s also productive. There’s a lot of productivity tools supported like there’s Visual Studio support, there is IntelliJ, Eclipse support, there is support for Notebooks like Jupyter and Zeppelin, there’s support for BI tools like Power BI, Tableau.

So, there’s a wide variety of developers persona that is supported on HDInsight. And it’s fairly cost-effective. There was a study done by IDC where, as enterprises moved from on-premises to the cloud, the total cost of ownership gets lowered by about sixty percent.

And it’s extensible, that means that you can install…you can bring in any other open source project that you like. For example, you can bring in Presto, you can bring in OpenTSDB, and you can also install one of the industry’s leading solutions on top of HDInsight.

So, there’s an offering called Azure HDInsight Application Platform which gives customers a one-click deploy experience for deploying any of these applications on the cluster itself. And one of the key innovations that we have done in the cloud is the separation of Meta Store and storage. So, the clusters are stateless, that means you can, sort of, delete your cluster and recreate your cluster and not lose any data because all of your data is stored in an external store, which is, you can use Azure SQL Database for Hive Meta Store and then you can store your actual data in Azure Data Lake or Azure Blob Storage.

And so, this is a diagram which shows you that you can have one cluster, which is, you know, processing some data, maybe this is doing a bad job, against some data that is stored in a remote store. Then, you can have another cluster which you can, sort of, drop and recreate, which is…could be your data signs cluster, which can all, sort of, use the data that was prepped up using your ETL cluster.

Or you can have many more clusters for, like, BI-centrics and RUs. Each of these clusters can have different sizing based on their workload and each of this clusters can scale up and down based on the load of the workload, which allows you to, sort of, get the best benefit in terms of a cost-to-performance ratio.

And HDInsight has got fairly good coverage in the press. Like, as of December of last year, the price of this service was cut up to fifty percent for all workloads and up to eighty percent for our server, and this was just in recognition of the fact that lots of customers are moving to the cloud and we want to make sure that Azure is the best place to move all of your, sort of, analytical workloads.

This is a case study done by Microsoft where Microsoft is using the Kafka service on HDInsight, and it’s operating at a fairly high scale. So, this is, sort of, doing all of the injection using Kafka. And as you can see, the scale at which we’re operating is fairly high. It’s about eight million events per second, about 800 Terabytes of ingress per day. And this is one of the largest deployments on Kafka that exists.

HDInsight is also, sort of, used across all verticals. There’s Johnson Controls, a sort of, building connective, a building scenario, in the manufacturing space.

Heartland Bank is doing fraud detection, Roche Healthcare is driving transformation in medical devices in terms of how diagnostics happen. And there are other verticals happening as well. For example, we have Jet, who’s a retailer, who’s doing, sort of, online retailing on HDInsight as well.

So, it spans a variety of scenarios. And I’m going to hand it over to Shivnath to talk about how you can leverage Unravel on HDInsight to do, sort of, application performance management for a variety of Big Data scenarios.

[Shivnath] Thanks, Pranav. So, I’m Shivnath Babu, I’m CTO and co-founder at Unravel. So, as Pranav described, HDInsight and Azure, in general, is being used to power many mission-critical applications across a variety of enterprises. And these applications range across more traditional data warehousing and business intelligence applications to more modern, real-time, streaming IoT applications, to the very innovative AI machine learning applications.

And all of these applications are being built on a new distributed stack that comprises systems like Apache, Spark, Hive, Azure Blob Store. And all of these are, really, like, you know, distributed systems so it should not be a surprise to many enterprises getting into Big Data and especially on the cloud, that things can go wrong with applications.

So, maybe an application that is very mission critical, like, you know, fails, all of a sudden, at 2 a.m. in the morning, and it becomes a nightmare to quickly troubleshoot and understand what happened and get the application running back again. Or an application that is supposed to finish within an SLA may not be finishing within that time, or there could be suboptimal applications that are using up a lot of resources and burning money on the cloud.

It’s very hard to troubleshoot these sort of problems in a manual fashion. For example, many of these applications generate logs, and these are not just, like, you know, one log file but they could be hundreds, thousands, if not tens of thousands of log files, and trying to understand why an application failed or why it’s slow by going through all of these logs is, like, you know, it often takes a long time and sometimes it’s just unfeasible, right?

And many of these systems have metrics and there are literally, like, you know, many thousands of metrics, and what is unfortunate is these metrics lack application context. So, without good tools, manually troubleshooting these problems is hard. And if you think about it, even one application from one user could be, like, you know, hard to troubleshoot and make it reliable.

And as more applications come out of the platform, more users come onto the platform, where there’s a huge diversity in the skill sets of these users. And as applications go beyond BI and data warehousing to AI and ML, where they’re combining Kafka, Spark, HBase, this problem just become very hard to handle and that can directly impact the business and the productivity.

And the very fast time-to-market that every business wants that can be affected because DevOps teams are spending more time troubleshooting issues rather than getting applications to market. At the same time, inefficient applications can lead to high cost on the cloud and it might not be possible to get the very high-cost benefits that the cloud promises, the sixty percent that Pranav was actually talking about.

And the worst thing is, sometimes, for a data-driven business, SLAs of the business may not be met. APM is not like a new industry that’s existed, but it has existed for different kinds of applications in the past. In the late 1990s and early 2000s when J2EE applications were very popular and heavily used, companies like Wily and Quest served that need.

And fast forward into the late 2000s, as more transnational applications, especially driven by the web, and mobile applications came about, companies like AppDynamics and New Relic started addressing that need of the market. Today, the need is for distributed data applications running on systems like Spark, Hadoop, Hive, and whatnot, and that really requires a new kind of thinking because this is a new breed of applications.

And we have built solutions in Unravel to provide application performance management for these modern and emerging applications. And Unravel collects monitoring information from the entire stack, not only from the application level but also from the Spark and the service level as well as the infrastructure level from the BMs, from the cloud as well as on-prem systems.

All of this data continuously streams into the Unravel platform where it’s analyzed using machine learning and AI algorithms to provide these key value props to all enterprises that are using systems like Azure HDInsight. You can guarantee performance with literally one click by having automated solutions to understand what caused a problem to happen and how to fix it.

At the same time, the same kind of analysis can also tell you inefficiencies in applications and clusters that can help you maximize utilization and really control your cloud expenses. Last but not least, Unravel can provide you a single pane of glass to combine all the monitoring information from what is happening at the application and infrastructure level to what is happening at the business SLA level, in one place, to help with even things like migrating from on-prem to the cloud or jointly running hybrid solutions which are on-prem and cloud as well as help with the growth and planning.

So, here is a case study from a very large healthcare firm that is a joint customer of Azure HDInsight and Unravel. So, this firm, while they were creating and bringing more applications to the cloud, ran into a whole bunch of issues. One of the first things they realized is it was taking them a long time to troubleshoot application performance issues.

And as they moved more and more applications into the cloud, they also felt that these inefficiencies in the applications were costing them a huge amount of dollars. And the worst thing was, bringing new applications into production was taking months, right? And the operations team felt that they did not have the visibility to understand and control as well as fix these problems.

And they started using Unravel and within a matter of three months, here are the KPIs that the team has been getting. First and foremost, they now feel confident that they have the end-to-end visibility to understand what is happening. But from a very specific day-to-day perspective, they’re seeing a huge reduction in the time it takes to troubleshoot issues.

They’re seeing at least fifty percent savings compared to the cost that they were spending earlier to run the same and even a larger and growing number of applications. But the key thing and the most important thing for them is, it is taking far less time, literally just a week, often, to move a new application into production. So, what I’d like to do in the rest of the webinar is quickly show you a demo.

And I’ll cover a few aspects of Unravel on Azure HDInsight. First, I want to tell you how easy it is to bring up Unravel on an Azure HDInsight cluster or to monitor any number of transient or, like, you know, scaling Azure HDInsight clusters, and then I will lead you through how Unravel can provide each of these key-value provs to your enterprise, reduction in troubleshooting times, reduction in cost, as well as, like, you know, making you feel very confident that you can actually meet all of your business SLAs and move more and more applications into the cloud.

So, I’m going to start here, right here, with the Azure Marketplace portal, right? If you go and search for Unravel right in the Azure Marketplace, the Unravel app will show up right here. We can click and go into the Unravel app page on the Azure Marketplace, right?

This is a great place to get started. All the information that you need to get started, for example, how to bring up Unravel, right, on an Azure HDInsight cluster or any number of Azure HDInsight clusters is right here. We have two ways in which Unravel supports HDInsight.

One is the HDInsight app which, literally, when you’re bringing up a cluster, either from the Azure portal, like you’re spinning up, in this case, a Spark cluster, Unravel is an application that you can, with one click, add on to the cluster. So, the cluster literally comes up with Unravel, right? Now, if you’re using many clusters, a lot of our customers use Kafka, use Spark, use Hive, use HBase, right, many different types of clusters, many of which could be running concurrently, Unravel has a solution where Unravel itself would run outside of these clusters and would be collecting information and giving a single pane of glass across all of these clusters.

And that’s called the Unravel VM installation mode. And both of these are supported so, now, if I go to Azure portal, this is Unravel’s Azure portal where you can see all the different kinds of HDInsight clusters that we are running currently; I can go into one of these clusters.

So, this is a HDInsight cluster; it is the Spark cluster, that is being monitored and managed by Unravel. If you click on this application link right here, you’ll be able to go directly into the Unravel portal. So, see how, literally, with one click on Azure Insight cluster, I was able to come to the Unravel interface.

So, Unravel, from a very quick overview perspective, has many different capabilities. We have operations capabilities around understanding what is going on in the cluster, where is cost being spent. We have capabilities around applications, that is where I’m going to start right here and then we can monitor and manage many kinds of applications across SQL, across MBP systems, across Kafka, NoSQL, but to start with Spark.

So, we’re going to go into a Spark application right here, this is a Spark application that was run on an HDInsight cluster. And first and foremost, what you’ll see for any Unravel application page, is you can see the source of the application, as I mentioned, this is a complex Spark application, and you can see the scalar source of the application right here, right?

This application took some significant amount of time to run. So, the developer of this application or the operations person who’s in charge of this application may want to understand why this application is spending a lot of time, how you can optimize it. We have created very good visuals to help you understand and answer questions like this.

For example, this Spark application seems to be having a bottleneck job that’s taking six minutes of the overall seven-minute run of the application. You can drill down to the level of the individual job and get into the level of actual task of containers, all those things this application is spinning up so it seems like this application is running two tasks on two different hosts.

And you can hover on it and see all the different kinds of metrics at whatever fine granularity you want, right? But most of the time, what people are interested in is where is this, like, you know, where is this time actually going in my application itself. So, in Unravel, we help you connect time and resource usage with a line of code in your applications.

See how easy it was for me to understand this line 68 in my program is where all this bottleneck is coming from, right? And along with this, you can understand resource usage at a fine-grain, at the level of executors, in containers, in application logs, errors in the application configuration, and what not. But along with all of this and while being a single pane of glass, the very critical functionality that Unravel provides us, it analyzes all of this monitoring data, the full stack data from applications, the platform to resources, and gives you crisp recommendations like this.

How the current settings for the application are at a source or at a configuration level and how it can be improved, in order to improve performance. And instead of just giving you values like this, Unravel also explains why. Why certain settings are causing, maybe, resources to be underutilized or contested, right?

And everything, right here, in a single pane of glass, right? And if you take this particular application and apply the Unravel recommendations, that same application can run, in this case, fifty percent faster, sometimes even faster. On the cloud, something faster is just not, you know, does not just mean that you are getting good performance, it also means that you are running your clusters for much less time and getting savings in terms of cost as well, right?

So, along with applications, Spark, Hive, applications like this, Unravel also can monitor data pipelines end-to-end, right? Here, I’m going to go into a data pipeline. So, this data pipeline is called the Production Machine Learning model. It’s representative of the kind of data pipelines that a lot of our customers run in production.

Notice that this data pipeline basically consists of, like, you know, some MapReduce applications to ingest data into the cluster along with some Spark applications that are building machine learning models on top of this data, right?

So, what you’re seeing right here is this application is running repeatedly, right? So, every data point here represents a run of this application. So, the machine learning model is getting constantly updated. The SLA for this application is around five minutes, right?

And there was a point of time when the application really misses this SLA. So, it can drill down into why the SLA was missed. So, right now, I’m drilling down into that bad run where the SLA was missed and you can see how there was a component, like, in the component 580 for the application, that ran for much longer and that caused the SLA to be missed.

Along with this, Unravel can actually provide you SLA analysis to say that, “Look, this application missed the SLA because it was actually suffering from contention in the cluster,” right? And there could be many reasons why SLAs could be missed. It could be that maybe there was a change in the amount of data that the application needed to process or there was some issue with the change in the configuration of the cluster or some noisy neighbors that are going on.

So, Unravel can actually analyze all of these lower-level scenarios that can happen and give you a very crisp understanding of why problems might be, like, you know, happening in the cluster. So, along with helping you troubleshoot problems like this, Unravel is proactive. So, it can help you with alerts of why an application issue happened.

So, if I were to drill down into the level of the operations where Unravel provides alerts, right, you can see that not only will Unravel help you understand why the problem happened, it can help you prevent these problems from happening.

In this specific case, the contention for our SLA-bound application 584 was caused by a rogue application that came up in the cluster. So, Unravel has this feature called Auto Actions which, in this case, triggered and is proactively seeing that this rogue application is what is causing the problem in the cluster.

Now, to deal with such use cases like SLA management across one or many clusters, we have the capability called Auto Actions. So, Auto Actions is Unravel’s feature where you can register rules, like in a business context, their rules on the cluster, and these rules can span best practices that you want to enforce, contention, SLA management, all of these things.

And if you look at any of these individual rules, these are things that can be configured via an Unravel API. I’m going to show you a very interesting use case that has been powered using these Unravel Auto Actions. So, here, what you’re looking at is an auto-scaling cluster.

So, this is a view where the black line represents the total amount of resources in the cluster and you can see how Unravel is automatically scaling the cluster up and down. And let’s go into a fine-grain to understand what really is happening. So, if you notice, now, we’re looking at a five-minute interval view of the cluster where you can see how, based on the green line, which is the actual usage and the SLA needs of applications, the cluster is being scaled up and down.

And you can go to any point of time and understand which application, which SLA need, required the cluster to be scaled up. So, this is a very powerful use case that helps you connect applications, SLAs, and business needs in a way where you are not provisioning very large clusters to meet those SLAs but having the cluster cost really be applied and optimized for the use case.

Business SLA where auto-scaling is something that is now very possible with the features in Unravel. Along with, like, you know, tracking data pipelines, Spark, and, you know, Hive, kind of, data pipelines, Unravel also has very rich support for New Age applications like real-time streaming applications that might come from IoT, right?

Here, we have a sentiment monitor application that is basically going across Kafka/HBase. If I were to show you the actual source of this application, you can see that the application comprises Kafka, it comprises HBase.

It’s really like, you know, extracting the sentiment from the Tweets and all of these green lines here represent the rate at which data is coming from Kafka into the cluster. And the actual application itself is written in Spark Streaming that’s processing this and trying to deliver the results in real time. Now, Unravel has the recommendations and monitoring even to improve applications like this that, in an HDInsight work, cuts across multiple clusters.

There is a Kafka cluster, there is a Spark Streaming cluster, and an HBase cluster, all of them tied together by the application. And, as I mentioned, Unravel still is a single pane of glass across all these different applications. I can zoom in to the level of Kafka itself.

So, here, as I can see multiple Kafka clusters being monitored, right? And Unravel can track exactly what is going on at the level of Kafka and we can even go into the level of the individual topics, right, or the exact list of consumers that might be consuming from these clusters.

So, all the information comes in a single pane of glass. So now, as you can imagine, as application information, resource information, data information, everything comes together, this rich suite of data can power use cases beyond day-to-day application monitoring and SLA management; it can also power use cases such as migrating to the cloud, right?

A lot of our customers run on-prem clusters as well as cloud clusters and many of them are, today, in that phase where they’re running on-prem clusters and are slowly and surely, but they want to move to the cloud. So, what you’re seeing here is Unravel’s capability to connect together departments and projects like business-specific metadata like that and help you understand, from that level, from the perspective of a project or a department, how the resource research in the cluster is happening, how much has been consumed, what is the cost, what is the cost of running an application on-prem as well as what the potential cost will be of running it on the cloud.

And not only can we give you the detail at this level, we can also drill down to any level of, like, specific granularity like individual applications, right? Which of those applications coming from a project are taking more resources than others? And sometimes, what happens is, to migrate applications to the cluster, you have to understand what datasets those applications are using.

So, notice how very easily I can come into the level of the application itself, the exact application that is part of the project, understand which datasets, which tables are being accessed, and I can also go into that level of detail to know where those datasets are or the other dependencies, other applications, these datasets might be consumed from, right?

So, once all of this information comes together, migration can be done very confidently without any risk of breaking applications that are already consuming these datasets, not to mention you can bring these applications on the cluster, manage and control cost, like I showed you earlier.

So, overall we saw a lot of different use cases of how Unravel can help you improve your productivity, bring applications to production very, very quickly, while, at the same time, being in good control of cost and ensuring that all your business contact SLAs are being met. Unravel is very deeply integrated in the entire ecosystem.

We saw examples of different applications that Unravel can actually support along with different data pipelines from Spark-oriented pipelines to real-time streaming pipelines. We support on-prem as well as, like, you know, cloud clusters so that you can confidently migrate as well as ensure that you’re getting the best cost:benefit.

We are also integrated very well with all the security mechanisms that are very critical to enterprises today from Kerberos to Active Directory and LDAP as well as alerting and collaboration tools that you might be using like PagerDuty.

And together, what we saw in the webinar is Azure HDInsight and Unravel, together, give you a very complete solution for all your Big Data provs and needs, from the variety of open source software that you can easily spin up on the cloud, manage it via the Azure portal, monitor all of the platforms using Azure Log Analytics, and then, get deep application performance management with Unravel.

So, all the different use cases for the entire lifecycle of Big Data projects can be managed using Azure and Unravel together. So, with that, we’d love to take questions.

– [Audrey] Great. Thank you Pranav and thank you Shivnath. We have a few questions and we’ll try to get to as many of them as possible. Okay. First question, “What open source frameworks does HDInsight and Unravel support?”

– [Pranav] So, Azure HDInsight supports a variety of open source projects, some of the key ones being…so, you can get a fully-managed service for Hadoop and which includes all the Hadoop components like Hive, Big Fix, Spark, HBase, Kafka, Storm, Hive LLAP, and just to include it, like R, like, the open source R server.

And beyond these open source components which are fully supported, you can also install any of your favorite open source projects via an extensibility mechanism called Script Actions. So, you can install open source projects like Presto, OpenTSDB, you know, and anything else from the open source ecosystem.

– [Audrey] Okay. Thank you, Pranav. “What types of applications can I monitor using Unravel?”

– [Shivnath] I can take this question. So, as I showed in the demo, Unravel can monitor and manage a suite of applications, from SQL and BI to programmatic applications coming from Spark or Pig or Cascading as well as streaming applications that might be combining multiple systems like Spark, HBase, and Kafka.

And more recently, we have been adding more and more richer support for AI ML applications and more iterative applications that might even be running on GPUs.

– [Audrey] Okay. Thank you, Shivnath. Next question, “Can Unravel work on HDInsight when the cluster is carburized?”

– [Shivnath] Absolutely. Unravel supports carburized clusters and all these different security mechanisms and authentication as well as authorization mechanisms like Ranger that Pranav talked about.

– [Audrey] Great. All right. We have a few more. I think we have a few more minutes to take a couple. “Is there any effect to compliance? Does all the data live in the cluster or is it being sent off to another system?”

– [Shivnath] So, all the data resides within the VNET, as called in Azure, of the customer itself, so no data goes outside the customer’s network. And you can even further, the HDInsight app actually runs Unravel within the individual HDInsight cluster.

So, compliance is something that Unravel takes very, very seriously and the customer’s in full control of where their entire data being collected by Unravel resides.

– [Audrey] Okay.

– [Pranav] And to add to it, also, just in general, for even, you know, Azure, HDInsight, and other services, all of the data is stored in a customer’s account that the customer owns and, you know, can encrypt the data using keys that you own. You also restrict network traffic in terms of, you know, what ports are open and closed.

And the services are also, sort of, you know, meet a lot of compliance standards like PHI, HIPAA, and putting, sort of, GDPR as well so which should allow you to build these applications that can process sensitive data, can store sensitive data, in an account that you own.

– [Audrey] Okay, great. Next question, “HDInsight uses Hortonworks flavors. What’s the plan for moving to Hadoop 3 and container support similar to new HDP?”

– [Pranav] So, Hadoop 3 is a fairly significant change from an open source ecosystem. You know, I believe it’s been a couple of years since we had the last major rev to Hadoop. And, in this release, like, even though it’s just called Hadoop 3, you know, all of the major open source projects are going to go through a major upgrade component version. So, you know, you’re going to get a new version of Hive, you’re going to get new versions of HBase, Amari, like, all of these open source projects. Plus, there are, sort of, fairly, I would say, key enhancements coming in 3.X which is around containerization of apps, which means that you can containerize your apps and run on YARN.

And there’s also support for deep learning where, now, you can manage GPU resources through YARN itself. So, it’s a fairly significant set of changes and improvements that are coming to the Hadoop ecosystem. And given that the change from Hadoop 2X to 3X is going to be, sort of, a major change, we want to make sure that, you know, enterprises, when they make this change, the stack is ready.

So, we’re doing, sort of, going through a fairly rigorous certification process as we speak and we will make Hadoop 3X available at some point in time.

– [Audrey] Great. This is a good one… “What is the best way to size an HDInsight cluster? How many worker nodes needed for this much data?”

– [Pranav] That’s a really good question. So, I think the best way to, sort of, think about that problem is, you know, a), it’s a very hard problem, b), I think, a few factors that can go into it. Basically, sizing of the cluster sort of matters based on what SLA do you want.

Like, do you want to process the job faster or do you want, like, a high concurrency option where lots of users can, sort of, you know, quae the system or you want to size the cluster to ensure that the downstream and upstream processes are taken care of? So, let’s say you’re running an ETL job and you want to serve it to your, sort of, BI users, like, in, you know, under five minutes or so, or you want to, sort of, size your cluster based on the amount of data that you’re processing.

So, it’s a fairly dynamic space where, you know, based on the SLAs that you want to set, the sizing can vary. And even within the sizing itself, you can choose different BM sizes which will give you a different throughput, you know, output and a different, sort of, memory-to-core ratio and stuff. So, in general, sort of, what a good place to start with is, you know, you can, sort of, all of our clusters are optimized by default for, sort of, you know, good performance.

And then, as you, sort of, think about your SLA, you can start, sort of, using Unravel in your application stack to figure out, like, what does your current processing look like from an application perspective? What SLAs are you getting for your job processing? What SLAs are you getting for your pipeline, like, ETL pipeline processing or an ML pipeline processing?

And then, try to find out what does the pattern look like for your application and then use that insight to, sort of, scale or size your cluster or you can set specific auto scaling rules in Unravel to, you know, scale up and down the cluster. So, you know, just to summarize, like, there are lot of vectors which you can use to size the cluster, but start with the default.

Use Unravel to figure out what your current, sort of, work distribution or query performance looks like, you know, set some SLAs, and then use some of the Unravel intrinsics around, like, auto-scaling and stuff to get you to the right, sort of, cluster sizing.

– [Audrey] Great. Shivnath, do you have anything to add there?

– [Shivnath] What Pranav said is the exact methodology. If I were add one more thing, feature, that I didn’t demo today is Unravel also has a capability to do entire cluster workload analysis to help you with decisions like this and configure the cluster level defaults so that the cluster sizes are optimized for the workloads and SLAs being run on them.

– [Audrey] Great. Okay. I think we have time for maybe one or two more questions. “In the case of Spark applications which are running with dynamic allocation, how do the Unravel recommendations work?”

– [Shivnath] So, Unravel supports Spark applications that are running with dynamic allocation enabled, as well as not. In the case where dynamic allocation is enabled, one of the key types of configurations that need to be…that become super important…are the sizes of the containers themselves.

And container sizing is basically a pretty hard problem. The way Unravel actually tackles that is by observing what is happening at the level of each and every container, correlating the V-codes and the memory usage of the container to how much data and how much actual JBM-level usage the tasks within the container are taking.

And it uses all of this information, along with some machine learning analysis, to size the containers appropriately for efficiency or reliability. And once the containers are sized, the next usually important question is, “do you have the right degree of parallelism?” In Spark’s world, it’s called the number of partitions which are controlled in many different ways depending on whether the RDBs and data frames are generated from the input. Are they intermediate ones? Are they generated based on specific kinds of, like, you know, transformations like parallel processing?

So, all of these things become pretty hard for users to either understand and control, and Unravel automates everything using its machine learning models. So, dynamic allocation turned on is something for which Unravel’s recommendations work just out of the box. And sometimes, when users want to get strict SLAs been met and predictability, we also see where they will turn off the dynamic allocation and then configure the number of executors just to meet the SLA.

And, of course, going back to the earlier question, the size of the cluster itself. So, absolutely. It’s a great question and something which Unravel supports out of the box.

– [Audrey] All right. Let’s take one last question. “Can alerts from Unravel be integrated with ICM? Are the alerts seen as general Azure alerts?”

– [Shivnath] Yes. So, we actually…the entire alerting mechanism in Unravel supports a delivery mode which is really based on REST Post. So, using that, you can integrate it with ICM, you can actually have your alerts delivered via Azure.

So, we have customers who are using like, you know, PagerDuty or even homegrown alerting tools, and we’ve built Unravel for these kind of extensibility needs. Absolutely.

– [Audrey] Okay, great. Well, I think that’s all we have time for. Pranav, Shivnath, any last comments?

– [Shivnath] So, this is a really exciting opportunity for us to work very closely with Pranav and the Azure HDInsight team to bring the value props of Unravel to the broad set of enterprises that are using and benefiting on a day-to-day basis from running Big Data on the cloud.

Absolutely excited to, like, you know, present this to a large audience right here.

– [Pranav] Thank you so much for joining us as well.

– [Audrey] Right. All right. Thank you, everyone, for joining today. I’d like to thank our speakers again, Pranav and Shivnath. If you’d like to learn more about Azure or Unravel or both, send us a note at unraveldata.com/contact-us and we’ll put you in touch with Pranav and/or Shivnath.

If there are specific use cases you’d like to see us cover in future webinars, you can let us know there as well. Again, thanks for joining and have a great day. See you next time.