Best Practices Video 5 Min Read

How to Build and Run Reliable Machine Learning Applications

Best Practices Video 5 Min Read
  • Twitter
  • Facebook
  • LinkedIn

This demo shows (through a suite of different kinds of applications) how Unravel provides a single pane of glass for your entire artificial intelligence/machine learning big data application mix and operations management. Then, examples of how Unravel uses advanced algorithms to quickly pinpoint the root cause of problems or tell you about problems even before they happen…and it uses that AI/ML intelligence to now give you recommendations on how to fix problems. A lot of the time, it can fix these problems even before they happen.

Transcript of video (subtitles provided for silent viewing)

Shivnath: Hello, everyone. Welcome to the webinar on How to Build and Run Reliable Machine Learning Applications. So, first and foremost, I’m going to point out that Unravel has an entire webinar series that we started on operationalizing big data applications. The first was late last year where we presented a holistic webinar on big data operations.

And since then, in January, we presented a webinar on how to operationalize applications on the cloud on elastic auto-scaling environments. In February, we presented how to operationalize real-time and streaming applications on systems like Kafka, Spotstream, and Flink. And today, we are going to jump into machine learning applications.

Most enterprises that adopt big data go through a typical journey where they are creating many interesting and mission-critical applications. Usually, it all starts by bringing together data that must have been traditionally siloed into one platform and then running ETL kind of applications which clean up the data, combine multiple data sources, create holistic views.

And as the enterprise continues its journey, the next set of applications that come are on the business intelligence side of things where interesting insights can be extracted from the data. And as enterprises keep continuing this journey, they quickly get into a regime where the insights in the data can be used to power predictive analytics for casting and an entire use case around automated decision-making.

So, today, I’m actually going to focus on these applications, and all of these applications tend to be powered by artificial intelligence and machine learning algorithms. We have a whole suite of customers across different verticals, and we have seen how these companies have been gaining a lot of value from creating AI and ML applications.

For example, in the telecom sector, we have customers who are driving a lot of value by using machine learning to power churn prediction applications. So, churn is a big issue in telecom where knowing that a subscriber is in danger of switching to a different account. So, algorithms have been used to quickly identifying churn.

In the automotive industry, the industry of self-driving cars, there are many other applications. For example, applications that collect data from IOB sensors and can predict when different parts of the assembly line might need predictive maintenance. And, this is huge in terms of saving cost.

In the financial industry, there are a lot of challenges around fraud. And, this is an industry where fraud detection models for predicting fraud, controlling the impact of fraud. These models have had terrific impact.

In the e-commerce domain, arguably one of the domains where AI and ML applications were created and have been successful from the very beginning, recommendation engines are very, very important. These engines can give very personalized recommendations on products to buy or pages to look at.

And, in healthcare, an industry which is very important going forward, AI and ML have powered automated decision-making in clinical analysis and clinical trials, diagnosing causes of diseases very, very quickly.

So, across all of these different verticals, what we consistently see is that AI and ML apps tend to take a lifecycle of their own. And, this lifecycle is usually powered by systems like Kafka for ingesting data, systems like Hadoop, HDFS, NoSQL, and cloud storage systems to store this data, systems like Spark and MPP engines, and more recently, systems like TensorFlow to power the ingesting analysis and the computation that happens on this data.

It all starts with identifying the data sources that are needed by your AI and ML applications. It could be IoT, it could be transactions in financial, or machine data. And once the data is stored and captured, that’s when the AI/ML journey really starts, usually with feature engineering, which involves identifying the right attributes to extract from the data on which now, you can start to build models.

And, there are a whole suite of different kinds of models, like deep-learning models on one end, but a lot of innovation is happening today to move traditional models like support vector machines and random forest. And then, these models have parameters that might need to be trained.

Once we have all of these in place, that’s when the final step of productionizing the app, which might involve running the app continuously as new data comes in, keeping these apps up-to-date. And, this mark is what powers all the predictive analysis, and we also consistently see how companies have to keep on iterating on this process. The models may not be optimal or the right features may not have been extracted. And, this loop keeps going on while the AI and ML app starts to become mission-critical.

And, as that process continues, enterprises often realize there are many roadblocks in this entire process. While the application is being created and initially tested, failures can happen. Out-of-memory issues happen all the time, sometimes, wrong results are generated, or the app-in-training process might be very slow and the application might require a lot of resources to run. And, slowly, as companies productionize them, they might realize that models and apps may not scale to the amount of data being obtained.

Real-time streaming apps may start to lag and not be real-time anymore. Sometimes, there is the mismatch between the sizes of data that are coming in, and what these models need to process, leading to issues like small files in the cluster or a very high cost of running these applications. And sometimes, the applications may be performing well and suddenly their performance starts to degrade.

So, all of these roadblocks pretty much end up meaning that, if you don’t think about from the very beginning, a good approach to do operations and performance management for AI and ML applications, then there’s a very good chance that you will not get the right return of investment from these applications. And that’s exactly the problem that Unravel is focused on. We have developed a full-stack, intelligent, and autonomous solution to do performance management for a broad suite of applications. And today, I will show you how we apply this process for machine learning applications.

In a nutshell, what Unravel does is it collects telemetry information from every level of the stack, from the applications, from the services and platforms all the way down to the infrastructure. All of this telemetry information continuously streams into the Unravel platform where through advanced visualizations as well as AI and ML applications employed within Unravel, we can correlate and convert all of this data into deep insights.

So, how that works is without Unravel, often companies are left without one single tool, one single holistic view. And, they have to combine monitoring information from many levels of the stack from a Cloudera Manager and Ambari, to a Ganglia, to resource manager and whatnot. And Unravel first and foremost, combines all of this data and gives you a very app-centric correlated view, but it just does not stop there.

It analyzes all of this data coming into Unravel and then gives recommendations on how to fix problems. It can actually find out problems even before it happens. And, in a nutshell, with all of this combined, Unravel can give you a very good way to get the SLAs and the predictability and reliability from your applications. And, that’s what I’ll be focusing on in the demo today.

I’ll first show you through a suite of different kinds of applications how Unravel provides a single pane of glass for your entire AI/ML application and operations management. Then, I’ll give you examples of how Unravel uses advanced algorithms to quickly pinpoint the root cause of problems or tell you about problems even before they happen, and it uses that to now give you recommendations on how to fix problems. A lot of the time, it can fix these problems even before it happens.

Now, I’m going to jump into the demo. I’m going to pick two applications here. First, you are looking at the Unravel UI, and I have here a supervised learning application. So, supervised learning is where the dataset comes with labels. What I have here is a churn prediction application. That application itself has been built using Spark. The dataset turns this…it comes from the telecom domain, it’s about calls, it’s about subscribers. And, the key challenge here is how quickly can you predict that a customer is likely to churn? They will stop being a subscriber and might switch to a different provider.

And this is where, as I was mentioning, the typical lifecycle and the processing of machine learning applications starts to become visible. It’s very common for people to take the dataset and do some training of machine learning models on the data. As you can see here, it’s reading an input dataset and then trying to build up a model on it. But before it can start building the model, there is an entire space around features and feature engineering. Where picking the right features to include in the model and picking the right features to draw, could actually end up confusing the model. And once the feature engineering is done, that’s when the model generation happens.

So, there’s an entire pipeline that needs to be built to continuously train these models. In this case, the models are decision trees, which are a pretty good model, a simple and easy-to-understand model that can achieve very good accuracy a lot of the time. And, the models are built and they are evaluated using a cross-validation approach, one of the most popular approaches to understand the accuracy that a model is getting. And once the model has been fit and there’s a search done across all the spaces of models that are available, the best model is picked here. And then, for the best model, predictions coming from the model are used as part of the entire operationalizing of the app.

So, a very quick example of an entire supervised learning app right here. I also have another example, more simpler, unsupervised learning. In unsupervised learning, the dataset does not come with labels, and the more common techniques used in unsupervised learning are things like clustering, k-means clustering is a very good example.

What I have here is an example of doing customer segmentation. The dataset contains information about customers, and then you want to group customers into customers with different profiles. It’s usually one of the phases that is used as part of algorithms like collaborative filtering when you want to recommend if person A bought something, then a person B who has a similar profile is likely to buy that same thing.

So, I’m going to use this example to illustrate…if I am a data scientist or a user who’s created this application, then I would want to understand many different aspects of this application. How is it performing? Is it performing correctly? How do I improve the performance of this application? What can I do if the application were to fail? How can I keep tracking the performance of this application once I productionize it and see if it’s performing consistently and reliably?

Let’s start with the very first question. With this application, let me try to understand, as the user or the developer who created the application, “how is it running?” So, Unravel has a lot of views that have been created to help answer questions like this. I’m gonna start with the Gantt chart. So, the Gantt chart view shows how the entire application got converted into a series of steps of execution. As you can see, this is running in Spark, all the different jobs being created by Spark going from the top to the bottom. This part is the Gantt chart right here which is a timeline showing where in the program time is being spent.

So, very quickly, you will see a pattern which is very common of machine learning applications. We can see there are some initial jobs that take some time to run. They are extracting features, they are doing analysis like SQL analysis on the data. And after that, you will see a series of these small, small, small stages. This is where machine learning applications do their iterative computation. And, if you look at the sizes of data being processed, they’re all the same. And, this is also a very common pattern, the iterative pattern, where these machine learning algorithms take an input dataset and they keep working on it, applying iterative steps or clustering on it to generate the final result.

And so, quickly understanding the factor of execution of the application. Now, let me show you how… As a developer, I might want to further understand where time is going and which parts of my application are consuming time. So, we have created an entire execution view here. For example, I can see where my dataset is actually being read from and which part of the code it corresponds to. So, this is where the dataset is being run. It’s actually being read and processed. Or this is the part where I’m actually extracting all of those records that will eventually be applied to the clustering.

As we can see here, once these vectors are extracted from the data, the k-means clustering algorithm is actually applied. And, as we saw in the Gantt chart, this algorithm is a very iterative process. And, if you’re looking at this execution graph, you will quickly see a dataset is being created and the same kinds of operations are being run again and again on the data. This is the iterative part of the algorithm. So, quickly, you can understand which parts of the program, the source, how that corresponds to the time and execution on the data.

So, once such a holistic understanding of runtime performance with respect to the code is obtained, of course the very first question that comes up is, “Is this all doing the right thing? Is my code actually generating the right output? Is it processing the right input?” So, to answer those questions, I’m going to show you the same program that was run earlier for this user who is running this program.

With Unravel, one of the features that we provide is, along with looking at the performance, the user can get a quick set of samples of the data. Let me look at how the data is looking. This is not the full data but is a quick collection of samples of the data so the user can understand how the data flows through the entire program, and of course, map it back to their source as well.

So, this enables them to check whether the program is doing the right thing. Is it generating the right results? And if not, where could things be going wrong? Now, from a correctness perspective, of course, you need to understand things like the data flow, and it’s equally important to understand whether there are any errors being thrown in the program.

And, this is actually a pretty important aspect because in the big data platforms, the source that the user creates goes through many different layers. Here is Unravel very quickly telling the user that there were no fatal exceptions or errors. There are some warnings that might be worth looking at, and Unravel is able to map all of these individual errors and warnings to which component. These are executors in Spark or the master, the driver. So, this is pretty good. This particular application did not have any errors.

So, along with understanding performance, understanding correctness, the next question the user might have is, “Should it take this much amount to run? Is it running optimally? How is the performance looking?” Along with bringing together information from the logs, from the different configuration metrics as well as how resources are being used by the application across the resource manager and YARN or the actual containers that are running in the cluster consuming memory, consuming CPU resources, consuming memory in the JVM heap, etc.

What Unravel really does is it analyzes all of this data automatically and gives the user concrete recommendations on how to improve performance. So, Unravel analyzes a large suite of configuration parameters around the containers, container sizing, degree of parallelism, joint dives, data skew, and whatnot. And, it can give suggestions on…exactly pinpoint suggestions on how to improve performance, as well as it reduces all of that into English. And, why is it giving us a certain recommendation? And, as part of that, you will see how Unravel can also analyze the code in the application and suggest when there are opportunities for improving performance via partitioning, or in this case, by caching.

So, as you can imagine, in machine learning and iterative computations, often, there are a lot of opportunities for caching. And, by applying these sort of recommendations in Unravel for apps, for this app, we can improve performance significantly. Here is one example where I applied the recommendations for this customer segmentation app which was running in three-and-a-half minutes, and there was almost a two-x performance improvement in terms of run time of the application after applying the recommendations. Not to mention it can see how by employing caching, even resources can be saved significantly, almost a four-x reduction in the actual amount of I/O being run while generating the same result.

So, these improvements in speed-up, in terms of reducing resources, can lead to this application becoming a better citizen in a multi-tenant cluster, not to mention it is better set up for scale. Or if you are running on the cloud, this application and improvements like this can directly result in savings and cost. So far we have seen understanding of performance, understanding of correctness, tuning of performance.

But, another major challenge, as I mentioned is applications, when data scientists are creating them, they’re also trying to understand the data. So, it’s very common, and we see this across all of the enterprises and verticals that are employing machine learning and AI applications, a lot of failures happen.

Here’s a quick example of an application that failed. Now, if I open the application right here, this is an application which has a lot of Spark SQL in it. Unravel can quickly pinpoint…just like it gives suggestions about performance, it can pinpoint the root cause of the failure. So, it’s saying that the driver and executors fail with out-of-memory. So, let’s try to understand what really happened.

So, here is the job in the application that failed. So, if I were to open the job, I can see that the job actually failed because a stage in the spawned job failed. And, you can quickly see that it was able to run 199 out of the 200 tasks it had to run. And, one of the tasks seems to have failed is Unravel is pinpointing that a particular task failed.

Now, why did it fail? Just by opening the actual timeline of execution, very quickly you can see that there was this one task right here which is actually processing a much larger amount of input compared to all of the others. So, if I were to select that particular task, quickly you can see, well, task ID number 244 is actually having this failure. So, it’s great to actually have all of this data in a single place. You can drill down and understand that, skew in the data that is some load imbalance or incorrect partitioning of the data led to that.

But, Unravel actually goes one step better. You don’t even have to get to this level of detail. The error views that Unravel actually captures, in this case, notice how this application has some fatal exceptions and errors. It’s quickly pinpointing that it is an out-of-memory problem, and it can tell you that this particular task is the one that is causing the problem. So, users who are not initiated with all of these systems like Spark, and Flink, and H2O, and Kafka can very quickly, with a very good UI, understand what the root cause of the problem is.

And with the data skew, a lot of the time, you can fix it by re-partitioning the data and, some of the time, you can actually fix it by allocating more resources to the failed application. As we saw, this particular task was running and executed, too. And, if you were to quickly look at Unravel’s UI, you can see that the application was consuming too much memory and it was failing. So, in this way, Unravel helps quickly troubleshoot the causes of failure.

Now, moving right along, we have seen how performance, correctness, failures, Unravel can help you deal with that. A lot of the time, once you move beyond creating the models and using the right features, productionizing the application means this application might have to run in a constant manner, in a data pipeline. This data pipeline might be something that is scheduled by an enterprise scheduler like Control-M, or an Airflow, or an Oozie.

So, here, we have our Spark application, the model that was generated. Now, what I’m showing you is in Unravel, how you can get a holistic view of an entire data pipeline, the production machine learning model. Now, such a model which is part of a pipeline might have other stages like a Mapreduce job that is using Sqoop or Distributed Copy to ingest data into the cluster, and then maybe some Hive or Mapreduce jobs to do the ETL clean of the data, and then maybe Spark to generate the machine learning models.

And, when you have such a pipeline that is productionized, this pipeline might run every day, every hour, or it might slowly run in a streaming fashion on real data. So, Unravel can give you a holistic view of the entire data pipeline. And, what is very interesting is if you have such data pipelines, a very common problem that happens is maybe it was all running barely within the SLA of five minutes, but today’s performance is much worse. Suddenly, there’s a major change in performance.

So, when you have such problems, Unravel helps you very quickly diagnose what is going on. You want to understand, “what changed in my bad run, which took almost three-x the amount of time, and the good run?” So, quickly you can eyeball the different components from a good run, compare it with all of the components from a bad run. And quickly, you see that this particular component here, the Mapreduce job that was actually cleaning up the data, even though the data size did not change, it’s almost three gigabytes here and here, but it’s taking ten minutes to run. And, you can drill down into that component.

Unravel goes a step better here where it helps you with its advanced insights to quickly pinpoint what the cause of the change is. So, Unravel baselines its performance. It can pinpoint that today’s run is much worse compared to the baseline. But, when it’s automatically doing baselining, it can pinpoint by correlation analysis what lower level cost is causing this higher level performance change. And here, Unravel is pointing out that launching the application master, it’s a wait time decluster that is the culprit. So, it’s good to know this after the fact, but wouldn’t it be even better if you can capture these sort of problems before they become major problems and hopefully even fix it?

So, Unravel has an entire workload manager that’s built in. So, I’m gonna show you how Unravel detected this problem ahead of time. So, if I were to click on this alert that came from Unravel, it’s going to take me to that point of time in the cluster where my mission-critical application, application 584, which was the application whose performance was affected and that caused the SLA to be missed. So, the alert that Unravel is sending right here, it caught a rogue application, a shell that some user was experimenting with, that is consuming a lot of resources, that stole resources from the mission-critical app causing it to miss its SLA.

But what’s even more interesting is notice that this problem happened, but the next run did not have a problem and it met its SLA. So, applications 586, 588, 589 coming from the production machine learning model are fine. And here, Unravel actually caught a problem, if I were to go and click on this alert, you’ll see how the same problem repeated. There was a shell that was consuming a lot of resources, but this time, my mission-critical application 586 was not affected because Unravel was configured, in this case, to terminate the rogue application.

Now, Unravel has rich features around, as I mentioned, workload management where you can… This is the Workload Management UI, you can specify policies for understanding and employing best practices for your applications like understanding and enforcing SLAs, and then taking charge of SLAs and contention of the cluster where Unravel brings in all the telemetry information and it can employ your business policies to enforce them. So, with that, I’m going to wrap up the demo and come back to the presentation.

So today, in the entire demo, I showed you how Unravel gives you a single pane of glass across all applications. And, in particular, we looked at the entire lifecycle of machine learning applications, we looked at a supervised learning application, we looked at an unsupervised learning application.

I showed you how it can go from understanding performance, understanding correctness, debugging the application, to how if there are problems in the application, Unravel gives you quick troubleshooting as well as recommendations on how to fix a problem, as well as it has this entire workload management capabilities to help you enforce SLAs and be in control of the reliability of all your applications.

And once again, a quick reminder that we have an entire webinar series that is already taking shape as well as more to come. But, from any type of applications, be it your ETL or business intelligence applications, or the streaming and real-time applications, or your AI and machine learning applications, all the challenges involved in operationalizing the big data applications. We present and tell you what are best practices to operationalize big data applications, and especially use Unravel to achieve that.

With that, I’m going to take a pause for questions. Thank you.

Audrey: Thank you, Shivnath. Looks like we have time for a few questions. Please submit any of your additional questions in the GoToWebinar panel, and if we don’t get a chance to get to them, we’ll follow up on email and also post the Q&A summary on our blog. Okay, Shivnath, first question. “If I am creating my machine learning application from H2O or Notebooks, would Unravel be able to support that?”

Shivnath: Yes. So, that’s a great question. In the presentation today, I focused on Spark machine learning applications that have been created using Spark, and Kafka and HBase. But, you can create your applications in a Notebook, you can create them in…I heard the tool H2O which is a very popular tool for creating applications, or even DataRobot.

So, all of these applications, they do not have to be created in Unravel. What I showed you is a way in which Unravel can actually ingest the source of your applications and then correlate that with performance. And, the source can come from Spark Shell, in this case, Spark Summit, it could come from DataRobot, H2O.ai, it doesn’t matter. Unravel has been created to integrate with all of these solutions.

Audrey: Excellent. Okay, the next question. “It was interesting to see how your tool helps me debug my applications by showing data samples as they go through the pipeline. How do you protect data privacy?”

Shivnath: If I understand the question correctly, this question is referring to Unravel’s features for debugging where it can capture a sample of records that go through the entire machine learning pipeline, from the input dataset to the feature training part to the model generation part.

And, if that is the source of the question, then what Unravel does, and basically this applies across the board to be the entire product. The data that is captured in Unravel never leaves your premises. It could be you’re running entirely on FRAN, it could be that you are running entirely on the cloud. So, no data is sent to any of the Unravel servers, it all remains within your servers.

Now on top of that, Unravel also has role-based access control capabilities which can use by the operations teams to enforce whatever privacy they want to enforce. For example, this feature of collecting the data samples as they go through can be entirely turned off. And, even if they are turned on, user A will not be able to see user B’s applications unless the role-based access control has been configured this way. So, the data privacy and security, as well as authentication, is something which all of the enterprise-grade features are there in Unravel and we take that very seriously.

Audrey: Great. I think we have time for one more question. “What platforms does your solution support? Can I run it on apps that run in the cloud?”

Shivnath: This question seems related to the first question. As far as platforms are concerned, Unravel supports a range of platforms, from where you are authoring the application like Notebooks or Jupyter example in Notebooks, or H2O, or DataRobot, and these newer systems or TensorFlow, for that matter. Regarding the platforms, as far as the Hadoop, Spark, Kafka platforms are concerned, where the compute and the storage will happen, and I will suppose all of the vendors, like the Clouderas, the CDH platform, MapR, HTP, as well as on the cloud, we support Azure, the Google Cloud, as well as Amazon.

And, along with the compute platforms like Hadoop and Spark, we also support ingestion platforms like Kafka where we are seeing a lot of real-time machine learning applications are getting built, this is a key part of the pipeline, as well as newer engines for massive power processing like the Impala.

Audrey: Great. Thank you, Shivnath. That’s all we have time for today, but if you’d like to learn more about Unravel, get a demo, or start our free trial, please contact us at unraveldata.wpengine.com or let us know in the survey that will pop up shortly. Also, let us know if there are specific use cases or topics you’d like to see us cover in our future webinars in that survey as well. And, as Shivnath mentioned, all our previous webinars are available on unraveldata.wpengine.com for you to watch on demand.

Thanks again to our speaker, Shivnath, and thanks for joining the webinar today. Have a great day, everybody.

Shivnath: Thank you.