Join us on December 11 for the Webinar: Always Meet Your SLAs on Amazon EMR. Register NowX
Blog Events Video 12 Min Read

Migrating and Scaling Data Pipelines on AWS

Blog Events Video 12 Min Read
By: Kunal Agarwal

On March 27, 2019, Unravel CEO and Co-Founder Kunal Agarwal and Unravel CTO and Co-founder Shivnath Babu presented a session on the Migrating and Scaling Data Pipelines on AWS at AWS Global Summit in Santa Clara, CA. You can view a video of the session or read the transcript.


Transcript

Kunal Agarwal:

Hi, everyone. I’m Kunal. And with me, I have my cofounder, Shivnath.

Shivnath Babu:

Hello.

Kunal:

We’re founders of Unravel Data, which is an application and operations management company that helps you migrate and then manage all your data workloads and pipelines on the commonly used Amazon systems. So, before we jump in, a quick history about ourselves. Unravel was founded by Shivnath, me, and a couple of us who’ve done performance management in the past. Also, some data scientists and specialists in the data arena.

Our team consists of people from AppDynamics, Cloudera, AWS. Our product’s being used by a lot of different enterprises and every vertical to help them monitor and manage their big data workloads and help them get to the cloud, or between environments as well. And it’s really boils down to three things. We have performance management, reducing costs, and optimizing your cluster itself, as well as helping your migration. And in each of these areas where we’ll show you a few insights.

So, when you try to move environments, whether that’s on-premise to the cloud, between clouds, and then between data systems, there’s a lot of common questions that come up. Questions around, how much will this cost us? What are the right type of instances for me to get on Amazon for me to actually run these type of workloads? Which apps are actually best suited for this particular cloud environment or this particular infrastructure that I’m going to be moving to? And how do I know if my migration is successful?

Now, all of these questions are usually guesstimating. I’m sure when you were doing your exercise or if you are doing your exercise to move to a new environment like Amazon, it’s basically on a few factors like, what is your size of data to process? And then you pick your instance types, and then you pick a quantity of that particular instance type. Now, that technique doesn’t get you the accuracy in terms of cost and performance that you need. What Unravel can do is help you get really scientific and surgical in understanding what’s the best instance type, what’s the right environment, and how big should the environment be on any given day of the week, so you can use auto-scaling features up and down to actually run your code there in a very optimized and a cost-effective fashion.

This is where our Unravel Migration features help. Everything from helping you discover your current environment, helps you do the planning, and the migration project management work tool as software. So, you don’t need people involved to help you understand all of these different complexities, where you’d otherwise be spending days, weeks, sometimes months, figuring out the answers to all these different questions. And let me walk you through some of these different feature sets that we have.

So first and foremost, you have a current environment. It may be on-premises, it may be on another provider, and you try to get to, say, Amazon. Firstly, understanding what you’re currently running is super, super key, so then you can start to figure out whether you’ve got compatibility, whether you will be able to scale up in this new environment or not. So Unravel, once you install it, instantly populates a report like this that tells you about everything from what systems you are running, what services you are actually leveraging on your current environment, and then helps you break down other usage information by how many apps do I have, how many of these applications are Spark versus Kafka, or Hive or Hadoop, for example, which users actually access these particular applications.

Then you start to look at your resource information around how much CPU, memory, containers, etc., do I currently have, and how much of that do I actually use? All of this information is baseline information for you to understand what is the shape of your workload or the cost in your new environment will start to look like. Today, this is a very manual exercise, and Unravel can give you that in an instant.

Once you have all that data collected, Unravel can also help you understand which are the workloads that are best suited for the cloud. So, you may have a consistently running environment. And maybe you’re trying to get to an auto-scaling, dynamic ecosystem. Unravel can help you understand which are these applications that are very bursty in nature, for example. Which of these applications sometimes process 50 gigs of data, another day, 500 gigs of data, so they need a lot of different resource changes. In this way, you can pick and choose, and only migrate those types of workloads that are best suited for the cloud.

Other ways of slicing this could be in your organization, you may be thinking, “Hey, we’ve got 15 business groups on the cluster right now. We want to move just the marketing department to the cloud.” Unravel can help you slice and dice that data as well to help you understand, “Hey, you don’t need to move your entire 5 petabytes of data. We can show you all the apps that marketing department touches, all the datasets the marketing department accesses, and you can only move those particular workloads to your cloud environment as well.” Sometimes you want to stage it, sometimes you want a very serial effort in getting all of your different organizations to the cloud as well. That’s something that Unravel can also help you out with.

But now, here’s the cool part. Now you’ve figured out what you want to move to the cloud. You’ve selected your different workloads. Unravel can now start to shape your workloads as well, based on your resource requirements, and then map it to the right instance type using an intelligence engine that Unravel’s built in internally. Again, this is something that software does. You don’t need people to do this.

For example, I identified my marketing department that needs to move to the cloud, they’re running a ton of  Spark, they may be doing some Kafka-based streaming applications as well, and some NoSQL and MPP applications too. And now, when I come here, I’ve got a couple of different ways in which I can think about which instance type I should get.

Number one option is lift and shift, which is I wanna move to exactly the similar environment that I had on my previous ecosystem onto the Amazon environment. And what we do here is a simple mapping exercise to help you understand what your current resources look like, what their specs were, and what resources on Amazon should you be mapping to if you wanna run one for one. And over here, you see your total cost for running those particular applications or a dataset in that lift and shift environment.

Where this gets even cooler is if you think about, “Okay, I don’t wanna do a lift and shift. I wanna use Amazon’s capabilities to be able to reduce my costs.” Unravel can now start mapping all of your requirements based on instance types that you really need. The way we do this is twofold. One, we understand all your resources currently available, but what’s actually used on your cluster maybe day of the week, time of the day.

And then we realize that, “Hey, you don’t use your cluster to 100% utilization.” And we do that mapping exercise to figure out what your actual usage is, and then map that to an Amazon environment, for example. So you see, there’s huge cost savings. And I’m sure a lot of us want to leverage Amazon’s auto-scaling capabilities as well, to keep our costs low. This is another way that Unravel helps you leverage that.

A third way is to help you understand cost also with the lens of performance and reliability. So cost reduction is great but will all my workloads actually finish on time every time? This is another view to help you do that, where you can choose the number of workflows, or your number of data pipelines, or number of your data applications and say, “I want to match 100% of my SLAs. I want to make sure that everything that I’m moving to the cloud finishes within a certain amount of time, by that time of the day.” And Unravel can help you actually match those workload requirements with the cloud as well.

So just to summarize, Unravel, once you install it, instantly discovers everything that’s happening on your cluster, which has rich information about applications, datasets, resources used, all the users in your system. And once you have all of this data, you can start to very intelligently plan and migrate your workloads to the cloud based on your cost and performance requirements.

This is another view of that same workload fit that we were talking about, which is showing you 100% SLA match would cost you $5.48. But a 95% or an 85% SLA match would actually reduce your costs a little bit more. So, based on the leeway you have within your organization of, “Hey, I’m fine living with 85% of SLAs,” Unravel can also help you map your requirements accordingly.

And now that you have your migration exercise going on, it’s usually a multi-month exercise. How do you even guarantee and tell your business constituents that migration app actually happened successfully? Unravel starts becoming a project management tool at that point and says, “Hey, you know what? We have this application, a Spark app, that was running in about two minutes. And now in the new environment, it’s actually taking 35 minutes to run.”

And this is where Unravel’s optimization features kick in. And you can actually start to optimize this application to make it go under that two-minute SLA. And then you can check it off and say, “Yup, that application was migrated successfully.”

So, with that, I’m gonna hand it over to Shivnath to talk a little bit more about how do we actually optimize those workloads once they’re in the cloud.

Shivnath Babu:

Hello, everyone. Once again, I’m Shivnath Babu, CTO and cofounder at Unravel. So Kunal showed you how you can get your workloads in a quick, timely, optimal way to the cloud. Now that it’s on the cloud, what you might end up with is an architecture that might look something like this. You have a lot of different kinds of data, from structured to unstructured data, maybe ingested via Kafka or Kinesis, stored on Amazon S3, DynamoDB. Maybe the preparation, cleaning of the data might happen with Spark.

And the result might be visualized using Tableau. Maybe you are running Athena with serverless queries to power your ETL pipelines, report-generation pipelines BI use cases. Or you might be even more fancy, where all this data that’s actually getting collected in real time, as well as in a batch fashion, you might want to do machine learning on it. You might use Amazon SageMaker, or you might roll up your sleeves and do a lot of modeling with Spark or with Python. Or you might actually have IoT applications to real-time applications, maybe combining the machine learning models with the serving or maybe generating alerts in real time.

In short, once you’re on the cloud, there is this plethora of different services you can use. We’ll end up with a very sophisticated pipeline. And if you look at all of these systems here, they’re are distributed systems. And the moment you have all of these systems that you’re trying to rely on, which are powering mission-critical applications, a lot of things can go wrong. The application fails, or maybe your IoT pipeline starts to lag. So the operations teams or the developers now need to get involved to figure out what is happening and solve the problem, or it might be that the pipeline, which was actually running well last week, last month, suddenly starts to miss the SLA.

So what went wrong? How do I really find the problem quickly and fix it. Or it might be problems like I’m getting a flood of alerts. I have these CloudWatch alerts configured. And all of these alerts are coming and I don’t know where to start, how to de-duplicate these alerts, how to find the root cause. Or it could be that your teams, you have the dev teams, the operations teams, somebody says the problem is in how you’ve architected things. Or the problem is in the cluster, or the problem could actually be in the application design itself, in the code.

A lot of finger pointing, fighting could happen. And in all these efforts, it could be that you’re missing SLAs. Maybe your costs are going through the roof. So complex pipeline to make it mission critical on the cloud, you actually have to solve all of these challenges. So in Unravel, that’s the challenge that we took on. So once your app’s in the cloud, all that data that Unravel is ingesting about your applications, about your infrastructure, about the usage, we bring it into a single pane of glass.

So instead of having to look at three CloudWatch dashboards here, logs, their metrics pipelines here, you can come to a single pane of glass, a single place where the data is brought in, visualized. But the key actually challenge we have solved is, how can we take all this data, apply AI and ML algorithms on it, and help operations teams, architects and devs, not waste their time troubleshooting problems, but truly building these applications and making them reliable.

So overall, we support a huge class of systems, from Amazon EMR, NoSQL systems, Spark, Kafka, HBase, as well the born in the cloud, native systems like Redshift, or Databricks, as well as the most serverless kind of usage like Athena. Across all of these systems, it could have transient clusters and EMR, you can have auto-scaling systems with Redshift, or you could have entirely serverless. On all of these environments, Unravel is constantly collecting the telemetry data from metrics to logs to configurations, and constantly bring it into the Unravel system, where it can analyze this data to give you dashboards, to give you reports, and all the way to actually give you recommendations and how to fix the problems, while automatically fixing a lot of these problems.

Let me drill down a little bit more and tell you what goes on under the covers. High-level at Unravel, we support application performance management. So your application might be a Spark application or a Kafka-based pipeline. Unravel can, on one end, auto-tune. And what is auto-tuning? Auto-tuning is where, for the application, you have a goal or an SLA, which might be, “I want it to finish in 10 minutes each time it runs.” So Unravel can analyze all of this data and automatically tune and ensure that you’re getting the goal, which might be an efficiency or speed-up goal.

For some of these use cases, what Unravel can do is it can go to the level of logs and metrics, apply machine-learning algorithms, automatically pinpoint root causes, as well as give you recommendations – what configuration should I set my Spark containers to so that the application will run the library and finish in time, as well as views like this where all the errors happening at any level can be brought together with an analysis to what is the root cause of the problem.

And on the other side, we don’t actually look at just individual applications at a time, we look at your entire pipeline, your entire cluster, which could be from an operational kind of a use case. Unravel can analyze the entire workload running in a cluster and give you recommendations like, “Use these type of instances to cut down the costs while meeting the performance SLAs,” or, “Configure the overall configuration of the cluster so that you can optimize costs while meeting all the SLAs.” Or at this extreme, Unravel has the sophisticated forecasting algorithms that can help you with capacity planning. How do you plan for the server instances, Spot instances, costs on the cloud. All of these can be handled automatically by Unravel.

So to summarize, we actually have a system that we have released. Now, it’s on the Amazon Marketplace, where you can use it for application-level use cases, remediation, as well as operational use cases, while bringing all your apps to the cloud, as well as supporting a hybrid environment. So there are a lot of customers who are using Unravel at a very high level. These are some of the numbers that we have seen from customers. Huge gains in productivity. No time spent on troubleshooting and application-performance tuning, and things like that. Reliability of the apps. Meet SLAs all the time, as well as reduce costs. You can get Unravel very easily. It’s on the Marketplace.

Summary, AI-driven, full-stack analysis of all your entire cloud environment and hybrid environments to save costs. And that’s it.

Thanks a lot.


To learn more about Unravel and to see a product demo, check out Simplifying DataOps: Unravel demo with Kunal Agarwal.