Troubleshooting & DataOps

CIDR 2019

As Enterprises Deploy complex Data pipelines into Full Production, AI for Operations (AIOPS) is Key to ensuring reliability and performance. I recently traveled down to Asilomar, Calif., to speak at the Conference on Innovative Data Systems Research (CIDR), […]

  • 3 min read

As Enterprises Deploy complex Data pipelines into Full Production, AI for Operations (AIOPS) is Key to ensuring reliability and performance.

I recently traveled down to Asilomar, Calif., to speak at the Conference on Innovative Data Systems Research (CIDR), a gathering where researchers and practicing IT professionals discuss the latest innovative and visionary ideas in data. There was a diverse range of speakers and plenty of fascinating talks this year, leaving me with some valuable insights and new point of views to consider. However, despite all of the innovative and cutting edge material, the main theme of the event was an affirmation of what some of us already know: the challenges of managing distributed data systems—the problems we’ve been addressing at Unravel for years—are very real and are being experienced in both academia and the enterprise, both small and large businesses, and in the government and private sectors. Moreover, organizations feel like they have no choice but to react to these problems as they occur, rather than preparing in advance and avoiding them altogether.

My presentation looked at the common headaches in running modern applications that are built on many distributed systems. Apps fail, SLAs are missed, cloud costs spiral out of control, etc. There’s a wide variety of causes for these issues, such as bad joins, inaccurate configuration settings, bugs, misconfigured containers, machine degradation, and many more. It’s tough (and sometimes impossible) to identify and diagnose these problems manually, because monitoring data is highly siloed and often not available at all.

As a result, enterprises take a slow, reactive approach to addressing these issues. According to a survey from AppDynamics, most enterprise IT teams first discover that an app has failed when users call or email their help desk or a non-IT member alerts the IT department. Some don’t even realize there’s an issue until users post on social media about it! This of course results in a high mean time to resolve problems (an average of seven hours in AppDynamics’ survey). Clearly, this is not a sustainable way to manage apps.

Unravel’s approach starts first by collecting all the disparate monitoring data in a unified platform, eliminating silo issues. Then the platform applies algorithms to analyze the data and, whenever possible, take action automatically; providing automatic fixes for any of the problems listed above. The use of AIOps and automation is what really differentiates this approach and provides so much value. Take root cause analysis, for example. Manually determining root cause of an app failure is time consuming and often requires domain expertise. It’s a process that can often last days. Using AI and our inference engine, Unravel can complete root cause analysis in seconds.

How does this work? We draw on a large set of root cause patterns learned from customers and partners. This data is constantly updated. We then continuously inject this root cause data to train and test models for root-cause prediction and proactive remediation.

During the Q&A portion of the session, an engineering lead from Amazon asked a great question about what Unravel is doing to keep their AIOps techniques up to date as modern data stack systems evolve rapidly. Simply, the answer is that the platform doesn’t stop learning. We consistently perform careful probes to identify places where we can enhance the training data for learning, then collect more of that data to do so.

There were a couple of other conference talks that also nicely highlighted the value of AIOps:

  • SageDB: A Learned Database System: Traditional data processing systems have been designed to be general purpose. SageDB presents a new type of data processing system, one which highly specializes to an application through code synthesis and machine learning. (Joint work from Massachusetts Institute of Technology, Google, and Brown University)
  • Learned Cardinalities: Estimating Correlated Joins with Deep Learning: This talk addressed a critical challenge of cardinality estimation, namely correlated data. The presenters have developed a new deep learning method for cardinality estimation. (Joint work from the Technical University of Munich, University of Amsterdam, and Centrum Wiskunde & Informatica)

Organizations are deploying highly distributed data pipelines into full production now. These aren’t science projects, they’re for real, and the bar has been raised even higher for accuracy and performance. And these organizations aren’t just growing data lakes like they were five years ago—they’re now trying to get tangible value from that data by using it to develop a range of next-generation applications. Organizations are facing serious hurdles daily as they take that next step with data, and AIOps is emerging as the clear answer to help them with it.

Big data is no longer just the future, it’s the present, too.