Databricks

Predict Databricks Costs in Four Steps

Modern data analytics platforms like Databricks have become indispensable for many companies. However Databricks costs can become unpredictable. This article gives four steps to help companies like yours predict Databricks costs and save your company time […]

  • 6 min read

Modern data analytics platforms like Databricks have become indispensable for many companies. However Databricks costs can become unpredictable. This article gives four steps to help companies like yours predict Databricks costs and save your company time and money.

Overview

J.B. Hunt Transport, Inc., a leader in North American logistics, implemented the Databricks Data Intelligence Platform to improve visibility for IoT and location tracking data across its vast data infrastructure. However, the usage and infrastructure costs for the Databricks solution have been unpredictable and difficult to manage. This has made it impossible to accurately assess ROI and has limited scaling to additional workloads.

Modern data analytics platforms like Databricks have become indispensable for many companies grappling with legacy systems, rapid data expansion, and expanding AI operations. Utilized by over 10,000 global customers, Databricks empowers companies to deliver operational solutions that improve supply chain efficiency and productivity.

However, more significant usage also creates issues with predicting and managing cost, assessing ROI, and ongoing governance. These problems disrupt data operations and can limit further deployment of Databricks into additional use cases. This paper identifies the root causes of these challenges and provides suggested paths for remediation, leading to maximum ROI and extended impact from the Databricks data platform.

The Challenges of Cost Prediction

The recently published Gartner Market Guide for Data Observability Tools1 exposes critical shortcomings in using traditional (general) monitoring approaches for monitoring and managing modern data platforms like Databricks. These general observability tools (like DataDog or Dynatrace) are designed for infrastructure and web application performance observability. They are based on understanding event-based models focused on service response times rather than the more complex and varied data pipeline dependencies associated with modern data stacks.

As a result, companies using general observability tools struggle to anticipate and respond to problems that arise in their increasingly complex data environments. Without granular visibility of ongoing data job activity correlated with the logs, metrics, traces, and other telemetry data, they fail to provide the real-time, comprehensive support necessary to prevent critical data issues, cost overruns, or system downtime.

These shortcomings translate into a significant business challenge for J.B. Hunt in managing Databricks costs, performance, and reliability.

Top Four Ways to Predict J.B. Hunt’s Databricks Costs

Understanding and managing Databricks costs is pivotal for optimizing J.B. Hunt’s data operations and ensuring an optimal return on investment, particularly given its rapid data volume expansion and business objectives to unlock value from data within legacy enterprise data warehouses.
The following four strategies, aligned with the recommendations outlined in Gartner’s Market Guide for Data Observability Tools1, offer a comprehensive approach to cost optimization and spend management. By adopting these methodologies, J.B. Hunt can navigate the complexities of processing and storing extensive datasets from locational data, enhance visibility of Databricks usage, pinpoint inefficiencies in data processing, execute targeted actions to optimize resources, and ensure continuous governance of Databricks cost.

Step #1. Gain Granular Visibility

Effective cost prediction begins with detailed insights into the data pipelines and jobs. This level of understanding allows organizations to optimize performance, troubleshoot issues quickly, and ensure data quality and compliance. With detailed visibility, data engineers and analysts can identify bottlenecks, track resource utilization, and make informed decisions about scaling and optimization.

General observability platforms do not provide this granular visibility for modern data stacks, which has given rise to a new crop of Data Observability tools developed to monitor and detect issues effectively across the modern data ecosystem, including data quality, catalog management, data profiling, data quality assessment, and anomaly detection.

Step #2. Identify Inefficiencies

After establishing granular visibility of data jobs and pipelines, the next step is identifying the weaknesses and performance issues needed to lower costs, improve performance, and ensure reliability. Pinpointing areas of suboptimal performance, such as poorly written queries, inefficient data transformations, or underutilized compute resources, will eliminate unnecessary expenses and sluggish performance.

However, the significant challenges to identifying these inefficiencies are their extreme complexity, enormous scale, and a critical need for engineers with the required skill and experience. The good news is that this is where the AI capabilities of modern Data Observability tools can help. AI algorithms trained on specific functions, requirements, patterns, and ideal workflows can instantly detect inefficiencies and even recommend corrective actions.

Implementing an effective AI Data Observability engine can help identify bottlenecks, allowing data teams to refactor code, optimize query plans, and right-size infrastructure. This proactive approach reduces cloud computing costs and improves data processing speeds, query response times, and overall platform responsiveness. As data volumes continue to grow, efficiently managing and processing information becomes a critical competitive advantage, making identifying and eliminating inefficiencies an essential practice for organizations like J.B. Hunt to maximize the impact of Databricks.

Step #3. Implement Your Action Plan

After identifying inefficiencies, the next step is for J.B. Hunt to prioritize problems based on their impact on cost and performance, focusing on quick wins that offer maximum savings. Developing a roadmap that addresses inefficiencies in areas such as query optimization, storage management, and compute resource allocation. It is also essential to foster a culture of cost awareness among data teams, encouraging them to write efficient code and leverage built-in platform optimizations.

Advanced Data Observability tools can significantly enhance this process by providing AI-enabled suggestions for optimizing queries, workflows, and computing resources. This will empower engineers with the insights and information they need to optimize their code and more rapidly and effectively troubleshoot issues without relying on scarce expert resources.

Continuously reassessing and adjusting remediation efforts and balancing immediate gains with long-term architectural improvements will help ensure sustained benefits in cost savings and performance enhancements.

Step #4. Establish Ongoing Governance

Continuous governance is essential for J.B. Hunt to maintain performance reliability and cost predictability. As data volumes and user bases grow, unchecked usage can spiral expenses and degrade system performance. Effective governance will enable J.B. Hunt to implement consistent data retention, security, and compliance policies across the entire Databricks platform.

By proactively managing these aspects with an advanced Data Observability platform, J.B. Hunt can optimize their investments in data infrastructure, prevent budget overruns, and maintain the high-performance levels necessary for critical data operations. These advanced platforms leverage the granular visibility, correlation, and AI capabilities referenced above to provide real-time alerts before problems occur, guardrails to prevent deployment of inefficient code into production environments, automation to prevent “runaway” jobs from inadvertently running up cloud costs, and budgeting to ensure predictability of cost.

Case Study Example

Peter Rees, Lead Architect & Engineer at Maersk, a leading global shipping and logistics company, stated the following about Unravel, a leading Data Observability and Actionability platform:

“Using Unravel helped us prioritize our focus and efforts to reduce cost. For example, more than half of our jobs were waiting for 70% of their duration to start because of a mismatch in the sizing of our Spark configurations for those jobs. As a developer, you’re thinking that if I assign more resources to this job, it will run faster and more efficiently. But, of course, that blocked resources, causing the second or third job to wait.”

“With Unravel, we get actionable recommendations on addressing problems and inefficiencies. Looking at our cluster configurations, we had expensive memory or processing-intensive nodes that were not required for the workloads we were putting through them. The size of the clusters and nodes that we were using was mismatched. Knowing this allowed us to use the specific recommendations provided by Unravel to change the cluster configurations and substantially reduce cost.”

“Unravel also helped us examine the frequencies of the pipelines we were running and challenged whether we were getting enough value to justify the cost. It also provides us with specific insights into code inefficiencies and recommends modifications to make it more cost-efficient and reliably meet SLAs.”

Key Takeaways for J.B. Hunt

Managing costs, optimizing performance, and ensuring reliability are critical for maximizing the value and business impact of the Databricks data platform. Implementing the abovementioned strategies can enable J.B. Hunt to reduce the costs of existing workloads and expand usage of Databricks technology across two or three times the number of use cases for the exact overall cost.

J.B. Hunt would benefit from implementing an advanced Data Observability system to enable this. As noted in the latest Gartner Market Guide for Data Observability1, these systems provide granular visibility into data pipelines, jobs, and queries, allowing a deeper understanding of usage patterns and costs when correlated with other telemetry data. By harnessing AI capabilities, they can also identify inefficiencies in code quality and infrastructure allocations, which is particularly valuable given the complexity of Databricks ecosystems and the scarcity of more profound expertise. The AI-driven insights can quantify the impact of problems on cost and performance and also recommend optimal solutions. Finally, these advanced platforms establish ongoing governance through guardrails, alerts, and budgeting tools, thus ensuring predictable costs and reliable performance.

As J.B. Hunt continues to scale data operations, the adoption of a sophisticated Data Observability system becomes not just beneficial, but essential for maintaining competitive edge and operational excellence in the data-driven landscape.

Unravel Data provides a unique Data Observability platform for Databricks, Snowflake, and BigQuery that uses AI to enable insights, recommendations, and actionability not offered by any other provider. Because Unravel has been building and refining its platform for many years, it has the most extensive underlying knowledge graph to make the most accurate and insightful recommendations.
Unravel offers three specialized AI agents to assist data engineers, DataOps, and FinOps teams with customizable levels of automation that leave the operator in charge of what, when, and how to automate functions.

To learn more about how Unravel can help J.B. Hunt predict and optimize Databricks costs, request a health check at https://www.unraveldata.com/health-check-demo/ or demo at https://www.unraveldata.com.

——————
References
[1] Gartner Market Guide for Data Observability Tools, 25 June 2024 – ID G00765184