Optimising End-to-End Data Management with Databricks • Devoteam

The history of Databricks

Databricks was founded in 2013 by the minds behind Apache Spark, Delta Lake, and MLflow. The platform was first made available for public use in 2015 as the world’s first lakehouse platform in the cloud.Databricks now combines the best of data warehouses and data lakes to offer an open and unified platform for data and Artificial Intelligence (AI), with a single security and governance model.

A lakehouse is an open architecture that combines the features and best practices of data warehouses and data lakes. It allows for transaction support, schema enforcement and governance, BI support, openness, decoupling of storage and compute, support for diverse data types and workloads, and end-to-end streaming. It provides a single system that supports enterprise-grade security and access control, data governance, data discovery tools, privacy regulations, retention, and lineage. A lakehouse is enabled by implementing data warehouse-like structures and data management features on top of low-cost cloud storage in open formats and provides an API for accessing the data directly.

Databricks is built to handle and manage all types of data and is cloud agnostic, which means it can govern data storage wherever it is located. This platform is intended to support a range of data and artificial intelligence workloads, allowing team members to access the necessary data and co-create to drive innovation.

The features mentioned above are present in the architecture of the Databricks Lakehouse platform, including the use of the Delta Lake foundation to ensure reliability and performance, fine-grained governance for data and AI achieved through the Unity Catalog, and support for persona-based use cases.

An Overview of the Databricks Lakehouse Architecture

Databricks has two main planes: control and compute.

The control plane contains the backend services that Databricks manages. Notebook commands are other workspace configurations are stored and encrypted at rest in this plane.

Data processing is handled by the compute plane. For most computing tasks, Databricks uses the classic compute plane comprising resources in the user’s AWS, Azure or Google Cloud Platform account. As for serverless SQL warehouses or Model Serving, Databricks uses serverless compute resources running in a compute plane in a user’s Databricks account.

The E2 architecture, released in 2020, offers features such as multi-workspace accounts via the Account API, customer-managed VPCs, secure cluster connectivity using private IP addresses, and customer-managed keys for encrypting notebook and secret data. Token management, IP access lists, cluster policies, and IAM credential passthrough are also included in the E2 architecture, making Databricks easier to manage and more secure in its operations.

Databricks supports Python, SQL, R, and Scala to perform data science, data engineering, data analysis, and data visualisation in various formats (bar chart, line chart, area chart, pie chart, histogram, heatmap, scatter chart, bubble chart, box chart, combo chart, cohort analysis, counter display, funnel visualisation, choropleth map visualisation, marker map visualisation, pivot table visualisation, sankey, sunburst sequence, table, word cloud).

In 2023, Databricks introduced an updated user interface that enhances the overall navigation experience and reduces the number of clicks it takes to complete a task. The improved UI features impact the home page, sidebar, and search functionality, including a streamlined tile system, unifying previously separate pages for data science and engineering, SQL, and machine learning. The sidebar has been revamped, giving direct access to universal resources like workspaces and compute resources. At the same time, the global search function now looks for all available assets, including notebooks, tables, dashboards, and more. Overall, these improvements ensure simpler navigation and discoverability of features for Databricks users.

The main benefits of the Databricks Lakehouse platform

Databricks has a unified approach that eliminates the challenges caused by previous data environments, such as data silos, complicated structures, and fractional governance and security structures.

As a lakehouse platform, Databricks includes all these key features and benefits: support for ACID (Atomicity, Consistency, Isolation, and Durability) transactions, schema enforcement and governance, openness, BI support, storage and compute decoupled, support for structured and unstructured data types, end to end streaming, and support for diverse workloads, including data science, ML, SQL and analytics.

The platform is:

Simple

Databricks unifies data warehousing and AI use cases on a single platform. It also employs natural language to offer a simplified user experience. The Data Intelligence Engine allows natural language usage to discover and explore new data.

Open

Based on open source and open standards, Databricks provides users with complete control of their data and avoids the use of proprietary formats and closed ecosystems.

Collaborative

Delta Sharing enables secure sharing of live data from your lakehouse to any computing platform without requiring complex ETL or data replication.

Multi-cloud

Databricks Lakehouse runs on every major public cloud—Microsoft Azure, AWS and Google Cloud— tightly integrated with the security, compute, storage, analytics and AI services natively offered by the cloud providers.

What can Databricks be used for?

Databricks offers a suite of tools that enable users to aggregate their data sources on one platform and process, store, share, analyze, model, and monetize the datasets across a broad range of applications, from business intelligence (BI) to generative AI.

Use cases include:

Building an enterprise data lakehouse that combines the strengths of enterprise data warehouses and data lakes, providing a single source of truth for data;
ETL and data engineering, combining Apache Spark with Delta Lake and custom tools for data ingestion;
Machine Learning capabilities, extended by tools such as MLflow and Databricks Runtime for Machine Learning;
Large language models and generative AI where Databricks Runtime for Machine Learning includes libraries like Hugging Face Transformers allowing users to integrate existing pre-trained models;
Data warehousing, analytics, and BI, based on user-friendly UIs, scalable compute clusters, SQL query editor, and notebooks that support Python, R, and Scala;
Data Governance and secure data sharing through the Unity Catalog, allowing controlled access to data;
DevOps, CI/CD, and task orchestration, reducing duplicate efforts and out-of-sync reporting while providing common tools to manage versioning, automation, scheduling, and deployment for monitoring, orchestration, and operations;
Real-time and streaming analytics, using Apache Spark Structured Streaming.

AI and ML with Databricks

Databricks facilitates the full machine learning (ML) lifecycle on its platform with end-to-end governance throughout the ML pipeline. Several built-in tools support ML workflows:

Unity Catalog – data catalog and governance tool;
Lakehouse Monitoring – for tracking model prediction quality and drift;
Feature Engineering and Serving – for finding and sharing features;
Databricks AutoML – for automated model training;
MLflow – for model development tracking;
Databricks Model Serving – for low-latency high-availability model serving;
Databricks Workflows – for automated workflows and production-ready ETL pipelines;
Databricks Repos – for code management with Git integration.

Databricks Runtime for Machine Learning includes tools such as Hugging Face Transformers and LangChain libraries which enable integration of pre-trained models or other open-source libraries into your workflow. The MLflow integration helps in using the MLflow tracking service with transformer pipelines, models, and processing components.

Databricks also provides AI functions that data analysts can use to access LLM models, including OpenAI, directly within their data pipelines and workflows.

Databricks in Action at Devoteam

As a Databricks Consulting Partner, Devoteam can help organisations build, deploy or migrate to the Databricks Lakehouse Platform. With our team’s specialised experience and industry knowledge, we can assist in implementing complex data engineering, collaborative data science, full lifecycle ML and business analytics initiatives.

Want to know how we used Databricks to implement a new data platform that empowers Omgevingsdienst, an environmental service in the Netherlands, to gain more control over their data? Check out our success story here.

In conclusion

Over 9,000 organisations across the globe now consider Databricks their data intelligence platform of choice. With its ability to support enormous-scale data engineering, collaboration in data science, comprehensive machine learning, and business analytics, Databricks is driving the mission to democratize data and AI and assist data teams in resolving challenging issues.

Want to assess Databricks’s relevance and potential for your organisation?

Connect with one of our experts today and find out if Databricks is the right solution for you.

This article and infographic are part of a larger series centred around the technologies and themes found within the 2023 edition of the TechRadar by Devoteam report. To learn more about Databricks and other technologies you need to know about, please download the TechRadar by Devoteam .