Data Vault 2.0 using Databricks Lakehouse Architecture on Azure (2024)

Data Vault 2.0 on the Databricks Lakehouse

This Article isAuthored ByJonas De Keuster&Koen MoermansfromVaultSpeedandCo-authored withMichael Olschimke, co-founder and CEO atScalefreeInternational GmbH
TheTechnical Review is done byIan Clarke,Naveed Hussain,Irfan MaroofandHajar Habjaoui– GBBs (Cloud Scale Analytics) for EMEA at Microsoft

Introduction

In the previous articles in this blog series, we have looked at the Data Vault 2.0 architecture from different perspectives: the hybrid Data Vault 2.0 architecture and how we apply it on the Azure platform and real-time architecture. Also, check out our initial article on the value of Data Vault 2.0. In this article, we take a high-level look at how Data Vault 2.0 can be applied to the Databricks platform.

Reference Architecture

Delta lake gives us an open-source storage layer on top of cloud data lake storage, like Azure data lake storage. It can handle a variety of workloads (batch or streaming) and sources (structured, semi-, or unstructured) to bring all the data together under one roof whilst guaranteeing ACID transactions (Atomic, Consistent, Isolated, and Durable).

A commonly used best practice in data warehousing is the concept of layers; Databricks typically uses their medallion architecture which divides data into Bronze (raw ingestion or landing), Silver (curated zone) and Gold (business reporting grade) layers.

The Data Vault 2.0 reference architecture has been presented in a previous article in this blog series:

The architecture consists roughly of a data lake as a landing zone and staging area, the Raw Data Vault, the Business Vault, and information marts.

The following image shows how the layers from Databricks medallion architecture can be mapped with the layers from Data Vault 2.0:

The Bronze layer is the zone where data arrives, the landing zone. It can also be used for staging purposes (like hash key calculations). With Databricks, Data can be easily landed in the Bronze area using auto-loader, regardless of the loading frequency. Data can have various formats like CSV, parquet, AVRO, JSON, or external databases, all of which get converted to delta-formatted tables (more ingestion options are described on the Microsoft blog). Change data feed, a feature that allows tracking row-level changes between versions of a Delta table, can help to provide CDC-like data ingestion, ready for processing into the Silver layer.

The Silver layer, or integration layer, is where the Raw Data Vault and Business Vault reside. The Raw Data Vault is modeled using hubs, links, and satellites, but as discussed in an earlier article, no business rules are applied here. It contains raw data from various sources and fully captures the history of those sources. This means it is always possible to reconstruct any given source, at any given time, from Raw Data Vault.
Sometimes, Delta Time Travel is proposed as an alternative way to construct the satellites in the Raw Data Vault. And that makes sense because it allows you to access all historical versions of the data. However, we must keep in mind that the Raw Data Vault must always contain all versions of the source. If someone decided to execute a VACUUM command, for instance, you’d lose part, if not all the history. Therefore, it is not recommended.

Let's focus on the Raw Data Vault as the integration layer. And this is vital to understand why Data Vault is such a powerful concept. Data Vault is built for business, so if you start building a Data Vault, you should start by thinking about business concepts that exist in your organization. Eventually, a well-designed Data Vault model uses these business concepts to simplify multi-source data integration. Data Vault provides the highest value of data regarding those various business concepts and is spread across many different data sources. The Silver layer acts as an integration layer, modeled in a way that data is grouped around the business concepts of your organization.

The Business Vault layer is where business rules are added, it is still modeled using Data Vault entity types and data is still integrated around several business concepts.

You can add several structures in the Business Vault like computed satellites, Point-in-Time tables and computed aggregated links or bridge tables to apply business rules and speed up querying the Data Vault structures.

Point-in-Time (PIT) tables, for instance, help us to access all history that is stored in the Data Vault. For any desired snapshot time, they point to the combination of satellite records that were valid at that specific time.

Finally, the Gold layer consists of the information marts which are used to present information to end users using, for example, PowerBI. Typically, the marts are modeled as a star schema, which is easy to derive from Data Vault structures. Customer dimensions, for example, can be easily derived from a customer hub and surrounding satellites. A purchased product fact can be easily derived from a transactional purchases link.

Streaming is enabled by tight integration with structured streaming allowing data to integrate both batch and streaming workloads.

Automation helps to build both the data models for the Silver and Gold layers, as well as the loading patterns to load data. VaultSpeed automation provides these in the form of Databricks Spark SQL notebooks for DDL and DML code and Scala notebooks to run streaming processes. Automation should not stop supplying the code though, CI/CD pipelines should also be automated so that no time is lost deploying the new code.

More complex versions of the architecture can combine multiple technologies to host the Bronze, Silver, and Gold layers. For instance, using Databricks landing, processing, and storage of the integration layer, but to build the Gold layer virtualized or on a different platform like Azure Synapse analytics. Performance and cost optimizations should be the drivers of such discussions. Link

Real-time on Databricks

A common case for Data Vault 2.0 on Databricks is the integration of IoT data, for example trading data or inventory data, coming from Kafka topics. The goal is to integrate the data into the same integration layer as traditional data streams. In the previous article, we explained in depth how the Data Vault 2.0 architecture allows the integration layer to run at different speeds, whilst always making sure that the data is integrated into the same physical layer, the same physical data model.

To absorb the Kafka topics in the same integration layer, they need to be processed and transformed into the Data Vault structure. Just as with traditional data delivery, automation will accelerate this process. Kafka topic metadata can be harvested from the schema registry and used to speed up the design and building of the Data Vault model.

Besides automating the target model using schema registry, it is also important to make sure that the model can be loaded. To process the streaming data into the Data Vault model, the ability to handle transformations is required. Like the ones used to process static data using (Spark) SQL. This is where Spark Structured Streaming comes into play.

It allows us to express streaming computations the same way one would express a batch computation on static data.

In addition to the computations, which are required to transform the data, it is also required to be able to deal with typical streaming issues. In the short term, the process needs to deal with issues such as latency. Long-term issues include late arriving data and downtime problems. Automation will ensure that incidents are handled using various checkpoints. VaultSpeed automation supports this by watermarking data so that later, this watermarked data can be used for referential integrity resolution.

VaultSpeed automation will generate two types of code:
• Scala code to be deployed into the Databricks cluster. This is the runtime code.

DDL code to create the corresponding Data Vault structures on the target platform.

Example of a running Scala Notebook for Spark Structured Streaming

At runtime, messages will be read by Databricks from the Kafka topics and transformed into a Data Vault structure in the same integration area as data from batch loads. Data will be written to Delta storage.

Overview of the architecture for streaming

Some code snippets explain in a bit more detail how the scala code works.

First, we need to read the Kafka topic stream:

Then we need to define the transformations in Spark SQL and write the resulting stream into delta storage:

Disaggregated Storage & Compute Architecture

Databricks and its Lakehouse architecture allow for disaggregated storage and computation. This means that it is not necessarily required to use the same compute engine for all workloads; neither do the same compute resources need to be used by all users to access the data.

No matter how data is ingested, it will eventually be stored somewhere. In Data Vault scenarios, the recommendation is to use Delta Formatted tables for the Raw Data Vault, the Business Vault and Gold layer.

Another important storage setting is partitioning as a general practice, over partitioning should be avoided as it will create too much overhead. This and other actions like applying OPTIMIZE and Z-ORDER on indexes can easily be added to DDL and ETL code using automation tools.

Automation simplifies consistent use of optimize actions in Spark SQL code.

To perform data transformations, computers are also required. There are many options to choose from in Databricks’ compute clusters, which can be distinguished between “all-purpose compute” and “job compute”. “All purpose compute” is what is typically used for analysis and ad-hoc querying. They will often be shared by multiple people in the same team and are typically more expensive. “Job compute” on the other hand, will execute specific jobs like loading the Raw Data Vault. They can be scheduled and terminated when a job is completed and are therefore less expensive.

The most important decision is to size the cluster correctly for each specific workload. The advantage of disaggregated storage and compute is that it is possible to test various setups with very little effort. Dynamic cluster configuration will even allow us to spin up clusters based on the required workloads.

A recommendation for Data Vault is to use the latest Databricks runtime version because it is practically always faster than the previous one. At this moment, Photon is the fastest engine available, claiming almost double the speed compared to previous engines. Using the latest runtime version will ensure that Adaptive Query Execution is ON by default so that the best join strategy is automatically used. That’s important knowing that Data Vault tends to increase the number of joins.

Unity Catalog

Another feature Databricks offers is Unity Catalog. Unity Catalog is the data governance solution for Lakehouse. It allows organizations to manage users and their access to data assets centrally across all the workspaces in an Azure Databricks account. These assets can include data tables, ML models, and other files.

It consists of four layers: the meta store, the catalog, the schema, and the table (or external table or view). The metastore stores all metadata of objects existing in the LakeHouse and the permissions that applied to impose data governance. Every metastore hosts a three-level namespace model: <catalog>, <schema>, <table>.

Unity Catalog brings features like versioning, lineage, and access controls. It is possible to use these to apply data governance principles often used in Data Vault 2.0 implementations. Same applies for Data Mesh principles where organizations need solid governance to ensure data products are used only by authorized users.

An example is that access to the Raw Data Vault layer is often more restricted than access to the Business Vault layer. Or that specific Business Vault satellites are only accessible by privileged users (perhaps for GDPR reasons). All of this can easily be implemented with Unity Catalog.

Unity catalog also allows sharing data outside of the organization by enabling delta sharing. In addition, Databricks is currently implementing row and column level access controls.

Granting privileges on a specific Raw Data Vault hub

Databricks automatically generates lineage based on all Databricks SQL notebooks that have been run for the last 30 days:

Data lineage for a satellite table in the Raw Data Vault

The screenshot above shows lineage down to the column level. This helps to understand in detail how a column value is generated.

Versioning allows users to track changes to assets over time and roll back to previous versions if necessary. Keep track of changes made to satellites, data loads, new columns, etc. Very useful features in a Data Vault context.

Unity catalog allows for integration with other data governance solutions. For instance, Azure Purview and Unity Catalog can work together to provide organizations with a holistic data management solution. Unity Catalog can be used as a source of truth for the data within the Databricks workspace.

Workflows

Loading a Data Vault can become complex when considering all layers. While the methodology maximizes parallelization, there are so-called synchronization points: for example, before loading the Business Vault, the Raw Data Vault should be loaded in the simplest approach. To achieve this, Databricks offers several options to orchestrate and schedule the runtime on Databricks:

Option number 1 is the Azure Data Factory:

ADF pipeline to load specific source into the Data Vault

ADF provides a visual interface for designing, building, and managing data pipelines and integrates seamlessly with Databricks. At the same time, ADF can handle various other sources and destinations within the Azure ecosystem. It can even provide a managed Airflow that integrates with your existing ADF pipelines. It is also very capable of handling exceptions and restarts in case of a load failure. This is the preferrable solution in complex environments.

In less complex use cases, Databricks’ workflows and scheduler offer an out of the box option:

Setting up a job scheduler in Databricks

This option allows the orchestration of the Data Vault in the centralized Databricks environment.

Creating all these workflows and schedules manually can be cumbersome. Thus, automation can be used to build Data Vault on Databricks on a scale.

Deployment

DevOps is a core component in any Data Vault project. DevOps and continuous integration and continuous deployment (CI/CD) enable teams to build and quickly deploy their code and deliver minimum consumable data products within their organization, including governed version control.

Key to an automated and governed CI/CD setup is integration with the leading version control system GIT and the services built around it like for instance Azure DevOps. This integration is a core feature of both Databricks and VaultSpeed.

The diagram shows an example of a high-level setup. The arrows show the workflow and flow of the code. The developer pushes the code from an automation tool like VaultSpeed to either Databricks or Git into a development branch. When all code is available in the development branch, it can be pulled into the main branch. This will trigger the automated deployment into the test and production environment (with optional approval requests). Once everything is successfully deployed, the metadata is updated in VaultSpeed to allow a new cycle.

Databricks has a native integration with GIT allowing to develop and deploy notebooks directly into Databricks workspace and push it to and pull it from a remote repository for version control:

It also allows consolidation with all code generated by Data Vault automation tools such as VaultSpeed.

If your work is organized in an agile project in Azure DevOps, it is possible to link work items to commits and pull requests:

It also allows to have stories, for instance, closed if code is part of pull requests; and have the entire code base automatically deployed, all the way up to production environments. You can even automate the setup for your testing and experimentation environments using Databricks’ deep and shallow cloning capabilities.

A setup with Databricks, Azure DevOps and VaultSpeed allows organizations to significantly decrease deployment time of their Data Vault projects in an automated way with proper version control and governance:

VaultSpeed capabilities, such as deploying DDL scripts as notebooks and automatic deployment to either Azure DevOps or the GIT folder into the Databricks Workspace, allow teams to customize the CI/CD to their needs and preferences.

Outlook

In this article, we have described how to implement Data Vault 2.0 on Databricks. The platform provides all necessary tools and services to build a data analytics platform based on the Data Vault 2.0 principles that also works at scale.

In the next article, we will look at the Data Science with Azure Synapse Analytics and Data Vault 2.0.

About the Authors

Jonas De Keuster is VP Product Marketing at VaultSpeed. He had close to 10 years of experience as a DWH consultant in various industries like banking, insurance, healthcare, and HR services, before joining the data automation vendor. This background allows him to help understand current customer needs and engage in conversations with members of the data industry.
Koen Moermans is responsible for professional services at VaultSpeed, the leading data warehouse automation platform. Koen works daily with clients from various industries to help them build their data warehouse with data vault and VaultSpeed. Koen has a vast experience in data warehousing on all leading data platforms, especially on Azure Synapse and Databricks.
Michael Olschimke is co-founder and CEO at Scalefree International GmbH, a Big-Data consulting firm in Europe, empowering clients across all industries to take advantage of Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of data warehousing individuals from the industry, taught classes in academia, and publishes these topics on a regular basis.

<<< Back to Blog Series Title Page