The 10 Capabilities of Data Vault 2.0 You Should Be Using (2022)

A Data Engineer’s Guide to Unlocking Creative Solutions with the Capabilities of Data Vault 2.0

The 10 Capabilities of Data Vault 2.0 You Should Be Using (1)

Not all clients are created equal, and with this logic in mind, neither are the methods for data acquisition and ingestion.

One of the many challenges of working in data engineering is continuing to come up with creative solutions to solve your clients’ data problems. For example, if you have a client that cannot afford to implement CDC replication with a paid subscription or license, what would you propose instead? What if you needed to find a source that pushes data via Kafka, API, or file-based extract?

Over the course of my experience working with clients in the data engineering space, the number of source systems that are becoming candidates for data lake and data warehouse use cases has grown increasingly more diverse. This phenomenon, coupled with quality data from various data providers in the data marketplace, brings some unique challenges to the data engineering life cycle. While the ability to adapt to these ever-present challenges in data engineering is a necessity, it’s required (and equally important) to have data processors that can maintain, and retain, recently active records and history.

Since these data sources are not static, they can evolve during the development life cycle. Types of common evolutions to the data sources include adding or removing columns, renaming or updating tables, schema drift, etc. In this case, if one is considering the data acquisition and ingestion components of a data lake, it’s important to ask the question, “What would happen if a schema drift were to occur? How would this event affect my data and processes?”

Enter Data Vault 2.0

By and large, there has been a consistent (and good!) influx of information about the Data Vault model and methodology, which has helped various data teams handle various scenarios with ease.

For example, take Data Vault with its Hub-Link-Satellite approach. In my opinion, this combination offers agility by taking a Property Graph-like approach with nodes (Hubs), edges (Links), and properties (Satellites). As more and more source systems are added, or as the schema evolves, your underlying model remains viable and doesn’t get invalidated.

How is this possible?

Well, the short answer is that Data Vault methodology facilitates forward progression — just because a schema drift has happened, it doesn’t mean the existing data ingestion (ET) processes need to change.

Data Vault 2.0 has a number of other lesser-known or overlooked specifications (check out the full list here!), so I won’t be focusing on just the Hub-Link-Satellite example from above. Instead, I’m going to walk you through the ten capabilities that piqued my interest when I began to learn about Data Vault. If you’re looking to improve your data ingestion process, a great way to get started today is by familiarizing yourself with these capabilities.

1. Always Insert

The 10 Capabilities of Data Vault 2.0 You Should Be Using (2)

(Video) Data Vault vs Traditional Data Warehouse Architectures

With this capability, as sources are sending in any inserts, updates, and deletes, we maintain them a separate record. Think of it as kind of like what Kafka does in the way that a table is similar to a log.

By avoiding the update operation of “Load End Date,” it is much easier for it to be adopted by modern cloud object datastores like S3, Azure Blob, etc. Insert operations, in general, are much faster, due to the fact that loading times are consistent (for the most part).

The 10 Capabilities of Data Vault 2.0 You Should Be Using (3)

2. Tooling and Data Warehouse Vendor

In a way, this also means that you may be able to get away with not adopting a CDC replication tool for some of your data sources. For example, the source database tables are small/medium-sized, and in this case, a full extract is acceptable.

The insert-only approach is also very friendly to modern cloud-based data warehouses and data lakes. Regardless of whether you choose a vendor like Snowflake, Databricks, BigQuery, etc., the underlying object storage does not have to necessarily update. For example, using the existing micro-partitions for Snowflake, a Parquet/Delta table for Databricks, etc.

This means you’ll experience faster loads and ingestion. Additionally, it’s important to note that DataVault has been adopted in non-SQL databases like MongoDB as well.

3. Active Records

To get the active record, we can use analytical functions like lag and lead. You can easily adapt this with Query Assistance tables like “Pits” and “Bridges.”

The 10 Capabilities of Data Vault 2.0 You Should Be Using (4)

This might look like an overhead during reading, but my argument is that it eases the ingestion implementation logic in a simpler fashion with the ability to handle out-of-sequence or late-arriving records.

(Video) A brief introduction to Data Vault 2.0 - Part 2 of 7 - Benefits

If querying the satellite is not meeting your SLA, I recommend that you adopt the “PIT” pattern. The “PIT” is technically a snapshot that can be materialized as often you want and preserved as long as you want. In fact, you can have the “PIT” be the same table, you would just add a column named “PIT-Snapshot-Date,” and use a partition/cluster on this new column.

4. SCD Type 2/3 Tables

One of the benefits I’ve observed with this type of table is that the needs of the SCD Type 2/3 can be added to the information data mart. You can even add these tables in the data mart later on in the delivery lifecycle, and you would still be able to have the historical records from day 1.

5. Out-of-Sequence Data

One common pain point in data ingestion, specifically with IIoT datasets, is receiving events that are out of sequence. It is typically a pain to identify older records, then take the time to update multiple records, reposition, and reorder them in a way that accommodates the late arrival record.

Another constant pain point is identifying and avoiding the need to load duplicate records from a stream. With the “Always Insert” based method, coupled with “HashDiff” columns, the aforementioned pain of repositioning does not need to be implemented which keeps the data ingestion process much simpler.

The “HashDiff” column, with some caveats, prevents loading duplicate records and maintains the ease and simplicity of ingestion. Additionally, if the active record implementation logic is using the analytical windows function, there is no need for it to change.

6. CDC and Replication Data Streams

When using the “Always Insert” approach, regardless of the data acquisition patterns such as CDC, file-based, streams, or API pull, the data ingestion pattern, as well as the underlying implementation, always remains the same.

I feel that this is a huge plus for data engineers. Without this functionality, it can be difficult to successfully implement data ingestion based on the type of sources and their replication pattern. This issue, coupled with source systems, drifts, and channels (Kafka, Rest, etc.), means that you have to accommodate various scenarios and error processing in the ingestion components. Additionally, you would need to write source-specific ingestion components.

Due to the complications of keeping track of each source system behavior and how their data ingestion pattern implementation changes, the ingestion process can be prone to miscommunication, misunderstanding, and potential error mines.

7. Loading Hubs, Links, and Satellites

With this specification:

  • Hub and Satellite Sequence ID (deprecated)
  • Hub and Link Last Seen Dates (deprecated)
  • Adoption of hash keys

It is much much faster to load into the various entities, and a lot of parallelisms are achieved during ingestion in Raw Vault, which eventually leads to faster data availability later in the pipeline.

8. Hub as an Index

The hub is defined as “a unique list of business keys,” and it is technically an index. The records get inserted into the hub only for new entities such as new customers, addresses, or IoT devices. For example, with input data, if an existing customer needs an address change or it contains records that were previously in the hub, then it is not inserted again, nor is the existing record updated.

The 10 Capabilities of Data Vault 2.0 You Should Be Using (5)

(Video) Modelling data spaghetti with Data Vault 2.0 by Grzegorz Kaminski

Technically, this means that you are only inserting delta records that are new entities, such as customers. If the changed record set does not have new entities, then no write operations are observed, which results in an improvement in ingestion speed.

9. Satellites Based on Data Classification

Due to the ever-increasing need for data governance, privacy and regulations, it is a must to keep track of and maintain PII data under secure datasets.

Satellites, which contain context data, like name, date of birth, sex, role of the entity (ex: customer), need not be preserved together. You could implement satellites that are split on data classifications. Therefore, you could store all PII-related context in one table and the non-PII data in another table. Using the Hash-keys, the full recordset can still be built even though the columns are residing in different satellites.

The 10 Capabilities of Data Vault 2.0 You Should Be Using (6)

Another observation here is that if the PII columns are not changed, then, no new records are necessarily inserted in the satellite holding the PII data.

10. Links are Immutable

The link/relation is immutable, which means that once there is a relation between two hubs, they are always there and will remain unchanged over time. The Link-Satellite holds the information on “End-Date” (if applicable), and again, if there is no change to the relation, there is no write operation. The link table thus essentially is another index-like store.

I hope that you found these 10 Data Vault 2.0 capabilities as interesting as I did since there are plenty of effective and positive effects they offer to the data ingestion process. There are additional specifications that might be of interest like “Same-as” Link, “Exploration” Link, and many more.

If you want to read further on more characteristics of DV2, I would recommend this article: Do you know these 7 Characteristics Of Data Vault 2.0? Keep an eye out for my corresponding story as I reflect on my journey and experience with learning about DV2, along with some helpful tips for starting the adoption process.

Getting Started with Data Vault 2.0My top recommendations for beginning your journey with DV2medium.com
3NF and Data Vault 2.0 — A simple comparisonData Vault, from its first edition ‘1.0’ as a data modeling specification has evolved into a more elaborate version…medium.com
(Video) Data Vault vs other methodologies | Dan Linstedt | Data Vault 2.0 | Certus Solutions
Digging into Data Governance and Data ModelingTesting and validation are where data governance begins to play a massively important role. If you’ve documented where…medium.com

At Hashmap, an NTT DATA Company, we work with our clients to build better, together. We are partnering with companies across a diverse range of industries to solve the toughest data challenges — we can help you shorten time to value!

(Video) (2022) Using Business Modeling and Data Vault to Automate an Azure Synapse Data Solution

We offer a range of enablement workshops and assessment services, data modernization and migration services, and consulting service packages for building new data products as part of our service offerings. We would be glad to work through your specific requirements. Connect with us here.

Venkat Sekar is a Senior Architect at Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

FAQs

What is Data Vault 2.0 methodology? ›

Data Vault 2.0 Methodology focuses on 2 to 3 week sprint cycles with adaptations and optimizations for repeatable data warehousing tasks. Data Vault 2.0 Architecture includes NoSQL, real-time feeds, and big data systems for unstructured data handling and big data integration.

What are the advantages of data vault? ›

Data Vault enables quicker data loading simply because a number of tables can be loaded at the same time in parallel. The model decreases dependencies between tables during the load process and simplifies the ingestion process by leveraging inserts only, which load quicker than upserts or merges.

What is the data vault and why do we need it? ›

Data Vault is a method and architecture for delivering a Data Analytics Service to an enterprise supporting its Business Intelligence, Data Warehousing, Analytics and Data Science requirements. At the core it is a modern, agile way of designing and building efficient, effective Data Warehouses.

What is Data Vault 2.0 hub? ›

Hubs are the core of any DV design. If done properly, Hubs are what allow you to integrate multiple source systems in your data warehouse. To do that, they must be source system agnostic. That means they must be based on true Business Keys (or meaningful natural keys) that are not tied to any one source system.

What are the 10 advantages of database approach? ›

A database management system helps improve organizational security, integration, compliance, and performance.
  • Improved data sharing and data security. ...
  • Effective data integration. ...
  • Consistent, reliable data. ...
  • Data that complies with privacy regulations. ...
  • Increased productivity. ...
  • Better decision-making.

What features are provided by database Vault? ›

Oracle Database Vault use cases
  • Protect sensitive data. Block attackers from accessing sensitive data with stolen privileged user credentials—the most common attack vector today.
  • Prevent inadvertent access. ...
  • Prevent unauthorized database changes. ...
  • Enforce policy-based access control. ...
  • Separation of duties.

How secure is data vault? ›

Encrypts information using Advanced Encryption Standard (AES), the global standard for data protection. Nobody can access your data without your master password, not even Ascendo. Employs 256-bit encryption, PBKDF2 and salting to insure the highest security of any password manager for iPhone, iPad or iPod touch.

What is a data vault and what does it address? ›

Data Vault is an innovative data modelling methodology for large scale Data Warehouse platforms. Invented by Dan Linstedt, Data Vault is designed to deliver an Enterprise Data Warehouse while addressing the drawbacks of the normalized (3rd normal form), and Dimensional Modelling techniques.

What is business key in data vault? ›

Definition. According to the Data Vault 2.0 methodology, the business key is the key which the business uses to identify one record (from here): Business keys are supposed to have meaning to the business user. In a perfect world, they really should be decipherable by individuals who live and work in that industry.

What is a data vault system of Business Intelligence? ›

The Data Vault System of Business Intelligence modeling approach organizes data in a way that separates structural information, such as a table's unique identifier or foreign key relationship, from its attributes.

What normal form is a data vault? ›

Data Vault models. Every Data Vault model is also a third normal form model.

Is data vault a data warehouse? ›

A data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites.

What is a hash key in data vault? ›

The purpose of hash keys is to provide a surrogate key for business keys, composite business keys and business key combinations. Hash keys are defined in parent tables, which are hubs and links in Data Vault 2.0.

Who invented data vault? ›

Dan Linstedt is the inventor of the Data Vault Architecture (now deemed Data Vault 1.0) and Data Vault 2.0., and a world-renowned expert in Data Warehousing and Business Intelligence with over 25 years of IT experience.

Where is Vault data stored? ›

Solution: The files which are saved(checked in) into Vault are saved in the called Vault Filestore. More information to the Vault Filestore can be found on the help: File Stores.

How do I access my Vault data? ›

Get started with Vault search and export
  1. Sign in to Vault. Go to https://vault.google.com and sign in with your Google Workspace account. ...
  2. Create a matter, your search and export space. ...
  3. Search for data. ...
  4. Preview your results. ...
  5. Save your search query. ...
  6. Export and analyze.

What are the 5 steps under data modeling? ›

CASE STUDY:
  • Step 1: Gathering Business requirements: ...
  • Step 2: Identification of Entities: ...
  • Step 3: Conceptual Data Model: ...
  • Step 4: Finalization of attributes and Design of Logical Data Model. ...
  • Step 5: Creation of Physical tables in database:

What is core backbone of data vault? ›

Hubs and Links form the backbone of a Data Vault schema. Records in Hub and Link tables can be created and read, but they are not updated or deleted.

What are data Modelling techniques? ›

Data Modelling is the process of analyzing the data objects and their relationship to the other objects. It is used to analyze the data requirements that are required for the business processes. The data models are created for the data to be stored in a database.

What are the advantages of database answer? ›

Greater data integrity and independence from applications programs. Improved data access to users through use of host and query languages. Improved data security. Reduced data entry, storage, and retrieval costs.

What are the main database objectives? ›

The objective of a database management system is to facilitate the creation of data structures and relieve the programmer of the problems of setting up complicated files. Data base management systems have developed from a concept of the data base as something distinct from the programs accessing it.

What are the five 5 major components of a database explain each components? ›

The five major components of a database are hardware, software, data, procedure, and database access language.

What are the 4 main objects that make up a database? ›

While Microsoft Access is made up of seven components, this text will focus on the main objects: tables, forms, queries and reports. Together, these objects allow users to enter, store, analyze and compile data in various ways.

What is the safest way to secure data? ›

Here are some practical steps you can take today to tighten up your data security.
  • Back up your data. ...
  • Use strong passwords. ...
  • Take care when working remotely. ...
  • Be wary of suspicious emails. ...
  • Install anti-virus and malware protection. ...
  • Don't leave paperwork or laptops unattended. ...
  • Make sure your Wi-Fi is secure.
8 Aug 2022

What is the safest way of storing data? ›

Enable full disk encryption on all devices

On devices where confidential data is stored or transmitted (i.e., most of them), be sure to enable full disk encryption (FDE). Encryption protects the data in case the device falls into the wrong hands.

What is the most secure form of data storage? ›

What Is the Most Secure Way To Store Digital Evidence?
  • ENCRYPTION. Perhaps the most obvious solution is to encrypt all sensitive data. ...
  • CLOUD STORAGE. ...
  • SOPHISTICATED AUTHENTICATION PROTOCOLS. ...
  • PHYSICAL SECURITY. ...
  • METADATA. ...
  • CHOOSING THE RIGHT MANAGEMENT SOFTWARE.

What is an example of vault? ›

The definition of a vault is a secured room or area where valuables are kept. An example of a vault is a bank vault where all the money is stored. An arched overhead covering, such as the sky, that resembles the architectural structure in form.

What type of security control is database vault? ›

A: Oracle Database Vault can be used by organizations as a preventive control. In other words, organizations can configure Oracle Database Vault to prevent privileged users (DBAs) from accessing application data.

What are the two types of keys access? ›

Primary Key – is a column or group of columns in a table that uniquely identify every row in that table. Candidate Key – is a set of attributes that uniquely identify tuples in a table.

What is database key example? ›

It is an attribute or a set of attributes that can act as a Primary Key for a table to uniquely identify each record in that table. There can be more than one candidate key. In our example, student_id and phone both are candidate keys for table Student. A candiate key can never be NULL or empty.

What is surrogate key and primary key? ›

The primary key is a unique key in your table that you choose that best uniquely identifies a record in the table. All tables should have a primary key, because if you ever need to update or delete a record you need to know how to uniquely identify it. A surrogate key is an artificially generated key.

What is hub and satellite in data vault? ›

Remember: a Hub is a collection of business keys. A link tracks the relationship between hubs, or potentially with other relationships (links). A satellite is the time sensitive collection of attributes related to either an only one hub or link.

What is raw data vault and business data vault? ›

In many cases, data vault practitioners distinguish between raw vault objects with data that is loaded directly from source systems, without any additional changes, and business vault objects with data that is loaded from the raw vault and modified according to some business rules.

What is business key in ETL? ›

A Business key is the column or set of columns that uniquely identifies each row in an entity. Through definitions of business keys, you guarantee that the information held in the key column (or columns) of each record will differentiate it from all the other records in the same entity (avoid duplicates).

Is Data Vault insert only? ›

To get the most out of the system due to scalability and performance, it also has to be built on an architecture which is completely insert only. On the way into the Data Vault, all update operations can be eliminated and loading processes simplified.

What is the difference between Data Vault and data warehouse? ›

Data vaults store raw data as-is without applying business rules. Data transformation happens on-demand, and the results are available for viewing in a department-specific data mart. While a traditional data warehouse structure relies on extensive data pre-processing, the data vault model takes a more agile approach.

Why is it important that a database be stored in a normal form? ›

It is important that a database is normalized to minimize redundancy (duplicate data) and to ensure only related data is stored in each table. It also prevents any issues stemming from database modifications such as insertions, deletions, and updates. The stages of organization are called normal forms.

Who owns data vault? ›

Who owns Data Vault? The Data Vault method is published open material. Dan Linstedt, its creator, provides training and owns the trademark Data Vault 2.0. He continues to refine and publish his approach to exploit the latest trends in technology and leads a global community of data experts.

What are the 5 components of data warehouse? ›

What are the key components of a data warehouse? A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools. All of these components are engineered for speed so that you can get results quickly and analyze data on the fly.

Is data vault still relevant? ›

“DataVault is proven to be terrible when it comes to extracting data and making it useful to the business for analytics and reporting. Ultimately it does not meet today's needs.

What is data vault 2.0 methodology? ›

Data Vault 2.0 Methodology focuses on 2 to 3 week sprint cycles with adaptations and optimizations for repeatable data warehousing tasks. Data Vault 2.0 Architecture includes NoSQL, real-time feeds, and big data systems for unstructured data handling and big data integration.

What is surrogate key? ›

A surrogate key is a unique key for an entity in the client's business or for an object in the database. Sometimes natural keys cannot be used to create a unique primary key of the table. This is when the data modeler or architect decides to use surrogate or helping keys for a table in the LDM.

What is surrogate key in data mining? ›

A surrogate key uniquely identifies each entity in the dimension table, regardless of its natural source key. This is primarily because a surrogate key generates a simple integer value for every new entity. Surrogate keys provide the means to maintain data warehouse information when dimensions change.

What is a data vault system of Business Intelligence? ›

The Data Vault System of Business Intelligence modeling approach organizes data in a way that separates structural information, such as a table's unique identifier or foreign key relationship, from its attributes.

What is the difference between data vault and data warehouse? ›

Data vaults store raw data as-is without applying business rules. Data transformation happens on-demand, and the results are available for viewing in a department-specific data mart. While a traditional data warehouse structure relies on extensive data pre-processing, the data vault model takes a more agile approach.

What is data vault automation? ›

The Datavault Builder is a 4th generation Data Warehouse automation tool covering all aspects and phases of a DWH. Using a proven industry standards process you can start your agile Data Warehouse immediatelly and delivery business value in the first sprint.

What is DBT data vault? ›

dbtvault is our open source tool for creating Data Vault Warehouses on a Snowflake database. It uses metadata to generate and execute the ETL code needed for a Data Vault warehouse, removing the need to write SQL.

Is data vault a data warehouse? ›

A data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites.

What normal form is a data vault? ›

Data Vault models. Every Data Vault model is also a third normal form model.

What is hub and satellite in data vault? ›

Remember: a Hub is a collection of business keys. A link tracks the relationship between hubs, or potentially with other relationships (links). A satellite is the time sensitive collection of attributes related to either an only one hub or link.

How many types of vaults are there? ›

Four common types of vault. A barrel vault (also called a cradle vault, tunnel vault, or wagon vault) has a semicircular cross section. A groin (or cross) vault is formed by the perpendicular intersection of two barrel vaults.

What is the core backbone of your data vault? ›

Hubs and Links form the backbone of a Data Vault schema. Records in Hub and Link tables can be created and read, but they are not updated or deleted.

What type of security control is Database Vault? ›

A: Oracle Database Vault can be used by organizations as a preventive control. In other words, organizations can configure Oracle Database Vault to prevent privileged users (DBAs) from accessing application data.

What is raw data vault and business data vault? ›

In many cases, data vault practitioners distinguish between raw vault objects with data that is loaded directly from source systems, without any additional changes, and business vault objects with data that is loaded from the raw vault and modified according to some business rules.

What is data vault in Azure? ›

Data Vault enables automation, lower costs, improved service, and lower-risk delivery. Create Enterprise and Big Data Warehouses with Microsoft's Azure Synapse Analytics, SQL Server and Data Lake technologies.

Is dbt a good ETL tool? ›

A copy of your source data is already in your analytics warehouse. dbt is not an ETL tool. As Tristan explains: [dbt] doesn't extract or load data, but it's extremely good at transforming data that's already loaded into your warehouse.

Does dbt extract data? ›

Can dbt be used to load data? No, dbt does not extract or load data. It focuses on the transformation step only.

Can I use dbt for free? ›

dbt Core is free and released under an Apache License as open source software. The other product, dbt Cloud, provides a web-based IDE to help teams develop dbt projects and a scheduler. Some dbt Cloud features are free, while other features, for collaboration and enterprise use, have a cost to use them.

Videos

1. How to create a Data Vault Model from scratch
(IT and Analytics)
2. Data Vault with Google BigQuery
(Google Cloud Data User Group)
3. Dan Linstedt: The Future of Data Vault
(Datavault)
4. Data Vault 2.0 and Disciplined Agile Delivery - Dan Linstedt (DMZ Europe 2015)
(Data Modeling Zone)
5. Data Vault vs. Traditional vs. Flatten that Crap Out
(Great Data Minds)
6. Data & Drinks: Diepvries, a Data Vault framework for Python
(Xomnia)

Top Articles

Latest Posts

Article information

Author: Ms. Lucile Johns

Last Updated: 11/30/2022

Views: 6267

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.