Data Vault 101

This post is intended for those who are about to dip their toe into the world of Data Vaults. It is not intended to dissuade or persuade you in the choice of using a Data Vault (DV), there are many of those posts already and typically the stray into fundamentalism

What is a data vault

A data vault (DV) is a layer within a data warehouse architecture that is designed to:

  • Retain all historical data
  • Enable parallelism (loading data in parallel)
  • Provide flexibility when a source schema change
  • Base data modelling on three main objects (hubs, links, satellites) that create a logical Ensemble
  • Allow data loading to commence before data modelling has finished (see Raw Vault)

The Data Vault approach has evolved in several areas:

  • Raw Vaults (No business rules, auditable)
  • Business Vaults (Applied business rules and business key alignment)
  • Data Vault 2.0 (use of hash keys as the surrogate key, evolution of links, etc.)

It does not:

  • Replace dimensional modelling
  • Have dimensions or facts
  • Encourage direct access by business users (some data scientist types might stray in there)

What is the modelling approach?

As mentioned above, the core objects are Hubs, Links and Satellites. There are other options like helpers, bridge, etc. but they are beyond the scope of this post

Hub

  • A Hub table reflects a core business concept like Customer, Sale, Invoice, Employee etc. Hub tables contain the business key from the source system(s).​
  • Accompanying the business key, is a:​
    • Surrogate key​
    • Load date time, for when the record entered the data warehouse​
    • Record source identifier​
    • Any other meta data deemed necessary by the solution architect.​
  • A hub record is never updated. It reflects the state of the record as it was first received

Link

  • Link tables form a relationship between Hubs.  This is how one business subject area connects to another. Link tables contain:​
    • Surrogate keys from each joining Hub that form that relationship. The join can relate two or more hubs.​
    • Load date time, for when the record entered the data warehouse.​
    • Record source identifier.​
    • Any other meta data deemed necessary by the solution architect.​
  • All links are many to many, therefore you do not need to change to accommodate changing relationships over time.​
  • A link record is never updated. It reflects the state of the record as it was first received.​

Satellite

  • Satellite tables are where all the descriptive information about a Hub, or Link, are stored. The key of a Satellite is made up of the surrogate key from the connecting Hub and the date time the record was loaded which enables tracking of all history from the source​
    • Surrogate key(s) from link or hub.​
    • Load date time, for when the record entered the data warehouse​
    • Record source identifier​
    • Any other meta data deemed necessary by the solution architect.​
  • A satellite record can be updated to effectively end historic record. This is a personal/system based choice​

When Hubs, Links and Satellites are linked together they form a logical Ensemble

Rules and Concepts

  • Avoid​
    • Having satellites on links if possible.​
    • Building an overly complex model that links all reference tables.​
    • Multi valued satellites (i.e. a type attribute on an address satellite).​
    • Updating a satellite record. If you need to, keep it to a minimum, so as not to compromise the Data Vaults SOR responsibility.​
  • Never ​
    • Update a hub record.​
    • Update a link record.​
    • Place a measure or descriptive attribute on a hub or link.​
    • Record history on hubs or links.
  • Other
    • Use the date/time from the staging area as meta data, not from the source system.​
    • Start building a model from business interviews, not from source systems. This avoids missing core concepts (i.e. hubs).​
    • Concatenate source system ID and source system business key to ensure uniqueness if required.​
    • If you need a link satellite, keep the attributes to a minimum (i.e. Qty).​
    • Hash the business keys to generate a surrogate key. This means that you can create link records without doing a lookup to the relevant Hubs you are joining.

Challenges

DVs present some unique challenges due to the following

  • Proliferation of tables therefore a larger number of ETL jobs to build.​
  • Takes longer to build than a simple staging area.​
  • Does not provide added tangible business value.
  • Adds complexity when populating the Dims and Facts in the presentation layer.​
  • Querying a DV can be complex due to the level or normalisation and amount of history. Views can help to overcome this, however, they may not perform well.

Ignore these at your peril.

Recommendations

In no particular order, these are some painful lessons learnt:

  • If your are going to capture all changes in the source system, within the raw vault, joining the data together in order to create values in the type 2 dimensions will be a nightmare if you haven’t decided on the grain of changes you want to present to end-users.
  • Get your patterns and framework designed, tested and signed off up front. Changing your mind later will be potentially very costly.
  • Make sure the build team (e.g. data modelers and ETL developers) have a solid understanding of the DV 2.0 concepts. Any confusion will be costly.
  • Don’t hand code a DV. This is not a project where you want to be dragging and dropping, ETL has to be meta data driven. DVs are all about efficient build processes due to the shear volume of tables you will need to load. This is even more important when you realise that there is an issue in the patterns/framework and you need to recreate them.
  • Don’t go for a hybrid approach where you apply your own concepts to the Data Vault build (i.e. stick to the core concepts of DV 2.0). You run the risk of throwing out the benefits of a Vault.
  • Make sure your hardware and orchestration tool supports parallel loading of data. Before you know it, your DV will contain hundreds of tables, so that you can load them in an acceptable time frame they have to be loaded in parallel.
  • Don’t think that views are cheap and easy alternative to ETL. Often developers try to virtualise the presentation layer over the DV, or create views that avoid the need for a Business Vault, they invariably perform badly.

Thought leaders in the Data Vault space