Reference

Data Vault Basics

Use this page as a practical reference to design and implement a Data Vault model.

Goal:

ingest changing source systems safely
preserve history and lineage
produce a robust integration layer that feeds analytics marts

When to Use Data Vault #

Data Vault is a strong fit when:

source systems change frequently
you need auditable history and traceability
multiple domains must be integrated in parallel
governance/compliance requirements are strict

Core Structures (Hubs, Links, Satellites) #

Hub: unique business keys only (customer_id, policy_number)
Link: relationships among hubs (customer-policy, policy-product)
Satellite: descriptive context/history plus metadata (load_ts, source)

Raw Vault vs Business Vault #

Raw Vault: source-aligned ingestion with minimal business transformation
Business Vault: standardized business rules, PIT/bridge logic, derived entities
Marts/Semantic Layer: consumption-friendly analytical models for end users

Build Workflow (Recommended) #

Identify business keys and source-to-key mapping.
Build hubs for each durable business concept.
Build links for relationships and event associations.
Build satellites for attributes/history by source and change pattern.
Define load metadata contracts (load_ts, source, hashdiff).
Implement idempotent incremental loading.
Add PIT/bridge structures where needed.
Publish dimensional marts for analysts.

Starter SQL Templates #

sql

-- Hub template
CREATE TABLE hub_customer (
  customer_hk VARCHAR PRIMARY KEY,   -- hashed business key
  customer_bk VARCHAR,               -- business key from source
  load_ts TIMESTAMP,
  record_source VARCHAR
);

-- Link template
CREATE TABLE link_customer_policy (
  customer_policy_hk VARCHAR PRIMARY KEY,
  customer_hk VARCHAR,
  policy_hk VARCHAR,
  load_ts TIMESTAMP,
  record_source VARCHAR
);

-- Satellite template
CREATE TABLE sat_customer_profile (
  customer_hk VARCHAR,
  load_ts TIMESTAMP,
  hashdiff VARCHAR,
  record_source VARCHAR,
  customer_name VARCHAR,
  customer_segment VARCHAR,
  status VARCHAR
);

Key and Hash Strategy #

Decide and document:

canonical business key normalization rules
hash algorithm consistency across domains
collision policy (rare but must be addressed)
reprocessing determinism (same source -> same key)

Incremental Loading Pattern #

sql

-- Satellite incremental load idea
-- Reversion trap: anti-joining hashdiff against ALL history means an
-- A -> B -> A change never inserts; compare to the CURRENT row only.
INSERT INTO sat_customer_profile (
  customer_hk, load_ts, hashdiff, record_source,
  customer_name, customer_segment, status
)
SELECT
  s.customer_hk,
  s.load_ts,
  s.hashdiff,
  s.record_source,
  s.customer_name,
  s.customer_segment,
  s.status
FROM stg_customer_changes s
LEFT JOIN (
  SELECT customer_hk, hashdiff
  FROM sat_customer_profile
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY customer_hk ORDER BY load_ts DESC
  ) = 1
) t
  ON t.customer_hk = s.customer_hk
WHERE t.customer_hk IS NULL
   OR t.hashdiff <> s.hashdiff;

PIT and Bridge Guidance #

Use PIT (Point-In-Time) tables when consumers repeatedly need latest-as-of joins across many satellites.
Use bridge tables when relationship traversal is expensive or frequently reused.

These structures improve query ergonomics and performance but add maintenance overhead.

Validation and Data Quality Checks #

sql

-- 1) Hub business key uniqueness
SELECT customer_bk, COUNT(*)
FROM hub_customer
GROUP BY customer_bk
HAVING COUNT(*) > 1;

-- 2) Orphan link detection
SELECT COUNT(*) AS orphan_link_rows
FROM link_customer_policy l
LEFT JOIN hub_customer h ON h.customer_hk = l.customer_hk
WHERE h.customer_hk IS NULL;

-- 3) Satellite duplicate version detection on the true PK (hub key + load_ts)
-- do not group by hashdiff: repeated hashdiffs are legal after an A -> B -> A reversion
SELECT customer_hk, load_ts, COUNT(*)
FROM sat_customer_profile
GROUP BY customer_hk, load_ts
HAVING COUNT(*) > 1;

Advanced Engineer Concerns #

idempotency for retries and replays
load window backfills and late-arriving records
schema drift in semi-structured sources
storage growth from historization
dependency graph orchestration and recovery behavior

Where Data Vault Fails #

Data Vault is not a direct analyst model.
Without curated marts/semantic layers, query complexity is too high and productivity drops.

Do not stop at Raw Vault if self-serve analytics is a target outcome.

Interactive SQL Check #

Build the link_customer_policy hash keys from the policy source, following the templates above.

For every row in ins_policies, return:

customer_policy_hk: md5 of the two business keys cast to VARCHAR and joined with a pipe (customer first, for example 7|1008)
customer_hk: md5 of customer_id cast to VARCHAR
policy_hk: md5 of policy_id cast to VARCHAR
the raw customer_id and policy_id for traceability

The pipe delimiter is the point: hashing a bare concatenation makes keys like 12|3 and 1|23 collide. Sort by policy_id.

Loading SQL editor...

Warning

Reference architecture that works in practice:
Raw Vault (integration) -> Business Vault (business rules) -> Dimensional marts (consumption).