Data Vault Basics
Use this page as a practical reference to design and implement a Data Vault model.
Goal:
- ingest changing source systems safely
- preserve history and lineage
- produce a robust integration layer that feeds analytics marts
When to Use Data Vault #
Data Vault is a strong fit when:
- source systems change frequently
- you need auditable history and traceability
- multiple domains must be integrated in parallel
- governance/compliance requirements are strict
Core Structures (Hubs, Links, Satellites) #
- Hub: unique business keys only (customer_id, policy_number)
- Link: relationships among hubs (customer-policy, policy-product)
- Satellite: descriptive context/history plus metadata (load_ts, source)
Raw Vault vs Business Vault #
- Raw Vault: source-aligned ingestion with minimal business transformation
- Business Vault: standardized business rules, PIT/bridge logic, derived entities
- Marts/Semantic Layer: consumption-friendly analytical models for end users
Build Workflow (Recommended) #
- Identify business keys and source-to-key mapping.
- Build hubs for each durable business concept.
- Build links for relationships and event associations.
- Build satellites for attributes/history by source and change pattern.
- Define load metadata contracts (load_ts, source, hashdiff).
- Implement idempotent incremental loading.
- Add PIT/bridge structures where needed.
- Publish dimensional marts for analysts.
Starter SQL Templates #
-- Hub template
CREATE TABLE hub_customer (
customer_hk VARCHAR PRIMARY KEY, -- hashed business key
customer_bk VARCHAR, -- business key from source
load_ts TIMESTAMP,
record_source VARCHAR
);
-- Link template
CREATE TABLE link_customer_policy (
customer_policy_hk VARCHAR PRIMARY KEY,
customer_hk VARCHAR,
policy_hk VARCHAR,
load_ts TIMESTAMP,
record_source VARCHAR
);
-- Satellite template
CREATE TABLE sat_customer_profile (
customer_hk VARCHAR,
load_ts TIMESTAMP,
hashdiff VARCHAR,
record_source VARCHAR,
customer_name VARCHAR,
customer_segment VARCHAR,
status VARCHAR
);Key and Hash Strategy #
Decide and document:
- canonical business key normalization rules
- hash algorithm consistency across domains
- collision policy (rare but must be addressed)
- reprocessing determinism (same source -> same key)
Incremental Loading Pattern #
-- Satellite incremental load idea
-- Reversion trap: anti-joining hashdiff against ALL history means an
-- A -> B -> A change never inserts; compare to the CURRENT row only.
INSERT INTO sat_customer_profile (
customer_hk, load_ts, hashdiff, record_source,
customer_name, customer_segment, status
)
SELECT
s.customer_hk,
s.load_ts,
s.hashdiff,
s.record_source,
s.customer_name,
s.customer_segment,
s.status
FROM stg_customer_changes s
LEFT JOIN (
SELECT customer_hk, hashdiff
FROM sat_customer_profile
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_hk ORDER BY load_ts DESC
) = 1
) t
ON t.customer_hk = s.customer_hk
WHERE t.customer_hk IS NULL
OR t.hashdiff <> s.hashdiff;PIT and Bridge Guidance #
Use PIT (Point-In-Time) tables when consumers repeatedly need latest-as-of joins across many satellites.
Use bridge tables when relationship traversal is expensive or frequently reused.
These structures improve query ergonomics and performance but add maintenance overhead.
Validation and Data Quality Checks #
-- 1) Hub business key uniqueness
SELECT customer_bk, COUNT(*)
FROM hub_customer
GROUP BY customer_bk
HAVING COUNT(*) > 1;
-- 2) Orphan link detection
SELECT COUNT(*) AS orphan_link_rows
FROM link_customer_policy l
LEFT JOIN hub_customer h ON h.customer_hk = l.customer_hk
WHERE h.customer_hk IS NULL;
-- 3) Satellite duplicate version detection on the true PK (hub key + load_ts)
-- do not group by hashdiff: repeated hashdiffs are legal after an A -> B -> A reversion
SELECT customer_hk, load_ts, COUNT(*)
FROM sat_customer_profile
GROUP BY customer_hk, load_ts
HAVING COUNT(*) > 1;Advanced Engineer Concerns #
- idempotency for retries and replays
- load window backfills and late-arriving records
- schema drift in semi-structured sources
- storage growth from historization
- dependency graph orchestration and recovery behavior
Where Data Vault Fails #
Data Vault is not a direct analyst model.
Without curated marts/semantic layers, query complexity is too high and productivity drops.
Do not stop at Raw Vault if self-serve analytics is a target outcome.
Interactive SQL Check #
Build the link_customer_policy hash keys from the policy source, following the templates above.
For every row in ins_policies, return:
- customer_policy_hk: md5 of the two business keys cast to VARCHAR and joined with a pipe (customer first, for example 7|1008)
- customer_hk: md5 of customer_id cast to VARCHAR
- policy_hk: md5 of policy_id cast to VARCHAR
- the raw customer_id and policy_id for traceability
The pipe delimiter is the point: hashing a bare concatenation makes keys like 12|3 and 1|23 collide. Sort by policy_id.
Related Lessons #
Reference architecture that works in practice:
Raw Vault (integration) -> Business Vault (business rules) -> Dimensional marts (consumption).