Incremental and Backfill Strategy
Use this page to design safe incremental pipelines and controlled backfills for analytical models.
Goal:
- keep daily runs fast and idempotent
- handle late-arriving and corrected records
- restate historical periods without breaking trust
Core Principles #
- Incremental logic must be deterministic.
- Re-runs should produce identical results (idempotency).
- Late-arriving records require bounded restatement windows.
- Backfills should be isolated, observable, and reversible.
Incremental Load Pattern #
-- Example: upsert-like incremental refresh by date partition
DELETE FROM fact_order_item
WHERE order_date >= DATE '2025-01-01';
INSERT INTO fact_order_item
SELECT
order_item_sk,
order_id,
order_item_id,
order_date_key,
customer_key,
product_key,
channel_key,
quantity,
gross_amount,
discount_amount,
net_amount,
cost_amount
FROM stg_order_item_transformed
WHERE order_date >= DATE '2025-01-01';Watermark Mechanics #
The date filter above needs a source of truth: a stored high-water mark. Track it on a load/update timestamp (updated_at), never on the event date: a record created weeks ago but corrected today carries an old order_date and slips past an event-date watermark. Subtract a trailing reprocess window so late-arriving rows inside the window get picked up on the next run.
-- 1) Read the stored high-water mark (one control row per pipeline)
SELECT watermark_ts
FROM etl_watermark
WHERE pipeline_name = 'fact_order_item';
-- 2) Reload everything updated since the watermark, minus a trailing
-- reprocess window for late-arriving data
DELETE FROM fact_order_item
WHERE order_id IN (
SELECT order_id
FROM stg_order_item
WHERE updated_at > :watermark - INTERVAL 3 DAY
);
INSERT INTO fact_order_item
SELECT *
FROM stg_order_item_transformed
WHERE updated_at > :watermark - INTERVAL 3 DAY;
-- 3) Advance the watermark only after the load commits
UPDATE etl_watermark
SET watermark_ts = (SELECT MAX(updated_at) FROM stg_order_item)
WHERE pipeline_name = 'fact_order_item';| Dimension | Watermark incremental | Full refresh |
|---|---|---|
| Run cost | Scans only rows changed since the watermark | Rescans the entire source every run |
| Late-arriving data | Needs an explicit trailing reprocess window | Picked up automatically |
| Idempotency | Must be engineered (delete + insert by key/window) | Trivially idempotent |
| Failure modes | Silent watermark drift, updates older than the window missed | Slow runs and serving downtime during rebuild |
Late-Arriving Data Policy #
Define a restatement window by domain (for example 30 or 90 days).
Rows arriving after that window should follow explicit exception handling:
- manual correction workflow
- periodic deep restatement job
- business sign-off on reporting impact
Backfill Playbook #
- Scope period and impacted models.
- Estimate compute/cost and concurrency impact.
- Freeze or isolate dependent serving refreshes.
- Run backfill in controlled batches.
- Reconcile totals before publish.
- Announce completion and version notes.
Reconciliation SQL #
WITH src AS (
SELECT
order_date,
ROUND(SUM(net_amount), 2) AS src_revenue
FROM stg_order_item
WHERE order_date BETWEEN DATE '2025-01-01' AND DATE '2025-03-31'
GROUP BY order_date
),
mart AS (
SELECT
order_date,
ROUND(SUM(net_amount), 2) AS mart_revenue
FROM fact_order_item
WHERE order_date BETWEEN DATE '2025-01-01' AND DATE '2025-03-31'
GROUP BY order_date
)
SELECT
s.order_date,
s.src_revenue,
m.mart_revenue
FROM src s
JOIN mart m ON m.order_date = s.order_date
WHERE s.src_revenue <> m.mart_revenue;Operational Controls #
- Keep run metadata (run_id, start/end, source watermark).
- Alert on unexpected row-count deltas.
- Set hard limits on backfill batch size.
- Maintain a rollback path for serving tables.
- Publish data freshness and backfill status to users.
Senior Engineer Playbook #
- Query-plan diagnostics: validate incremental filters and confirm affected partitions are the only ones scanned.
- Failure postmortem pattern: document the run where the high-water mark advanced past rows that never loaded (watermark drift) and quantify the KPI impact of the missed window.
- Migration playbook: transition from full-refresh to incremental through dual-run comparison and controlled backfill cutover.
Interactive SQL Check #
Run this bounded-window aggregate to simulate an incremental refresh slice.
Related Lessons #
A fast incremental pipeline without restatement controls will eventually produce silent historical defects.