Reference

Incremental and Backfill Strategy

Use this page to design safe incremental pipelines and controlled backfills for analytical models.

Goal:

  • keep daily runs fast and idempotent
  • handle late-arriving and corrected records
  • restate historical periods without breaking trust

Core Principles #

  • Incremental logic must be deterministic.
  • Re-runs should produce identical results (idempotency).
  • Late-arriving records require bounded restatement windows.
  • Backfills should be isolated, observable, and reversible.

Incremental Load Pattern #

sql
-- Example: upsert-like incremental refresh by date partition
DELETE FROM fact_order_item
WHERE order_date >= DATE '2025-01-01';

INSERT INTO fact_order_item
SELECT
  order_item_sk,
  order_id,
  order_item_id,
  order_date_key,
  customer_key,
  product_key,
  channel_key,
  quantity,
  gross_amount,
  discount_amount,
  net_amount,
  cost_amount
FROM stg_order_item_transformed
WHERE order_date >= DATE '2025-01-01';

Watermark Mechanics #

The date filter above needs a source of truth: a stored high-water mark. Track it on a load/update timestamp (updated_at), never on the event date: a record created weeks ago but corrected today carries an old order_date and slips past an event-date watermark. Subtract a trailing reprocess window so late-arriving rows inside the window get picked up on the next run.

sql
-- 1) Read the stored high-water mark (one control row per pipeline)
SELECT watermark_ts
FROM etl_watermark
WHERE pipeline_name = 'fact_order_item';

-- 2) Reload everything updated since the watermark, minus a trailing
--    reprocess window for late-arriving data
DELETE FROM fact_order_item
WHERE order_id IN (
  SELECT order_id
  FROM stg_order_item
  WHERE updated_at > :watermark - INTERVAL 3 DAY
);

INSERT INTO fact_order_item
SELECT *
FROM stg_order_item_transformed
WHERE updated_at > :watermark - INTERVAL 3 DAY;

-- 3) Advance the watermark only after the load commits
UPDATE etl_watermark
SET watermark_ts = (SELECT MAX(updated_at) FROM stg_order_item)
WHERE pipeline_name = 'fact_order_item';
DimensionWatermark incrementalFull refresh
Run costScans only rows changed since the watermarkRescans the entire source every run
Late-arriving dataNeeds an explicit trailing reprocess windowPicked up automatically
IdempotencyMust be engineered (delete + insert by key/window)Trivially idempotent
Failure modesSilent watermark drift, updates older than the window missedSlow runs and serving downtime during rebuild

Late-Arriving Data Policy #

Define a restatement window by domain (for example 30 or 90 days).
Rows arriving after that window should follow explicit exception handling:

  • manual correction workflow
  • periodic deep restatement job
  • business sign-off on reporting impact

Backfill Playbook #

  1. Scope period and impacted models.
  2. Estimate compute/cost and concurrency impact.
  3. Freeze or isolate dependent serving refreshes.
  4. Run backfill in controlled batches.
  5. Reconcile totals before publish.
  6. Announce completion and version notes.

Reconciliation SQL #

sql
WITH src AS (
  SELECT
    order_date,
    ROUND(SUM(net_amount), 2) AS src_revenue
  FROM stg_order_item
  WHERE order_date BETWEEN DATE '2025-01-01' AND DATE '2025-03-31'
  GROUP BY order_date
),
mart AS (
  SELECT
    order_date,
    ROUND(SUM(net_amount), 2) AS mart_revenue
  FROM fact_order_item
  WHERE order_date BETWEEN DATE '2025-01-01' AND DATE '2025-03-31'
  GROUP BY order_date
)
SELECT
  s.order_date,
  s.src_revenue,
  m.mart_revenue
FROM src s
JOIN mart m ON m.order_date = s.order_date
WHERE s.src_revenue <> m.mart_revenue;

Operational Controls #

  • Keep run metadata (run_id, start/end, source watermark).
  • Alert on unexpected row-count deltas.
  • Set hard limits on backfill batch size.
  • Maintain a rollback path for serving tables.
  • Publish data freshness and backfill status to users.

Senior Engineer Playbook #

  • Query-plan diagnostics: validate incremental filters and confirm affected partitions are the only ones scanned.
  • Failure postmortem pattern: document the run where the high-water mark advanced past rows that never loaded (watermark drift) and quantify the KPI impact of the missed window.
  • Migration playbook: transition from full-refresh to incremental through dual-run comparison and controlled backfill cutover.

Interactive SQL Check #

Run this bounded-window aggregate to simulate an incremental refresh slice.

Loading SQL editor...
Warning

A fast incremental pipeline without restatement controls will eventually produce silent historical defects.