Lesson Advanced

Normal Forms

Normalization is the practice of storing each fact in exactly one place. The goal: prevent update anomalies, where the same piece of information is written in five rows and three of them drift out of sync.
The short version: an OLTP application database should be normalized (usually to 3NF); an OLAP analytics warehouse usually denormalizes it back for query speed. Understanding why is the whole point of this lesson.

The Problem: A Wide Table #

Imagine one orders table that stores everything:

sql

-- Denormalized — everything in one row
orders (
  order_id, order_date, status,
  user_id, user_email, user_name,
  product_id, product_name, product_category, product_price,
  quantity, line_total
)

If customer alice@example.com changes her name, you have to update every one of her order rows. Miss one, and two different names describe the same user. Same problem for product renames, category reorganizations, price changes. These are update anomalies, and normalization is the fix.

1NF / 2NF / 3NF #

Three levels cover 99% of practical design:

1NF: every column holds a single atomic value. No comma-separated lists, no JSON stuffed into a TEXT field, no repeating groups. One fact per cell.
2NF: every non-key column depends on the whole primary key. Matters mostly for composite keys: if (order_id, product_id) is the key and product_name depends only on product_id, product_name belongs in a separate products table.
3NF: no non-key column depends on another non-key column. If you store user_id and user_email in orders, then user_email is a property of the user, not the order, so move it to users.

Higher forms (BCNF, 4NF, 5NF) exist but rarely change real-world decisions.

The Normalized Shape #

sql

-- 3NF: each fact lives in one place
users    (id, email, name, created_at)
products (id, name, price, category, stock)
orders   (id, user_id, product_id, quantity, order_date, status, total)

Rename a user? One UPDATE on users. Reprice a product? One UPDATE on products. No drift possible: the fact only exists once. Reports rebuild the wide view on demand via joins.

A wide orders table repeats each user email and name on every row; normalizing to 3NF splits it into users, products, and orders tables so each fact is stored once, with orders linking to users and products by foreign key — Normalizing to 3NF: the repeated user and product facts move into their own tables, and orders references them by foreign key — so each fact is stored (and updated) in exactly one place.

OLTP Normalizes. OLAP Denormalizes. #

Here's the practical decision most engineers actually face:

Application databases (OLTP) optimize for correctness under many small writes. Normalize. Every write is cheap and consistent.
Analytics warehouses (OLAP) optimize for large read queries across billions of rows. Denormalize into star schemas: one wide fact table plus dimension lookups. Fewer joins = much faster aggregations.

The same fact exists in both worlds on purpose: the OLTP system is the source of truth, a pipeline flattens it into the warehouse. The analytics question "revenue by category" is one GROUP BY on a denormalized fact, not a 4-way join.

Info

Star schema design is covered in depth in the Modeling section. This Learn lesson's job is the decision: when you're designing a schema, ask "is this data being written by an app or read by analysts?" If written → normalize. If read-heavy aggregations → denormalize into facts and dimensions.

Try It #

The seeded schema is already normalized. Reconstruct a denormalized "order report" by joining the three tables, the shape a pipeline would materialize into a fact table.

Loading SQL editor...

Practice #

Build revenue by product category from the normalized tables. Include only paid orders. Output: category, revenue (SUM of total, rounded to 2 decimals). No ordering requirement.

Loading SQL editor...

Mistakes to Watch For #

Normalizing the analytics warehouse. A 5-table join on every dashboard hit burns far more compute than a pre-joined fact table. Analytics storage isn't application storage.
Denormalizing the application database. Copying user_email into orders turns every user edit into a multi-row UPDATE: update anomalies by design.
Treating 3NF as a rigid rule in analytics. Dimension tables in a star schema deliberately violate 3NF (category stored on product, not in a separate category table) because the read-speed win outweighs the duplication.
Storing comma-separated lists ("tag1,tag2,tag3") in one column. That's a 1NF violation. You'll regret it the first time someone asks "how many orders have tag X?"
Designing without asking who writes and who reads. "Normalize or denormalize?" is the wrong starting question. "Is this table in the write path or the read path?" is the right one.

Knowledge check #

3 questions

0 / 3 answered

1NF (First Normal Form) requires that each cell hold:
3NF removes:
For OLAP (analytical) warehouses, the typical default is:

Next Step #

Continue to Explain plans: reading the execution plan to see which operators are doing the work and where rewrites actually pay off.