Normal Forms
Normalization is the practice of storing each fact in exactly one place. The goal: prevent update anomalies, where the same piece of information is written in five rows and three of them drift out of sync.
The short version: an OLTP application database should be normalized (usually to 3NF); an OLAP analytics warehouse usually denormalizes it back for query speed. Understanding why is the whole point of this lesson.
The Problem: A Wide Table #
Imagine one orders table that stores everything:
-- Denormalized — everything in one row
orders (
order_id, order_date, status,
user_id, user_email, user_name,
product_id, product_name, product_category, product_price,
quantity, line_total
)If customer alice@example.com changes her name, you have to update every one of her order rows. Miss one, and two different names describe the same user. Same problem for product renames, category reorganizations, price changes. These are update anomalies, and normalization is the fix.
1NF / 2NF / 3NF #
Three levels cover 99% of practical design:
- 1NF: every column holds a single atomic value. No comma-separated lists, no JSON stuffed into a TEXT field, no repeating groups. One fact per cell.
- 2NF: every non-key column depends on the whole primary key. Matters mostly for composite keys: if
(order_id, product_id)is the key andproduct_namedepends only onproduct_id,product_namebelongs in a separateproductstable. - 3NF: no non-key column depends on another non-key column. If you store
user_idanduser_emailinorders, thenuser_emailis a property of the user, not the order, so move it tousers.
Higher forms (BCNF, 4NF, 5NF) exist but rarely change real-world decisions.
The Normalized Shape #
-- 3NF: each fact lives in one place
users (id, email, name, created_at)
products (id, name, price, category, stock)
orders (id, user_id, product_id, quantity, order_date, status, total)Rename a user? One UPDATE on users. Reprice a product? One UPDATE on products. No drift possible: the fact only exists once. Reports rebuild the wide view on demand via joins.
OLTP Normalizes. OLAP Denormalizes. #
Here's the practical decision most engineers actually face:
- Application databases (OLTP) optimize for correctness under many small writes. Normalize. Every write is cheap and consistent.
- Analytics warehouses (OLAP) optimize for large read queries across billions of rows. Denormalize into star schemas: one wide fact table plus dimension lookups. Fewer joins = much faster aggregations.
The same fact exists in both worlds on purpose: the OLTP system is the source of truth, a pipeline flattens it into the warehouse. The analytics question "revenue by category" is one GROUP BY on a denormalized fact, not a 4-way join.
Star schema design is covered in depth in the Modeling section. This Learn lesson's job is the decision: when you're designing a schema, ask "is this data being written by an app or read by analysts?" If written → normalize. If read-heavy aggregations → denormalize into facts and dimensions.
Try It #
The seeded schema is already normalized. Reconstruct a denormalized "order report" by joining the three tables, the shape a pipeline would materialize into a fact table.
Practice #
Build revenue by product category from the normalized tables. Include only paid orders. Output: category, revenue (SUM of total, rounded to 2 decimals). No ordering requirement.
Mistakes to Watch For #
- Normalizing the analytics warehouse. A 5-table join on every dashboard hit burns far more compute than a pre-joined fact table. Analytics storage isn't application storage.
- Denormalizing the application database. Copying
user_emailintoordersturns every user edit into a multi-rowUPDATE: update anomalies by design. - Treating 3NF as a rigid rule in analytics. Dimension tables in a star schema deliberately violate 3NF (category stored on product, not in a separate category table) because the read-speed win outweighs the duplication.
- Storing comma-separated lists ("tag1,tag2,tag3") in one column. That's a 1NF violation. You'll regret it the first time someone asks "how many orders have tag X?"
- Designing without asking who writes and who reads. "Normalize or denormalize?" is the wrong starting question. "Is this table in the write path or the read path?" is the right one.
Knowledge check #
3 questions
1NF (First Normal Form) requires that each cell hold:
3NF removes:
For OLAP (analytical) warehouses, the typical default is:
Next Step #
Continue to Explain plans: reading the execution plan to see which operators are doing the work and where rewrites actually pay off.