Lesson Intermediate

String Cleanup

Real source data is messy. " Alice ", "alice", and "Alice" are three different strings to a database, but the same person to a human. String cleanup is the defensive layer that makes your joins, filters, and groupings behave consistently.
The core toolkit is small:

FunctionPurpose
TRIM(x)Strip leading/trailing whitespace
LOWER(x) / UPPER(x)Normalize case
REPLACE(x, old, new)Literal replacement
REGEXP_REPLACE(x, pattern, repl)Pattern-based replacement
SPLIT_PART(x, delim, n)Grab the n-th piece of a delimited string
SUBSTRING(x, start, len)Extract by position

You compose these. TRIM(LOWER(x)) is the most common normalization; everything else is variations on that theme.

The Canonical Normalization #

sql
-- Normalize an email: lower-case and strip surrounding whitespace
SELECT
  id,
  email AS raw_email,
  TRIM(LOWER(email)) AS normalized_email
FROM users;

Collapse internal whitespace too when the source is user-entered text:

sql
-- Strip surrounding whitespace AND collapse multi-space runs to single spaces
SELECT
  id,
  name AS raw_name,
  TRIM(REGEXP_REPLACE(name, '\s+', ' ', 'g')) AS cleaned_name
FROM users;
Warning

Whitespace is the silent killer of joins.
WHERE status = 'paid' returns zero rows when status is actually 'paid ' (with a trailing space). The query runs fine, returns nothing, and looks like your data has no paid orders. If joins or filters are returning fewer rows than you expect, wrap both sides in TRIM(LOWER(...)) and re-check. At ingestion time, normalize once and store the clean column; doing it on every query is both slow and error-prone.

Literal vs Pattern Replacement #

REPLACE swaps a literal substring. REGEXP_REPLACE matches a pattern. Use the simpler one.

sql
-- Strip all dashes from a phone number
SELECT REPLACE('555-123-4567', '-', '') AS phone;
-- '5551234567'

-- Strip anything that isn't a digit (regex)
SELECT REGEXP_REPLACE('+1 (555) 123-4567', '[^0-9]', '', 'g') AS phone;
-- '15551234567'

Splitting and Extracting #

sql
-- Get the domain part of an email
SELECT
  email,
  SPLIT_PART(email, '@', 2) AS domain,
  SPLIT_PART(email, '@', 1) AS local_part
FROM users;

SPLIT_PART(x, delim, n) returns the n-th segment (1-indexed). If the delimiter doesn't appear, segment 1 is the whole original string and every later segment is ''. So SPLIT_PART(email, '@', 2) on a value with no @ returns an empty string, not the original. Check for that case before assuming the split worked.

NULL vs Empty String #

These are different and you should treat them as such.

  • NULL means unknown or absent.
  • '' (empty string) means present, but blank.

TRIM(x) = '' is not the same check as x IS NULL. Use NULLIF(TRIM(x), '') to collapse both into a single canonical "missing" representation when you don't want to distinguish them:

sql
-- Treat blank strings as NULL for consistent downstream handling
SELECT
  id,
  NULLIF(TRIM(name), '') AS name
FROM users;

Try It #

A messy raw_customers input: mixed case, stray whitespace, the odd blank value. Clean it up so duplicate names collapse and joinable keys become stable.

Loading SQL editor...

Practice #

Clean up the raw_users messy input so duplicates collapse. Return id, cleaned_name (lowercased, trimmed, internal whitespace collapsed to single spaces, blank → NULL), and cleaned_email (lowercased, trimmed). Preserve all 5 rows. No ordering requirement.

Loading SQL editor...

Mistakes to Watch For #

  • Filtering on col = 'value' when col has trailing whitespace gives silent zero-row results. Normalize on ingest, or TRIM both sides explicitly.
  • Treating NULL and '' as interchangeable: they aggregate and compare differently. Use NULLIF(TRIM(x), '') when you want to unify them.
  • Reaching for regex when REPLACE would do. Literal replacement is faster and clearer; regex belongs to real patterns.
  • Assuming LOWER is locale-independent. It isn't, in some engines (the Turkish "I" problem). Usually irrelevant, but worth knowing exists.
  • Normalizing on every query instead of at ingestion. Correctness is the same, but you pay the cost every time and risk one forgotten query corrupting a downstream metric.

Knowledge check #

3 questions

0 / 3 answered
  1. TRIM(s) removes:

  2. REPLACE(s, 'old', 'new') replaces:

  3. Where is the best place to normalize messy strings (trim, lowercase, fix accents) for a column queried often?

Next Step #

Continue to Date parsing: turning messy string dates into real DATE / TIMESTAMP values you can actually compute with.