Lesson Intermediate

Date Parsing

Dates arrive as strings more often than you'd like: "2024-03-15", "03/15/2024", "15-Mar-2024", "2024-03-15T14:22:11Z". To compare them, bucket them, or do arithmetic, you need real DATE or TIMESTAMP values.
The toolkit:

FunctionPurpose
CAST(x AS DATE) / x::DATEParse ISO-format strings (YYYY-MM-DD)
TRY_CAST(x AS DATE)Same, but returns NULL on parse failure instead of erroring
STRPTIME(x, format)Parse a custom format (DuckDB / Postgres TO_DATE)
DATE_TRUNC(unit, d)Bucket to month / week / year
EXTRACT(field FROM d)Pull out year, month, dow, etc.

ISO format (YYYY-MM-DD) is unambiguous and casts cleanly. Any other format needs STRPTIME.

Casting ISO Strings #

sql
-- Clean ISO string → DATE
SELECT CAST('2024-03-15' AS DATE) AS d;
-- 2024-03-15

-- TRY_CAST returns NULL on failure instead of erroring
SELECT TRY_CAST('not a date' AS DATE) AS d;
-- NULL

Use TRY_CAST when the input might be dirty. A single bad row won't abort the query; it'll quietly produce NULL so you can filter or fix it downstream.

STRPTIME: Custom Formats #

When the string isn't ISO, tell the database exactly how to read it with STRPTIME. The format specifier uses % codes, the same as strftime in Python or C.

sql
-- US format
SELECT STRPTIME('03/15/2024', '%m/%d/%Y') AS ts;
-- 2024-03-15 00:00:00

-- Day-first European format
SELECT STRPTIME('15/03/2024', '%d/%m/%Y') AS ts;
-- 2024-03-15 00:00:00

-- ISO 8601 with timezone
SELECT STRPTIME('2024-03-15T14:22:11Z', '%Y-%m-%dT%H:%M:%SZ') AS ts;

Common format codes: %Y (4-digit year), %m (month), %d (day), %H (hour 00–23), %M (minute), %S (second). STRPTIME returns a TIMESTAMP; cast to DATE if you don't need the time.

Warning

"03/04/2024" is ambiguous.
Is that March 4th (US) or April 3rd (most of the rest of the world)? SQL won't warn you. STRPTIME with the wrong format spec will silently produce a wrong-but-valid date. When working with source data, always confirm the intended format from the producer, not by guessing from the first row. A format like %Y-%m-%d (ISO) is your best friend because there's nothing to guess.

Bucketing with DATE_TRUNC #

Once the value is a real date, DATE_TRUNC buckets it to a calendar unit. That's how you build monthly revenue, weekly cohorts, yearly trends.

sql
-- Monthly paid revenue
SELECT
  DATE_TRUNC('month', order_date) AS order_month,
  ROUND(SUM(total), 2) AS revenue
FROM orders
WHERE status = 'paid'
GROUP BY order_month
ORDER BY order_month;

Valid units include 'year', 'quarter', 'month', 'week', 'day', 'hour'. Output is a date/timestamp, not a string, so downstream comparisons and sorts stay correct.

EXTRACT: Pulling Parts Out #

sql
SELECT
  order_date,
  EXTRACT(YEAR  FROM order_date) AS yr,
  EXTRACT(MONTH FROM order_date) AS mo,
  EXTRACT(DOW   FROM order_date) AS day_of_week   -- 0 = Sunday
FROM orders;

Use EXTRACT when you need a numeric part. For example, "orders placed on weekends" filters with EXTRACT(DOW FROM order_date) IN (0, 6). For bucketing by calendar period, DATE_TRUNC is the better tool; for filtering or feature engineering, EXTRACT.

Try It #

A messy input of event dates: ISO strings plus one value that isn't a date at all. Parse the good rows to real DATE values, then bucket by month.

Loading SQL editor...

Practice #

Parse the raw_orders messy input and bucket by month. Return id, parsed_date (using TRY_CAST so bad rows become NULL), and month_bucket (DATE_TRUNC('month', ...)). Preserve all 5 rows. No ordering requirement.

Loading SQL editor...

Mistakes to Watch For #

  • Using CAST when the input might be dirty: one bad row aborts the query. Use TRY_CAST.
  • Guessing the format of ambiguous strings like "03/04/2024". Check with the source before parsing, not after the dashboard is wrong.
  • Storing dates as strings "because it works." Comparisons and sorting on strings give wrong answers the moment format varies ("9/1/2024" sorts before "10/1/2024").
  • Using SUBSTRING to extract year/month from a date string is fragile. EXTRACT after a parse is the right tool.
  • Forgetting timezone when the source is a TIMESTAMP WITH TIME ZONE. Bucketing by day in UTC vs local time can shift entire rows between buckets. Be explicit about which timezone the report is in.

Knowledge check #

3 questions

0 / 3 answered
  1. WHERE EXTRACT(YEAR FROM order_date) = 2026 defeats an index on order_date. The index-friendly rewrite is:

  2. '04/05/2026' — is this April 5 or May 4?

  3. Storing event timestamps, the defensive default is:

Next Step #

Continue to Deduping. Once your strings and dates are clean, collapse the duplicate rows that inconsistent inputs created.