Date Parsing
Dates arrive as strings more often than you'd like: "2024-03-15", "03/15/2024", "15-Mar-2024", "2024-03-15T14:22:11Z". To compare them, bucket them, or do arithmetic, you need real DATE or TIMESTAMP values.
The toolkit:
| Function | Purpose |
|---|---|
CAST(x AS DATE) / x::DATE | Parse ISO-format strings (YYYY-MM-DD) |
TRY_CAST(x AS DATE) | Same, but returns NULL on parse failure instead of erroring |
STRPTIME(x, format) | Parse a custom format (DuckDB / Postgres TO_DATE) |
DATE_TRUNC(unit, d) | Bucket to month / week / year |
EXTRACT(field FROM d) | Pull out year, month, dow, etc. |
ISO format (YYYY-MM-DD) is unambiguous and casts cleanly. Any other format needs STRPTIME.
Casting ISO Strings #
-- Clean ISO string → DATE
SELECT CAST('2024-03-15' AS DATE) AS d;
-- 2024-03-15
-- TRY_CAST returns NULL on failure instead of erroring
SELECT TRY_CAST('not a date' AS DATE) AS d;
-- NULLUse TRY_CAST when the input might be dirty. A single bad row won't abort the query; it'll quietly produce NULL so you can filter or fix it downstream.
STRPTIME: Custom Formats #
When the string isn't ISO, tell the database exactly how to read it with STRPTIME. The format specifier uses % codes, the same as strftime in Python or C.
-- US format
SELECT STRPTIME('03/15/2024', '%m/%d/%Y') AS ts;
-- 2024-03-15 00:00:00
-- Day-first European format
SELECT STRPTIME('15/03/2024', '%d/%m/%Y') AS ts;
-- 2024-03-15 00:00:00
-- ISO 8601 with timezone
SELECT STRPTIME('2024-03-15T14:22:11Z', '%Y-%m-%dT%H:%M:%SZ') AS ts;Common format codes: %Y (4-digit year), %m (month), %d (day), %H (hour 00–23), %M (minute), %S (second). STRPTIME returns a TIMESTAMP; cast to DATE if you don't need the time.
"03/04/2024" is ambiguous.
Is that March 4th (US) or April 3rd (most of the rest of the world)? SQL won't warn you. STRPTIME with the wrong format spec will silently produce a wrong-but-valid date. When working with source data, always confirm the intended format from the producer, not by guessing from the first row. A format like %Y-%m-%d (ISO) is your best friend because there's nothing to guess.
Bucketing with DATE_TRUNC #
Once the value is a real date, DATE_TRUNC buckets it to a calendar unit. That's how you build monthly revenue, weekly cohorts, yearly trends.
-- Monthly paid revenue
SELECT
DATE_TRUNC('month', order_date) AS order_month,
ROUND(SUM(total), 2) AS revenue
FROM orders
WHERE status = 'paid'
GROUP BY order_month
ORDER BY order_month;Valid units include 'year', 'quarter', 'month', 'week', 'day', 'hour'. Output is a date/timestamp, not a string, so downstream comparisons and sorts stay correct.
EXTRACT: Pulling Parts Out #
SELECT
order_date,
EXTRACT(YEAR FROM order_date) AS yr,
EXTRACT(MONTH FROM order_date) AS mo,
EXTRACT(DOW FROM order_date) AS day_of_week -- 0 = Sunday
FROM orders;Use EXTRACT when you need a numeric part. For example, "orders placed on weekends" filters with EXTRACT(DOW FROM order_date) IN (0, 6). For bucketing by calendar period, DATE_TRUNC is the better tool; for filtering or feature engineering, EXTRACT.
Try It #
A messy input of event dates: ISO strings plus one value that isn't a date at all. Parse the good rows to real DATE values, then bucket by month.
Practice #
Parse the raw_orders messy input and bucket by month. Return id, parsed_date (using TRY_CAST so bad rows become NULL), and month_bucket (DATE_TRUNC('month', ...)). Preserve all 5 rows. No ordering requirement.
Mistakes to Watch For #
- Using
CASTwhen the input might be dirty: one bad row aborts the query. UseTRY_CAST. - Guessing the format of ambiguous strings like
"03/04/2024". Check with the source before parsing, not after the dashboard is wrong. - Storing dates as strings "because it works." Comparisons and sorting on strings give wrong answers the moment format varies (
"9/1/2024"sorts before"10/1/2024"). - Using
SUBSTRINGto extract year/month from a date string is fragile.EXTRACTafter a parse is the right tool. - Forgetting timezone when the source is a
TIMESTAMP WITH TIME ZONE. Bucketing by day in UTC vs local time can shift entire rows between buckets. Be explicit about which timezone the report is in.
Knowledge check #
3 questions
WHERE EXTRACT(YEAR FROM order_date) = 2026defeats an index onorder_date. The index-friendly rewrite is:'04/05/2026'— is this April 5 or May 4?Storing event timestamps, the defensive default is:
Next Step #
Continue to Deduping. Once your strings and dates are clean, collapse the duplicate rows that inconsistent inputs created.