Pipeline automation strategy — protect Kelvin's eye, automate Kelvin's hands¶
Goal: make the monthly Snowflake → Eyeota pipeline boring and reliable, without removing Kelvin's visibility into data-quality issues. Automate the mechanical 95%, ritualise the thoughtful 5%.
The dilemma¶
Today, every month Kelvin manually runs ~6 SQL statements to:
1. CTAS FAGG_MAID${YYYYMM} from external AGG_MAID
2. DELETE the SGP rows that fail the agg_maid_sgp_filter allowlist (~1.2-1.4B rows)
3. CREATE + INSERT HEM_${YYYYMM} from AGG_HEM joined to HUSKY_ID_DB3
4. Three COPY INTO @HUSKY_EYEOTA_UPLOADER statements (Android / iOS / HEM)
5. Verify stage file counts via LIST @...
This is reliable in practice (revenue-bearing for years) but fragile in design:
Cost of full manual (status quo):
- Bus factor of 1 — if Kelvin is sick on the 4th-7th, ~US$69K/month Eyeota revenue is at risk
- Typos clobber production (2026-04-26 incident: Kelvin's HKG-only test ran CREATE OR REPLACE TABLE FAGG_MAID202602 instead of a sandbox name → 4.09B rows wiped to 33.6M)
- Filter design lives only in filter_list_* table names; the why stays in Kelvin's head
- Audit trail is only SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
Cost of full automation (naïve Snowflake TASK):
- Kelvin loses the touch-point that makes him notice "wait, why is HK MAID count 200x population?"
- Silent failure modes — TASK runs, COPY succeeds, but supplier data was bad → bad data shipped to Eyeota
- Filter improvements (today's filter_260424_misc, the 10 BDXXX260423 country filters from 2026-04-23) stop happening
The tension is real. Kelvin's manual work is doing two things at once: routine + insight. Automation conflates them and removes both. The fix is to separate them.
Proposed structure — 4 layers¶
Layer 1 — Automate the boring pipeline (with assertions)¶
Wrap the 6-statement runbook as a Snowflake stored procedure, schedule via TASK on the 6th of each month at 06:00 UTC. Critical: fail loudly when assertions break, do not ship bad data.
CREATE OR REPLACE PROCEDURE HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_PIPELINE()
RETURNS VARIANT
LANGUAGE SQL
AS
$$
DECLARE
yyyymm VARCHAR := TO_CHAR(DATEADD(MONTH, -1, CURRENT_DATE), 'YYYYMM');
rows_actual NUMBER;
rows_baseline_p25 NUMBER;
BEGIN
-- Step 1: CTAS FAGG_MAID${YYYYMM}
EXECUTE IMMEDIATE 'CREATE OR REPLACE TABLE FAGG_MAID' || :yyyymm || ' AS ...';
-- Step 2: SGP scrub
EXECUTE IMMEDIATE 'DELETE FROM FAGG_MAID' || :yyyymm || ' WHERE country_part=''SGP'' AND ...';
-- ASSERTION: row count vs trailing-3-month p25 (catch >20% drop)
SELECT COUNT(*) INTO :rows_actual FROM IDENTIFIER('FAGG_MAID' || :yyyymm);
SELECT MIN(rows) INTO :rows_baseline_p25 FROM PIPELINE_RUN_LOG
WHERE table_name LIKE 'FAGG_MAID%' AND yyyymm IN (last 3 months);
IF (rows_actual < rows_baseline_p25 * 0.8) THEN
INSERT INTO PIPELINE_ALERT_LOG VALUES (CURRENT_TIMESTAMP(), 'FAGG_MAID', :rows_actual, :rows_baseline_p25, 'ABORT');
RETURN OBJECT_CONSTRUCT('status', 'ABORTED', 'reason', 'row count below 80% of trailing-3-month minimum');
END IF;
-- Step 3-5: HEM, COPY INTOs (with similar assertions)
...
-- Step 6: log success
INSERT INTO PIPELINE_RUN_LOG VALUES (...);
RETURN OBJECT_CONSTRUCT('status', 'SUCCESS', 'yyyymm', :yyyymm, 'rows', :rows_actual);
END;
$$;
CREATE OR REPLACE TASK HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_TASK
WAREHOUSE = DEV
SCHEDULE = 'USING CRON 0 6 6 * * UTC'
AS CALL HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_PIPELINE();
Critically, the ABORT path notifies Kelvin (Slack webhook from a follow-up TASK that watches PIPELINE_ALERT_LOG). Monthly run is silent on success, loud on anomaly.
Layer 2 — Insight ritual (Kelvin reviews, doesn't run)¶
Auto-generated monthly data-quality dashboard (Streamlit on the existing HUSKY_DATA.PUBLIC schema, since DataCopilot / SNOWCHAT stages are already there). Refreshes after each pipeline run.
What the dashboard surfaces:
- Per-(country × platform) row count this month vs trailing 6 months — line chart
- SGP / HKG / IDN scrub ratios over time — flag deviations >30%
- Segment catalog drift: new seg_ids appearing, existing seg_ids disappearing
- filter_* table version diff (which seg_ids removed/added, by whom, when)
- Eyeota stage file count + total bytes vs prior — flag missing partitions
- Snowflake credit consumption per pipeline run
Kelvin's job changes from "run SQL" to "30-60 min/month of dashboard review + write a snapshot note". The snapshot lives at docs/internal/data-quality/YYYY-MM-snapshot.md — explicit capture of what he saw and what he did about it.
Layer 3 — Sandbox isolation¶
Today's HKG-only incident traces to one root cause: Kelvin's experiments share namespace with production. Fix:
| Use case | Object | Notes |
|---|---|---|
| Production CTAS | HUSKY_DATA.PRIVATE.FAGG_MAID${YYYYMM} |
Only TASK / approved manual runs touch these |
| Per-country experiments | HUSKY_DATA.PRIVATE.FAGG_MAID_${COUNTRY}_TEST |
Free-for-all sandbox |
| Filter version testing | HUSKY_DATA.PRIVATE.FAGG_MAID_DEV |
Replaceable scratch table |
| Promote-to-prod | Explicit RENAME or view-switch step |
Documented in runbook, never CREATE OR REPLACE on production target |
Naming convention extension to docs/reference/snowflake-naming-convention.md: any object containing _TEST or _DEV is sandbox; never wired to a SHARE.
Layer 4 — Document learnings¶
Filter tables get a _meta companion table (or row-level metadata):
CREATE TABLE HUSKY_DATA.PRIVATE.FILTER_REGISTRY (
filter_name VARCHAR,
created_at TIMESTAMP_NTZ,
created_by VARCHAR,
reason VARCHAR,
supersedes VARCHAR,
applies_to VARCHAR, -- 'USA', 'NON_USA', 'SGP_ONLY', etc.
active BOOLEAN
);
Every new filter Kelvin builds gets a row. Future review: query FILTER_REGISTRY to see the changelog of why each filter exists. Solves the "Kelvin's brain is the documentation" problem.
Each monthly ritual (Layer 2) writes a snapshot:
docs/internal/data-quality/2026-04-snapshot.md
## Anomalies seen this month
- HK MAID count up 32% vs trailing-3-mo mean. Investigated supplier; created BDHKG260423 boundary filter.
- SGP scrub ratio steady at 99.4%.
## Filter changes
- Added: filter_260424_misc (replaces filter_list_250418 for non-USA)
## Open follow-ups
- Verify whether AGG_MAID_HD lineage replaces AGG_MAID by 202604 (Kelvin's TODO)
Sprint plan¶
Sprint 1 (week 1)
─────────────────
• Create PIPELINE_RUN_LOG + PIPELINE_ALERT_LOG tables
• Wrap Kelvin's existing SQL into MONTHLY_EYEOTA_PIPELINE stored proc
(NO behavioural change — same statements, just logging added)
• Run procedure manually for 202602 to verify parity with hand-run
Sprint 2 (week 2-3)
───────────────────
• Streamlit dashboard reading PIPELINE_RUN_LOG: row count trends, country/platform
splits, segment catalog drift
• Slack/email webhook on ABORT or assertion failure
• Schedule MONTHLY_EYEOTA_TASK in SUSPEND state — Kelvin manually RESUME
after one successful dry-run
Sprint 3 (week 4)
─────────────────
• Sandbox tables (FAGG_MAID_DEV, FAGG_MAID_${COUNTRY}_TEST)
• Promote-to-prod runbook in docs/playbooks/
• Update naming convention doc
Sprint 4 (ongoing, monthly cadence)
───────────────────────────────────
• First monthly snapshot doc (2026-05 cycle)
• FILTER_REGISTRY backfill from existing filter_* tables
• Quarterly review: are assertions catching real issues, or only noise?
What we're NOT doing¶
- Not moving the pipeline off Snowflake. Snowflake TASK + stored proc is sufficient — no Airflow, no dbt, no third-party orchestrator.
- Not rebuilding Kelvin's filter logic. The
filter_list_*,BDXXX260423,agg_maid_sgp_filtertables stay as-is — they are domain knowledge encoded as data, not technical debt. - Not automating the supplier interaction. Reaching out to the supplier about column-level data quality (e.g. AlikeAudience-tagged columns in
monthly_segment_count, the orphanedagg-hem_sizeLimitedActiveSegmentsprefix) remains a Kelvin-and-Benjamin job. - Not activating the TASK on day one. Suspend by default; resume after Kelvin signs off on a dry-run.
Open questions¶
- Does Kelvin want Slack or email for alerts? Affects the assertion-violation path in Sprint 2.
- Should we expose the dashboard to Benjamin / non-Kelvin viewers? If yes, gate via Snowflake's row-level / object-level grants on
PUBLIC.MONTHLY_PIPELINE_DASHBOARD. - What's the cost ceiling for the new TASK? DEV warehouse is X-Small; one full pipeline takes ~30 min. Budget should be in line with current ad-hoc DEV usage.
- Do we backfill
FILTER_REGISTRYfor the historical filter tables (12+ months of accumulated filters), or only forward-going?
Provenance¶
Drafted 2026-04-26 after data-plane review session that surfaced (a) absent automation across Snowflake (zero TASKs / Streams), (b) production-clobbering incident on FAGG_MAID202602 same morning, (c) 95/5 split between Kelvin's mechanical and thoughtful work.