Skip to content

Pipeline automation strategy — protect Kelvin's eye, automate Kelvin's hands

Goal: make the monthly Snowflake → Eyeota pipeline boring and reliable, without removing Kelvin's visibility into data-quality issues. Automate the mechanical 95%, ritualise the thoughtful 5%.

The dilemma

Today, every month Kelvin manually runs ~6 SQL statements to: 1. CTAS FAGG_MAID${YYYYMM} from external AGG_MAID 2. DELETE the SGP rows that fail the agg_maid_sgp_filter allowlist (~1.2-1.4B rows) 3. CREATE + INSERT HEM_${YYYYMM} from AGG_HEM joined to HUSKY_ID_DB3 4. Three COPY INTO @HUSKY_EYEOTA_UPLOADER statements (Android / iOS / HEM) 5. Verify stage file counts via LIST @...

This is reliable in practice (revenue-bearing for years) but fragile in design:

Cost of full manual (status quo): - Bus factor of 1 — if Kelvin is sick on the 4th-7th, ~US$69K/month Eyeota revenue is at risk - Typos clobber production (2026-04-26 incident: Kelvin's HKG-only test ran CREATE OR REPLACE TABLE FAGG_MAID202602 instead of a sandbox name → 4.09B rows wiped to 33.6M) - Filter design lives only in filter_list_* table names; the why stays in Kelvin's head - Audit trail is only SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

Cost of full automation (naïve Snowflake TASK): - Kelvin loses the touch-point that makes him notice "wait, why is HK MAID count 200x population?" - Silent failure modes — TASK runs, COPY succeeds, but supplier data was bad → bad data shipped to Eyeota - Filter improvements (today's filter_260424_misc, the 10 BDXXX260423 country filters from 2026-04-23) stop happening

The tension is real. Kelvin's manual work is doing two things at once: routine + insight. Automation conflates them and removes both. The fix is to separate them.

Proposed structure — 4 layers

Layer 1 — Automate the boring pipeline (with assertions)

Wrap the 6-statement runbook as a Snowflake stored procedure, schedule via TASK on the 6th of each month at 06:00 UTC. Critical: fail loudly when assertions break, do not ship bad data.

CREATE OR REPLACE PROCEDURE HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_PIPELINE()
RETURNS VARIANT
LANGUAGE SQL
AS
$$
DECLARE
  yyyymm  VARCHAR := TO_CHAR(DATEADD(MONTH, -1, CURRENT_DATE), 'YYYYMM');
  rows_actual NUMBER;
  rows_baseline_p25 NUMBER;
BEGIN
  -- Step 1: CTAS FAGG_MAID${YYYYMM}
  EXECUTE IMMEDIATE 'CREATE OR REPLACE TABLE FAGG_MAID' || :yyyymm || ' AS ...';

  -- Step 2: SGP scrub
  EXECUTE IMMEDIATE 'DELETE FROM FAGG_MAID' || :yyyymm || ' WHERE country_part=''SGP'' AND ...';

  -- ASSERTION: row count vs trailing-3-month p25 (catch >20% drop)
  SELECT COUNT(*) INTO :rows_actual FROM IDENTIFIER('FAGG_MAID' || :yyyymm);
  SELECT MIN(rows) INTO :rows_baseline_p25 FROM PIPELINE_RUN_LOG
    WHERE table_name LIKE 'FAGG_MAID%' AND yyyymm IN (last 3 months);
  IF (rows_actual < rows_baseline_p25 * 0.8) THEN
    INSERT INTO PIPELINE_ALERT_LOG VALUES (CURRENT_TIMESTAMP(), 'FAGG_MAID', :rows_actual, :rows_baseline_p25, 'ABORT');
    RETURN OBJECT_CONSTRUCT('status', 'ABORTED', 'reason', 'row count below 80% of trailing-3-month minimum');
  END IF;

  -- Step 3-5: HEM, COPY INTOs (with similar assertions)
  ...

  -- Step 6: log success
  INSERT INTO PIPELINE_RUN_LOG VALUES (...);
  RETURN OBJECT_CONSTRUCT('status', 'SUCCESS', 'yyyymm', :yyyymm, 'rows', :rows_actual);
END;
$$;

CREATE OR REPLACE TASK HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_TASK
  WAREHOUSE = DEV
  SCHEDULE  = 'USING CRON 0 6 6 * * UTC'
AS CALL HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_PIPELINE();

Critically, the ABORT path notifies Kelvin (Slack webhook from a follow-up TASK that watches PIPELINE_ALERT_LOG). Monthly run is silent on success, loud on anomaly.

Layer 2 — Insight ritual (Kelvin reviews, doesn't run)

Auto-generated monthly data-quality dashboard (Streamlit on the existing HUSKY_DATA.PUBLIC schema, since DataCopilot / SNOWCHAT stages are already there). Refreshes after each pipeline run.

What the dashboard surfaces: - Per-(country × platform) row count this month vs trailing 6 months — line chart - SGP / HKG / IDN scrub ratios over time — flag deviations >30% - Segment catalog drift: new seg_ids appearing, existing seg_ids disappearing - filter_* table version diff (which seg_ids removed/added, by whom, when) - Eyeota stage file count + total bytes vs prior — flag missing partitions - Snowflake credit consumption per pipeline run

Kelvin's job changes from "run SQL" to "30-60 min/month of dashboard review + write a snapshot note". The snapshot lives at docs/internal/data-quality/YYYY-MM-snapshot.md — explicit capture of what he saw and what he did about it.

Layer 3 — Sandbox isolation

Today's HKG-only incident traces to one root cause: Kelvin's experiments share namespace with production. Fix:

Use case Object Notes
Production CTAS HUSKY_DATA.PRIVATE.FAGG_MAID${YYYYMM} Only TASK / approved manual runs touch these
Per-country experiments HUSKY_DATA.PRIVATE.FAGG_MAID_${COUNTRY}_TEST Free-for-all sandbox
Filter version testing HUSKY_DATA.PRIVATE.FAGG_MAID_DEV Replaceable scratch table
Promote-to-prod Explicit RENAME or view-switch step Documented in runbook, never CREATE OR REPLACE on production target

Naming convention extension to docs/reference/snowflake-naming-convention.md: any object containing _TEST or _DEV is sandbox; never wired to a SHARE.

Layer 4 — Document learnings

Filter tables get a _meta companion table (or row-level metadata):

CREATE TABLE HUSKY_DATA.PRIVATE.FILTER_REGISTRY (
  filter_name      VARCHAR,
  created_at       TIMESTAMP_NTZ,
  created_by       VARCHAR,
  reason           VARCHAR,
  supersedes       VARCHAR,
  applies_to       VARCHAR,  -- 'USA', 'NON_USA', 'SGP_ONLY', etc.
  active           BOOLEAN
);

Every new filter Kelvin builds gets a row. Future review: query FILTER_REGISTRY to see the changelog of why each filter exists. Solves the "Kelvin's brain is the documentation" problem.

Each monthly ritual (Layer 2) writes a snapshot:

docs/internal/data-quality/2026-04-snapshot.md
  ## Anomalies seen this month
  - HK MAID count up 32% vs trailing-3-mo mean. Investigated supplier; created BDHKG260423 boundary filter.
  - SGP scrub ratio steady at 99.4%.
  ## Filter changes
  - Added: filter_260424_misc (replaces filter_list_250418 for non-USA)
  ## Open follow-ups
  - Verify whether AGG_MAID_HD lineage replaces AGG_MAID by 202604 (Kelvin's TODO)

Sprint plan

Sprint 1 (week 1)
─────────────────
• Create PIPELINE_RUN_LOG + PIPELINE_ALERT_LOG tables
• Wrap Kelvin's existing SQL into MONTHLY_EYEOTA_PIPELINE stored proc
  (NO behavioural change — same statements, just logging added)
• Run procedure manually for 202602 to verify parity with hand-run

Sprint 2 (week 2-3)
───────────────────
• Streamlit dashboard reading PIPELINE_RUN_LOG: row count trends, country/platform
  splits, segment catalog drift
• Slack/email webhook on ABORT or assertion failure
• Schedule MONTHLY_EYEOTA_TASK in SUSPEND state — Kelvin manually RESUME
  after one successful dry-run

Sprint 3 (week 4)
─────────────────
• Sandbox tables (FAGG_MAID_DEV, FAGG_MAID_${COUNTRY}_TEST)
• Promote-to-prod runbook in docs/playbooks/
• Update naming convention doc

Sprint 4 (ongoing, monthly cadence)
───────────────────────────────────
• First monthly snapshot doc (2026-05 cycle)
• FILTER_REGISTRY backfill from existing filter_* tables
• Quarterly review: are assertions catching real issues, or only noise?

What we're NOT doing

  • Not moving the pipeline off Snowflake. Snowflake TASK + stored proc is sufficient — no Airflow, no dbt, no third-party orchestrator.
  • Not rebuilding Kelvin's filter logic. The filter_list_*, BDXXX260423, agg_maid_sgp_filter tables stay as-is — they are domain knowledge encoded as data, not technical debt.
  • Not automating the supplier interaction. Reaching out to the supplier about column-level data quality (e.g. AlikeAudience-tagged columns in monthly_segment_count, the orphaned agg-hem_sizeLimitedActiveSegments prefix) remains a Kelvin-and-Benjamin job.
  • Not activating the TASK on day one. Suspend by default; resume after Kelvin signs off on a dry-run.

Open questions

  1. Does Kelvin want Slack or email for alerts? Affects the assertion-violation path in Sprint 2.
  2. Should we expose the dashboard to Benjamin / non-Kelvin viewers? If yes, gate via Snowflake's row-level / object-level grants on PUBLIC.MONTHLY_PIPELINE_DASHBOARD.
  3. What's the cost ceiling for the new TASK? DEV warehouse is X-Small; one full pipeline takes ~30 min. Budget should be in line with current ad-hoc DEV usage.
  4. Do we backfill FILTER_REGISTRY for the historical filter tables (12+ months of accumulated filters), or only forward-going?

Provenance

Drafted 2026-04-26 after data-plane review session that surfaced (a) absent automation across Snowflake (zero TASKs / Streams), (b) production-clobbering incident on FAGG_MAID202602 same morning, (c) 95/5 split between Kelvin's mechanical and thoughtful work.