Skip to content

Pipeline automation — execution & migration plan

Why this doc exists: the 2026-04-26 strategy doc defines the what (4-layer architecture) and why (don't replace Kelvin's eye, replace his hands). This doc defines the how — concrete phased rollout, Snowflake account changes, deliverables, exit criteria, and rollback paths. No Snowflake objects are touched until Phase 1 starts. Shadow mode runs alongside Kelvin's manual pipeline for at least one full month before cutover.

Scope

In scope: - Wrap Kelvin's existing 4-step monthly process (CTAS FAGG_MAID, SGP scrub, HEM_ build, COPY INTO Eyeota stage) inside a Snowflake stored procedure - Add row-count and partition-existence assertions; abort + alert on anomaly - One Snowflake TASK to schedule monthly run - PIPELINE_RUN_LOG + PIPELINE_ALERT_LOG tables for observability - FILTER_REGISTRY table to capture filter-version metadata - Streamlit dashboard reading PIPELINE_RUN_LOG for monthly insight ritual - Sandbox tables (FAGG_MAID_DEV, FAGG_MAID_${ISO3}_TEST) for experiments

Not in scope: - Migration to a different Snowflake account, region, or org. The existing account ujb07324 (org ciqcvik, region AWS_US_WEST_2) is already aligned with the SG primary contracting entity (Husky Technology Pte Limited). Marketplace provider profile GZTYZ2Y10J8 is bound to this account; recreating it elsewhere would invalidate the live Marketplace listings. No account-level migration. - Changes to upstream Databricks pipeline. The supplier writes intermediate prefixes (*_activeSegments, *_distinctActiveSegments, etc.) that Husky originally requested for testing; whether to keep or drop them is a separate conversation with the supplier. - Re-architecting the HEM-side asymmetry (no SGP scrub, no segment-blocklist filter). Surfaced as Open Question; not blocking automation rollout. - Replacing agg_maid_sgp_filter allowlist generation. Today Kelvin runs the allowlist refresh ad-hoc from AGG_MAID_SL; automation reuses the existing allowlist as-is.

Snowflake account changes (in-place, no account migration)

The migration is additive — every new object lives alongside Kelvin's existing setup. Nothing existing gets renamed, dropped, or repointed during rollout.

Layer New objects Touches existing?
Roles MARKETPLACE_PIPELINE_OPERATOR (least-privilege role for procedure execution) No — Kelvin keeps ACCOUNTADMIN. New role gets only the grants it needs.
Warehouses None new in Phase 1-3. Phase 4 may introduce PIPELINE_S (Small, auto-suspend 60s) if DEV proves too small for assertion overhead. No — DEV continues for manual; new warehouse is opt-in.
Schemas HUSKY_DATA.OPS for run logs, alerts, filter registry, pipeline scratch tables (separates pipeline-internal from PRIVATE production tables) No — PRIVATE and SHARE untouched.
Tables OPS.PIPELINE_RUN_LOG, OPS.PIPELINE_ALERT_LOG, OPS.FILTER_REGISTRY, PRIVATE.FAGG_MAID_DEV (sandbox) No — production tables untouched.
Procedures OPS.MONTHLY_EYEOTA_PIPELINE() — Kelvin's 4 steps wrapped + assertions + logging Reads PRIVATE.AGG_MAID, AGG_HEM, etc. (read-only); writes FAGG_MAID${YYYYMM}_SHADOW during shadow mode, FAGG_MAID${YYYYMM} only after cutover.
Tasks OPS.MONTHLY_EYEOTA_TASK — runs procedure on schedule, starts SUSPENDED None — does nothing until Kelvin manually RESUMEs it.
Stages None new. Existing @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER reused as-is. No.
Grants MARKETPLACE_PIPELINE_OPERATOR gets: SELECT on PRIVATE.AGG_MAID/AGG_HEM/HUSKY_ID_DB3/filter tables; CREATE/MODIFY on OPS.*; WRITE on Eyeota stage; CREATE TABLE on PRIVATE (only for materialised monthly snapshots). No grants to the SHARE schema. No existing grants modified.
Streamlit / Apps One Streamlit app OPS.PIPELINE_DASHBOARD reading PIPELINE_RUN_LOG. Hosted in PUBLIC schema since that's where existing Streamlit lives. No.

Cost-of-rollback: because every change is additive and the production path stays manual until cutover (Phase 4), reverting is DROP SCHEMA OPS CASCADE; DROP TABLE PRIVATE.FAGG_MAID_DEV; REVOKE ALL ON ROLE MARKETPLACE_PIPELINE_OPERATOR; DROP ROLE MARKETPLACE_PIPELINE_OPERATOR;. Kelvin's manual workflow is unaffected at every phase.

Phased rollout

Phase 0 — Pre-flight alignment (Week 0)

Owner: Benjamin Wong + Kelvin Chan (joint) Deliverables: - 30-min sync with Kelvin reviewing the 2026-04-26 strategy + this execution doc - Confirm Kelvin agrees with sandbox-vs-production naming (FAGG_MAID_DEV vs FAGG_MAID${YYYYMM}) - Confirm Slack vs email for assertion alerts - Confirm row-count assertion thresholds (default ±20% from trailing-3-month median; Kelvin can tighten) - Identify any in-flight Kelvin work (e.g., AGG_MAID_HD lineage) that should be paused or coordinated

Exit criteria: signed-off scope. No code yet.


Phase 1 — Build automation in shadow (Week 1)

Owner: Benjamin Wong (build) + Kelvin Chan (review) Deliverables: 1. HUSKY_DATA.OPS schema created 2. OPS.PIPELINE_RUN_LOG + OPS.PIPELINE_ALERT_LOG tables created 3. OPS.MONTHLY_EYEOTA_PIPELINE() stored procedure — implements all 4 steps verbatim from playbook + logging + assertions, but writes to FAGG_MAID${YYYYMM}_SHADOW table (not production target) 4. OPS.MONTHLY_EYEOTA_TASK created in SUSPENDED state 5. MARKETPLACE_PIPELINE_OPERATOR role + grants 6. Migration SQL committed to docs/internal/snowflake-pipeline-migration-2026-04-27.sql for repeatability

Exit criteria: - Procedure callable manually: CALL OPS.MONTHLY_EYEOTA_PIPELINE(); - Returns success on a re-run of an already-shipped month (e.g., 202601) - FAGG_MAID202601_SHADOW row count = FAGG_MAID202601 row count (within 0.1%) - PIPELINE_RUN_LOG populated with one entry

Rollback: DROP SCHEMA HUSKY_DATA.OPS CASCADE; DROP TABLE FAGG_MAID202601_SHADOW;


Phase 2 — Dashboard + alerts (Week 2)

Owner: Benjamin Wong Deliverables: 1. Streamlit OPS.PIPELINE_DASHBOARD showing: row count history, country×platform splits, segment catalog drift, scrub-ratio trends. One page, no auth (gated by Snowsight login). 2. Slack webhook (or SES email) trigger on any PIPELINE_ALERT_LOG insert. Cloudflare Worker reads the table via Snowflake REST and posts to #data-pipeline Slack channel. 3. Test alerts by deliberately failing one assertion (e.g., set threshold so tight that 202601 re-run trips it) — confirm Slack receives the alert.

Exit criteria: - Kelvin confirms dashboard surfaces info he'd want to see - Test alert delivered to Slack within 5 min of trigger

Rollback: un-deploy Streamlit + delete webhook. No Snowflake side-effects.


Phase 3 — Shadow run for the next live month (Week 3-4, full calendar month)

Owner: Kelvin Chan (manual, as today) + automation runs in parallel Deliverables: 1. Kelvin runs the manual playbook for 202703 as usual (early May 2026, real production ship) 2. Same day, run CALL OPS.MONTHLY_EYEOTA_PIPELINE(); which writes FAGG_MAID202703_SHADOW (NOT the target — Eyeota gets the manual FAGG_MAID202703) 3. Compare FAGG_MAID202703 (manual production) vs FAGG_MAID202703_SHADOW (automated) 4. If row counts match within 0.5%, distinct user counts within 0.5%, and segment_id distributions within 1% K-S divergence → automation parity confirmed 5. Document any deltas in docs/internal/data-quality/2026-05-shadow-comparison.md

Exit criteria: - Parity confirmed for 202703 - No assertion false-positives during the run - Dashboard correctly reflects the shadow run

Rollback: if parity fails, do NOT cut over. Investigate diff. Manual continues. Phase repeats next month with fixes.


Phase 4 — Cutover (Week 5)

Owner: Kelvin Chan (executes), Benjamin Wong (on standby) Deliverables: 1. Modify OPS.MONTHLY_EYEOTA_PIPELINE() to write directly to FAGG_MAID${YYYYMM} instead of the _SHADOW target. One-line change, version-tagged. 2. Schedule one final dry-run for the next month via manual CALL OPS.MONTHLY_EYEOTA_PIPELINE(); — this becomes the actual production for that month. 3. ALTER TASK OPS.MONTHLY_EYEOTA_TASK RESUME; to enable scheduled runs from the following month. 4. Update docs/playbooks/monthly-eyeota-upload.md to reflect: "automated; manual fallback path retained below."

Exit criteria: - One full month shipped via automation - Eyeota supplier earnings report (60-day lag) confirms revenue unchanged

Rollback: - If a TASK run fails: PIPELINE_ALERT_LOG fires, Slack notifies Kelvin, Kelvin runs the manual playbook for that month. Maximum revenue exposure: one month delayed by 1-2 days. - If multiple consecutive failures: ALTER TASK OPS.MONTHLY_EYEOTA_TASK SUSPEND; and revert to manual until root cause resolved.


Phase 5 — 90-day soak (Week 6+)

Owner: Kelvin Chan (monthly review) + automatic alerts Deliverables: 1. Three consecutive successful automated runs (months 1-3 post-cutover) 2. Monthly snapshot doc at docs/internal/data-quality/YYYY-MM-snapshot.md capturing what dashboard surfaced + any actions taken 3. Quarterly review: are assertion thresholds catching real issues, or only noise? Tune. 4. Backfill OPS.FILTER_REGISTRY from existing filter_* tables (~12-18 historical filters) — capture the why behind each.

Exit criteria: stable for 90 days; quarterly review approved.


Phase 6 — Layer 4 documentation hardening (ongoing)

Owner: Benjamin Wong Deliverables: 1. OPS.FILTER_REGISTRY becomes the canonical source for "why this seg_id is blocked" — every new filter_* table gets a registry row at creation 2. Snowflake naming convention doc updated to reference the registry pattern 3. Per-country filter convention (BD${ISO3}260423 → standardise to FILTER_BD_${ISO3}_${YYMMDD})

Exit criteria: new filter tables follow convention; old ones tagged with legacy_naming=TRUE in registry.

Risk register

Risk Likelihood Impact Mitigation
Shadow run produces different row counts than manual Medium High — blocks cutover Phase 3 explicit comparison + diff investigation. If supplier external table state differs between manual and automated runs (e.g. partition-refresh timing), document and reconcile.
TASK fails silently overnight Low (alerts catch it) High (missed Eyeota ship) PIPELINE_ALERT_LOG triggers Slack on any non-success status; Kelvin oncall during first 90 days
--strict build / metadata drift breaks dashboard Low Low (dashboard only) Streamlit failure doesn't affect data pipeline; Kelvin can still review via direct table query
Kelvin in-flight AGG_MAID_HD work conflicts with shadow procedure Medium Medium Phase 0 alignment surfaces this; coordinate the procedure to use the chosen lineage
Permissions error: MARKETPLACE_PIPELINE_OPERATOR lacks a needed grant during real run Medium High (failed ship) Phase 1 exit criterion is a successful re-run of a historical month — exposes any missing grant before cutover
Storage cost spike from _SHADOW tables Low Low Each shadow table is one month's worth (~2 TB); auto-drop tables older than 60 days via a separate cleanup task
Eyeota supplier earnings drop after cutover (data quality regression we didn't catch) Low Critical 60-day delayed earnings report is the final truth; if drops detected, revert to manual immediately and investigate

Stakeholder & RACI

Activity Benjamin Kelvin Husky board
Strategy approval A C I
Execution plan approval A C I
Phase 1-2 build R C
Phase 3 shadow run review C R I
Phase 4 cutover C R I
Phase 5+ monthly review I R
Rollback decision A R I

R = Responsible, A = Accountable, C = Consulted, I = Informed.

Open questions

  1. Shadow table cleanup cadence — auto-drop after 60 days? 90 days? Affects storage cost.
  2. Slack channel ownership — does #data-pipeline exist, or do we create it? Who's on the alert recipient list besides Kelvin?
  3. Failure escalation chain — if Kelvin doesn't ack a Slack alert within 4 hours, who's the second escalation? (Today there's no fallback.)
  4. Backfill scope for FILTER_REGISTRY — every historical filter_* table, or only ones still actively referenced by current month's CTAS?
  5. Coordination with AGG_MAID_HD lineage — should the automated procedure assume the new HD path, or maintain compatibility with both AGG_MAID and AGG_MAID_HD?
  6. Per-country sanity-cap rollout — the 10 BDXXX260423 country filter tables created 2026-04-23 hint at extending SGP-style scrub to other markets. Does the automated pipeline need to add per-country DELETE steps, or is that a separate sprint after cutover?

S3 cleanup (parallel workstream)

The 2026-04-26 data-plane review surfaced multiple S3 hygiene items. Consolidated here so cleanup happens once alongside the Snowflake migration, rather than piecemeal later.

Account: 969735114743 (alias huskydata, profile huskyai).

Cleanup catalog

ID Object Issue Action Priority Owner Reversible?
S3-1 ✅ DONE 2026-04-27 s3://raw.huskydata.com/Acxiom.csv (258.3 GB, us-east-1) PII data, never used since 2024-02-01 ingest. Filename embedded blocked supplier brand. Held PII without documented purpose. DELETED 2026-04-27 via aws s3 rm. Bucket has no versioning → permanent. 258.3 GB freed. GDPR Art. 5(1)(b) compliance recovered. HIGH (closed) Benjamin N/A — completed
S3-2 s3://raw.huskydata.com/Twitter Hit 50M.txt, FB/Hong Kong.txt, FB/USA.zip, linkedin/part-*.gz, vaccine_data.json/jsonl Supplier-platform-named files (Twitter, FB, linkedin) — same brand-leak concern, plus stale (last-modified pre-2024). Audit each; if unused → archive to non-branded prefix. If used → namespace under legacy/ and update consumers. MEDIUM Benjamin Yes
S3-3 s3://agg-data.huskyai.com/identity_db/ (empty per 2026-04-26 audit) External table HUSKY_DATA.PRIVATE.IDENTITY_DB claims to read here; prefix has zero objects. Either data lives elsewhere or load never completed. Investigate: query IDENTITY_DB returns rows? If yes → find real S3 location, fix external table pointer. If no → drop external table. MEDIUM Benjamin + Kelvin Yes
S3-4 s3://data.huskyai.com/scanbuy/hem_maid/ Supplier still drops files here monthly; no Snowflake external table consumes them. Orphan ingest. Confirm with Kelvin if needed; if no → ask supplier to stop delivering, lifecycle-rule the existing files to expire after 90 days. LOW Kelvin Yes
S3-5 s3://agg-data.huskyai.com/agg-hem_sizeLimitedActiveSegments/ (15.7 GB/month, written but no Snowflake consumer) Asymmetric vs MAID side: MAID's SL prefix surfaces as AGG_MAID_SL and feeds SGP allowlist; HEM's SL prefix has no Snowflake surface and no consumer. Either we should be using it, or supplier shouldn't be writing it. Decide: (a) wire HEM-side SGP scrub equivalent (uses this prefix) — see automation strategy Open Questions, OR (b) ask supplier to stop delivering. MEDIUM Benjamin + Kelvin Yes
S3-6 s3://agg-data.huskyai.com/agg-maid_latest_month_base/ + agg-hem_latest_month_base/ (Delta intermediates) External tables AGG_MAID_CUR (pinned date=202412) + AGG_HEM_CUR (pinned date=202308, 30 months stale) point at these. Zero query history in 365 days. Likely DBX-internal Delta checkpoints we surfaced once and forgot. Drop the two _CUR external tables. Keep S3 prefixes (DBX needs them for next month's run). LOW Kelvin Yes — just drop external table; data unchanged
S3-7 ✅ DONE 2026-05-12 s3://agg-data.huskyai.com/agg-hem_distinctActiveSegments/ → ~~AGG_HEM1~~ 180-day re-audit (2026-05-12) confirmed 0 production reads — earlier 3 q30d count was DDL-maintenance + GET_DDL audit noise. DROPPED 2026-05-12 via DROP EXTERNAL TABLE HUSKY_DATA.PRIVATE.AGG_HEM1. S3 prefix retained (supplier still writes; n8n sync rule downgrade deferred to other workstream per owner direction). LOW (closed) Benjamin N/A — completed
S3-8 s3://agg-data.huskyai.com/SegmentedCount/ (XLS exports year=2019..2025) Manual reporting artifacts — emailed XLS, not consumed by Snowflake. Pre-2024 likely dead. Lifecycle-rule pre-2024 partitions to Glacier Deep Archive (~$1/TB/month vs S3 Standard ~$23/TB/month). Keep latest 24 months in Standard for active reference. LOW Benjamin Reversible (Glacier restore)
S3-9 s3://agg-data.huskyai.com/monthly_segment_count/ (Databricks Delta, AlikeAudience strings + 25 DSP brand columns in schema) Brand contamination at column-name level. Currently NOT exposed via Snowflake — but if anyone ever creates an external table over this prefix, supplier brand leaks. Supplier-side fix. Open ticket with supplier (Databricks owner: tech@alikeaudience.com, jobId 638399474177810) to rename branded columns to neutral names (e.g. dsp_1, dsp_2, ...) and replace AlikeAudience: ... strings in path_new column. HIGH (supplier ticket) Benjamin (open ticket) N/A
S3-10 s3://agg-data.huskyai.com/agg-{maid,hem}/date=202505..202510/ (6+ month old production partitions) Standard-tier storage at ~2 TB/month × 6+ months × ~$23/TB = ~$280/month for partitions Kelvin's monthly job no longer reads (only the latest month is registered in AGG_MAID). S3 Lifecycle rule: transition to Intelligent-Tiering after 60 days; to Glacier Instant Retrieval after 180 days. Saves ~60-80% on storage cost for cold partitions while preserving query-on-demand. MEDIUM Benjamin Yes — rule reversible
S3-11 s3://huskyai.com/ (433+ Dremio tutorial PNG/JPG files, legacy) No active website hosted here; old static-mirror of Dremio docs. Hundreds of MB. Move to s3://huskyai-archive/legacy-dremio-tutorials/ and lifecycle to Glacier Deep Archive. Or just delete after confirming no inbound links. LOW Benjamin Yes (archive) / No (delete)
S3-12 s3://code.huskyai.com/sync_agg.sh + sync_daily.sh (4 KB total) Pipeline driver shell scripts in S3. If they're production, they should be in git, not S3. If they're dead, drop. Read both files, commit to repo at scripts/data-pipeline/ if active, archive if not. LOW Benjamin Yes

Sequencing — when in the rollout

┌─────────────────────────────────────────────────────────────────────┐
│ Phase 0 (Week 0) — pre-flight                                        │
│   • S3-1 (HIGH): rename Acxiom.csv. 1-hour task, do this immediately │
│   • S3-9 (HIGH): open supplier ticket re: AlikeAudience contamination│
│                                                                      │
│ Phase 1-2 (Week 1-2) — automation build                              │
│   • S3-3 (MEDIUM): investigate identity_db emptiness                  │
│   • S3-2 (MEDIUM): audit raw.huskydata.com supplier-named files      │
│   • S3-5 (MEDIUM): decide HEM SGP scrub — yes/no — drives wire-up    │
│                                                                      │
│ Phase 3 (Week 3-4) — shadow run                                      │
│   • S3-10 (MEDIUM): apply lifecycle rule to agg-data old partitions  │
│   • S3-6, S3-7 (LOW): drop stale `AGG_HEM_CUR`, `AGG_MAID_CUR`,     │
│     `AGG_HEM1` external tables                                        │
│                                                                      │
│ Phase 5+ (Week 6+) — soak                                            │
│   • S3-4, S3-8, S3-11, S3-12 (LOW): final hygiene sweep              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Cost impact estimate

Item Current cost Post-cleanup Annual saving
S3-10 lifecycle on cold agg-data partitions ~$280/mo ~$60/mo ~$2,640
S3-8 SegmentedCount Glacier transition ~$100/mo (estimate) ~$10/mo ~$1,080
S3-11 huskyai.com archive ~$5/mo ~$0.5/mo ~$54
S3-1 Acxiom.csv (rename only, no cost change) $6/mo $6/mo $0
Total annual cost saved ~$3,800

Not big money — but ~$300/mo recurring savings + brand-leak risk eliminated + dashboard clarity restored. The real ROI is risk reduction, not cost.

What we are NOT doing in S3 cleanup

  • Not deleting any production prefix (agg-data.huskyai.com/agg-maid/ and agg-hem/ core prefixes are untouched)
  • Not changing supplier-side Databricks job behaviour without their explicit confirmation (S3-9 is a ticket, not a unilateral change)
  • Not touching vendor-managed buckets (dremio-*, databricks-workspace-stack-*, elasticbeanstalk-*) — out of scope
  • Not dropping the agg-{maid,hem}_activeSegments/, _distinctActiveSegments/, _sizeLimitedActiveSegments/ S3 prefixes themselves — supplier needs them for their next-month pipeline runs. Only the Snowflake external tables that point at stale partitions get dropped.

Provenance

Drafted 2026-04-27 as the execution layer beneath the 2026-04-26 pipeline automation strategy. Designed in the context of a manual-pipeline-with-incident-the-day-before (FAGG_MAID202602 HKG-only overwrite) and an 8-deploy-failure wiki backlog (root-caused to README/index.md collision in MkDocs strict mode). Both incidents informed the additive-only / shadow-first rollout philosophy. S3 cleanup section added 2026-04-27 to consolidate hygiene items surfaced during data-plane review.