Pipeline automation — execution & migration plan¶

Why this doc exists: the 2026-04-26 strategy doc defines the what (4-layer architecture) and why (don't replace Kelvin's eye, replace his hands). This doc defines the how — concrete phased rollout, Snowflake account changes, deliverables, exit criteria, and rollback paths. No Snowflake objects are touched until Phase 1 starts. Shadow mode runs alongside Kelvin's manual pipeline for at least one full month before cutover.

Scope¶

In scope: - Wrap Kelvin's existing 4-step monthly process (CTAS FAGG_MAID, SGP scrub, HEM_ build, COPY INTO Eyeota stage) inside a Snowflake stored procedure - Add row-count and partition-existence assertions; abort + alert on anomaly - One Snowflake TASK to schedule monthly run - PIPELINE_RUN_LOG + PIPELINE_ALERT_LOG tables for observability - FILTER_REGISTRY table to capture filter-version metadata - Streamlit dashboard reading PIPELINE_RUN_LOG for monthly insight ritual - Sandbox tables (FAGG_MAID_DEV, FAGG_MAID_${ISO3}_TEST) for experiments

Not in scope: - Migration to a different Snowflake account, region, or org. The existing account ujb07324 (org ciqcvik, region AWS_US_WEST_2) is already aligned with the SG primary contracting entity (Husky Technology Pte Limited). Marketplace provider profile GZTYZ2Y10J8 is bound to this account; recreating it elsewhere would invalidate the live Marketplace listings. No account-level migration. - Changes to upstream Databricks pipeline. The supplier writes intermediate prefixes (*_activeSegments, *_distinctActiveSegments, etc.) that Husky originally requested for testing; whether to keep or drop them is a separate conversation with the supplier. - Re-architecting the HEM-side asymmetry (no SGP scrub, no segment-blocklist filter). Surfaced as Open Question; not blocking automation rollout. - Replacing agg_maid_sgp_filter allowlist generation. Today Kelvin runs the allowlist refresh ad-hoc from AGG_MAID_SL; automation reuses the existing allowlist as-is.

Snowflake account changes (in-place, no account migration)¶

The migration is additive — every new object lives alongside Kelvin's existing setup. Nothing existing gets renamed, dropped, or repointed during rollout.

Layer	New objects	Touches existing?
Roles	`MARKETPLACE_PIPELINE_OPERATOR` (least-privilege role for procedure execution)	No — Kelvin keeps `ACCOUNTADMIN`. New role gets only the grants it needs.
Warehouses	None new in Phase 1-3. Phase 4 may introduce `PIPELINE_S` (Small, auto-suspend 60s) if `DEV` proves too small for assertion overhead.	No — `DEV` continues for manual; new warehouse is opt-in.
Schemas	`HUSKY_DATA.OPS` for run logs, alerts, filter registry, pipeline scratch tables (separates pipeline-internal from `PRIVATE` production tables)	No — `PRIVATE` and `SHARE` untouched.
Tables	`OPS.PIPELINE_RUN_LOG`, `OPS.PIPELINE_ALERT_LOG`, `OPS.FILTER_REGISTRY`, `PRIVATE.FAGG_MAID_DEV` (sandbox)	No — production tables untouched.
Procedures	`OPS.MONTHLY_EYEOTA_PIPELINE()` — Kelvin's 4 steps wrapped + assertions + logging	Reads `PRIVATE.AGG_MAID`, `AGG_HEM`, etc. (read-only); writes `FAGG_MAID${YYYYMM}_SHADOW` during shadow mode, `FAGG_MAID${YYYYMM}` only after cutover.
Tasks	`OPS.MONTHLY_EYEOTA_TASK` — runs procedure on schedule, starts SUSPENDED	None — does nothing until Kelvin manually `RESUME`s it.
Stages	None new. Existing `@HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER` reused as-is.	No.
Grants	`MARKETPLACE_PIPELINE_OPERATOR` gets: SELECT on `PRIVATE.AGG_MAID`/`AGG_HEM`/`HUSKY_ID_DB3`/filter tables; CREATE/MODIFY on `OPS.*`; WRITE on Eyeota stage; CREATE TABLE on `PRIVATE` (only for materialised monthly snapshots). No grants to the SHARE schema.	No existing grants modified.
Streamlit / Apps	One Streamlit app `OPS.PIPELINE_DASHBOARD` reading `PIPELINE_RUN_LOG`. Hosted in `PUBLIC` schema since that's where existing Streamlit lives.	No.

Cost-of-rollback: because every change is additive and the production path stays manual until cutover (Phase 4), reverting is DROP SCHEMA OPS CASCADE; DROP TABLE PRIVATE.FAGG_MAID_DEV; REVOKE ALL ON ROLE MARKETPLACE_PIPELINE_OPERATOR; DROP ROLE MARKETPLACE_PIPELINE_OPERATOR;. Kelvin's manual workflow is unaffected at every phase.

Phased rollout¶

Phase 0 — Pre-flight alignment (Week 0)¶

Owner: Benjamin Wong + Kelvin Chan (joint) Deliverables: - 30-min sync with Kelvin reviewing the 2026-04-26 strategy + this execution doc - Confirm Kelvin agrees with sandbox-vs-production naming (FAGG_MAID_DEV vs FAGG_MAID${YYYYMM}) - Confirm Slack vs email for assertion alerts - Confirm row-count assertion thresholds (default ±20% from trailing-3-month median; Kelvin can tighten) - Identify any in-flight Kelvin work (e.g., AGG_MAID_HD lineage) that should be paused or coordinated

Exit criteria: signed-off scope. No code yet.

Phase 1 — Build automation in shadow (Week 1)¶

Owner: Benjamin Wong (build) + Kelvin Chan (review) Deliverables: 1. HUSKY_DATA.OPS schema created 2. OPS.PIPELINE_RUN_LOG + OPS.PIPELINE_ALERT_LOG tables created 3. OPS.MONTHLY_EYEOTA_PIPELINE() stored procedure — implements all 4 steps verbatim from playbook + logging + assertions, but writes to FAGG_MAID${YYYYMM}_SHADOW table (not production target) 4. OPS.MONTHLY_EYEOTA_TASK created in SUSPENDED state 5. MARKETPLACE_PIPELINE_OPERATOR role + grants 6. Migration SQL committed to docs/internal/snowflake-pipeline-migration-2026-04-27.sql for repeatability

Exit criteria: - Procedure callable manually: CALL OPS.MONTHLY_EYEOTA_PIPELINE(); - Returns success on a re-run of an already-shipped month (e.g., 202601) - FAGG_MAID202601_SHADOW row count = FAGG_MAID202601 row count (within 0.1%) - PIPELINE_RUN_LOG populated with one entry

Rollback: DROP SCHEMA HUSKY_DATA.OPS CASCADE; DROP TABLE FAGG_MAID202601_SHADOW;

Phase 2 — Dashboard + alerts (Week 2)¶

Owner: Benjamin Wong Deliverables: 1. Streamlit OPS.PIPELINE_DASHBOARD showing: row count history, country×platform splits, segment catalog drift, scrub-ratio trends. One page, no auth (gated by Snowsight login). 2. Slack webhook (or SES email) trigger on any PIPELINE_ALERT_LOG insert. Cloudflare Worker reads the table via Snowflake REST and posts to #data-pipeline Slack channel. 3. Test alerts by deliberately failing one assertion (e.g., set threshold so tight that 202601 re-run trips it) — confirm Slack receives the alert.

Exit criteria: - Kelvin confirms dashboard surfaces info he'd want to see - Test alert delivered to Slack within 5 min of trigger

Rollback: un-deploy Streamlit + delete webhook. No Snowflake side-effects.

Phase 3 — Shadow run for the next live month (Week 3-4, full calendar month)¶

Owner: Kelvin Chan (manual, as today) + automation runs in parallel Deliverables: 1. Kelvin runs the manual playbook for 202703 as usual (early May 2026, real production ship) 2. Same day, run CALL OPS.MONTHLY_EYEOTA_PIPELINE(); which writes FAGG_MAID202703_SHADOW (NOT the target — Eyeota gets the manual FAGG_MAID202703) 3. Compare FAGG_MAID202703 (manual production) vs FAGG_MAID202703_SHADOW (automated) 4. If row counts match within 0.5%, distinct user counts within 0.5%, and segment_id distributions within 1% K-S divergence → automation parity confirmed 5. Document any deltas in docs/internal/data-quality/2026-05-shadow-comparison.md

Exit criteria: - Parity confirmed for 202703 - No assertion false-positives during the run - Dashboard correctly reflects the shadow run

Rollback: if parity fails, do NOT cut over. Investigate diff. Manual continues. Phase repeats next month with fixes.

Phase 4 — Cutover (Week 5)¶

Owner: Kelvin Chan (executes), Benjamin Wong (on standby) Deliverables: 1. Modify OPS.MONTHLY_EYEOTA_PIPELINE() to write directly to FAGG_MAID${YYYYMM} instead of the _SHADOW target. One-line change, version-tagged. 2. Schedule one final dry-run for the next month via manual CALL OPS.MONTHLY_EYEOTA_PIPELINE(); — this becomes the actual production for that month. 3. ALTER TASK OPS.MONTHLY_EYEOTA_TASK RESUME; to enable scheduled runs from the following month. 4. Update docs/playbooks/monthly-eyeota-upload.md to reflect: "automated; manual fallback path retained below."

Exit criteria: - One full month shipped via automation - Eyeota supplier earnings report (60-day lag) confirms revenue unchanged

Rollback: - If a TASK run fails: PIPELINE_ALERT_LOG fires, Slack notifies Kelvin, Kelvin runs the manual playbook for that month. Maximum revenue exposure: one month delayed by 1-2 days. - If multiple consecutive failures: ALTER TASK OPS.MONTHLY_EYEOTA_TASK SUSPEND; and revert to manual until root cause resolved.

Phase 5 — 90-day soak (Week 6+)¶

Owner: Kelvin Chan (monthly review) + automatic alerts Deliverables: 1. Three consecutive successful automated runs (months 1-3 post-cutover) 2. Monthly snapshot doc at docs/internal/data-quality/YYYY-MM-snapshot.md capturing what dashboard surfaced + any actions taken 3. Quarterly review: are assertion thresholds catching real issues, or only noise? Tune. 4. Backfill OPS.FILTER_REGISTRY from existing filter_* tables (~12-18 historical filters) — capture the why behind each.

Exit criteria: stable for 90 days; quarterly review approved.

Phase 6 — Layer 4 documentation hardening (ongoing)¶

Owner: Benjamin Wong Deliverables: 1. OPS.FILTER_REGISTRY becomes the canonical source for "why this seg_id is blocked" — every new filter_* table gets a registry row at creation 2. Snowflake naming convention doc updated to reference the registry pattern 3. Per-country filter convention (BD${ISO3}260423 → standardise to FILTER_BD_${ISO3}_${YYMMDD})

Exit criteria: new filter tables follow convention; old ones tagged with legacy_naming=TRUE in registry.

Risk register¶

Risk	Likelihood	Impact	Mitigation
Shadow run produces different row counts than manual	Medium	High — blocks cutover	Phase 3 explicit comparison + diff investigation. If supplier external table state differs between manual and automated runs (e.g. partition-refresh timing), document and reconcile.
TASK fails silently overnight	Low (alerts catch it)	High (missed Eyeota ship)	`PIPELINE_ALERT_LOG` triggers Slack on any non-success status; Kelvin oncall during first 90 days
`--strict` build / metadata drift breaks dashboard	Low	Low (dashboard only)	Streamlit failure doesn't affect data pipeline; Kelvin can still review via direct table query
Kelvin in-flight `AGG_MAID_HD` work conflicts with shadow procedure	Medium	Medium	Phase 0 alignment surfaces this; coordinate the procedure to use the chosen lineage
Permissions error: `MARKETPLACE_PIPELINE_OPERATOR` lacks a needed grant during real run	Medium	High (failed ship)	Phase 1 exit criterion is a successful re-run of a historical month — exposes any missing grant before cutover
Storage cost spike from `_SHADOW` tables	Low	Low	Each shadow table is one month's worth (~2 TB); auto-drop tables older than 60 days via a separate cleanup task
Eyeota supplier earnings drop after cutover (data quality regression we didn't catch)	Low	Critical	60-day delayed earnings report is the final truth; if drops detected, revert to manual immediately and investigate

Stakeholder & RACI¶

Activity	Benjamin	Kelvin	Husky board
Strategy approval	A	C	I
Execution plan approval	A	C	I
Phase 1-2 build	R	C	—
Phase 3 shadow run review	C	R	I
Phase 4 cutover	C	R	I
Phase 5+ monthly review	I	R	—
Rollback decision	A	R	I

R = Responsible, A = Accountable, C = Consulted, I = Informed.

Open questions¶

Shadow table cleanup cadence — auto-drop after 60 days? 90 days? Affects storage cost.
Slack channel ownership — does #data-pipeline exist, or do we create it? Who's on the alert recipient list besides Kelvin?
Failure escalation chain — if Kelvin doesn't ack a Slack alert within 4 hours, who's the second escalation? (Today there's no fallback.)
Backfill scope for FILTER_REGISTRY — every historical filter_* table, or only ones still actively referenced by current month's CTAS?
Coordination with AGG_MAID_HD lineage — should the automated procedure assume the new HD path, or maintain compatibility with both AGG_MAID and AGG_MAID_HD?
Per-country sanity-cap rollout — the 10 BDXXX260423 country filter tables created 2026-04-23 hint at extending SGP-style scrub to other markets. Does the automated pipeline need to add per-country DELETE steps, or is that a separate sprint after cutover?

S3 cleanup (parallel workstream)¶

The 2026-04-26 data-plane review surfaced multiple S3 hygiene items. Consolidated here so cleanup happens once alongside the Snowflake migration, rather than piecemeal later.

Account: 969735114743 (alias huskydata, profile huskyai).

Cleanup catalog¶

ID	Object	Issue	Action	Priority	Owner	Reversible?
S3-1 ✅ DONE 2026-04-27	`s3://raw.huskydata.com/Acxiom.csv` (258.3 GB, us-east-1)	PII data, never used since 2024-02-01 ingest. Filename embedded blocked supplier brand. Held PII without documented purpose.	DELETED 2026-04-27 via `aws s3 rm`. Bucket has no versioning → permanent. 258.3 GB freed. GDPR Art. 5(1)(b) compliance recovered.	HIGH (closed)	Benjamin	N/A — completed
S3-2	`s3://raw.huskydata.com/Twitter Hit 50M.txt`, `FB/Hong Kong.txt`, `FB/USA.zip`, `linkedin/part-*.gz`, `vaccine_data.json/jsonl`	Supplier-platform-named files (Twitter, FB, linkedin) — same brand-leak concern, plus stale (last-modified pre-2024).	Audit each; if unused → archive to non-branded prefix. If used → namespace under `legacy/` and update consumers.	MEDIUM	Benjamin	Yes
S3-3	`s3://agg-data.huskyai.com/identity_db/` (empty per 2026-04-26 audit)	External table `HUSKY_DATA.PRIVATE.IDENTITY_DB` claims to read here; prefix has zero objects. Either data lives elsewhere or load never completed.	Investigate: query `IDENTITY_DB` returns rows? If yes → find real S3 location, fix external table pointer. If no → drop external table.	MEDIUM	Benjamin + Kelvin	Yes
S3-4	`s3://data.huskyai.com/scanbuy/hem_maid/`	Supplier still drops files here monthly; no Snowflake external table consumes them. Orphan ingest.	Confirm with Kelvin if needed; if no → ask supplier to stop delivering, lifecycle-rule the existing files to expire after 90 days.	LOW	Kelvin	Yes
S3-5	`s3://agg-data.huskyai.com/agg-hem_sizeLimitedActiveSegments/` (15.7 GB/month, written but no Snowflake consumer)	Asymmetric vs MAID side: MAID's SL prefix surfaces as `AGG_MAID_SL` and feeds SGP allowlist; HEM's SL prefix has no Snowflake surface and no consumer. Either we should be using it, or supplier shouldn't be writing it.	Decide: (a) wire HEM-side SGP scrub equivalent (uses this prefix) — see automation strategy Open Questions, OR (b) ask supplier to stop delivering.	MEDIUM	Benjamin + Kelvin	Yes
S3-6	`s3://agg-data.huskyai.com/agg-maid_latest_month_base/` + `agg-hem_latest_month_base/` (Delta intermediates)	External tables `AGG_MAID_CUR` (pinned date=202412) + `AGG_HEM_CUR` (pinned date=202308, 30 months stale) point at these. Zero query history in 365 days. Likely DBX-internal Delta checkpoints we surfaced once and forgot.	Drop the two `_CUR` external tables. Keep S3 prefixes (DBX needs them for next month's run).	LOW	Kelvin	Yes — just drop external table; data unchanged
S3-7 ✅ DONE 2026-05-12	`s3://agg-data.huskyai.com/agg-hem_distinctActiveSegments/` → ~~`AGG_HEM1`~~	180-day re-audit (2026-05-12) confirmed 0 production reads — earlier 3 q30d count was DDL-maintenance + GET_DDL audit noise.	DROPPED 2026-05-12 via `DROP EXTERNAL TABLE HUSKY_DATA.PRIVATE.AGG_HEM1`. S3 prefix retained (supplier still writes; n8n sync rule downgrade deferred to other workstream per owner direction).	LOW (closed)	Benjamin	N/A — completed
S3-8	`s3://agg-data.huskyai.com/SegmentedCount/` (XLS exports year=2019..2025)	Manual reporting artifacts — emailed XLS, not consumed by Snowflake. Pre-2024 likely dead.	Lifecycle-rule pre-2024 partitions to Glacier Deep Archive (~$1/TB/month vs S3 Standard ~$23/TB/month). Keep latest 24 months in Standard for active reference.	LOW	Benjamin	Reversible (Glacier restore)
S3-9	`s3://agg-data.huskyai.com/monthly_segment_count/` (Databricks Delta, AlikeAudience strings + 25 DSP brand columns in schema)	Brand contamination at column-name level. Currently NOT exposed via Snowflake — but if anyone ever creates an external table over this prefix, supplier brand leaks.	Supplier-side fix. Open ticket with supplier (Databricks owner: `tech@alikeaudience.com`, jobId 638399474177810) to rename branded columns to neutral names (e.g. `dsp_1`, `dsp_2`, ...) and replace `AlikeAudience: ...` strings in `path_new` column.	HIGH (supplier ticket)	Benjamin (open ticket)	N/A
S3-10	`s3://agg-data.huskyai.com/agg-{maid,hem}/date=202505..202510/` (6+ month old production partitions)	Standard-tier storage at ~2 TB/month × 6+ months × ~$23/TB = ~$280/month for partitions Kelvin's monthly job no longer reads (only the latest month is registered in `AGG_MAID`).	S3 Lifecycle rule: transition to Intelligent-Tiering after 60 days; to Glacier Instant Retrieval after 180 days. Saves ~60-80% on storage cost for cold partitions while preserving query-on-demand.	MEDIUM	Benjamin	Yes — rule reversible
S3-11	`s3://huskyai.com/` (433+ Dremio tutorial PNG/JPG files, legacy)	No active website hosted here; old static-mirror of Dremio docs. Hundreds of MB.	Move to `s3://huskyai-archive/legacy-dremio-tutorials/` and lifecycle to Glacier Deep Archive. Or just delete after confirming no inbound links.	LOW	Benjamin	Yes (archive) / No (delete)
S3-12	`s3://code.huskyai.com/sync_agg.sh` + `sync_daily.sh` (4 KB total)	Pipeline driver shell scripts in S3. If they're production, they should be in git, not S3. If they're dead, drop.	Read both files, commit to repo at `scripts/data-pipeline/` if active, archive if not.	LOW	Benjamin	Yes

Sequencing — when in the rollout¶

┌─────────────────────────────────────────────────────────────────────┐
│ Phase 0 (Week 0) — pre-flight                                        │
│   • S3-1 (HIGH): rename Acxiom.csv. 1-hour task, do this immediately │
│   • S3-9 (HIGH): open supplier ticket re: AlikeAudience contamination│
│                                                                      │
│ Phase 1-2 (Week 1-2) — automation build                              │
│   • S3-3 (MEDIUM): investigate identity_db emptiness                  │
│   • S3-2 (MEDIUM): audit raw.huskydata.com supplier-named files      │
│   • S3-5 (MEDIUM): decide HEM SGP scrub — yes/no — drives wire-up    │
│                                                                      │
│ Phase 3 (Week 3-4) — shadow run                                      │
│   • S3-10 (MEDIUM): apply lifecycle rule to agg-data old partitions  │
│   • S3-6, S3-7 (LOW): drop stale `AGG_HEM_CUR`, `AGG_MAID_CUR`,     │
│     `AGG_HEM1` external tables                                        │
│                                                                      │
│ Phase 5+ (Week 6+) — soak                                            │
│   • S3-4, S3-8, S3-11, S3-12 (LOW): final hygiene sweep              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Cost impact estimate¶

Item	Current cost	Post-cleanup	Annual saving
S3-10 lifecycle on cold agg-data partitions	~$280/mo	~$60/mo	~$2,640
S3-8 SegmentedCount Glacier transition	~$100/mo (estimate)	~$10/mo	~$1,080
S3-11 huskyai.com archive	~$5/mo	~$0.5/mo	~$54
S3-1 Acxiom.csv (rename only, no cost change)	$6/mo	$6/mo	$0
Total annual cost saved			~$3,800

Not big money — but ~$300/mo recurring savings + brand-leak risk eliminated + dashboard clarity restored. The real ROI is risk reduction, not cost.

What we are NOT doing in S3 cleanup¶

Not deleting any production prefix (agg-data.huskyai.com/agg-maid/ and agg-hem/ core prefixes are untouched)
Not changing supplier-side Databricks job behaviour without their explicit confirmation (S3-9 is a ticket, not a unilateral change)
Not touching vendor-managed buckets (dremio-*, databricks-workspace-stack-*, elasticbeanstalk-*) — out of scope
Not dropping the agg-{maid,hem}_activeSegments/, _distinctActiveSegments/, _sizeLimitedActiveSegments/ S3 prefixes themselves — supplier needs them for their next-month pipeline runs. Only the Snowflake external tables that point at stale partitions get dropped.

Provenance¶

Drafted 2026-04-27 as the execution layer beneath the 2026-04-26 pipeline automation strategy. Designed in the context of a manual-pipeline-with-incident-the-day-before (FAGG_MAID202602 HKG-only overwrite) and an 8-deploy-failure wiki backlog (root-caused to README/index.md collision in MkDocs strict mode). Both incidents informed the additive-only / shadow-first rollout philosophy. S3 cleanup section added 2026-04-27 to consolidate hygiene items surfaced during data-plane review.