Pipeline automation — execution & migration plan¶
Why this doc exists: the 2026-04-26 strategy doc defines the what (4-layer architecture) and why (don't replace Kelvin's eye, replace his hands). This doc defines the how — concrete phased rollout, Snowflake account changes, deliverables, exit criteria, and rollback paths. No Snowflake objects are touched until Phase 1 starts. Shadow mode runs alongside Kelvin's manual pipeline for at least one full month before cutover.
Scope¶
In scope:
- Wrap Kelvin's existing 4-step monthly process (CTAS FAGG_MAID, SGP scrub, HEM_ build, COPY INTO Eyeota stage) inside a Snowflake stored procedure
- Add row-count and partition-existence assertions; abort + alert on anomaly
- One Snowflake TASK to schedule monthly run
- PIPELINE_RUN_LOG + PIPELINE_ALERT_LOG tables for observability
- FILTER_REGISTRY table to capture filter-version metadata
- Streamlit dashboard reading PIPELINE_RUN_LOG for monthly insight ritual
- Sandbox tables (FAGG_MAID_DEV, FAGG_MAID_${ISO3}_TEST) for experiments
Not in scope:
- Migration to a different Snowflake account, region, or org. The existing account ujb07324 (org ciqcvik, region AWS_US_WEST_2) is already aligned with the SG primary contracting entity (Husky Technology Pte Limited). Marketplace provider profile GZTYZ2Y10J8 is bound to this account; recreating it elsewhere would invalidate the live Marketplace listings. No account-level migration.
- Changes to upstream Databricks pipeline. The supplier writes intermediate prefixes (*_activeSegments, *_distinctActiveSegments, etc.) that Husky originally requested for testing; whether to keep or drop them is a separate conversation with the supplier.
- Re-architecting the HEM-side asymmetry (no SGP scrub, no segment-blocklist filter). Surfaced as Open Question; not blocking automation rollout.
- Replacing agg_maid_sgp_filter allowlist generation. Today Kelvin runs the allowlist refresh ad-hoc from AGG_MAID_SL; automation reuses the existing allowlist as-is.
Snowflake account changes (in-place, no account migration)¶
The migration is additive — every new object lives alongside Kelvin's existing setup. Nothing existing gets renamed, dropped, or repointed during rollout.
| Layer | New objects | Touches existing? |
|---|---|---|
| Roles | MARKETPLACE_PIPELINE_OPERATOR (least-privilege role for procedure execution) |
No — Kelvin keeps ACCOUNTADMIN. New role gets only the grants it needs. |
| Warehouses | None new in Phase 1-3. Phase 4 may introduce PIPELINE_S (Small, auto-suspend 60s) if DEV proves too small for assertion overhead. |
No — DEV continues for manual; new warehouse is opt-in. |
| Schemas | HUSKY_DATA.OPS for run logs, alerts, filter registry, pipeline scratch tables (separates pipeline-internal from PRIVATE production tables) |
No — PRIVATE and SHARE untouched. |
| Tables | OPS.PIPELINE_RUN_LOG, OPS.PIPELINE_ALERT_LOG, OPS.FILTER_REGISTRY, PRIVATE.FAGG_MAID_DEV (sandbox) |
No — production tables untouched. |
| Procedures | OPS.MONTHLY_EYEOTA_PIPELINE() — Kelvin's 4 steps wrapped + assertions + logging |
Reads PRIVATE.AGG_MAID, AGG_HEM, etc. (read-only); writes FAGG_MAID${YYYYMM}_SHADOW during shadow mode, FAGG_MAID${YYYYMM} only after cutover. |
| Tasks | OPS.MONTHLY_EYEOTA_TASK — runs procedure on schedule, starts SUSPENDED |
None — does nothing until Kelvin manually RESUMEs it. |
| Stages | None new. Existing @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER reused as-is. |
No. |
| Grants | MARKETPLACE_PIPELINE_OPERATOR gets: SELECT on PRIVATE.AGG_MAID/AGG_HEM/HUSKY_ID_DB3/filter tables; CREATE/MODIFY on OPS.*; WRITE on Eyeota stage; CREATE TABLE on PRIVATE (only for materialised monthly snapshots). No grants to the SHARE schema. |
No existing grants modified. |
| Streamlit / Apps | One Streamlit app OPS.PIPELINE_DASHBOARD reading PIPELINE_RUN_LOG. Hosted in PUBLIC schema since that's where existing Streamlit lives. |
No. |
Cost-of-rollback: because every change is additive and the production path stays manual until cutover (Phase 4), reverting is DROP SCHEMA OPS CASCADE; DROP TABLE PRIVATE.FAGG_MAID_DEV; REVOKE ALL ON ROLE MARKETPLACE_PIPELINE_OPERATOR; DROP ROLE MARKETPLACE_PIPELINE_OPERATOR;. Kelvin's manual workflow is unaffected at every phase.
Phased rollout¶
Phase 0 — Pre-flight alignment (Week 0)¶
Owner: Benjamin Wong + Kelvin Chan (joint)
Deliverables:
- 30-min sync with Kelvin reviewing the 2026-04-26 strategy + this execution doc
- Confirm Kelvin agrees with sandbox-vs-production naming (FAGG_MAID_DEV vs FAGG_MAID${YYYYMM})
- Confirm Slack vs email for assertion alerts
- Confirm row-count assertion thresholds (default ±20% from trailing-3-month median; Kelvin can tighten)
- Identify any in-flight Kelvin work (e.g., AGG_MAID_HD lineage) that should be paused or coordinated
Exit criteria: signed-off scope. No code yet.
Phase 1 — Build automation in shadow (Week 1)¶
Owner: Benjamin Wong (build) + Kelvin Chan (review)
Deliverables:
1. HUSKY_DATA.OPS schema created
2. OPS.PIPELINE_RUN_LOG + OPS.PIPELINE_ALERT_LOG tables created
3. OPS.MONTHLY_EYEOTA_PIPELINE() stored procedure — implements all 4 steps verbatim from playbook + logging + assertions, but writes to FAGG_MAID${YYYYMM}_SHADOW table (not production target)
4. OPS.MONTHLY_EYEOTA_TASK created in SUSPENDED state
5. MARKETPLACE_PIPELINE_OPERATOR role + grants
6. Migration SQL committed to docs/internal/snowflake-pipeline-migration-2026-04-27.sql for repeatability
Exit criteria:
- Procedure callable manually: CALL OPS.MONTHLY_EYEOTA_PIPELINE();
- Returns success on a re-run of an already-shipped month (e.g., 202601)
- FAGG_MAID202601_SHADOW row count = FAGG_MAID202601 row count (within 0.1%)
- PIPELINE_RUN_LOG populated with one entry
Rollback: DROP SCHEMA HUSKY_DATA.OPS CASCADE; DROP TABLE FAGG_MAID202601_SHADOW;
Phase 2 — Dashboard + alerts (Week 2)¶
Owner: Benjamin Wong
Deliverables:
1. Streamlit OPS.PIPELINE_DASHBOARD showing: row count history, country×platform splits, segment catalog drift, scrub-ratio trends. One page, no auth (gated by Snowsight login).
2. Slack webhook (or SES email) trigger on any PIPELINE_ALERT_LOG insert. Cloudflare Worker reads the table via Snowflake REST and posts to #data-pipeline Slack channel.
3. Test alerts by deliberately failing one assertion (e.g., set threshold so tight that 202601 re-run trips it) — confirm Slack receives the alert.
Exit criteria: - Kelvin confirms dashboard surfaces info he'd want to see - Test alert delivered to Slack within 5 min of trigger
Rollback: un-deploy Streamlit + delete webhook. No Snowflake side-effects.
Phase 3 — Shadow run for the next live month (Week 3-4, full calendar month)¶
Owner: Kelvin Chan (manual, as today) + automation runs in parallel
Deliverables:
1. Kelvin runs the manual playbook for 202703 as usual (early May 2026, real production ship)
2. Same day, run CALL OPS.MONTHLY_EYEOTA_PIPELINE(); which writes FAGG_MAID202703_SHADOW (NOT the target — Eyeota gets the manual FAGG_MAID202703)
3. Compare FAGG_MAID202703 (manual production) vs FAGG_MAID202703_SHADOW (automated)
4. If row counts match within 0.5%, distinct user counts within 0.5%, and segment_id distributions within 1% K-S divergence → automation parity confirmed
5. Document any deltas in docs/internal/data-quality/2026-05-shadow-comparison.md
Exit criteria: - Parity confirmed for 202703 - No assertion false-positives during the run - Dashboard correctly reflects the shadow run
Rollback: if parity fails, do NOT cut over. Investigate diff. Manual continues. Phase repeats next month with fixes.
Phase 4 — Cutover (Week 5)¶
Owner: Kelvin Chan (executes), Benjamin Wong (on standby)
Deliverables:
1. Modify OPS.MONTHLY_EYEOTA_PIPELINE() to write directly to FAGG_MAID${YYYYMM} instead of the _SHADOW target. One-line change, version-tagged.
2. Schedule one final dry-run for the next month via manual CALL OPS.MONTHLY_EYEOTA_PIPELINE(); — this becomes the actual production for that month.
3. ALTER TASK OPS.MONTHLY_EYEOTA_TASK RESUME; to enable scheduled runs from the following month.
4. Update docs/playbooks/monthly-eyeota-upload.md to reflect: "automated; manual fallback path retained below."
Exit criteria: - One full month shipped via automation - Eyeota supplier earnings report (60-day lag) confirms revenue unchanged
Rollback:
- If a TASK run fails: PIPELINE_ALERT_LOG fires, Slack notifies Kelvin, Kelvin runs the manual playbook for that month. Maximum revenue exposure: one month delayed by 1-2 days.
- If multiple consecutive failures: ALTER TASK OPS.MONTHLY_EYEOTA_TASK SUSPEND; and revert to manual until root cause resolved.
Phase 5 — 90-day soak (Week 6+)¶
Owner: Kelvin Chan (monthly review) + automatic alerts
Deliverables:
1. Three consecutive successful automated runs (months 1-3 post-cutover)
2. Monthly snapshot doc at docs/internal/data-quality/YYYY-MM-snapshot.md capturing what dashboard surfaced + any actions taken
3. Quarterly review: are assertion thresholds catching real issues, or only noise? Tune.
4. Backfill OPS.FILTER_REGISTRY from existing filter_* tables (~12-18 historical filters) — capture the why behind each.
Exit criteria: stable for 90 days; quarterly review approved.
Phase 6 — Layer 4 documentation hardening (ongoing)¶
Owner: Benjamin Wong
Deliverables:
1. OPS.FILTER_REGISTRY becomes the canonical source for "why this seg_id is blocked" — every new filter_* table gets a registry row at creation
2. Snowflake naming convention doc updated to reference the registry pattern
3. Per-country filter convention (BD${ISO3}260423 → standardise to FILTER_BD_${ISO3}_${YYMMDD})
Exit criteria: new filter tables follow convention; old ones tagged with legacy_naming=TRUE in registry.
Risk register¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Shadow run produces different row counts than manual | Medium | High — blocks cutover | Phase 3 explicit comparison + diff investigation. If supplier external table state differs between manual and automated runs (e.g. partition-refresh timing), document and reconcile. |
| TASK fails silently overnight | Low (alerts catch it) | High (missed Eyeota ship) | PIPELINE_ALERT_LOG triggers Slack on any non-success status; Kelvin oncall during first 90 days |
--strict build / metadata drift breaks dashboard |
Low | Low (dashboard only) | Streamlit failure doesn't affect data pipeline; Kelvin can still review via direct table query |
Kelvin in-flight AGG_MAID_HD work conflicts with shadow procedure |
Medium | Medium | Phase 0 alignment surfaces this; coordinate the procedure to use the chosen lineage |
Permissions error: MARKETPLACE_PIPELINE_OPERATOR lacks a needed grant during real run |
Medium | High (failed ship) | Phase 1 exit criterion is a successful re-run of a historical month — exposes any missing grant before cutover |
Storage cost spike from _SHADOW tables |
Low | Low | Each shadow table is one month's worth (~2 TB); auto-drop tables older than 60 days via a separate cleanup task |
| Eyeota supplier earnings drop after cutover (data quality regression we didn't catch) | Low | Critical | 60-day delayed earnings report is the final truth; if drops detected, revert to manual immediately and investigate |
Stakeholder & RACI¶
| Activity | Benjamin | Kelvin | Husky board |
|---|---|---|---|
| Strategy approval | A | C | I |
| Execution plan approval | A | C | I |
| Phase 1-2 build | R | C | — |
| Phase 3 shadow run review | C | R | I |
| Phase 4 cutover | C | R | I |
| Phase 5+ monthly review | I | R | — |
| Rollback decision | A | R | I |
R = Responsible, A = Accountable, C = Consulted, I = Informed.
Open questions¶
- Shadow table cleanup cadence — auto-drop after 60 days? 90 days? Affects storage cost.
- Slack channel ownership — does
#data-pipelineexist, or do we create it? Who's on the alert recipient list besides Kelvin? - Failure escalation chain — if Kelvin doesn't ack a Slack alert within 4 hours, who's the second escalation? (Today there's no fallback.)
- Backfill scope for FILTER_REGISTRY — every historical
filter_*table, or only ones still actively referenced by current month's CTAS? - Coordination with
AGG_MAID_HDlineage — should the automated procedure assume the new HD path, or maintain compatibility with both AGG_MAID and AGG_MAID_HD? - Per-country sanity-cap rollout — the 10 BDXXX260423 country filter tables created 2026-04-23 hint at extending SGP-style scrub to other markets. Does the automated pipeline need to add per-country DELETE steps, or is that a separate sprint after cutover?
S3 cleanup (parallel workstream)¶
The 2026-04-26 data-plane review surfaced multiple S3 hygiene items. Consolidated here so cleanup happens once alongside the Snowflake migration, rather than piecemeal later.
Account: 969735114743 (alias huskydata, profile huskyai).
Cleanup catalog¶
| ID | Object | Issue | Action | Priority | Owner | Reversible? |
|---|---|---|---|---|---|---|
| S3-1 ✅ DONE 2026-04-27 | s3://raw.huskydata.com/Acxiom.csv (258.3 GB, us-east-1) |
PII data, never used since 2024-02-01 ingest. Filename embedded blocked supplier brand. Held PII without documented purpose. | DELETED 2026-04-27 via aws s3 rm. Bucket has no versioning → permanent. 258.3 GB freed. GDPR Art. 5(1)(b) compliance recovered. |
HIGH (closed) | Benjamin | N/A — completed |
| S3-2 | s3://raw.huskydata.com/Twitter Hit 50M.txt, FB/Hong Kong.txt, FB/USA.zip, linkedin/part-*.gz, vaccine_data.json/jsonl |
Supplier-platform-named files (Twitter, FB, linkedin) — same brand-leak concern, plus stale (last-modified pre-2024). | Audit each; if unused → archive to non-branded prefix. If used → namespace under legacy/ and update consumers. |
MEDIUM | Benjamin | Yes |
| S3-3 | s3://agg-data.huskyai.com/identity_db/ (empty per 2026-04-26 audit) |
External table HUSKY_DATA.PRIVATE.IDENTITY_DB claims to read here; prefix has zero objects. Either data lives elsewhere or load never completed. |
Investigate: query IDENTITY_DB returns rows? If yes → find real S3 location, fix external table pointer. If no → drop external table. |
MEDIUM | Benjamin + Kelvin | Yes |
| S3-4 | s3://data.huskyai.com/scanbuy/hem_maid/ |
Supplier still drops files here monthly; no Snowflake external table consumes them. Orphan ingest. | Confirm with Kelvin if needed; if no → ask supplier to stop delivering, lifecycle-rule the existing files to expire after 90 days. | LOW | Kelvin | Yes |
| S3-5 | s3://agg-data.huskyai.com/agg-hem_sizeLimitedActiveSegments/ (15.7 GB/month, written but no Snowflake consumer) |
Asymmetric vs MAID side: MAID's SL prefix surfaces as AGG_MAID_SL and feeds SGP allowlist; HEM's SL prefix has no Snowflake surface and no consumer. Either we should be using it, or supplier shouldn't be writing it. |
Decide: (a) wire HEM-side SGP scrub equivalent (uses this prefix) — see automation strategy Open Questions, OR (b) ask supplier to stop delivering. | MEDIUM | Benjamin + Kelvin | Yes |
| S3-6 | s3://agg-data.huskyai.com/agg-maid_latest_month_base/ + agg-hem_latest_month_base/ (Delta intermediates) |
External tables AGG_MAID_CUR (pinned date=202412) + AGG_HEM_CUR (pinned date=202308, 30 months stale) point at these. Zero query history in 365 days. Likely DBX-internal Delta checkpoints we surfaced once and forgot. |
Drop the two _CUR external tables. Keep S3 prefixes (DBX needs them for next month's run). |
LOW | Kelvin | Yes — just drop external table; data unchanged |
| S3-7 ✅ DONE 2026-05-12 | s3://agg-data.huskyai.com/agg-hem_distinctActiveSegments/ → ~~AGG_HEM1~~ |
180-day re-audit (2026-05-12) confirmed 0 production reads — earlier 3 q30d count was DDL-maintenance + GET_DDL audit noise. | DROPPED 2026-05-12 via DROP EXTERNAL TABLE HUSKY_DATA.PRIVATE.AGG_HEM1. S3 prefix retained (supplier still writes; n8n sync rule downgrade deferred to other workstream per owner direction). |
LOW (closed) | Benjamin | N/A — completed |
| S3-8 | s3://agg-data.huskyai.com/SegmentedCount/ (XLS exports year=2019..2025) |
Manual reporting artifacts — emailed XLS, not consumed by Snowflake. Pre-2024 likely dead. | Lifecycle-rule pre-2024 partitions to Glacier Deep Archive (~$1/TB/month vs S3 Standard ~$23/TB/month). Keep latest 24 months in Standard for active reference. | LOW | Benjamin | Reversible (Glacier restore) |
| S3-9 | s3://agg-data.huskyai.com/monthly_segment_count/ (Databricks Delta, AlikeAudience strings + 25 DSP brand columns in schema) |
Brand contamination at column-name level. Currently NOT exposed via Snowflake — but if anyone ever creates an external table over this prefix, supplier brand leaks. | Supplier-side fix. Open ticket with supplier (Databricks owner: tech@alikeaudience.com, jobId 638399474177810) to rename branded columns to neutral names (e.g. dsp_1, dsp_2, ...) and replace AlikeAudience: ... strings in path_new column. |
HIGH (supplier ticket) | Benjamin (open ticket) | N/A |
| S3-10 | s3://agg-data.huskyai.com/agg-{maid,hem}/date=202505..202510/ (6+ month old production partitions) |
Standard-tier storage at ~2 TB/month × 6+ months × ~$23/TB = ~$280/month for partitions Kelvin's monthly job no longer reads (only the latest month is registered in AGG_MAID). |
S3 Lifecycle rule: transition to Intelligent-Tiering after 60 days; to Glacier Instant Retrieval after 180 days. Saves ~60-80% on storage cost for cold partitions while preserving query-on-demand. | MEDIUM | Benjamin | Yes — rule reversible |
| S3-11 | s3://huskyai.com/ (433+ Dremio tutorial PNG/JPG files, legacy) |
No active website hosted here; old static-mirror of Dremio docs. Hundreds of MB. | Move to s3://huskyai-archive/legacy-dremio-tutorials/ and lifecycle to Glacier Deep Archive. Or just delete after confirming no inbound links. |
LOW | Benjamin | Yes (archive) / No (delete) |
| S3-12 | s3://code.huskyai.com/sync_agg.sh + sync_daily.sh (4 KB total) |
Pipeline driver shell scripts in S3. If they're production, they should be in git, not S3. If they're dead, drop. | Read both files, commit to repo at scripts/data-pipeline/ if active, archive if not. |
LOW | Benjamin | Yes |
Sequencing — when in the rollout¶
┌─────────────────────────────────────────────────────────────────────┐
│ Phase 0 (Week 0) — pre-flight │
│ • S3-1 (HIGH): rename Acxiom.csv. 1-hour task, do this immediately │
│ • S3-9 (HIGH): open supplier ticket re: AlikeAudience contamination│
│ │
│ Phase 1-2 (Week 1-2) — automation build │
│ • S3-3 (MEDIUM): investigate identity_db emptiness │
│ • S3-2 (MEDIUM): audit raw.huskydata.com supplier-named files │
│ • S3-5 (MEDIUM): decide HEM SGP scrub — yes/no — drives wire-up │
│ │
│ Phase 3 (Week 3-4) — shadow run │
│ • S3-10 (MEDIUM): apply lifecycle rule to agg-data old partitions │
│ • S3-6, S3-7 (LOW): drop stale `AGG_HEM_CUR`, `AGG_MAID_CUR`, │
│ `AGG_HEM1` external tables │
│ │
│ Phase 5+ (Week 6+) — soak │
│ • S3-4, S3-8, S3-11, S3-12 (LOW): final hygiene sweep │
│ │
└─────────────────────────────────────────────────────────────────────┘
Cost impact estimate¶
| Item | Current cost | Post-cleanup | Annual saving |
|---|---|---|---|
| S3-10 lifecycle on cold agg-data partitions | ~$280/mo | ~$60/mo | ~$2,640 |
| S3-8 SegmentedCount Glacier transition | ~$100/mo (estimate) | ~$10/mo | ~$1,080 |
| S3-11 huskyai.com archive | ~$5/mo | ~$0.5/mo | ~$54 |
| S3-1 Acxiom.csv (rename only, no cost change) | $6/mo | $6/mo | $0 |
| Total annual cost saved | ~$3,800 |
Not big money — but ~$300/mo recurring savings + brand-leak risk eliminated + dashboard clarity restored. The real ROI is risk reduction, not cost.
What we are NOT doing in S3 cleanup¶
- Not deleting any production prefix (
agg-data.huskyai.com/agg-maid/andagg-hem/core prefixes are untouched) - Not changing supplier-side Databricks job behaviour without their explicit confirmation (S3-9 is a ticket, not a unilateral change)
- Not touching vendor-managed buckets (
dremio-*,databricks-workspace-stack-*,elasticbeanstalk-*) — out of scope - Not dropping the
agg-{maid,hem}_activeSegments/,_distinctActiveSegments/,_sizeLimitedActiveSegments/S3 prefixes themselves — supplier needs them for their next-month pipeline runs. Only the Snowflake external tables that point at stale partitions get dropped.
Provenance¶
Drafted 2026-04-27 as the execution layer beneath the 2026-04-26 pipeline automation strategy. Designed in the context of a manual-pipeline-with-incident-the-day-before (FAGG_MAID202602 HKG-only overwrite) and an 8-deploy-failure wiki backlog (root-caused to README/index.md collision in MkDocs strict mode). Both incidents informed the additive-only / shadow-first rollout philosophy. S3 cleanup section added 2026-04-27 to consolidate hygiene items surfaced during data-plane review.