Monthly Eyeota upload — runbook¶

Purpose: Every month after the MAID + HEM aggregations finalize, Husky uploads the latest month's segment data to Eyeota's S3 intake so Eyeota can sync our segments into their global audience marketplace. This is the revenue engine for the unbranded Eyeota channel (~US$69K/month per Feb 2026).

Principle: Keep this process boring and reproducible. If Kelvin is out, anyone with Snowflake ACCOUNTADMIN should be able to follow this page and ship the month.

Schedule¶

When	What
1st–3rd of each month	Data pipeline produces `HUSKY_DATA.PRIVATE.FAGG_MAID${YYYYMM}` and `HUSKY_DATA.PRIVATE.HEM_${YYYYMM}` for the prior month
4th–7th of each month	This runbook executes — Kelvin uploads android / ios / hem partitions to Eyeota stage
7th–15th of each month	Eyeota ingests the files into their global catalog; Husky segments become available to Eyeota buyers
14th–30th of next month	Eyeota supplier earnings report arrives (still titled "Global Next Buzz Media" until contract migration completes — see `docs/reference/corporate-entities.md`)

Prerequisites¶

Snowflake login with ACCOUNTADMIN role (or whatever role has SELECT on HUSKY_DATA.PRIVATE.* and WRITE on @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER)
DEV warehouse (or any X-Small warehouse) — each step takes ~5-15 min
Supplier (Databricks-side) has finalised the new month's aggregation in s3://agg-data.huskyai.com/agg-maid/date=${YYYYMM}/ and agg-hem/date=${YYYYMM}/. Confirm via aws s3 ls s3://agg-data.huskyai.com/agg-maid/date=${YYYYMM}/ --profile huskyai before starting.
External tables AGG_MAID and AGG_HEM need their partition for ${YYYYMM} registered (the externals are single-month views; ALTER EXTERNAL TABLE AGG_MAID REFRESH; if not auto-refreshing).

Pipeline overview¶

Step 1: CTAS  FAGG_MAID${YYYYMM} ← FROM AGG_MAID (external)        ~5 min
Step 2: DELETE SGP rows not in agg_maid_sgp_filter allowlist        ~3 min
Step 3: CREATE + INSERT HEM_${YYYYMM} ← FROM AGG_HEM × HUSKY_ID_DB3  ~5 min
Step 4: 3 × COPY INTO @HUSKY_EYEOTA_UPLOADER/{date}/{android,ios,hem}/  ~15-30 min

Steps 1-3 build the Snowflake-native materialised tables. Step 4 ships them to Eyeota's S3 stage. Total wall-clock ~30-50 min on DEV warehouse.

Variables to set per run¶

-- Source month to ship (the month whose data was aggregated):
SET YYYYMM = '202602';     -- e.g. Feb 2026 data (shipped early Apr)
-- Run date (goes into the S3 prefix so Eyeota knows which batch is which):
SET RUN_DATE = '20260407';  -- typically today's date in YYYYMMDD

Step 1 — Build `FAGG_MAID${YYYYMM}` (Filtered Aggregated MAID)¶

Why "F" prefix: AGG_MAID is the supplier-delivered raw aggregation (multi-row per user, ~4B rows). FAGG_MAID adds three Husky-side filters: DISTINCT user_id dedup, segment-id blocklist via ARRAY_EXCEPT, and SGP allowlist (Step 2). Result: one cleaned month-snapshot table.

CREATE OR REPLACE TABLE HUSKY_DATA.PRIVATE.FAGG_MAID${YYYYMM} AS
SELECT DISTINCT
    user_id,
    CASE
      WHEN country_part='USA' THEN
        ARRAY_TO_STRING(
          ARRAY_EXCEPT(seg, (SELECT ARRAY_AGG(seg_id) FROM HUSKY_DATA.PRIVATE.filter_260302_3_300w)),
          ','
        )
      ELSE
        ARRAY_TO_STRING(
          ARRAY_EXCEPT(seg, (SELECT ARRAY_AGG(seg_id) FROM HUSKY_DATA.PRIVATE.filter_260424_misc)),
          ','
        )
    END AS seg_list,
    country_part,
    date_part,
    platform_part
FROM HUSKY_DATA.PRIVATE.AGG_MAID;

USA and non-USA rows go through different filter_* blocklists (current versions: filter_260302_3_300w for USA, filter_260424_misc for everyone else, as of 2026-04 cycle). See docs/strategy/2026-04-26-pipeline-automation-strategy.md for the proposed FILTER_REGISTRY capturing why each filter exists.
Expected row count: ~3.9-4.1B for a normal full month. Sharply lower means the upstream supplier load is incomplete or the filter version is wrong.
Today's row count is logged for trend monitoring (see Step 6).

Step 2 — SGP sanity-cap scrub¶

Why this step exists: raw AGG_MAID contains ~1.2-1.4B SGP-tagged MAIDs per month, ~200x Singapore's population (5.9M). Cause is likely VPN/proxy mis-tagging, transit-traveller GPS triggers, ad-network spoofing, or supplier quality issues. agg_maid_sgp_filter is a curated allowlist of "real" SGP user_ids (~8.16M rows, close to Singapore's actual smartphone population). Anything not in the allowlist gets removed.

DELETE FROM HUSKY_DATA.PRIVATE.FAGG_MAID${YYYYMM} a
WHERE  a.country_part = 'SGP'
   AND NOT EXISTS (
     SELECT 1 FROM HUSKY_DATA.PRIVATE.agg_maid_sgp_filter b
     WHERE  a.user_id = b.user_id
   );

Typical magnitude: ~50-60M rows deleted (revised 2026-05-14 baseline). 6-cycle median = 55M (cycles 202510→202603 backfilled to HUSKY_DATA.OPS.PIPELINE_RUN_LOG). Earlier doc cited "1.2-1.4B" — that was supplier-upstream behaviour pre-2025; supplier appears to have fixed SGP inflation at source. If this step deletes <30M or >100M, investigate — supplier behaviour shifted again.
The agg_maid_sgp_filter allowlist itself is rebuilt periodically from AGG_MAID_SL (the supplier's agg-maid_sizeLimitedActiveSegments prefix, which contains only "active" users with multi-segment lists). See ~/Projects/.../memory/project_agg_maid_vs_sl_difference.md for empirical row-count evidence.
Per-country expansion is in flight: 10 BD${ISO3}260423 boundary-data filter tables created 2026-04-23 (HKG, IDN, IND, JPN, KOR, MYS, PHL, SGP, THA, VNM) suggest SGP-style sanity caps are extending to other markets in upcoming runs.

Step 3 — Build `HEM_${YYYYMM}` (Hashed Email rollup)¶

Why no "F" prefix: the HEM transform is identity remapping (MAID → email via HUSKY_ID_DB3 crosswalk), not filtering. There is currently no segment-blocklist filter and no SGP scrub on the HEM side — see Open Questions at the bottom.

CREATE OR REPLACE TABLE HUSKY_DATA.PRIVATE.HEM_${YYYYMM} (
  hem_id   VARCHAR,
  country  VARCHAR,
  seg_list VARCHAR
);

INSERT INTO HUSKY_DATA.PRIVATE.HEM_${YYYYMM}
SELECT  b.h_email,
        a.country_part,
        a.seg_list
FROM    HUSKY_DATA.PRIVATE.AGG_HEM       AS a
INNER JOIN HUSKY_DATA.PRIVATE.HUSKY_ID_DB3 AS b
        ON a.aaid = b.hem_id;

Expected row count: ~900M-950M for a normal month.
HUSKY_ID_DB3 is Husky's MAID ↔ hashed-email crosswalk (~1B rows). It's the join key that converts the supplier's MAID-keyed AGG_HEM into Husky's email-keyed delivery.

Step 4 — Ship to Eyeota stage (3 COPY INTO statements)¶

All three must succeed. Run sequentially; each takes 5-15 minutes on DEV warehouse.

4.1 MAID Android¶

COPY INTO @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER/{RUN_DATE}/android/
FROM (
    SELECT  user_id,
            CASE WHEN country_part = 'KOR' THEN 'KR'
                 ELSE LEFT(country_part, 2)
            END AS country,
            seg_list
    FROM HUSKY_DATA.PRIVATE.FAGG_MAID{YYYYMM}
    WHERE platform_part = 'ANDROID'
)
FILE_FORMAT = (FORMAT_NAME = 'HUSKY_DATA.PRIVATE.PARQUET')
HEADER = TRUE
OVERWRITE = TRUE;

4.2 MAID iOS¶

COPY INTO @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER/{RUN_DATE}/ios/
FROM (
    SELECT  user_id,
            CASE WHEN country_part = 'KOR' THEN 'KR'
                 ELSE LEFT(country_part, 2)
            END AS country,
            seg_list
    FROM HUSKY_DATA.PRIVATE.FAGG_MAID{YYYYMM}
    WHERE platform_part = 'IOS'
)
FILE_FORMAT = (FORMAT_NAME = 'HUSKY_DATA.PRIVATE.PARQUET')
HEADER = TRUE
OVERWRITE = TRUE;

4.3 HEM¶

COPY INTO @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER/{RUN_DATE}/hem/
FROM (
    SELECT  hem_id,
            CASE WHEN country = 'KOR' THEN 'KR'
                 ELSE LEFT(country, 2)
            END AS country,
            seg_list
    FROM HUSKY_DATA.PRIVATE.HEM_{YYYYMM}
)
FILE_FORMAT = (FORMAT_NAME = 'HUSKY_DATA.PRIVATE.PARQUET')
HEADER = TRUE
OVERWRITE = TRUE;

Why the KOR → KR transformation¶

Husky's internal country codes use ISO 3166-1 alpha-3 (KOR, JPN, SGP, etc.). Eyeota's intake expects ISO 3166-1 alpha-2 (KR, JP, SG, etc.). LEFT(country, 2) gives the correct alpha-2 for all APAC markets except Korea, where KOR → KO would be wrong (the alpha-2 is KR). Hence the explicit CASE branch. Do not simplify this — the KOR case is the only exception.

Verification after upload¶

LIST @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER/{RUN_DATE}/
  ORDER BY LAST_MODIFIED DESC;

Expected: three subfolders (android/, ios/, hem/), each with one or more Parquet files totaling ~500 GB combined. If any folder is missing or empty, re-run that COPY INTO.

Troubleshooting¶

Symptom	Cause	Fix
`Object 'FAGG_MAID${YYYYMM}' does not exist`	This month's aggregation hasn't finalized yet	Wait for the upstream pipeline; confirm with the data team
`Insufficient privileges`	Role is not ACCOUNTADMIN	`USE ROLE ACCOUNTADMIN` before running
`File format 'PARQUET' does not exist`	Schema qualifier missing	Use `HUSKY_DATA.PRIVATE.PARQUET` as shown
COPY INTO hangs >30 min	Warehouse too small / auto-suspend kicked in	Resize warehouse: `ALTER WAREHOUSE DEV SET WAREHOUSE_SIZE = 'SMALL'` for the run, reset to X-Small after
Eyeota reports empty file	OVERWRITE omitted and folder already contains older partial files	Use `OVERWRITE = TRUE`

What NOT to do¶

Do not rename HUSKY_DATA.PRIVATE.FAGG_MAID* or HEM_* tables. Column names user_id, country_part, seg_list, platform_part, hem_id, country must stay exactly as-is — they are referenced by hard-coded COPY INTO statements.
Do not drop the HUSKY_EYEOTA_UPLOADER stage or modify its S3 credentials. Eyeota reads from this bucket directly.
Do not disable the HUSKY_DATA.PRIVATE schema or the HUSKY_EYEOTA_UPLOADER stage on a whim; a lapsed monthly upload costs ~US$69K of revenue.

Open questions / known asymmetries¶

HEM has no SGP scrub equivalent. Steps 1-2 apply DISTINCT + segment-blocklist + SGP allowlist on MAID. Step 3 (HEM) only does the MAID→email join. If raw AGG_HEM carries the same SGP inflation pattern as AGG_MAID, the HEM Eyeota delivery may be shipping noise. Verify with: SELECT country, COUNT(*) FROM HEM_${YYYYMM} GROUP BY 1 ORDER BY 2 DESC; and compare against expected smartphone population per country.
HEM seg_list is unfiltered. a.seg_list from AGG_HEM is used as-is. If supplier's HEM-side aggregation doesn't apply the same segment-id blocklist that the MAID side does (per filter_* table), HEM Eyeota delivery may include blocked seg_ids that MAID excludes.
AGG_MAID_HD lineage (new, 2026-04-23). Kelvin created AGG_MAID_HD external table + 10 country-specific BD${ISO3}260423 boundary-data filter tables. Suspect this is a future replacement / refinement of the AGG_MAID → FAGG_MAID flow that introduces per-country sanity caps. Behaviour TBD; runbook to be updated when the new lineage stabilises.
External tables are single-month views. AGG_MAID and AGG_HEM only register one date partition at a time. Historical months sit on S3 (s3://agg-data.huskyai.com/agg-{maid,hem}/date=YYYYMM/) but are not query-visible. If you need to re-build a prior month's FAGG_MAID, you have to ALTER EXTERNAL TABLE ... REFRESH first. Snowflake-native FAGG_MAID${YYYYMM} snapshots persist 12+ months back.

Automation status (updated 2026-05-14)¶

Sprint 1 SHIPPED (artifacts): all 4 steps now wrapped in a Snowflake Notebook artifact HUSKY_DATA.OPS.MONTHLY_EYEOTA_NB_TEMPLATE (open in Snowsight). Each step pairs with OPS.LOG_STEP() + OPS.CHECK_BASELINE() for assertion-based verification (verdict ✅/🟡/🔴 vs trailing-3-month median).

For the 6月 cycle (data month 202604), Kelvin (or Benjamin) clones the template → MONTHLY_EYEOTA_NB_202604 → edits Cell 2 → runs cells top-to-bottom. Manual SQL fallback below remains the documented disaster-recovery path.

Manual fallback (this playbook's Steps 1-4) stays as the canonical source-of-truth SQL. The Notebook is a thin wrapper; if Notebook fails or feels uncomfortable, fall back to copy-pasting SQL from this doc into a Snowsight Worksheet.

The Step-4-only TASK stub below is preserved for reference but is superseded by the broader Sprint-1 plan in the strategy doc, which wraps Steps 1-4 together with row-count assertions:

-- Run the 7th of each month at 06:00 UTC; wrap the three COPY INTOs in a stored procedure
CREATE OR REPLACE PROCEDURE HUSKY_DATA.PRIVATE.UPLOAD_EYEOTA_MONTHLY()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
    last_month     VARCHAR := TO_CHAR(DATEADD(MONTH, -1, CURRENT_DATE), 'YYYYMM');
    run_date       VARCHAR := TO_CHAR(CURRENT_DATE, 'YYYYMMDD');
    maid_table     VARCHAR := 'HUSKY_DATA.PRIVATE.FAGG_MAID' || :last_month;
    hem_table      VARCHAR := 'HUSKY_DATA.PRIVATE.HEM_' || :last_month;
    -- ... execute the three COPY INTOs with EXECUTE IMMEDIATE, dynamically substituting :last_month and :run_date
BEGIN
    -- Android / iOS / HEM COPY INTO blocks here, each wrapped in EXECUTE IMMEDIATE
    RETURN 'Uploaded ' || :last_month || ' to ' || :run_date || '/';
END;
$$;

CREATE OR REPLACE TASK HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_UPLOAD
    WAREHOUSE = DEV
    SCHEDULE  = 'USING CRON 0 6 7 * * UTC'   -- 06:00 UTC on the 7th of each month
    COMMENT   = 'Monthly Eyeota upload. See docs/playbooks/monthly-eyeota-upload.md'
AS
    CALL HUSKY_DATA.PRIVATE.UPLOAD_EYEOTA_MONTHLY();

ALTER TASK HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_UPLOAD SUSPEND;  -- start suspended; Kelvin resumes when comfortable

Kelvin's decision: activate automation only after a dry-run confirms it produces the exact same output as his manual run for at least one month. Not urgent — manual is working.

Changelog¶

2026-04-21 — Runbook created from reverse-engineering Kelvin's 2026-04-06 COPY INTO history. Latest month shipped: 202602 → 20260407 run date. No prior written playbook existed. Automation stub included but not activated. Captured Step 4 only (the COPY INTO ship-out); Steps 1-3 (FAGG_MAID build, SGP scrub, HEM_ build) remained undocumented.
2026-04-26 — Steps 1-3 reverse-engineered from SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY and added to this runbook: FAGG_MAID CTAS with USA/non-USA filter branching, SGP allowlist DELETE (~1.2-1.4B rows/month removed), HEM_ INSERT with HUSKY_ID_DB3 join. "F" prefix meaning explained (Filtered = DISTINCT + seg blocklist + SGP scrub). Open questions section added for HEM asymmetry and AGG_MAID_HD evolution. Pipeline overview block added.