Monthly Eyeota upload — runbook¶
Purpose: Every month after the MAID + HEM aggregations finalize, Husky uploads the latest month's segment data to Eyeota's S3 intake so Eyeota can sync our segments into their global audience marketplace. This is the revenue engine for the unbranded Eyeota channel (~US$69K/month per Feb 2026).
Principle: Keep this process boring and reproducible. If Kelvin is out, anyone with Snowflake ACCOUNTADMIN should be able to follow this page and ship the month.
Schedule¶
| When | What |
|---|---|
| 1st–3rd of each month | Data pipeline produces HUSKY_DATA.PRIVATE.FAGG_MAID${YYYYMM} and HUSKY_DATA.PRIVATE.HEM_${YYYYMM} for the prior month |
| 4th–7th of each month | This runbook executes — Kelvin uploads android / ios / hem partitions to Eyeota stage |
| 7th–15th of each month | Eyeota ingests the files into their global catalog; Husky segments become available to Eyeota buyers |
| 14th–30th of next month | Eyeota supplier earnings report arrives (still titled "Global Next Buzz Media" until contract migration completes — see docs/reference/corporate-entities.md) |
Prerequisites¶
- Snowflake login with
ACCOUNTADMINrole (or whatever role has SELECT onHUSKY_DATA.PRIVATE.*and WRITE on@HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER) DEVwarehouse (or any X-Small warehouse) — each step takes ~5-15 min- Supplier (Databricks-side) has finalised the new month's aggregation in
s3://agg-data.huskyai.com/agg-maid/date=${YYYYMM}/andagg-hem/date=${YYYYMM}/. Confirm viaaws s3 ls s3://agg-data.huskyai.com/agg-maid/date=${YYYYMM}/ --profile huskyaibefore starting. - External tables
AGG_MAIDandAGG_HEMneed their partition for${YYYYMM}registered (the externals are single-month views;ALTER EXTERNAL TABLE AGG_MAID REFRESH;if not auto-refreshing).
Pipeline overview¶
Step 1: CTAS FAGG_MAID${YYYYMM} ← FROM AGG_MAID (external) ~5 min
Step 2: DELETE SGP rows not in agg_maid_sgp_filter allowlist ~3 min
Step 3: CREATE + INSERT HEM_${YYYYMM} ← FROM AGG_HEM × HUSKY_ID_DB3 ~5 min
Step 4: 3 × COPY INTO @HUSKY_EYEOTA_UPLOADER/{date}/{android,ios,hem}/ ~15-30 min
Steps 1-3 build the Snowflake-native materialised tables. Step 4 ships them to Eyeota's S3 stage. Total wall-clock ~30-50 min on DEV warehouse.
Variables to set per run¶
-- Source month to ship (the month whose data was aggregated):
SET YYYYMM = '202602'; -- e.g. Feb 2026 data (shipped early Apr)
-- Run date (goes into the S3 prefix so Eyeota knows which batch is which):
SET RUN_DATE = '20260407'; -- typically today's date in YYYYMMDD
Step 1 — Build FAGG_MAID${YYYYMM} (Filtered Aggregated MAID)¶
Why "F" prefix: AGG_MAID is the supplier-delivered raw aggregation (multi-row per user, ~4B rows). FAGG_MAID adds three Husky-side filters: DISTINCT user_id dedup, segment-id blocklist via
ARRAY_EXCEPT, and SGP allowlist (Step 2). Result: one cleaned month-snapshot table.
CREATE OR REPLACE TABLE HUSKY_DATA.PRIVATE.FAGG_MAID${YYYYMM} AS
SELECT DISTINCT
user_id,
CASE
WHEN country_part='USA' THEN
ARRAY_TO_STRING(
ARRAY_EXCEPT(seg, (SELECT ARRAY_AGG(seg_id) FROM HUSKY_DATA.PRIVATE.filter_260302_3_300w)),
','
)
ELSE
ARRAY_TO_STRING(
ARRAY_EXCEPT(seg, (SELECT ARRAY_AGG(seg_id) FROM HUSKY_DATA.PRIVATE.filter_260424_misc)),
','
)
END AS seg_list,
country_part,
date_part,
platform_part
FROM HUSKY_DATA.PRIVATE.AGG_MAID;
- USA and non-USA rows go through different
filter_*blocklists (current versions:filter_260302_3_300wfor USA,filter_260424_miscfor everyone else, as of 2026-04 cycle). Seedocs/strategy/2026-04-26-pipeline-automation-strategy.mdfor the proposedFILTER_REGISTRYcapturing why each filter exists. - Expected row count: ~3.9-4.1B for a normal full month. Sharply lower means the upstream supplier load is incomplete or the filter version is wrong.
- Today's row count is logged for trend monitoring (see Step 6).
Step 2 — SGP sanity-cap scrub¶
Why this step exists: raw
AGG_MAIDcontains ~1.2-1.4B SGP-tagged MAIDs per month, ~200x Singapore's population (5.9M). Cause is likely VPN/proxy mis-tagging, transit-traveller GPS triggers, ad-network spoofing, or supplier quality issues.agg_maid_sgp_filteris a curated allowlist of "real" SGP user_ids (~8.16M rows, close to Singapore's actual smartphone population). Anything not in the allowlist gets removed.
DELETE FROM HUSKY_DATA.PRIVATE.FAGG_MAID${YYYYMM} a
WHERE a.country_part = 'SGP'
AND NOT EXISTS (
SELECT 1 FROM HUSKY_DATA.PRIVATE.agg_maid_sgp_filter b
WHERE a.user_id = b.user_id
);
- Typical magnitude: ~50-60M rows deleted (revised 2026-05-14 baseline). 6-cycle median = 55M (cycles 202510→202603 backfilled to
HUSKY_DATA.OPS.PIPELINE_RUN_LOG). Earlier doc cited "1.2-1.4B" — that was supplier-upstream behaviour pre-2025; supplier appears to have fixed SGP inflation at source. If this step deletes <30M or >100M, investigate — supplier behaviour shifted again. - The
agg_maid_sgp_filterallowlist itself is rebuilt periodically fromAGG_MAID_SL(the supplier'sagg-maid_sizeLimitedActiveSegmentsprefix, which contains only "active" users with multi-segment lists). See~/Projects/.../memory/project_agg_maid_vs_sl_difference.mdfor empirical row-count evidence. - Per-country expansion is in flight: 10
BD${ISO3}260423boundary-data filter tables created 2026-04-23 (HKG, IDN, IND, JPN, KOR, MYS, PHL, SGP, THA, VNM) suggest SGP-style sanity caps are extending to other markets in upcoming runs.
Step 3 — Build HEM_${YYYYMM} (Hashed Email rollup)¶
Why no "F" prefix: the HEM transform is identity remapping (MAID → email via
HUSKY_ID_DB3crosswalk), not filtering. There is currently no segment-blocklist filter and no SGP scrub on the HEM side — see Open Questions at the bottom.
CREATE OR REPLACE TABLE HUSKY_DATA.PRIVATE.HEM_${YYYYMM} (
hem_id VARCHAR,
country VARCHAR,
seg_list VARCHAR
);
INSERT INTO HUSKY_DATA.PRIVATE.HEM_${YYYYMM}
SELECT b.h_email,
a.country_part,
a.seg_list
FROM HUSKY_DATA.PRIVATE.AGG_HEM AS a
INNER JOIN HUSKY_DATA.PRIVATE.HUSKY_ID_DB3 AS b
ON a.aaid = b.hem_id;
- Expected row count: ~900M-950M for a normal month.
HUSKY_ID_DB3is Husky's MAID ↔ hashed-email crosswalk (~1B rows). It's the join key that converts the supplier's MAID-keyedAGG_HEMinto Husky's email-keyed delivery.
Step 4 — Ship to Eyeota stage (3 COPY INTO statements)¶
All three must succeed. Run sequentially; each takes 5-15 minutes on DEV warehouse.
4.1 MAID Android¶
COPY INTO @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER/{RUN_DATE}/android/
FROM (
SELECT user_id,
CASE WHEN country_part = 'KOR' THEN 'KR'
ELSE LEFT(country_part, 2)
END AS country,
seg_list
FROM HUSKY_DATA.PRIVATE.FAGG_MAID{YYYYMM}
WHERE platform_part = 'ANDROID'
)
FILE_FORMAT = (FORMAT_NAME = 'HUSKY_DATA.PRIVATE.PARQUET')
HEADER = TRUE
OVERWRITE = TRUE;
4.2 MAID iOS¶
COPY INTO @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER/{RUN_DATE}/ios/
FROM (
SELECT user_id,
CASE WHEN country_part = 'KOR' THEN 'KR'
ELSE LEFT(country_part, 2)
END AS country,
seg_list
FROM HUSKY_DATA.PRIVATE.FAGG_MAID{YYYYMM}
WHERE platform_part = 'IOS'
)
FILE_FORMAT = (FORMAT_NAME = 'HUSKY_DATA.PRIVATE.PARQUET')
HEADER = TRUE
OVERWRITE = TRUE;
4.3 HEM¶
COPY INTO @HUSKY_DATA.PRIVATE.HUSKY_EYEOTA_UPLOADER/{RUN_DATE}/hem/
FROM (
SELECT hem_id,
CASE WHEN country = 'KOR' THEN 'KR'
ELSE LEFT(country, 2)
END AS country,
seg_list
FROM HUSKY_DATA.PRIVATE.HEM_{YYYYMM}
)
FILE_FORMAT = (FORMAT_NAME = 'HUSKY_DATA.PRIVATE.PARQUET')
HEADER = TRUE
OVERWRITE = TRUE;
Why the KOR → KR transformation¶
Husky's internal country codes use ISO 3166-1 alpha-3 (KOR, JPN, SGP, etc.). Eyeota's intake expects ISO 3166-1 alpha-2 (KR, JP, SG, etc.). LEFT(country, 2) gives the correct alpha-2 for all APAC markets except Korea, where KOR → KO would be wrong (the alpha-2 is KR). Hence the explicit CASE branch. Do not simplify this — the KOR case is the only exception.
Verification after upload¶
Expected: three subfolders (android/, ios/, hem/), each with one or more Parquet files totaling ~500 GB combined. If any folder is missing or empty, re-run that COPY INTO.
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
Object 'FAGG_MAID${YYYYMM}' does not exist |
This month's aggregation hasn't finalized yet | Wait for the upstream pipeline; confirm with the data team |
Insufficient privileges |
Role is not ACCOUNTADMIN | USE ROLE ACCOUNTADMIN before running |
File format 'PARQUET' does not exist |
Schema qualifier missing | Use HUSKY_DATA.PRIVATE.PARQUET as shown |
| COPY INTO hangs >30 min | Warehouse too small / auto-suspend kicked in | Resize warehouse: ALTER WAREHOUSE DEV SET WAREHOUSE_SIZE = 'SMALL' for the run, reset to X-Small after |
| Eyeota reports empty file | OVERWRITE omitted and folder already contains older partial files | Use OVERWRITE = TRUE |
What NOT to do¶
- Do not rename
HUSKY_DATA.PRIVATE.FAGG_MAID*orHEM_*tables. Column namesuser_id,country_part,seg_list,platform_part,hem_id,countrymust stay exactly as-is — they are referenced by hard-coded COPY INTO statements. - Do not drop the
HUSKY_EYEOTA_UPLOADERstage or modify its S3 credentials. Eyeota reads from this bucket directly. - Do not disable the
HUSKY_DATA.PRIVATEschema or theHUSKY_EYEOTA_UPLOADERstage on a whim; a lapsed monthly upload costs ~US$69K of revenue.
Open questions / known asymmetries¶
- HEM has no SGP scrub equivalent. Steps 1-2 apply DISTINCT + segment-blocklist + SGP allowlist on MAID. Step 3 (HEM) only does the MAID→email join. If raw
AGG_HEMcarries the same SGP inflation pattern asAGG_MAID, the HEM Eyeota delivery may be shipping noise. Verify with:SELECT country, COUNT(*) FROM HEM_${YYYYMM} GROUP BY 1 ORDER BY 2 DESC;and compare against expected smartphone population per country. - HEM seg_list is unfiltered.
a.seg_listfromAGG_HEMis used as-is. If supplier's HEM-side aggregation doesn't apply the same segment-id blocklist that the MAID side does (perfilter_*table), HEM Eyeota delivery may include blocked seg_ids that MAID excludes. AGG_MAID_HDlineage (new, 2026-04-23). Kelvin createdAGG_MAID_HDexternal table + 10 country-specificBD${ISO3}260423boundary-data filter tables. Suspect this is a future replacement / refinement of the AGG_MAID → FAGG_MAID flow that introduces per-country sanity caps. Behaviour TBD; runbook to be updated when the new lineage stabilises.- External tables are single-month views.
AGG_MAIDandAGG_HEMonly register one date partition at a time. Historical months sit on S3 (s3://agg-data.huskyai.com/agg-{maid,hem}/date=YYYYMM/) but are not query-visible. If you need to re-build a prior month'sFAGG_MAID, you have toALTER EXTERNAL TABLE ... REFRESHfirst. Snowflake-nativeFAGG_MAID${YYYYMM}snapshots persist 12+ months back.
Automation status (updated 2026-05-14)¶
Sprint 1 SHIPPED (artifacts): all 4 steps now wrapped in a Snowflake Notebook artifact HUSKY_DATA.OPS.MONTHLY_EYEOTA_NB_TEMPLATE (open in Snowsight). Each step pairs with OPS.LOG_STEP() + OPS.CHECK_BASELINE() for assertion-based verification (verdict ✅/🟡/🔴 vs trailing-3-month median).
For the 6月 cycle (data month 202604), Kelvin (or Benjamin) clones the template → MONTHLY_EYEOTA_NB_202604 → edits Cell 2 → runs cells top-to-bottom. Manual SQL fallback below remains the documented disaster-recovery path.
Manual fallback (this playbook's Steps 1-4) stays as the canonical source-of-truth SQL. The Notebook is a thin wrapper; if Notebook fails or feels uncomfortable, fall back to copy-pasting SQL from this doc into a Snowsight Worksheet.
The Step-4-only TASK stub below is preserved for reference but is superseded by the broader Sprint-1 plan in the strategy doc, which wraps Steps 1-4 together with row-count assertions:
-- Run the 7th of each month at 06:00 UTC; wrap the three COPY INTOs in a stored procedure
CREATE OR REPLACE PROCEDURE HUSKY_DATA.PRIVATE.UPLOAD_EYEOTA_MONTHLY()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
last_month VARCHAR := TO_CHAR(DATEADD(MONTH, -1, CURRENT_DATE), 'YYYYMM');
run_date VARCHAR := TO_CHAR(CURRENT_DATE, 'YYYYMMDD');
maid_table VARCHAR := 'HUSKY_DATA.PRIVATE.FAGG_MAID' || :last_month;
hem_table VARCHAR := 'HUSKY_DATA.PRIVATE.HEM_' || :last_month;
-- ... execute the three COPY INTOs with EXECUTE IMMEDIATE, dynamically substituting :last_month and :run_date
BEGIN
-- Android / iOS / HEM COPY INTO blocks here, each wrapped in EXECUTE IMMEDIATE
RETURN 'Uploaded ' || :last_month || ' to ' || :run_date || '/';
END;
$$;
CREATE OR REPLACE TASK HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_UPLOAD
WAREHOUSE = DEV
SCHEDULE = 'USING CRON 0 6 7 * * UTC' -- 06:00 UTC on the 7th of each month
COMMENT = 'Monthly Eyeota upload. See docs/playbooks/monthly-eyeota-upload.md'
AS
CALL HUSKY_DATA.PRIVATE.UPLOAD_EYEOTA_MONTHLY();
ALTER TASK HUSKY_DATA.PRIVATE.MONTHLY_EYEOTA_UPLOAD SUSPEND; -- start suspended; Kelvin resumes when comfortable
Kelvin's decision: activate automation only after a dry-run confirms it produces the exact same output as his manual run for at least one month. Not urgent — manual is working.
Changelog¶
- 2026-04-21 — Runbook created from reverse-engineering Kelvin's 2026-04-06 COPY INTO history. Latest month shipped: 202602 → 20260407 run date. No prior written playbook existed. Automation stub included but not activated. Captured Step 4 only (the COPY INTO ship-out); Steps 1-3 (FAGG_MAID build, SGP scrub, HEM_ build) remained undocumented.
- 2026-04-26 — Steps 1-3 reverse-engineered from
SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORYand added to this runbook: FAGG_MAID CTAS with USA/non-USA filter branching, SGP allowlist DELETE (~1.2-1.4B rows/month removed), HEM_ INSERT with HUSKY_ID_DB3 join. "F" prefix meaning explained (Filtered = DISTINCT + seg blocklist + SGP scrub). Open questions section added for HEM asymmetry andAGG_MAID_HDevolution. Pipeline overview block added.