Skip to main content

CMP Scanner — Pre/Post Consent Diff

Last updated: 2025-09-14 (Consolidated under docs/cmp)

cmp-scanner is a lightweight Playwright-based tool that scans a URL in two phases (pre‑consent and post‑consent), then reports third‑party host and cookie diffs. It optionally classifies new hosts via the CMP Registry. A local Scan API is also available and is used by the Scanner Tool UI.

Install Playwright browser

npx playwright install chromium

Usage

# Basic
pnpm --filter cmp-scanner scan https://example.com tmp/report.json

# With classification via registry (point to API base `/api`)
CMP_REGISTRY_URL=http://localhost:3318 pnpm --filter cmp-scanner scan https://example.com tmp/report.json

# Pass site_key (for site-aware overrides)
pnpm --filter cmp-scanner scan https://example.com tmp/report.json --site-key=DEV_SITE_KEY

# Simulate Global Privacy Control (GPC)
pnpm --filter cmp-scanner scan https://example.com tmp/report.json --gpc=1

CMP_REGISTRY_URL should point to your running Registry API base (e.g., http://localhost:3318). When set, the scanner calls GET /v1/classify?host=…&site_key=… for newly observed hosts after consent. You can pass the site key via --site-key=… or SCAN_SITE_KEY.

What it does

  1. Phase 1 — Pre‑consent
  • Loads the page and waits for network idle
  • Records document.cookie, response Set-Cookie flags, and all requested hostnames
  1. Grants consent
  • Tries window.DWConsent.setAll(true)
  • Fallback: heuristically clicks a visible button that looks like “Accept all”
  1. Phase 2 — Post‑consent
  • Waits briefly, then records the same data again
  1. Diff and classification
  • Computes newly observed third‑party hosts and cookie names
  • Classifies new hosts via the Registry (or basic regex fallback when CMP_REGISTRY_URL is unset)

Report shape

{
"scannedAt": "2025-09-12T12:34:56.789Z",
"url": "https://example.com",
"finalUrl": "https://consent.google.com/...",
"redirectChain": [{ "from": "…", "to": "…", "status": 302 }],
"pre": {
"jsCookies": "...",
"setCookies": [
{ "url": "...", "name": "...", "secure": true, "httponly": false, "samesite": "Lax" },
],
"thirdParty": ["www.googletagmanager.com", "..."],
},
"post": {
/* same fields */
},
"diff": {
"newThirdPartyHosts": ["www.google-analytics.com"],
"newThirdPartyClassified": {
"www.google-analytics.com": {
"host": "…",
"category": "analytics",
"reasons": ["dataset", "override?"],
},
},
"newCookieNames": ["_ga", "_gid"],
},
}

See docs/cmp/scan-api.md for the HTTP API variant and curl examples. The UI at /scanner-tool displays the finalUrl and any redirect hops so you can see consent interstitials (e.g., Google under GPC).

Interpreting Results

  • GPC redirects to consent pages
    • When gpc=true, some sites (e.g., Google) may redirect to consent.google.com. This is expected; check finalUrl and the redirectChain to confirm.
  • Client‑side cookies appear as not HttpOnly
    • Cookies only observed via document.cookie are client‑set and therefore not HttpOnly. This is highlighted as an issue for awareness.
  • No third‑party hosts after consent
    • If the banner blocks third‑party loading until explicit accept, you may see zero new third‑party hosts post‑consent. Verify user flow and timing; some sites load after additional navigation or interaction.

Tips

  • Prefer real user flows (banner visible) for accurate results.
  • Use the Portal’s “Copy verify cmds” to quickly run header/cookie checks and the scanner with classification enabled.

Manual Gate (hosted workflows disabled)

Run the scanner locally against your canary URL and compare the diffs:

pnpm --filter cmp-scanner scan https://example.com tmp/report.json
pnpm --filter cmp-scanner scan https://example.com tmp/report-gpc.json --gpc=1

Compare diff.newThirdPartyHosts and diff.newCookieNames between runs and adjust CI_SCANNER_ALLOW_NEW_HOSTS / CI_SCANNER_ALLOW_NEW_COOKIES only when explicitly approved.