Skip to main content

CMP Scan API — Local Scanner Service

Last updated: 2025-09-14

The Scan API runs a Playwright-based two-phase scan (pre/post consent) and returns a single JSON report with security analysis and third‑party diffs. It is intended for local use by the Scanner Tool and CI.

Endpoints

  • GET /health

    • Returns: { ok: true, name: 'cmp-scan-api', version: 1, uptimeMs }
  • GET /health/ready

    • Attempts a Chromium launch/close to verify readiness.
    • Returns: { ok: true|false, name, version, details?, uptimeMs }
  • POST /scan

    • Body:
      • url string (required)
      • siteKey string (optional) — forwarded to classification
      • gpc boolean (optional) — simulates Global Privacy Control
      • registry string (optional) — CMP Registry base, e.g. http://localhost:3318/api
      • captureScreenshot boolean (optional, default false) — capture a full‑page PNG
      • captureTrace boolean (optional, default false) — capture a Playwright trace.zip
    • Returns: ScanReport (see below)

Behavior

  • Navigation

    • Uses a stable Chrome-like UA and Accept-Language: en-GB,en;q=0.9.
    • When gpc=true, sets Sec-GPC: 1 header and defines navigator.globalPrivacyControl = true.
    • Tracks redirect chain and records the first main-document response headers, even when redirected (e.g., to consent.google.com).
    • Exposes finalUrl in the report.
  • Third‑party detection

    • PSL-aware, schemeful site comparison via tldts (eTLD+1). The top-level site is derived from finalUrl.
  • Cookie analysis

    • Parses Set-Cookie flags and reports issues: SameSite=None without Secure, non‑Secure on HTTPS, third‑party without SameSite=None, and __Secure-/__Host- prefix requirements.
    • Any cookie observed only via document.cookie is considered client‑side and thus not HttpOnly.

Response shape (excerpt)

{
"scannedAt": "2025-09-14T16:06:55.643Z",
"url": "https://www.google.com",
"finalUrl": "https://consent.google.com/...", // may equal input if no redirect
"redirectChain": [
{ "from": "https://www.google.com", "to": "https://consent.google.com/...", "status": 302, "location": "..." }
],
"pre": {
"jsCookies": "...",
"setCookies": [...], // parsed from Set-Cookie response headers
"storageCookies": [ // from Playwright context.cookies(); never includes values
{ "name":"_ga","domain":".example.com","secure":true,"httpOnly":false,"sameSite":"Lax","expires":1735689600,"firstParty":true }
],
"classifiedCookies": [ // result of POST /api/v1/classify-cookies (DB-backed)
{ "name":"_ga","domain":".example.com","vendor":"Google Analytics","category":"analytics","retention":"~2 years","flags":{"secure":true,"httpOnly":false,"sameSite":"Lax"},"firstParty":true,"confidence":90,"evidence":["pattern:_ga"] }
],
"thirdParty": ["..."]
},
"post": { /* same fields as pre */ },
"status": "ok",
"durationMs": 1534,
"documentHeaders": { "strict-transport-security": "max-age=...", "referrer-policy": "..." },
"artifacts": { "screenshotUrl": "/artifacts/scan/1694690000-abcd12/screenshot.png", "traceUrl": "/artifacts/scan/1694690000-abcd12/trace.zip" },
"diff": {
"newThirdPartyHosts": ["..."],
"newThirdPartyClassified": { "...": { "category": "analytics", "reasons": ["dataset"] } },
"newCookieNames": ["_ga"],
"preConsentViolations": [ // non‑essential cookies detected pre-consent
{ "name":"_fbp","domain":".example.com","category":"advertising","vendor":"Meta" }
]
},
"summary": {
"newThirdPartyCount": 0,
"newThirdPartyByCategory": {},
"cookies": { "preSetCount": 0, "postSetCount": 0, "newNameCount": 0 }
},
"analysis": {
"headers": { "present": ["HSTS"], "missing": ["CSP", "Referrer-Policy", "X-Content-Type-Options"], "issues": ["Missing Content-Security-Policy"] },
"cookies": { "issues": ["Cookie _ga: created client‑side (not HttpOnly)"] }
}
}

Run locally

pnpm -w nx run cmp-scan-api:build
PORT=3006 node dist/apps/cmp/scan-api/src/main.js

# Health
curl -sS http://localhost:3006/health | jq

# Scan (baseline)
curl -sS -X POST http://localhost:3006/scan \
-H 'content-type: application/json' \
-d '{"url":"https://www.google.com","siteKey":"DEV_SITE_KEY","gpc":false,"registry":"http://localhost:3318/api","captureScreenshot":true}' \
| jq '{finalUrl, redirectChain, status, durationMs, artifacts, preCookies: .pre.classifiedCookies[0:3], violations: .diff.preConsentViolations}'

# Scan (GPC=true)
curl -sS -X POST http://localhost:3006/scan \
-H 'content-type: application/json' \
-d '{"url":"https://www.google.com","siteKey":"DEV_SITE_KEY","gpc":true,"registry":"http://localhost:3318/api","captureTrace":true}' \
| jq '{finalUrl, redirectChain, status, durationMs, artifacts, preCookies: .pre.classifiedCookies[0:3], violations: .diff.preConsentViolations}'

Notes

  • The dev server auto-selects a free port among 3005–3007 if PORT is not set. The Scanner Tool tries these ports by default.
  • CORS is permissive for local docs use; do not expose this service publicly without rate limits/ACLs.
  • Cookie values are never captured; only names/flags/retention and derived classification are emitted.

Security header quick checks

Quickly inspect common security headers for any public URL:

URL=https://example.com
curl -sSI "$URL" | egrep -i "strict-transport|content-security|referrer-policy|x-content-type|x-frame|permissions-policy"

Troubleshooting

  • No finalUrl or empty headers

    • Ensure the page reached a main-document response; the scanner records the first main-frame response headers seen. For highly dynamic sites, try re-running or using a different locale.
  • No third‑party hosts on Google domains under GPC

    • Under Sec-GPC: 1, Google may serve a consent interstitial; finalUrl/redirectChain will reflect that. This is expected.

See also

  • Scanner Tool UI: accessible under /scanner-tool in the docs site. It calls this API directly and displays finalUrl and the redirect chain.
  • Registry classification API: GET /v1/classify?host=… (docs/cmp/registry.md)

Deployment (Kubernetes)

  • ArgoCD application: kubernetes/cmp/scan-api/cmp-scan-api-argo.yaml
    • Tracks Helm chart at charts/cmp-scan-api on branch main
    • Automated prune/self-heal enabled
  • Helm chart: charts/cmp-scan-api
    • values.yaml exposes image, service, ingress, autoscaling, resources
    • Production values: charts/cmp-scan-api/values-prod.yaml
    • Ingress (UAT) is enabled with host cmp-scan-api.uat.digiwedge.com and TLS secret cmp-scan-api-uat-tls

Registry integration

Set the Registry env var CMP_SCAN_API_BASE to the public base of this service so that the Portal’s health check and scan runner can reach it:

# Example (production)
CMP_SCAN_API_BASE=https://cmp-scan-api.uat.digiwedge.com

The Registry public aggregator GET /api/health/scan-api uses this base to check /health and /health/ready on the Scan API.