Datasets & Classifier Sync
Last updated: 2025-09-14
This page documents the external datasets, environment variables, and jobs used to build the CMP classifier knowledge base. The datasets populate the following Prisma models:
- TrackerEntity: tracker/vendor organizations (WhoTracks.me, AdGuard)
- TrackerDomain: domain → canonical category mapping
- DomainEvidence: human‑readable evidence behind a classification
- VendorIAB: IAB Global Vendor List overlay (optional)
Once synced, the Classifier API at /v1/classify uses this data along with optional site/global overrides.
Configuration
-
CMP_REGISTRY_DATABASE_URL (required)
- Postgres URL for the CMP registry database.
- Example:
postgresql://cmp:cmp@db.local:5432/cmp?sslmode=require
-
Recommended dataset sources (used by default; override if needed)
- WTM_TRACKERS_URL (default):
https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/whotracksme.json - WTM_ENTITIES_URL (default):
https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json - ADG_COMPANIES_URL (default):
https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json - IAB_GVL_URL (default):
https://vendorlist.consensu.org/v3/vendor-list.json
- WTM_TRACKERS_URL (default):
How classification works (high level)
- Site/global overrides take precedence (
ClassifierOverride). - If no override, the domain (eTLD+1 aware) is looked up in
TrackerDomain. - Evidence from
DomainEvidenceis returned in the response for transparency.
The canonical categories are: analytics, advertising, functional, social, uncategorized.
Running the sync locally
Prerequisites:
- Ensure
CMP_REGISTRY_DATABASE_URLpoints to a reachable database. - (Optional) Set dataset URL overrides if you need to pin forks/mirrors.
Commands (choose one):
-
All datasets (recommended):
infisical run --env=dev -- pnpm -w nx run cmp-registry:jobs:datasets
-
Individually:
- WTM:
infisical run --env=dev -- pnpm -w nx run cmp-registry:datasets:wtm - AdGuard:
infisical run --env=dev -- pnpm -w nx run cmp-registry:datasets:adguard - IAB GVL:
infisical run --env=dev -- pnpm -w nx run cmp-registry:datasets:iab-gvl
- WTM:
Verification:
- Prometheus:
/metricsexposescmp_dataset_fetch_success_total{source=…}andcmp_dataset_fetch_timestamp{source=…}. - Classification:
curl "http://dev.digiwedge.com:3320/v1/classify?host=www.googletagmanager.com"
Progress output and caching
- The job prints step starts/ends with durations and per‑source progress:
- WTM:
companies=… trackers=… domains=…, thendomains X/Yregularly untildone. - AdGuard: total companies/domains with periodic
companies … / …anddomains … / …. - IAB GVL: total vendors with periodic
processed N/…anddone.
- WTM:
- If you only see the first WTM totals line and then no updates, you may be running a cached build. Re‑run with:
pnpm -w nx run cmp-datasets:build --skip-nx-cachepnpm -w nx run cmp-registry:build:production --skip-nx-cacheinfisical run --env=dev -- pnpm -w nx run cmp-registry:jobs:datasets --tui=false --skip-nx-cache
- To bypass Nx and stream logs directly:
infisical run --env=dev -- node dist/apps/cmp/registry/src/jobs/datasets.js
Kubernetes: set dataset URLs via Secret
kubectl -n cmp create secret generic cmp-datasets-urls \
--from-literal=WTM_TRACKERS_URL='https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/whotracksme.json' \
--from-literal=WTM_ENTITIES_URL='https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json' \
--from-literal=ADG_COMPANIES_URL='https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json' \
--from-literal=IAB_GVL_URL='https://vendorlist.consensu.org/v3/vendor-list.json' \
-o yaml --dry-run=client | kubectl apply -f -
Reference that Secret in the datasets CronJob (envFrom.secretRef or explicit env entries).
Kubernetes (nightly job)
Two configurations exist:
- Helm chart:
charts/cmp-jobs/templates/cronjob-datasets.yaml- CronJob name:
cmp-datasets-nightly - Schedule:
.Values.datasetSchedule - Command:
node /app/dist/apps/cmp/registry/src/jobs/datasets.js - Env from secret:
.Values.envSecretName
- CronJob name:
- Plain manifest:
kubernetes/cmp/jobs/datasets-cronjob.yaml- Schedule:
0 1 * * *(01:00 daily) - Secret keys required:
CMP_REGISTRY_DATABASE_URL,WTM_TRACKERS_URL,WTM_ENTITIES_URL,ADG_COMPANIES_URL,IAB_GVL_URL
- Schedule:
Data sources and mapping
- WhoTracks.me
- Trackers: domain, owner, category, prevalence →
TrackerDomain,DomainEvidence - Entities: owner metadata →
TrackerEntity
- Trackers: domain, owner, category, prevalence →
- AdGuard Companies
- Company → domains, categories →
TrackerEntity,TrackerDomain,DomainEvidence
- Company → domains, categories →
- IAB GVL (optional)
- Vendor purposes/legint →
VendorIAB
- Vendor purposes/legint →
Category mapping is normalized (e.g., “Advertising”, “Ads” → advertising; “Analytics”, “Measurement” → analytics).
Licensing
- AdGuard companiesdb is published under CC‑BY‑SA 4.0. Attribute appropriately and keep share‑alike if publishing merged datasets externally. Internal use for classification is fine.
Troubleshooting
- 400/500 during fetch
- Check egress to GitHub/Consensu; verify URLs and TLS.
- Empty classifications
- Ensure the datasets job completed successfully and the DB is pointed to the same environment your API uses.
- Duplicates/unique errors
- Upserts are idempotent. If you see constraint errors, ensure schema is up‑to‑date and no manual rows violate uniqueness on
TrackerDomain.host.
- Upserts are idempotent. If you see constraint errors, ensure schema is up‑to‑date and no manual rows violate uniqueness on
- Prometheus shows no dataset metrics
- Set
METRICS_ENABLED=trueand checkMETRICS_PATH(default/metrics).
- Set
References (code)
- Job orchestrator:
apps/cmp/registry/src/jobs/datasets.ts - Fetchers:
libs/cmp/datasets/src/*.fetch.ts - Prisma schema:
libs/prisma/cmp-registry-data/prisma/schema.prisma - Classifier service:
apps/cmp/registry/src/app/classifier.service.ts - Admin overrides:
apps/cmp/registry/src/app/admin.classifier.controller.ts