Datasets & Classifier Sync

Last updated: 2025-09-14

This page documents the external datasets, environment variables, and jobs used to build the CMP classifier knowledge base. The datasets populate the following Prisma models:

TrackerEntity: tracker/vendor organizations (WhoTracks.me, AdGuard)
TrackerDomain: domain → canonical category mapping
DomainEvidence: human‑readable evidence behind a classification
VendorIAB: IAB Global Vendor List overlay (optional)

Once synced, the Classifier API at /v1/classify uses this data along with optional site/global overrides.

Configuration

CMP_REGISTRY_DATABASE_URL (required)
- Postgres URL for the CMP registry database.
- Example: postgresql://cmp:cmp@db.local:5432/cmp?sslmode=require
Recommended dataset sources (used by default; override if needed)
- WTM_TRACKERS_URL (default): https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/whotracksme.json
- WTM_ENTITIES_URL (default): https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json
- ADG_COMPANIES_URL (default): https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json
- IAB_GVL_URL (default): https://vendorlist.consensu.org/v3/vendor-list.json

How classification works (high level)

Site/global overrides take precedence (ClassifierOverride).
If no override, the domain (eTLD+1 aware) is looked up in TrackerDomain.
Evidence from DomainEvidence is returned in the response for transparency.

The canonical categories are: analytics, advertising, functional, social, uncategorized.

Running the sync locally

Prerequisites:

Ensure CMP_REGISTRY_DATABASE_URL points to a reachable database.
(Optional) Set dataset URL overrides if you need to pin forks/mirrors.

Commands (choose one):

All datasets (recommended):
- infisical run --env=dev -- pnpm -w nx run cmp-registry:jobs:datasets
Individually:
- WTM: infisical run --env=dev -- pnpm -w nx run cmp-registry:datasets:wtm
- AdGuard: infisical run --env=dev -- pnpm -w nx run cmp-registry:datasets:adguard
- IAB GVL: infisical run --env=dev -- pnpm -w nx run cmp-registry:datasets:iab-gvl

Verification:

Prometheus: /metrics exposes cmp_dataset_fetch_success_total{source=…} and cmp_dataset_fetch_timestamp{source=…}.
Classification:
- curl "http://dev.digiwedge.com:3320/v1/classify?host=www.googletagmanager.com"

Progress output and caching

The job prints step starts/ends with durations and per‑source progress:
- WTM: companies=… trackers=… domains=…, then domains X/Y regularly until done.
- AdGuard: total companies/domains with periodic companies … / … and domains … / ….
- IAB GVL: total vendors with periodic processed N/… and done.
If you only see the first WTM totals line and then no updates, you may be running a cached build. Re‑run with:
- pnpm -w nx run cmp-datasets:build --skip-nx-cache
- pnpm -w nx run cmp-registry:build:production --skip-nx-cache
- infisical run --env=dev -- pnpm -w nx run cmp-registry:jobs:datasets --tui=false --skip-nx-cache
To bypass Nx and stream logs directly:
- infisical run --env=dev -- node dist/apps/cmp/registry/src/jobs/datasets.js

Kubernetes: set dataset URLs via Secret

kubectl -n cmp create secret generic cmp-datasets-urls \
  --from-literal=WTM_TRACKERS_URL='https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/whotracksme.json' \
  --from-literal=WTM_ENTITIES_URL='https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json' \
  --from-literal=ADG_COMPANIES_URL='https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json' \
  --from-literal=IAB_GVL_URL='https://vendorlist.consensu.org/v3/vendor-list.json' \
  -o yaml --dry-run=client | kubectl apply -f -

Reference that Secret in the datasets CronJob (envFrom.secretRef or explicit env entries).

Kubernetes (nightly job)

Two configurations exist:

Helm chart: charts/cmp-jobs/templates/cronjob-datasets.yaml
- CronJob name: cmp-datasets-nightly
- Schedule: .Values.datasetSchedule
- Command: node /app/dist/apps/cmp/registry/src/jobs/datasets.js
- Env from secret: .Values.envSecretName
Plain manifest: kubernetes/cmp/jobs/datasets-cronjob.yaml
- Schedule: 0 1 * * * (01:00 daily)
- Secret keys required: CMP_REGISTRY_DATABASE_URL, WTM_TRACKERS_URL, WTM_ENTITIES_URL, ADG_COMPANIES_URL, IAB_GVL_URL

Data sources and mapping

WhoTracks.me
- Trackers: domain, owner, category, prevalence → TrackerDomain, DomainEvidence
- Entities: owner metadata → TrackerEntity
AdGuard Companies
- Company → domains, categories → TrackerEntity, TrackerDomain, DomainEvidence
IAB GVL (optional)
- Vendor purposes/legint → VendorIAB

Category mapping is normalized (e.g., “Advertising”, “Ads” → advertising; “Analytics”, “Measurement” → analytics).

Licensing

AdGuard companiesdb is published under CC‑BY‑SA 4.0. Attribute appropriately and keep share‑alike if publishing merged datasets externally. Internal use for classification is fine.

Troubleshooting

400/500 during fetch
- Check egress to GitHub/Consensu; verify URLs and TLS.
Empty classifications
- Ensure the datasets job completed successfully and the DB is pointed to the same environment your API uses.
Duplicates/unique errors
- Upserts are idempotent. If you see constraint errors, ensure schema is up‑to‑date and no manual rows violate uniqueness on TrackerDomain.host.
Prometheus shows no dataset metrics
- Set METRICS_ENABLED=true and check METRICS_PATH (default /metrics).

References (code)

Job orchestrator: apps/cmp/registry/src/jobs/datasets.ts
Fetchers: libs/cmp/datasets/src/*.fetch.ts
Prisma schema: libs/prisma/cmp-registry-data/prisma/schema.prisma
Classifier service: apps/cmp/registry/src/app/classifier.service.ts
Admin overrides: apps/cmp/registry/src/app/admin.classifier.controller.ts

Configuration​

How classification works (high level)​

Running the sync locally​

Progress output and caching​

Kubernetes: set dataset URLs via Secret​

Kubernetes (nightly job)​

Data sources and mapping​

Licensing​

Troubleshooting​

References (code)​