Skip to main content

Datasets & Classifier Sync

Last updated: 2025-09-14

This page documents the external datasets, environment variables, and jobs used to build the CMP classifier knowledge base. The datasets populate the following Prisma models:

  • TrackerEntity: tracker/vendor organizations (WhoTracks.me, AdGuard)
  • TrackerDomain: domain → canonical category mapping
  • DomainEvidence: human‑readable evidence behind a classification
  • VendorIAB: IAB Global Vendor List overlay (optional)

Once synced, the Classifier API at /v1/classify uses this data along with optional site/global overrides.

Configuration

  • CMP_REGISTRY_DATABASE_URL (required)

    • Postgres URL for the CMP registry database.
    • Example: postgresql://cmp:cmp@db.local:5432/cmp?sslmode=require
  • Recommended dataset sources (used by default; override if needed)

    • WTM_TRACKERS_URL (default): https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/whotracksme.json
    • WTM_ENTITIES_URL (default): https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json
    • ADG_COMPANIES_URL (default): https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json
    • IAB_GVL_URL (default): https://vendorlist.consensu.org/v3/vendor-list.json

How classification works (high level)

  1. Site/global overrides take precedence (ClassifierOverride).
  2. If no override, the domain (eTLD+1 aware) is looked up in TrackerDomain.
  3. Evidence from DomainEvidence is returned in the response for transparency.

The canonical categories are: analytics, advertising, functional, social, uncategorized.

Running the sync locally

Prerequisites:

  • Ensure CMP_REGISTRY_DATABASE_URL points to a reachable database.
  • (Optional) Set dataset URL overrides if you need to pin forks/mirrors.

Commands (choose one):

  • All datasets (recommended):

    • infisical run --env=dev -- pnpm -w nx run cmp-registry:jobs:datasets
  • Individually:

    • WTM: infisical run --env=dev -- pnpm -w nx run cmp-registry:datasets:wtm
    • AdGuard: infisical run --env=dev -- pnpm -w nx run cmp-registry:datasets:adguard
    • IAB GVL: infisical run --env=dev -- pnpm -w nx run cmp-registry:datasets:iab-gvl

Verification:

  • Prometheus: /metrics exposes cmp_dataset_fetch_success_total{source=…} and cmp_dataset_fetch_timestamp{source=…}.
  • Classification:
    • curl "http://dev.digiwedge.com:3320/v1/classify?host=www.googletagmanager.com"

Progress output and caching

  • The job prints step starts/ends with durations and per‑source progress:
    • WTM: companies=… trackers=… domains=…, then domains X/Y regularly until done.
    • AdGuard: total companies/domains with periodic companies … / … and domains … / ….
    • IAB GVL: total vendors with periodic processed N/… and done.
  • If you only see the first WTM totals line and then no updates, you may be running a cached build. Re‑run with:
    • pnpm -w nx run cmp-datasets:build --skip-nx-cache
    • pnpm -w nx run cmp-registry:build:production --skip-nx-cache
    • infisical run --env=dev -- pnpm -w nx run cmp-registry:jobs:datasets --tui=false --skip-nx-cache
  • To bypass Nx and stream logs directly:
    • infisical run --env=dev -- node dist/apps/cmp/registry/src/jobs/datasets.js

Kubernetes: set dataset URLs via Secret

kubectl -n cmp create secret generic cmp-datasets-urls \
--from-literal=WTM_TRACKERS_URL='https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/whotracksme.json' \
--from-literal=WTM_ENTITIES_URL='https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json' \
--from-literal=ADG_COMPANIES_URL='https://raw.githubusercontent.com/AdguardTeam/companiesdb/main/dist/companies.json' \
--from-literal=IAB_GVL_URL='https://vendorlist.consensu.org/v3/vendor-list.json' \
-o yaml --dry-run=client | kubectl apply -f -

Reference that Secret in the datasets CronJob (envFrom.secretRef or explicit env entries).

Kubernetes (nightly job)

Two configurations exist:

  • Helm chart: charts/cmp-jobs/templates/cronjob-datasets.yaml
    • CronJob name: cmp-datasets-nightly
    • Schedule: .Values.datasetSchedule
    • Command: node /app/dist/apps/cmp/registry/src/jobs/datasets.js
    • Env from secret: .Values.envSecretName
  • Plain manifest: kubernetes/cmp/jobs/datasets-cronjob.yaml
    • Schedule: 0 1 * * * (01:00 daily)
    • Secret keys required: CMP_REGISTRY_DATABASE_URL, WTM_TRACKERS_URL, WTM_ENTITIES_URL, ADG_COMPANIES_URL, IAB_GVL_URL

Data sources and mapping

  • WhoTracks.me
    • Trackers: domain, owner, category, prevalence → TrackerDomain, DomainEvidence
    • Entities: owner metadata → TrackerEntity
  • AdGuard Companies
    • Company → domains, categories → TrackerEntity, TrackerDomain, DomainEvidence
  • IAB GVL (optional)
    • Vendor purposes/legint → VendorIAB

Category mapping is normalized (e.g., “Advertising”, “Ads” → advertising; “Analytics”, “Measurement” → analytics).

Licensing

  • AdGuard companiesdb is published under CC‑BY‑SA 4.0. Attribute appropriately and keep share‑alike if publishing merged datasets externally. Internal use for classification is fine.

Troubleshooting

  • 400/500 during fetch
    • Check egress to GitHub/Consensu; verify URLs and TLS.
  • Empty classifications
    • Ensure the datasets job completed successfully and the DB is pointed to the same environment your API uses.
  • Duplicates/unique errors
    • Upserts are idempotent. If you see constraint errors, ensure schema is up‑to‑date and no manual rows violate uniqueness on TrackerDomain.host.
  • Prometheus shows no dataset metrics
    • Set METRICS_ENABLED=true and check METRICS_PATH (default /metrics).

References (code)

  • Job orchestrator: apps/cmp/registry/src/jobs/datasets.ts
  • Fetchers: libs/cmp/datasets/src/*.fetch.ts
  • Prisma schema: libs/prisma/cmp-registry-data/prisma/schema.prisma
  • Classifier service: apps/cmp/registry/src/app/classifier.service.ts
  • Admin overrides: apps/cmp/registry/src/app/admin.classifier.controller.ts