orakel
Docs navigation

Canonical

Enrichment Pipeline

Orakel's in-house enrichment pipeline — domains, tech stacks, social handles, and ad pixels derived from public web signals.

Updated 2026-04-29

Source: Orakel enrichment pipeline (in-house; derived signals, not a third-party feed) Data: Verified company domains, detected technologies, ad pixels, social handles License: Orakel-generated derived data. Source web pages retain their own licenses. Attribution required: No — Orakel generates this data Link: N/A (internal pipeline) Update cadence: Daily chunks (morning/midday/evening Mon–Sat) plus a weekly seed on Sundays; technology detection runs as its own daily pass

What it is

Orakel enriches company records with signals derived from the public web. This does not use paid third-party data — no BuiltWith, no Clearbit, no ZoomInfo. The paid-data policy (lib/countries.ts) rules out subscription feeds until a paying customer's revenue covers one.

The pipeline runs in stages:

  1. Candidate generation. Start from Brreg's registered website and expand with heuristics derived from the company name.
  2. DNS + HTTP probes. Resolve each candidate; fetch the homepage.
  3. Signal extraction. Look for the company's org number on the homepage, look up the domain holder in NORID's RDAP (for .no domains), and compare against the registered company name. Crawl /robots.txt and /sitemap.xml for platform hints.
  4. Scoring. A composite confidence score (0–100) combines the signals. High-confidence matches land in confirmed; weaker ones go to ambiguous for LLM review; the rest are rejected.
  5. Technology + ad-pixel detection. Once a domain is confirmed, a separate daily pass fetches the homepage and matches HTML/script signatures for CMS, analytics, chat, marketing, ecommerce, framework, and advertising pixels (Meta, Google Tag Manager, Snap, TikTok, Twitter/X, plus LinkedIn Insight).
  6. Social handles. Extracted from homepage outbound links — LinkedIn company page, Facebook, Instagram, X/Twitter.
  7. SSL SAN scan. A nightly crt.sh scan pulls certificate Subject Alternative Names for confirmed domains, surfacing sibling domains on the same certificate.
  8. Traffic ranking. Weekly download of the Tranco top-1M domain list. Sets Company.trafficRank (1 = highest) for every company whose primaryDomain appears in the list.
  9. Address geocoding. Weekly batch geocodes businessAddress into (businessAddressLat, businessAddressLng) using country-appropriate APIs: Kartverket (ws.geonorge.no, NO), DAWA (api.dataforsyningen.dk, DK), Nominatim (nominatim.openstreetmap.org, SE/FI). Sub-unit locationAddress is also geocoded into (addressLat, addressLng).

Fields provided

Company (enrichment-populated fields)

Field Type Notes
primaryDomain string The single best-match verified domain for this company
enrichedDomains string[] All verified domains (includes primaryDomain)
brregWebsiteValid bool Whether Brreg's registered website survived verification
linkedinHandle / facebookHandle / instagramHandle / twitterHandle string Social handles extracted from the homepage
technologies JSON Array of { name, category, confidence, detectedAt }. Categories include cms, analytics, chat, marketing, ecommerce, framework, advertising
techSyncedAt datetime Last technology scan
sslSansSyncedAt datetime Last crt.sh SAN scan
description string Business activity text from the registry. See Fields → Company for per-country sources.
descriptionKeywords string[] Top-10 keywords extracted from description by Orakel
trafficRank / trafficRankUpdatedAt int / datetime Tranco top-1M rank, updated weekly
businessAddressLat / businessAddressLng float Geocoded coordinates; see geocoding note in pipeline step 9 above
digitalizationIndex number 0–1 Computed at read time from technologies
socialIndex number 0–1 Computed at read time from social handles
sizeClass string Computed at read time from employeeCount

SubUnit (enrichment-populated fields)

Field Type Notes
addressLat / addressLng float Geocoded coordinates of locationAddress. Weekly job.

DomainEnrichment (per-company pipeline state)

Field Type Notes
status string pending | processing | confirmed | ambiguous | rejected
confidence int 0–100 composite score
primaryDomain / confirmedDomains string Result set
stage int Current pipeline stage (0–5)
priority int Process-order weight; higher first
candidates JSON Per-candidate signal breakdown (DNS result, HTTP status, org-number found, RDAP holder, parked flag, score)
brregWebsiteValid / homepageOrgNr / rdapHolder mixed Raw signals recorded for auditability
llmConfidence / llmReasoning int / string Populated only when a candidate went to LLM adjudication
linkedinHandle / facebookHandle / instagramHandle / twitterHandle string Handles captured during scraping
startedAt / completedAt datetime Pipeline run timing

Endpoints that surface this data

  • GET /api/enrichment — browse the DomainEnrichment table directly. Filter by status (defaults to everything except pending) and minConfidence. Returns per-company decisions, candidates, and social handles.
  • GET /api/companies — the main search endpoint exposes primaryDomain, enrichedDomains, technologies, and social handles. Supports hasTechnology=<category-or-name> and hasAnyAdPixel=true filters.
  • GET /api/companies/:orgNumber — same fields on the single-company response.

Limitations

  • Not every company has a discoverable domain. Many small Norwegian companies don't have a web presence; these land in rejected with low confidence.
  • Technology detection is signature-based. False positives are possible for white-labelled or self-hosted variants that share signatures with a vendor build.
  • Ad-pixel IDs are detected as present/absent per category (advertising). The pixel ID value itself is captured in the technologies JSON when the signature exposes it, but there is no cross-reference back to the advertiser's ad account.
  • The pipeline is priority-ordered, not real-time. A newly registered company may wait several days before its turn.
  • Geocoding coverage depends on address quality. Companies without a businessAddressStreet are skipped. Nominatim ToS limits to 1 req/sec; geocoding is batched over weeks, not days.
  • trafficRank is only set for companies with a verified primaryDomain. New companies won't appear in the Tranco list until their domain gains traffic.

Gotchas

  • Freshness varies per company. Check DomainEnrichment.completedAt and Company.techSyncedAt / sslSansSyncedAt rather than assuming a common timestamp.
  • confirmed is not a permanent verdict — a company that changes domain will stay on the old one until the next pass. Re-run enrichment if you need a re-verification.
  • rejected means the pipeline couldn't find a confident match, not that the company has no website. Brreg's website field may still hold something plausible — brregWebsiteValid records whether that particular URL verified.
  • Standard robots.txt is respected. Domains that return 403 on the homepage are recorded as unfetchable and get no tech scan.