Canonical
Enrichment Pipeline
Orakel's in-house enrichment pipeline — domains, tech stacks, social handles, and ad pixels derived from public web signals.
Source: Orakel enrichment pipeline (in-house; derived signals, not a third-party feed) Data: Verified company domains, detected technologies, ad pixels, social handles License: Orakel-generated derived data. Source web pages retain their own licenses. Attribution required: No — Orakel generates this data Link: N/A (internal pipeline) Update cadence: Daily chunks (morning/midday/evening Mon–Sat) plus a weekly seed on Sundays; technology detection runs as its own daily pass
What it is
Orakel enriches company records with signals derived from the public web. This does not use paid third-party data — no BuiltWith, no Clearbit, no ZoomInfo. The paid-data policy (lib/countries.ts) rules out subscription feeds until a paying customer's revenue covers one.
The pipeline runs in stages:
- Candidate generation. Start from Brreg's registered website and expand with heuristics derived from the company name.
- DNS + HTTP probes. Resolve each candidate; fetch the homepage.
- Signal extraction. Look for the company's org number on the homepage, look up the domain holder in NORID's RDAP (for
.nodomains), and compare against the registered company name. Crawl/robots.txtand/sitemap.xmlfor platform hints. - Scoring. A composite confidence score (0–100) combines the signals. High-confidence matches land in
confirmed; weaker ones go toambiguousfor LLM review; the rest arerejected. - Technology + ad-pixel detection. Once a domain is confirmed, a separate daily pass fetches the homepage and matches HTML/script signatures for CMS, analytics, chat, marketing, ecommerce, framework, and advertising pixels (Meta, Google Tag Manager, Snap, TikTok, Twitter/X, plus LinkedIn Insight).
- Social handles. Extracted from homepage outbound links — LinkedIn company page, Facebook, Instagram, X/Twitter.
- SSL SAN scan. A nightly crt.sh scan pulls certificate Subject Alternative Names for confirmed domains, surfacing sibling domains on the same certificate.
- Traffic ranking. Weekly download of the Tranco top-1M domain list. Sets
Company.trafficRank(1 = highest) for every company whoseprimaryDomainappears in the list. - Address geocoding. Weekly batch geocodes
businessAddressinto(businessAddressLat, businessAddressLng)using country-appropriate APIs: Kartverket (ws.geonorge.no, NO), DAWA (api.dataforsyningen.dk, DK), Nominatim (nominatim.openstreetmap.org, SE/FI). Sub-unitlocationAddressis also geocoded into(addressLat, addressLng).
Fields provided
Company (enrichment-populated fields)
| Field | Type | Notes |
|---|---|---|
primaryDomain |
string | The single best-match verified domain for this company |
enrichedDomains |
string[] | All verified domains (includes primaryDomain) |
brregWebsiteValid |
bool | Whether Brreg's registered website survived verification |
linkedinHandle / facebookHandle / instagramHandle / twitterHandle |
string | Social handles extracted from the homepage |
technologies |
JSON | Array of { name, category, confidence, detectedAt }. Categories include cms, analytics, chat, marketing, ecommerce, framework, advertising |
techSyncedAt |
datetime | Last technology scan |
sslSansSyncedAt |
datetime | Last crt.sh SAN scan |
description |
string | Business activity text from the registry. See Fields → Company for per-country sources. |
descriptionKeywords |
string[] | Top-10 keywords extracted from description by Orakel |
trafficRank / trafficRankUpdatedAt |
int / datetime | Tranco top-1M rank, updated weekly |
businessAddressLat / businessAddressLng |
float | Geocoded coordinates; see geocoding note in pipeline step 9 above |
digitalizationIndex |
number 0–1 | Computed at read time from technologies |
socialIndex |
number 0–1 | Computed at read time from social handles |
sizeClass |
string | Computed at read time from employeeCount |
SubUnit (enrichment-populated fields)
| Field | Type | Notes |
|---|---|---|
addressLat / addressLng |
float | Geocoded coordinates of locationAddress. Weekly job. |
DomainEnrichment (per-company pipeline state)
| Field | Type | Notes |
|---|---|---|
status |
string | pending | processing | confirmed | ambiguous | rejected |
confidence |
int | 0–100 composite score |
primaryDomain / confirmedDomains |
string | Result set |
stage |
int | Current pipeline stage (0–5) |
priority |
int | Process-order weight; higher first |
candidates |
JSON | Per-candidate signal breakdown (DNS result, HTTP status, org-number found, RDAP holder, parked flag, score) |
brregWebsiteValid / homepageOrgNr / rdapHolder |
mixed | Raw signals recorded for auditability |
llmConfidence / llmReasoning |
int / string | Populated only when a candidate went to LLM adjudication |
linkedinHandle / facebookHandle / instagramHandle / twitterHandle |
string | Handles captured during scraping |
startedAt / completedAt |
datetime | Pipeline run timing |
Endpoints that surface this data
GET /api/enrichment— browse theDomainEnrichmenttable directly. Filter bystatus(defaults to everything exceptpending) andminConfidence. Returns per-company decisions, candidates, and social handles.GET /api/companies— the main search endpoint exposesprimaryDomain,enrichedDomains,technologies, and social handles. SupportshasTechnology=<category-or-name>andhasAnyAdPixel=truefilters.GET /api/companies/:orgNumber— same fields on the single-company response.
Limitations
- Not every company has a discoverable domain. Many small Norwegian companies don't have a web presence; these land in
rejectedwith low confidence. - Technology detection is signature-based. False positives are possible for white-labelled or self-hosted variants that share signatures with a vendor build.
- Ad-pixel IDs are detected as present/absent per category (
advertising). The pixel ID value itself is captured in thetechnologiesJSON when the signature exposes it, but there is no cross-reference back to the advertiser's ad account. - The pipeline is priority-ordered, not real-time. A newly registered company may wait several days before its turn.
- Geocoding coverage depends on address quality. Companies without a
businessAddressStreetare skipped. Nominatim ToS limits to 1 req/sec; geocoding is batched over weeks, not days. trafficRankis only set for companies with a verifiedprimaryDomain. New companies won't appear in the Tranco list until their domain gains traffic.
Gotchas
- Freshness varies per company. Check
DomainEnrichment.completedAtandCompany.techSyncedAt/sslSansSyncedAtrather than assuming a common timestamp. confirmedis not a permanent verdict — a company that changes domain will stay on the old one until the next pass. Re-run enrichment if you need a re-verification.rejectedmeans the pipeline couldn't find a confident match, not that the company has no website. Brreg'swebsitefield may still hold something plausible —brregWebsiteValidrecords whether that particular URL verified.- Standard
robots.txtis respected. Domains that return 403 on the homepage are recorded as unfetchable and get no tech scan.