Back to Research
securityAIphishingthreat detectionengineering

Anatomy of a Domain Risk Engine: Regex vs LLMs

2025-12-06 URLert Security Team

Anatomy of a Domain Risk Engine: Why Regex Isn't Enough (And LLMs Are Too Slow)

Building a URL scanner in 2025 is an exercise in managing trade-offs.

If you rely solely on traditional methods like Regex and Blocklists, you miss the sophisticated attacks (False Negatives). If you send every single URL to a massive, general-purpose Large Language Model (LLM), you will go bankrupt—the token costs simply don't scale for high-volume scanning.

At URLert, we filter tens of thousands of URLs monthly for developers and security teams. To balance millisecond-level speed with human-level reasoning, we moved away from a monolithic "AI Scanner" and built a Hybrid Intelligence Pipeline.

Here is the engineering deep dive into how we architected a system that escalates threat detection from static analysis to AI agents only when necessary.

The Architecture: The "Escalation Ladder"

We treat URL scanning like a triage unit. Not every URL needs a brain surgeon; some just need a thermometer. Our pipeline runs in three distinct layers, orchestrated asynchronously.

Layer 1: The Static Net (Speed: <50ms)

The first line of defense is purely deterministic. We don't need an LLM to tell us that a domain is dead or that a certificate is invalid. We run these checks in parallel to keep total request time low.

1. Homoglyph Normalization
Hackers often use Cyrillic characters that look like Latin characters (e.g., paypa1.com or googIe.com). A standard Regex for google.com would miss these. We implement strict Punycode decoding and normalization before any analysis happens.
2. Network Reality Checks (RDAP & WHOIS)
Before analyzing content, we analyze infrastructure. We query the registry database (RDAP/WHOIS) to catch "burner" domains before they even load.

  • Domain Age & Registration: We immediately flag Newly Registered Domains (NRDs). A domain registered 2 hours ago claiming to be "PayPal Support" is guilty until proven innocent.
  • EPP Status Codes: We check for "Death Flags" in the domain status. Statuses like clientHold or serverHold often mean the registrar has already suspended the domain due to abuse reports, even if the website is technically still resolving.
  • MX Records: If a domain claims to be a major corporate service but has no Mail Exchange (MX) records configured, it's a strong signal of a disposable phishing infrastructure.

Layer 2: The Heuristic Filter (Speed: ~100ms)

Before we ask an AI for an opinion, we run a series of specialized algorithmic detectors. These catch the "noisy" threats without costing a single LLM token.

1. Mathematical Gibberish Detection
Instead of asking an LLM "Is this random?", we first run a statistical analysis on the domain string:

  • Shannon Entropy: We calculate the randomness of the character distribution.
  • Consonant/Vowel Ratio: We flag unpronounceable strings (e.g., zxcvbnm).
    Only if the math indicates "High Entropy" do we escalate to the AI for verification (see Layer 3).

2. Algorithmic Typosquatting
We catch "dumb" impersonations purely with code. We maintain a database of major brands (Google, Microsoft, etc.) and check for:

  • TLD Squatting: Detecting google.net or google.org when the official domain is .com.
  • Visual Lookalikes: Detecting faceb00k.com or linked1n.com using Levenshtein distance and character substitution maps.

3. Infrastructure Abuse (Shorteners & Public Platforms)
Hackers often hide behind legitimate infrastructure to bypass reputation filters. We treat these as "high-risk carriers" that always require further investigation:

  • Public Platforms: We flag user-generated subdomains on platforms like github.io, vercel.app, or canva.com. While the platform root is safe, a subdomain like crypto-login.github.io is flagged for deep scanning.
  • URL Shorteners: We maintain a registry of shortener services (e.g., bit.ly, tinyurl.com). Since these obscure the final destination, we flag them immediately to ensure our advanced capabilities follow the redirect chain to the true payload.

4. Suspicious Keyword Extraction
We scan subdomains for high-risk keywords like login, secure, verify, or update. A domain like photos.google.com is fine, but login-secure-update.com triggers an immediate alert.

Layer 3: The AI Judge (Speed: 1s+)

This is the most expensive layer, so we treat it as our "Supreme Court." We only escalate to the LLM when the static code flags an anomaly that requires context.

1. The Gibberish Verifier
If Layer 2 flags xiaomi.com as "High Entropy," the LLM steps in. It understands that "Xiaomi" is a global brand, not a random DGA string. We use a strict Anti-Bias System Prompt:
ANTI-BIAS INSTRUCTIONS:

  • DO NOT flag non-English domains as gibberish.
  • DO NOT flag technical hostnames (e.g., 'db-prod-01').

2. The Intent Analyzer (Combosquatting)
A prime example is Combosquatting—combining a brand name with a generic keyword.

  • The Scenario: paypal-secure-login.com vs. paypal-sdk-docs.com.
  • The Problem: A static keyword search sees "paypal" in both. Blocking both breaks legitimate tools.
  • The Solution: We feed the URL parts to our Typosquat Detection Agent.

This agent uses a prompt designed to evaluate intent. It understands that login-verify implies a credential harvest (Suspicious), while cdn, assets, or docs implies infrastructure (Likely Benign).

We also enforce Structured Output. We never ask the LLM for a chatty response; we demand a strict JSON object:

{
"is_typosquatting": true,
"brand_name": "paypal",
"technique": "combosquatting",
"reasoning": "Brand 'paypal' combined with suspicious keyword 'login'."
}

Beyond the Domain: When to Go Deeper

This entire pipeline focuses on Domain Risk—evaluating the address of the website. For 90% of automated threats, this is enough.

However, a perfectly legitimate domain can still host malicious content (e.g., a hacked WordPress site). For these cases, we flag the domain as "Suspicious" in our API response and recommend escalating to our Threat Detection (Deep Scan) endpoint.

This advanced scan spins up a headless browser to analyze the actual page content, visual rendering, and JavaScript execution—but that's a topic for our next engineering deep dive.

The Result: Why "Hybrid" Wins

By implementing this Priority Cascade, we achieve the best of both worlds:

  1. Cost Efficiency: 90% of threats are caught by the cheap static layer.
  2. High Accuracy: The remaining 10% of "edge cases" get the full attention of a specialized AI agent.

Security isn't just about using the newest AI model; it's about knowing when not to use it.

If you are a developer building an application that deals with user-generated links, you can test this hybrid pipeline yourself at URLert Developer.

Suspicious of a link?

Stop guessing. Scan any link with our AI engine to reveal hidden phishing, cloaking, and zero-day threats instantly.

Scan a URL Free

Read Next