← Back to writing
May 15, 2024 6 min read Updated Apr 05, 2026

The Hybrid Search Engine: Combining Lexical and Semantic Ranks

OSINT relevance is multi-modal. A technical exploration of why keywords fail and how to fuse BM25 with Vector Embeddings for operator-grade retrieval.

Written by
Professional headshot of Ben Moataz
Ben Moataz

Systems Architect, Consultant, and Product Builder

Independent systems architect helping teams turn intelligence, evidence, and automation workflows into reliable products and clearer operating decisions.

Why I'm qualified to write this

This article is grounded in hands-on work across Correlation and scoring, including systems such as Viralink, SOVRINT, and TraxinteL.

I write from hands-on work across product systems, evidence pipelines, ranking layers, monitoring surfaces, and automation runtimes that have to stay reliable under operational pressure.

  • Years spent building product systems, automation infrastructure, and operator-facing platforms.
  • Project records and case studies tied directly to the same capability lanes discussed in the writing.
  • A public archive designed to connect essays back to real systems, delivery constraints, and consulting work.

In the world of intelligence operations, search is not a feature—it is the filtering mechanism for reality. If an operator cannot find the signal, the signal does not exist.

For decades, we relied on Lexical Search. We matched keywords, used Boolean operators, and hoped our targets hadn’t used a synonym or a slightly different encoding. Then came the “AI Revolution,” and the industry pivoted hard toward Semantic Search using vector embeddings.

But here is the hard-won truth: In high-fidelity OSINT, either approach on its own is a failure.

If you rely solely on lexical search, you miss the “conceptual” links. If you rely solely on semantic search, you lose the “exact” precision required for identifiers like usernames, hashes, or specific jargon.

To build an operator-grade intelligence platform, you must build a Hybrid Search Engine. This is the retrieval problem behind my correlation and scoring work, and it shows up directly in products like Viralink and TraxinteL. This essay explores the technical architecture of ranking fusion and why multi-modal retrieval is the only way to survive the data deluge.

Diagram showing a query splitting into lexical and semantic retrieval branches that later merge through reciprocal rank fusion.
Hybrid retrieval matters because operators ask two different things at once: "show me this exact identifier" and "show me the surrounding meaning I would have missed with keywords alone."

1. Why Keywords Fail the Operator

Lexical search (commonly implemented using algorithms like BM25) is incredible at finding exact strings. If I search for uuid:7a1b-42c9, I want that exact string. I don’t want a “semantically similar” UUID.

However, OSINT targets are rarely that cooperative. They use euphemisms. They change their terminology. They communicate in code.

The Problem of Vocabulary Mismatch

Imagine searching for “Money Laundering.” A lexical engine looks for those specific words. But an adversary might be discussing “smurfing,” “layering,” or “structuring.” Unless your keyword list is exhaustive (it never is), the lexical engine returns zero results.

The Problem of Contextual Noise

Keywords thrive on frequency. If a document mentions “Apple” fifty times, it ranks high for “Apple.” But if the operator is looking for “Apple” the company in the context of “supply chain disruption,” a keyword search might return a thousand articles about fruit or iPhone reviews. Lexical search has no concept of intent.


2. The Semantic Promise (and Its Pitfalls)

Semantic search solves the vocabulary problem by representing text as high-dimensional vectors (embeddings). By calculating the cosine similarity between the “Search Vector” and the “Document Vector,” we can find documents that are conceptually related, even if they share zero keywords.

A semantic search for “Financial Obfuscation” will naturally surface documents about “Money Laundering,” “Offshore Accounts,” and “Shell Companies.”

The Precision Problem

Semantic search is “fuzzy” by design. It is brilliant at finding “things like this.” But in intelligence work, we often need “this exact thing.”

If an operator searches for a specific leaked password hash, a semantic engine might return “similar looking hashes” or “documents discussing password security.” For an investigator, this is worse than useless—it is a distraction.

The Density Problem

Standard vector models (like BERT or Ada) are trained on general web text. They often struggle with the hyper-specific, technical, or adversarial language used in specialized OSINT domains (e.g., dark web markets, malware analysis, or military logistics).


3. Architecture of the Hybrid Engine

A Hybrid Search Engine doesn’t choose between Lexical and Semantic; it executes both in parallel and fuses the results.

Step 1: Parallel Retrieval

When a query enters the system (e.g., via our OpenSearch or Pinecone stack), it is branched:

  1. Branch A (Lexical): A BM25 query is run against the inverted index. This captures exact matches, usernames, and specific identifiers.
  2. Branch B (Semantic): The query is converted into a vector by an embedding model and a k-Nearest Neighbors (k-NN) search is run against the vector store. This captures themes, intent, and synonyms.

Step 2: Reciprocal Rank Fusion (RRF)

The challenge is that BM25 scores (e.g., 14.5) and Cosine Similarity scores (e.g., 0.89) are not comparable. We cannot just add them together.

Instead, we use Reciprocal Rank Fusion. We don’t look at the scores; we look at the positions.

  • If a document is #1 in Lexical and #100 in Semantic, it gets a high fused score.
  • If a document is #5 in both, it often outranks a document that is #1 in only one of them.

This ensures that “Exact Matches” are preserved while “Conceptual Matches” are promoted.

Corroboration note. The ranking strategy here follows the same foundations used in modern search stacks: BM25 remains the lexical baseline, Reciprocal Rank Fusion solves score incompatibility by combining positions instead of raw values, and OpenSearch's hybrid search stack exposes both the lexical and vector sides of that pipeline.


4. Tuning for the “Operator Mindset”

Building the engine is only 50% of the work. The other 50% is tuning it for the specific needs of an intelligence operator.

Signal Boosting

In our Intelligence Core, we apply Metadata Boosting. We don’t just search the text. We boost results based on:

  • Recency: Newer intelligence is generally more actionable.
  • Source Trust: Documents from “Verified Sensory Workers” rank higher than unvetted forum scrapes.
  • Entity Density: Documents that contain verified entities from our “Probabilistic Graph” (see Post 11) get a significant ranking multiplier.

The “Zero-Results” Guardrail

In standard search, “Zero Results” is a failure. In OSINT, “Zero Results” is an answer. A hybrid engine must be careful not to “hallucinate” relevance. If the top semantic match has a similarity score below a certain threshold (e.g., < 0.7), the system must be honest and say: “No exact matches found; here are some conceptually distant leads.”

Representative operator query

Consider a sanctions or investigations analyst searching for a handle that keeps mutating between exact aliases and coded descriptions:

  1. The lexical branch still has to surface the exact username, wallet string, or leaked identifier when it exists.
  2. The semantic branch needs to pull in euphemisms, adjacent phrasing, and discussions that never mention the literal string.
  3. Metadata boosts then push up the documents that are recent, come from trusted collection paths, or already touch verified entities in the graph.

Without the lexical branch, the analyst misses the hard identifier. Without the semantic branch, they miss the narrative around it. Without the boosts and guardrails, they drown in plausible but useless matches.


The future of the Hybrid Engine isn’t just text + vectors. It is Multi-Modal.

In the systems I deploy, we incorporate visual search. We use CLIP-based models to embed images and screenshots (see Post 14).

  • An operator can search for “Logo for X-Group.”
  • The engine finds documents where that logo appears in a screenshot, even if the text doesn’t mention the group by name.

By fusing Text-Lexical, Text-Semantic, and Visual-Semantic ranks, we create a search capability that mimics the associative power of the human brain with the scale of a distributed cluster.

Viralink product screenshot used as a shipped reference for retrieval, graph context, and narrative analytics workflows.
Products like Viralink only become useful once retrieval can move between exact entities, narrative context, and graph-level significance without lying to the operator about certainty.

6. Summary: Search as a Strategic Edge

The transition from “Keyword Matching” to “Hybrid Intelligence Retrieval” is the difference between an analyst spending 8 hours digging and an analyst spending 8 minutes identifying.

A Hybrid Search Engine recognizes that intelligence is both Exact and Conceptual. It respects the precision of the string while embracing the depth of the vector.

For the operator, this means a terminal that doesn’t just show them what they asked for, but what they meant to find. It is the bridge between a database of records and a library of insights.

Sources and corroboration


Next Up: Hybrid Search in Practice: Tuning Relevance Without Lying to Yourself

Relevant Work

Expertise areas and case studies tied to the same article.

Related Reading

More writing on adjacent systems problems.

Next Article

Usage-Based Intelligence: Building Scalable Billing Infrastructures

Billing is a distributed systems problem in disguise. Integrating real-time usage tracking with high-stakes intelligence signals.