Chengshuo Dai

In modern AI search and Retrieval-Augmented Generation (RAG) pipelines, finding the most relevant documents quickly and accurately is a complex challenge. The standard approach relies on Bi-encoders for dense vector retrieval, but this method often falls short in capturing deep semantic nuances. To achieve state-of-the-art search relevance, a two-stage retrieval pipeline incorporating Cross-encoder Reranking is essential.

Bi-encoders process the user's query and the documents independently. They map both into a shared vector space, allowing for lightning-fast retrieval using Approximate Nearest Neighbor (ANN) algorithms like HNSW. However, because the query and the document are embedded in isolation, the model cannot analyze the complex interactions between specific words in the query and the document. This often leads to high recall but lower precision, where retrieved documents are topically related but don't directly answer the user's specific question.

Cross-encoders solve this by processing the query and the document simultaneously. The input to a Cross-encoder is a concatenated string: [CLS] Query [SEP] Document [SEP]. This allows the self-attention mechanism within the Transformer to compute rich, token-level interactions between the query and the document, resulting in a highly accurate relevance score.

The trade-off is computational cost. Running a Cross-encoder on millions of documents is prohibitively slow and expensive. Therefore, the industry standard is a two-stage pipeline:

First-Stage Retrieval: Use a fast Bi-encoder (or traditional BM25 lexical search) to retrieve a broad set of candidate documents (e.g., the top 100).
Second-Stage Reranking: Pass these 100 candidates through a powerful Cross-encoder to calculate precise relevance scores and reorder them, ultimately passing only the top 3 to 5 documents to the LLM.

This architecture balances the speed of vector databases with the deep semantic understanding of Cross-encoders, significantly reducing hallucinations in downstream generation tasks.

References:

SBERT.net: Bi-Encoders vs. Cross-Encoders - https://www.sbert.net/examples/applications/cross-encoder/README.html
Cohere Blog: Rerank: The secret to improving search and RAG - https://cohere.com/blog/rerank