Chengshuo Dai

Text embedding models are the critical infrastructure underlying modern semantic search, Retrieval-Augmented Generation (RAG), and clustering applications. While early techniques like Word2Vec and GloVe revolutionized natural language processing by mapping individual words to dense vectors, they were fundamentally limited by their inability to capture context. A word like "bank" would have the exact same vector representation whether it appeared in "river bank" or "bank account." The introduction of Transformer-based models, such as BERT, solved this by generating contextualized embeddings, where the representation of a word depends entirely on the surrounding text.

However, simply averaging the token embeddings from a pre-trained BERT model does not yield high-quality sentence or document embeddings. To create models that accurately capture the semantic similarity between entire sentences or paragraphs, researchers turned to Contrastive Learning. The core objective of contrastive learning is to train a model to map semantically similar texts (positive pairs) close together in the vector space, while simultaneously pushing semantically dissimilar texts (negative pairs) far apart.

This is typically achieved using the InfoNCE (Noise-Contrastive Estimation) loss function. During training, the model is presented with an anchor text (e.g., a search query) and a positive document (e.g., the correct answer). It is also presented with a large batch of negative documents (incorrect answers). The model computes the cosine similarity between the anchor and all documents, and the loss function penalizes the model if the similarity score for the positive pair is not significantly higher than the scores for the negative pairs.

A critical factor in the success of contrastive learning is the selection of "hard negatives." If the negative documents are completely unrelated to the anchor (e.g., a query about quantum physics and a negative document about baking a cake), the model learns very little, as the task is too easy. Hard negatives are documents that are topically similar to the anchor but do not actually answer the query (e.g., a document about classical mechanics). Training with hard negatives forces the embedding model to learn fine-grained semantic distinctions, drastically improving its performance on complex retrieval tasks.

As embedding models have grown more powerful, their output dimensionality has also increased. Modern models often produce embeddings with 1536, 4096, or even 8192 dimensions. While higher dimensionality generally correlates with better performance, it also significantly increases the storage costs and search latency in vector databases. Storing billions of 4096-dimensional vectors requires massive amounts of expensive RAM.

To address this trade-off between performance and efficiency, researchers developed Matryoshka Representation Learning (MRL). Named after the nested Russian dolls, MRL trains embedding models to encode information at multiple granularities within a single, high-dimensional vector. During training, the model is optimized such that the most critical, coarse-grained semantic information is concentrated in the first few dimensions of the vector (e.g., the first 256 dimensions), while increasingly fine-grained details are stored in the subsequent dimensions.

This elegant approach allows developers to dynamically truncate the embeddings at inference time without retraining the model. For example, an application might use the full 4096-dimensional embeddings for a highly precise, small-scale search task. However, for a massive-scale retrieval task where speed and storage are paramount, the developer can simply slice off the first 512 dimensions of the exact same embeddings. Because the MRL model was explicitly trained to front-load the most important information, these truncated 512-dimensional vectors retain the vast majority of the semantic accuracy of the full vectors, while reducing storage costs and search latency by a factor of eight. This flexibility makes Matryoshka embeddings highly desirable for enterprise deployments.

References:

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks - https://arxiv.org/abs/1908.10084
Matryoshka Representation Learning - https://arxiv.org/abs/2205.13147