If you are seeing this, then Javascript is disabled. Please enable it to properly view this post.
Recent advances in Large Language Models (LLMs) have demonstrated their remarkable ability to capture semantic information. We investigate whether different language embedding models learn similar semantic representations despite variations in architecture, training data, and initialization. While previous work explored model similarity through top-k results and Centered Kernel Alignment (CKA), yielding mixed results, in the field of large language embedding models, which we focus on, there is a gap: more modern similarity quantifiation methods from Computer Vision, such as model stitching, which operationalizes the notion of "similarity" in a way that emphasizes downstream utility, are not explored. We apply stitching by training linear and nonlinear (MLP) mappings, called "stitches" between embedding spaces, which aim to biject between embeddings of the same datapoints. We define two spaces as connectivity-aligned if stitches achieve low mean squared error, indicating approximate bijectivity.
Our analysis spans 6 embedding datasets (5,000-20,000 documents), 18 models (between 20-30 layers, including both open-source and OpenAI models), and stitches ranging from linear stitches to MLPs 7 layers deep, with a focus on linear stitches. We hoped that stitching would recover the similarity between models, aligning with a strong interpretation of the platonic representation hypothesis. However, things appear to be more complicated. Our results suggest that embedding models are not linearly connectivity-aligned. Specifically, linear stitches do not perform significantly better than mean estimators. A brief foray into MLPs suggests that training shallow MLPs does not necessarily work out of the box either, but more work remains to be done on non-linear stitches, since we haven't fully maximized their potential here. Stitches are important, because their success can be used to determine operational, and therefore useful, notions of representational similarity. Our findings buttress the hypothesis that alignment metrics such as CKA are not always informative of behavior or feature overlap between models.
Modern "embedding" language models convert text into dense vectors called embeddings, which capture semantic meaning in a high-dimensional space, enabling downstream usage such as for semantic search and visualization. We investigate whether different embedding models learn similar semantic representations despite variations in architecture, training data, and initialization. If these representations are indeed (approximately) universal, we could leverage this to efficiently translate between different models' embeddings.
The paper
Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval
Augmented Generation Systems
We define key terminology and notation used throughout this paper:
Our key hypothesis was that embedding models of similar scale (parameters, training data size, etc.) are not only connectivity-aligned, but linearly so. We were also curious as to whether a rotation would be sufficient. The results suggest that they are in fact not linearly connectivity-aligned, and that shallow MLPs require some finesse to achieve connectivity-alignment. This finding has both theoretical implications for understanding how language models represent meaning and practical applications for efficient embedding translation.
Imagine you've just spent weeks processing millions of documents through a language model to create semantic search capabilities. Then, a more powerful model is released – but recomputing all those embeddings would cost thousands of dollars and days of processing time. What if there was a better way?
This research emerged from real-world challenges at MantisAI, where we help organizations understand and visualize large document collections. Our customers frequently needed to switch between different embedding models – sometimes prioritizing accuracy, other times speed or cost. But each switch required reprocessing entire datasets, creating significant computational overhead and compatibility challenges between workspaces.
We spend the majority of this blogpost diving more into the theory-relevant results, rather than the cost-savings possible for semantic search systems. However, it merits remembering that there are multiple angles through which this work is useful. In the future we also imagine that deeper theoretical knowledge could drive improvements in ML algorithms, to be more interpretable, robust, efficient, or performant.
There is ample evidence supporting the idea that neural network representations may be aligned to some degree. Some notable observations include:
At the same time
Representational similarity tools are becoming mature enough to explore
such questions more deeply and empirically
and are percolating across both the machine learning and neuroscience
communities
Our investigation into embedding model similarity followed a systematic experimental approach spanning multiple model architectures, embedding spaces, and datasets. The methodology consists of four main components: model selection, dataset selection, stitch architecture design, evaluation framework, and embedding parameter configuration.
MODEL_NAMES = [
# MODEL_NAME EMBEDDING DIMENSION
"WhereIsAI/UAE-Large-V1", # 1024
"BAAI/bge-base-en-v1.5", # 768
"BAAI/bge-large-en-v1.5", # 1024
"BAAI/bge-small-en-v1.5", # 384
"intfloat/e5-base-v2", # 768
"intfloat/e5-large-v2", # 1024
"intfloat/e5-small-v2", # 384
"thenlper/gte-base", # 768
"thenlper/gte-large", # 1024
"thenlper/gte-small", # 384
"sentence-transformers/gtr-t5-base", # 768
"sentence-transformers/gtr-t5-large", # 768
"mixedbread-ai/mxbai-embed-large-v1", # 1024
"sentence-transformers/sentence-t5-base", # 768
"sentence-transformers/sentence-t5-large", # 768
"openai/text-embedding-3-large", # 3072
"openai/text-embedding-3-small", # 1536
]
Model Family | Variant | Architecture | Dimension | Parameters | Training Data |
---|---|---|---|---|---|
BAAI BGE | large-v1.5 | DeBERTa-V3 | 1024 | 335M | 330M+ text pairs |
base-v1.5 | 768 | 110M | |||
small-v1.5 | 384 | 33M | |||
E5 | large-v2 | DeBERTa-V3 | 1024 | 335M | CCNet + web data |
base-v2 | 768 | 110M | |||
small-v2 | 384 | 33M | |||
GTE | large | DeBERTa-V3 | 1024 | 335M | MS MARCO + public datasets |
base | 768 | 110M | |||
small | 384 | 33M | |||
T5-based | gtr-t5-large | T5 encoder | 768 | 770M | C4 + MS MARCO |
gtr-t5-base | 768 | 110M | C4 + MS MARCO | ||
sentence-t5-large | 768 | 770M | C4 + NLI datasets | ||
sentence-t5-base | 768 | 220M | C4 + NLI datasets | ||
UAE | large-v1 | RoBERTa | 1024 | 355M | Adversarial training |
MXBai | embed-large-v1 | DeBERTa-V3 | 1024 | 335M | 700M+ pairs contrastive training, 30M+ fine tuning |
OpenAI | text-embedding-3-large | Proprietary | 3072 | - | Not public |
text-embedding-3-small | 1536 | - |
Model Family | Variant | Architecture | Dimension | Parameters | Training Data |
---|---|---|---|---|---|
OpenAI | text-embedding-3-large | Proprietary | 3072 | - | Not public |
OpenAI | text-embedding-3-small | Proprietary | 1536 | - | Not public |
UAE | large-v1 | RoBERTa | 1024 | 355M | Adversarial training |
BAAI BGE | large-v1.5 | DeBERTa-V3 | 1024 | 335M | 330M+ text pairs |
E5 | large-v2 | DeBERTa-V3 | 1024 | 335M | CCNet + web data |
GTE | large | DeBERTa-V3 | 1024 | 335M | MS MARCO + public datasets |
MXBai | embed-large-v1 | DeBERTa-V3 | 1024 | 335M | 700M+ pairs contrastive training, 30M+ fine tuning |
T5-based | gtr-t5-large | T5 encoder | 1024 | 770M | C4 + MS MARCO |
gtr-t5-base | 768 | 110M | C4 + MS MARCO | ||
sentence-t5-large | 1024 | 770M | C4 + NLI datasets | ||
sentence-t5-base | 768 | 220M | C4 + NLI datasets | ||
BAAI BGE | base-v1.5 | DeBERTa-V3 | 768 | 110M | 330M+ text pairs |
E5 | base-v2 | DeBERTa-V3 | 768 | 110M | CCNet + web data |
GTE | base | DeBERTa-V3 | 768 | 110M | MS MARCO + public datasets |
BAAI BGE | small-v1.5 | DeBERTa-V3 | 384 | 33M | 330M+ text pairs |
E5 | small-v2 | DeBERTa-V3 | 384 | 33M | CCNet + web data |
GTE | small | DeBERTa-V3 | 384 | 33M | MS MARCO + public datasets |
Model Family | Variant | Architecture | Dimension | Parameters | Training Data |
---|---|---|---|---|---|
OpenAI | text-embedding-3-large | Proprietary | 3072 | ? | Not public |
OpenAI | text-embedding-3-small | Proprietary | 1536 | ? | Not public |
T5-based | gtr-t5-large | T5 encoder | 768 | 770M | C4 + MS MARCO |
T5-based | sentence-t5-large | T5 encoder | 768 | 770M | C4 + NLI datasets |
UAE | large-v1 | RoBERTa | 1024 | 355M | Adversarial training |
BAAI BGE | large-v1.5 | DeBERTa-V3 | 1024 | 335M | 330M+ text pairs |
E5 | large-v2 | DeBERTa-V3 | 1024 | 335M | CCNet + web data |
GTE | large | DeBERTa-V3 | 1024 | 335M | MS MARCO + public datasets |
MXBai | embed-large-v1 | DeBERTa-V3 | 1024 | 335M | 700M+ pairs contrastive training, 30M+ fine tuning |
T5-based | gtr-t5-base | T5 encoder | 768 | 110M | C4 + MS MARCO |
T5-based | sentence-t5-base | T5 encoder | 768 | 220M | C4 + NLI datasets |
BAAI BGE | base-v1.5 | DeBERTa-V3 | 768 | 110M | 330M+ text pairs |
E5 | base-v2 | DeBERTa-V3 | 768 | 110M | CCNet + web data |
GTE | base | DeBERTa-V3 | 768 | 110M | MS MARCO + public datasets |
BAAI BGE | small-v1.5 | DeBERTa-V3 | 384 | 33M | 330M+ text pairs |
E5 | small-v2 | DeBERTa-V3 | 384 | 33M | CCNet + web data |
GTE | small | DeBERTa-V3 | 384 | 33M | MS MARCO + public datasets |
DATASETS = [
"arguana", # Around 10K Short documents
"fiqa", # Around 50K, shortened to 20K
"scidocs", # Around 25K, shortened to 20K
"nfcorpus", # Around 5K
"hotpotqa", # Over 100K, shortened to 20K
"trec-covid", # At least 20K, shortened to 20K
]
You can read more about the datasets below:
We implemented stitch functions using Ordinary Least Squares (OLS) to find a best fitting affine function transforming the source embeddings to the target embeddings. We also trained MLPs ranging in depth from zero nonlinearities (linear) to 6 non-linearities (7 layers). Each MLP had the same width throughout: the larger of the input and output width.
We considered a range of different metrics to analyze the relationship between embedding spaces, settling on a mix from our reference paper, statistics on the stiching models bridging two embedding spaces, and some visualizaitons to better understand exactly what these stitches are doing.
We used the parameters illustrated below for our embedding models. The
datasets are comprised of documents and sample queries (for
reproduceable top-K search results analysis). We embed each seperately,
appending a prefix as visible below, like in the prior work. Unlike
the prior work we use an OpenAI text splitter with a fixed model. This
allows us to ensure that we compare the exact same text's emebeddings
across models --- something we suspect is a subtle bug that we reported to the authors of
Beyond Benchmarks
encode
function is used.
VECTOR_SEARCH_SENTENCE_DEFAULT_CHUNK_SIZE=256
VECTOR_SEARCH_DISTANCE_FUNCTION="cosine"
VECTOR_SEARCH_NORMALIZE_EMBEDDINGS="true"
VECTOR_SEARCH_CHUNK_PREFIX="passage: "
VECTOR_SEARCH_QUERY_PREFIX="query: "
VECTOR_SEARCH_TEXT_SPLITTER_CHUNK_OVERLAP=25
BATCH_SIZE=64
CHUNK_SIZE=256
We conducted extensive experiments across different model scales and architectures. Here are our key findings:
Firstly, we reproduce some of the results from the original paper. In line with the metrics presented in Beyond Benchmarks, we examined the pairwise similarity between our various embedding models. Above you can see our CKA (Centered Kernel Alignment) matrix. This is a common metric to measure the similarity between representations. To do this, you start with a text dataset which are fed through the embedding models \(A\) and \(B\) to produce sample embeddings \(Z_A\) and \(Z_B\). These embeddings are then compared in a specific way. First they are mean-centered, then their kernel tables are respetively computed. Next, these kernels are flattened and their correlation (normalized inner product) is computed. Scores closer to \(1\) are more similar and scores closer to \(0\) are less similar. In our matrix we also introduced a control "embedding" model which was a (ArguAna Length \(\times 768\) embedding dimension) standard gaussian matrix not included in the reference paper.
As you can hope to see, the random embedding model is extremely dissimilar from other embeddings, getting a similarity score of only \(0.10\) to \(0.16\) while the strongest similarities are all the way up to \(1.00\) (Approximately-complete similarity) between mxbai-embed-large-v1 and UAE-Large-V1, models from different companies, both with embedding dimension \(1024\). Just looking at these preliminary similarity scores, we hypothesize that these two models have stitch-connectivity.
The lowest similarity we observe is \(0.72\) between bge-small-en-v1.5 and e5-large-v2 of dimensions \(384\) and \(1024\) respectively. Still quite a high similarity in representations. While we hypothesized that most models would be linear connectivity-aligned, we also suspected that it would be easier to linearly map from larger models to smaller models. As will be visible later, this appears to be qualitatively true, which is unsurprising, but we do not provide a statistical analysis.
We present the results of our stitching experiments in the table below. These are the raw mean squared errors (MSE) for each stitch as well as the MAE, R squared, and absolute variation. Each entry in the table corresponds To a pair of mdoels. Answers are provided in logarithmic (base 10) scale, since MSE is low.
Note: The axis labels are not in the same order as in the CKA matrix.
At a Distance
The plots above show the log mean errors in stitching from embeddings of the source model (x-axis) to embeddings of the target model (y-axis) using our OLS-derived affine function.
Overall, the affine stitches performed fairly well, ranging from a log MSE of \(-5.63\) (0.35% MSE) to \(-3.16\) (4.24% MSE).
The top four performing stitches were:
All of these models are of dimension \(1024\), corroborating the observation from the original paper that dimension is sometimes correlated with "similarity".
The bottom four performing stitches were all unsurprisingly stitched from bge-small-en-v1.5, one of our two smallest models with dimension \(384\). We had hypothesized that this would be the case. The only decently performing stitch with this native space was bge-small-en-v1.5 to gte-small, another model of dimension \(384\). This too is unsurprising.
However, what caught us off guard is that stitches from gte-small to other embedding spaces performed on-par with models that were far larger, including OpenAI's text-embedding-3-small, a model nearly 4 times its size. If text-emedding-3-small was storing a more expressive representation of semantic features than text-embedding-3-small, then how come affine stitches had similar performance when operating from their embeddings? As we observe later, these embedding spaces do not seem to be genuinely linear connectivity-aligned, so its likely that they are learning some simple, baseline strategy that does not depend, too heavily, on the additional nuance encodable by big OpenAI models.
Native Space Performance
To investigate, we needed to get a better metric than MSE and MAE, since those are in effect not interpretable. They tell us in absolute terms how geometrically distant two high-dimensional vectors are, but it is hard to use this knowledge to infer anything practical. Below we plot R squared and MAE explained which tells us what percentage of the variance present in the distribution of target embeddings is explained by the distribution of source embeddings. We provide controls with a mean estimator and a random gaussian. Clearly all the linear stitches perform better than the random gaussian, but they barely improve upon the mean estimator.
Target Space Performance
The fact that R squared is so abysmally low (and even sometimes negative) suggests that these affine transformations are actually not good in absolute terms. Simply predicting the mean (from the train-set) would perform effectively as well. We trained both with gradient descent (twice) and ordinary-least-squares based models. We tuned a ridge regression parameter to reduce overfitting. In each case, the mean estimator was just as good as the affine stitches. This result is strongly suggestive that the affine stitches are not reconstructing target embeddings. However, their training loss curves (visible in the appendix) plateaued, suggesting that they may not have much more mileage to go. In the table below we present some results for larger neural networks (MLPs) trained on the same objective via gradient descent. They do not perform significantly better.
The pattern in which larger models' embeddings are more capable of mapping onto smaller ones, but not vice versa, persists. However, these MLPs do not greatly improve upon the affine stitches. We did not get to hyperparameter sweep exhaustively, so it is possible that MLPs may yet bare fruit, but it is unlikely to work on the first try.
A natural question to ask, at this point, is whether or not these models may be simply learning to use their bias to become mean estimators. Below we plot the spectrum of the affine and MLP matrices to determine this. While MLP matrices have a fairly flat spectrum, suggesting some sort of rotation, the affine matrices have a relatively large boost on the largest singular values. Possibly they are catching on to a direction of high variance in the dataset---a question to answer with future research. Regardless, it should also be noted that the mean bias norm was very low: around 0.07. However, the norm of the mean of many of the embedding datasets was larger, often surpassing 1, but not 20. This means that the affine stitch stitches are most likely not mean estimators.
We set out to check (1) whether or not different embedding models' embeddings were linearly mappable, (2) what degree of complexity was needed to have a good mapping between embeddings, and (3) which types of models are easier to linearly connect. Along the way we explored whether linear mappings could be rotations and whether they may be learning mean estimators. We see that the answer to (1) is likely no, and that understanding (2) requires more depth to ascertain, but seems likely to be certainly more than 2 layers and at least one non-linearity, but probably far more at reasonable widths. Our answer to (3) is that it is easier to go from larger to smaller models, and of course we found that these mappings are usually not rotations in the linear case. The spectrum of our mappings revealed that some unexpected behavior was going on and opens up new questions for research, such as whether the dataset's diversity may impact the stitch performance.
Overall, we believe that different embedding models' embeddings are not linear connectivity-aligned, and posit that while moore research is needed to try with MLPs, it is unlikely to work out of the box and will require some tuning.
The key implication of this work is that for transformer-based embedding language models (like the ones we use, as one would find for semantic search), the embeddings are not linear connectivity-aligned and therefore, if the PRH and LRH are true, then either these models we are using are too small (or trained on too few data), or they are storing features non-linearly, or linearly, but in some way which is not linearly mappable (i.e. they might be using some form of sparse code which packs more information into fewer dimensions or they have the spatial relationships between related concepts be permuted relative to other such embedding models'). Unfortunately, if we want to get more evidence for the PRH and LRH we will need to try harder than this. Another implication is that cheap embedding translation for data visualization or semantic search is unlikely to work.
There are some limiations to our work. One key one is that we train using an MSE objective and evaluate using R squared and other such metrics. It's possible that these are simply not indicative of downstream tasks. With more time, we would explore in detail how linear (and non-linear) stitching affects semantic search rankings, since that is a real world usecase. We also would consider looking into alternative objects (not MSE) and we would invest more time into tuning small non-linear stitches. Another important limitation, is that due to our computation constraints, and in the interest of consistency with past work, we did not train on truly large datasets, nor truly diverse datasets. It is possible that the affine linear mappings could have far-outperfomred mean estimators if we had merged all the datasets in their entirety to creaet a large, much more diverse dataset, to train the stitch. As it stands, it's possible that the amount or type of data just wasn't sufficient. Lastly, it should be noted that our results are for embedding language models, and may not always hold for langauge models or deep neural networks at large.
To showcase that our models were trained sufficiently observe these loss plots:
@article{culp_hernandez_embed_stitch_2024, title = {LEAD: Linear Embedding Alignment across Deep Neural Network Language Models' Representations}, author = {Culp, Gatlen* and Hernandez, Adriano*}, note = {* Equal Contribution}, journal = {MIT Deep Learning Blogs}, year = {2024}, month = {dec}, url = {https://gatlenculp.github.io/embedding_translation/}, }
Our code will eventually be made public here: https://github.com/GatlenCulp/embedding_translation