Is “Embeddings” killing Embeddings?

Does a word matter when describing a technology?

6 min readFeb 22, 2024

I’m seeing a really interesting pushback recently around people not wanting to use embeddings, and I think part of the problem is in the name. So first off what are embeddings? Well welcome to the world of “why have a simple explanation when complicated ones can be done”

Well lets see what OpenAI say:

OpenAI’s text embeddings measure the relatedness of text strings.
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

Nice and clear? If not lets get the smart folks at the business advisory company McKinsey to explain it:

Embeddings are made by assigning each item from incoming data to a dense vector in a high-dimensional space. Since close vectors are similar by construction, embeddings can be used to find similar items or to understand the context or intent of the data.

So there you go, its about measuring the distance between two vectors to help understand how related they are. This is “Embedding”, which of course is related to “embed” which is defined by the dictionary as “fix (an object) firmly and deeply in a surrounding mass”.

Clear?

Maths isn’t business English

Now in maths they take that English meaning and say that embedding is a structure that is contained within another structure. So from a pure mathematical and data science perspective embeddings are a structure that are contained within some source information, but only in the mathematical sense.

And I think this is the problem. When people avoid using “embeddings” its because the language makes it hard to understand why on earth you should worry about it. “Prompt Engineering” is something everyone understands, its messing around with the prompt to get a better answer. People understand what “Fine-Tuning” of a model is, and why they want to do it. In neither of these cases do people always understand specific what is being done, they do however understand the concept based on their understanding of those terms in other contexts. But Embedding sounds like something you are putting “inside” that prompt and because its a stream of numbers people react negatively as the numbers (of course) mean absolutely nothing to them. It gets more complex and non-intuitive as we end up saving these embeddings in a vector database away from the model, so these “embeddings” aren’t actually embedded in what people see as the “normal” flow of working with something.

If I “embed” something into a request the mental model is that it is within the request. That something is embedded within something makes it sound like a subset or a specific item within the thing: “the knife was embedded in his chest” and then asking “do you want the doctor to just deal with the embedded item?”, well no doctor, I’d really like you to focus on the CHEST rather than the thing embedded within it.

Embeddings are a pre-processor, a compressor, a translator

While it is mathematically correct to say that the vectors created are structures embedded within the source material, they are not embedded in a way that a human understands. When we look for something embedded in a sentence, we look at the sentence.

So a simple way to think of embeddings is as a pre-processor something you call before calling the AI to translate your source material into “the language of the model”, and that these lists of numbers can be compared to determine their similarity. So instead of asking about using embeddings you are translating the input into computer language, then doing a comparison to determine how similar two things are. You aren’t “embedding” anything, you’re translating and comparing.

So what to call them?

So what should we call embeddings? Should we use some other word to describe this area or should we learn from Fine-Tuning and Prompt Engineering and instead describe what we are doing?

What we are doing is source translation for the purposes of the model, yes it returns something as a vector that we can store, yes we can compare those vectors and yes those vectors represent alternative structures. We aren’t changing at all the work that needs to be done, what we are doing is providing a simple way to explain what we are doing and why, without having to go into the messy details.

“First we do source translation to create a definition of the data in a way that the model really understands, then we can use that to quickly calculate how similar certain requests are”

This isn’t an appeal from me for every data scientist out there to drop the term embeddings, its about how we should be explaining these things to non-technical, or even technical but not purist mathematical, people. We cannot expect people to understand a mathematical term as sensible when that term doesn’t actually

A happy listener is a happy customer

I’ve often said that my job is to make exciting technologies boring, a big part of that is making people trust me that I know what I’m doing by explaining the core parts in language that people understand, so they then trust me to go into the details that they don’t need to understand.

It is really easy with technologies, and particularly AI, to drift into “sounding clever”, assuming context from your personal experience or worst of all thinking that someone really needs to understand something because you understand it really well. That isn’t the case. When a surgeon explains surgery to you, they do so at a level you understand, because if they used the detail language they understand and the complexities that they are trained in, you’d probably scream out in fear.

This doesn’t mean spouting nonsense bullshit, it means translating the challenge into someone else’s context, thinking about the listener’s perspective rather than your experience and knowledge. I had a call the other week where I assumed someone had more knowledge on a topic than they had, 15 minutes into the call I realized that what I was saying was going miles over their head, that was not their fault, it was mine.

Visuals work

The other thing is using visuals to explain to people what embeddings and vectors are. The Tensorflow Embeddings Projector is an awesome thing for this (thanks to Marek Sowa for the pointer), so instead of it being random numbers it becomes a 3D image of the space. This helps turn the abstract into the concrete.

Animated image of a cloud of values, in this case handwriting of numbers, with a group of them highlighted demonstrating similarity

So this is what embeddings is, its those dots in that cloud, and then the “distance” is literally that, the distance within that cloud. So this is a pretty way to visualize the result, and a really nice way to explain how embeddings turn your data set into something that the computer can understand better. This is “showing the user the way the computer thinks”.

Embeddings are a “you” problem, source translation is theirs

So to conclude, while “Embeddings” is the word that “every decent data scientist knows” it isn’t actually a great way to explain what you are doing so don’t be surprised if you get pushback from people who know less than you on the right approach to use. Instead step back for a moment and think about what you are actually aiming to do and how that would make sense from their perspective. I like using source translation because its an easy description:

“We turn the source information, the prompt, into the internal language of the model, that then means we can much more quickly work out things like similarities and matching than having the model do that every time”

But that might not work for you. Above all though, think about the listener and how you explain it to them in a way that makes sense in their world. We’ve been lucky with Prompt Engineer and Fine-Tuning, but as AI becomes more adopted and more complex so we’ve got to simplify how it is explained. That doesn’t mean you won’t be doing embeddings, just that you are explaining it to people who don’t care.

And yes, this does mean you aren’t allowed to use the phrase “Markov Chains”.