Understanding Embeddings in Large Language Models: A Practical Guide

With the ubiquity of media coverage around “AI everything” and companies racing to add shiny new “AI-driven” features, having a working understanding of the fundamental technologies driving the hype will help you find ways to leverage the good parts and add value to your project and/or your organization. Many people are curious about how they can pair their data with an LLM, and the technique commonly used to achieve this effect is known as Retrieval Augmented Generation (RAG). Embeddings are what make this technique possible in a fairly easy-to-understand and manageable way, but as an added bonus you can also utilize them for doing things like semantic searches, similarity searches, and text classification - traditionally complex problems now made easy!

In this article, we will focus on embeddings within the realm of large language models, and how to leverage some properties of embeddings to easily solve problems that previously required expert knowledge across numerous subjects. At the end of this you will have some basic knowledge and references to working examples that you can build on for your own exploration.

Why Embeddings?

With relatively little effort one can input arbitrary pieces of information and get back a set of outputs that can be used to do useful things on the inputs. This description from the OpenAI Embeddings documentation is a good summary of the types of operations that embeddings make easy:

Search: results are ranked by relevance to a query string
Clustering: text strings are grouped by similarity
Recommendations: items with related text strings are recommended
Anomaly detection: outliers with little relatedness are identified
Diversity measurement: similarity distributions are analyzed
Classification: text strings are classified by their most similar label

A compact description of an embedding is: take a text string, run it through transformer logic, and output a number at the end. Here, the “number” we get is in the form of a vector - an array of floating point values. The reason this is important is that these vectors can be compared and grouped according to how similar, or how close together their vector points are. If we create an embedding for “dogs like to chase balls” and one for “canine pets enjoy frisbee, catch, and hikes”, they are likely to have similar vectors despite sharing few literal words. Both phrases would likely be close matches to a query for “dog activities”. So embeddings let us group things together that have similar semantic or topical meaning.

An analogy could be packing a big suitcase for a long trip, where your inputs are various articles of clothing and toiletries, you might strategically group your items together in a blend of function and article type (sleep stuff, fancy dinner clothes, etc). An embedding model creates relationships between the semantic meaning of the inputs, and we can operate on these relationships using linear algebra to tell us how related a set of vectors are within the model. By taking advantage of how far apart things are, or where things are clustered together, we can infer these features of semantic similarity, and classification.

For a more thorough discussion of the technical parts, see Simon Willison’s blog post: https://simonwillison.net/2023/Oct/23/embeddings/#what-are-embeddings

Options for Generating Embeddings

You have identified a set of data you want to explore further and enhance with LLM capabilities: Django models, your S3 bucket of text documents, your archive of JSON files, etc. To get started generating embeddings from your data, you can either use an API or you can download a model and use a library capable of interfacing with the model.

Using an API

There are several API options available, and while they may require payment, it will only cost a few cents for small data loads. A search for “embedding API” should give plenty of options, but here are links to a few popular services: OpenAI, Anthropic, Gemini, Mistral

With an API you’ll typically get access to the highest performing models available and for many types of tasks the results will be good to exceptional with relatively little prompt engineering required. The trade-offs will come in cost per request (tokens used), execution speed, and model capabilities.

Using Models Locally

You can also download an embedding model to generate embeddings on your own server or computer. If you want to dive right into an example, you can follow the steps in another of Simon Willison’s posts, which illustrates using his llm library and plugins to download a model and generate embeddings.

Ollama is a popular choice for downloading and running large language models on your computer, and it supports using embedding models.

If you’re doing some exploration and/or have a small set of data you want to explore, you can use the SentenceTransformers library to download and use embedding models with just a few lines of code:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
zen = [
    'Explicit is better than implicit', 
    'Simple is better than complex.'
]
embedding = model.encode(zen)

Which Models Should You Use?

This probably incomplete list shows numerous embedding models that you can choose from, how does one know which to select? If you’re just getting started, “any available” is probably a fine answer. Example code throughout the SentenceTransformer site and numerous other tutorials will give good suggestions for general use models. You can also look at leaderboards for inspiration. Here is a Massive Text Embedding Benchmark (MTEB) leaderboard hosted by HuggingFace. You can click on the models in the table to get more information on each one and example code for using them.

There are different kinds of embedding models with different specialties, which may be of interest to your work. Some are better at multilingual data, some are trained more on Q&A style data, some may be optimized for certain types of texts. CLIP is a neat model that can handle embedding both images and text, leading to some neat applications of similarity search and clustering.

Vector Databases

Once you’ve generated your embeddings, you’ll need somewhere to store the vectors, and as your data size grows, you’ll want a storage engine that can efficiently store and compare vectors. To help fill this need, a type of storage known as vector databases have rapidly gained popularity. It’s not a requirement to have a vector DB, you can still use a traditional RDBMS for example. In the worst case, you’ll have to do brute force computations to make the comparison operations work, but many engines already have extensions available. For example, there is pgvector for PostgreSQL, and even a solution for the mighty SQLite exists as sqlite-vec. You may be able to use embeddings in your existing infrastructure without having to manage another service.

In this rapidly growing and ever-changing field, there are already numerous options available for dedicated vector databases. A couple of open source options are Chroma or Qdrant, while options like Pinecone exist to provide a hosted, performant service. It’s worth noting that many of these options are competing to be more “full-stack” solutions in this space and consequently might have their own embedding capabilities built-in.

Other Considerations

At Lincoln Loop, we’ve had the opportunity to build and ship LLM-driven solutions in both academic and publication settings, and it is exciting to see the possibilities this technology can make possible. There are hints of new platforms and libraries coming that will help commoditize some aspects of building these new solutions, but in the meantime, prepare to build a lot of your own scaffolding. Like many projects, going from working to prototype to production still requires building solid UIs, well-tested business logic, and a regular old web site that happens to use LLMs for some novel features. As you get deeper into the implementation, you might find that there are big differences between the models with respect to speed, cost, and accuracy, and you may start developing dedicated systems to evaluate the LLM performance. You’ll find yourself spending time considering one or more of:

Granularity of splitting content for embedding (chunking)
Making attributions back to the original source available in LLM outputs
Model output quality vs. cost
Data privacy
Preventing hallucinations or leaking system data
Structuring multi-step interactions and conditional conversations

In future articles we’ll cover these topics in more detail and share some tips learned from our travels.

Embeddings: A Bridge From LLMs to Your Data