What “Embeddings” Are (and What They Are Not)
An embedding is a numerical representation of a piece of content—most often text, but also images, audio, or code—designed so that “meaningfully similar” items end up close to each other in a high-dimensional vector space. In practice, an embedding is a list of numbers (a vector), such as 384, 768, or 1536 floating-point values. Each value is not interpretable on its own; the meaning is distributed across the entire vector.
Embeddings are not summaries. They do not necessarily preserve the original wording, and you cannot reliably reconstruct the original text from an embedding. Instead, embeddings preserve relationships: if two texts talk about the same concept, their vectors tend to be close; if they talk about different concepts, their vectors tend to be far apart.
Embeddings are also not “truth.” They capture patterns learned from data and can reflect biases, domain gaps, or ambiguity. Two texts can be close because they share a topic, style, or common co-occurrence patterns—even if one is incorrect. Treat embeddings as a powerful similarity tool, not a fact-checker.
Why Embeddings Enable Semantic Similarity
Traditional keyword search matches exact words. Semantic similarity aims to match meaning: “How do I reset my password?” should match “I forgot my login credentials” even if they share few words. Embeddings make this possible because the model has learned to place related concepts near each other in vector space.
Think of the embedding space as a map where distance corresponds to semantic relatedness. “Dog” and “puppy” land near each other; “dog” and “quantum tunneling” land far apart. The map is not 2D; it may have hundreds or thousands of dimensions, which allows it to encode many subtle relationships at once (topic, intent, sentiment, domain, etc.).
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Intuition: Directions and Neighborhoods
In an embedding space, you can imagine “directions” that correspond to latent features. For example, one direction might roughly correlate with “customer support intent,” another with “programming,” another with “medical.” These are not explicit labels; they are emergent patterns. A text’s embedding is a point whose position reflects many such features simultaneously.
Semantic similarity then becomes a geometric problem: given two vectors, compute how close they are. If close, treat them as semantically similar.
Common Similarity Measures (Cosine, Dot Product, Euclidean)
To compare embeddings, you use a similarity (or distance) function. The most common are cosine similarity, dot product, and Euclidean distance.
Cosine Similarity
Cosine similarity measures the angle between two vectors, ignoring their magnitude. It is widely used because many embedding models are trained so that direction matters more than length.
cos_sim(a, b) = (a · b) / (||a|| * ||b||)Values range from -1 to 1 (often 0 to 1 in practice for many embedding spaces). Higher means more similar.
Dot Product
The dot product is similar to cosine similarity but includes magnitude. Some systems normalize vectors to unit length so dot product and cosine similarity become equivalent.
dot(a, b) = a · bEuclidean Distance
Euclidean distance measures straight-line distance between points. It can work well, but cosine similarity is often preferred for text embeddings because it is less sensitive to vector length.
euclid(a, b) = ||a - b||Which Should You Use?
- If your embedding vectors are normalized (unit length), cosine similarity and dot product rank results the same.
- If you are using a vector database, follow its recommended metric for the embedding model you chose.
- When in doubt for text: cosine similarity is a safe default.
What Embeddings Are Used For
1) Semantic Search
You embed documents (or document chunks) and embed the user query. Then you retrieve the nearest document vectors to the query vector. This finds conceptually relevant results even when wording differs.
2) Retrieval-Augmented Generation (RAG)
Embeddings are commonly used to retrieve relevant passages that are then provided to a language model as reference material. The embedding step is the “retrieval” part: it selects which text snippets are most relevant to the user’s question.
3) Clustering and Topic Discovery
If you embed many items (support tickets, product reviews, meeting notes), you can cluster vectors to discover common themes. This is useful for analytics and triage.
4) Deduplication and Near-Duplicate Detection
Embeddings can detect paraphrases and near-duplicates. For example, two bug reports with different wording but same issue can be flagged as duplicates.
5) Recommendation and Matching
You can embed user profiles and items (jobs, products, articles) and recommend items whose embeddings are close to the user’s embedding.
Step-by-Step: Building a Simple Semantic Search Pipeline
This section outlines a practical workflow you can implement in many languages and tooling stacks. The core idea is always the same: embed, store, query, rank.
Step 1: Prepare Your Documents
Start with a set of documents you want to search: FAQs, internal policies, product manuals, or notes. Clean obvious noise (broken encoding, repeated headers/footers) because repeated boilerplate can dominate similarity.
- Remove irrelevant navigation text (e.g., “Privacy Policy | Terms | Contact”).
- Keep meaningful headings; they often help retrieval.
- Decide what metadata you need: document ID, title, URL, date, product version, access permissions.
Step 2: Chunk the Documents for Retrieval
Semantic search usually works best on smaller passages rather than entire long documents. If passages are too long, they may mix multiple topics and dilute similarity. If too short, they may lack context.
- Common chunk sizes: a few paragraphs or a few hundred words.
- Prefer chunk boundaries at natural breaks (headings, paragraphs) rather than arbitrary cuts.
- Store metadata linking each chunk back to its source document and section.
Even though you are not doing keyword search, chunking still matters because retrieval returns chunks. The goal is: each chunk should be “about one thing.”
Step 3: Generate Embeddings for Each Chunk
Use an embedding model to convert each chunk into a vector. Store the vector alongside the chunk text and metadata.
# Pseudocode (language-agnostic) for indexing documents into a vector storechunks = chunk_documents(docs)for chunk in chunks: vector = embed(chunk.text) vector_store.upsert(id=chunk.id, vector=vector, metadata=chunk.metadata, text=chunk.text)Practical tips:
- Batch embedding requests for speed and cost efficiency.
- Cache embeddings so you don’t recompute them unnecessarily.
- Use the same embedding model for documents and queries.
Step 4: Embed the User Query
When a user searches, embed their query using the same model.
query_vector = embed(user_query)Step 5: Retrieve Nearest Neighbors
Ask your vector store for the top k nearest chunks to the query vector.
results = vector_store.search(vector=query_vector, top_k=10, filter=optional_metadata_filter)Many systems support metadata filters, such as “product_version = v3” or “language = en” or “user has access = true.” Filtering is often essential in real applications.
Step 6: Re-rank (Optional but Often Valuable)
Nearest-neighbor retrieval is fast, but the top results can still include “almost relevant” chunks. A common improvement is re-ranking: take the top 20–50 retrieved chunks and score them with a more precise model (often a cross-encoder or a specialized re-ranker) that reads the query and chunk together.
candidates = vector_store.search(query_vector, top_k=30)ranked = rerank(user_query, [c.text for c in candidates])final = ranked[:10]Re-ranking improves precision, especially when your corpus contains many similar items.
Step 7: Present Results (or Feed Them Into Another System)
For semantic search, you display the top chunks with titles and links. For RAG, you pass the top chunks as reference context to a language model. In both cases, keep track of which chunks were used so you can audit behavior and improve indexing.
Practical Example: Why Semantic Similarity Beats Keywords
Imagine a knowledge base with a chunk that says: “To change your account password, go to Settings → Security and select ‘Update Password’.” A user searches: “I can’t remember my password, how do I make a new one?” Keyword search might miss it if it overweights “remember” and “new,” while embeddings will likely place these texts close because they share intent: password reset/change.
Another example: a chunk says “Request a refund within 30 days of purchase.” A user asks “Can I get my money back after buying?” Embeddings often connect “refund” with “money back” even though the exact term differs.
Design Choices That Strongly Affect Embedding Quality
Chunk Content and Noise
Embeddings reflect whatever text you feed them. If every chunk begins with the same boilerplate (“This article explains…”), that repeated text can reduce distinctiveness. Remove or minimize repeated templates.
Titles and Headings
Including a section title at the top of a chunk can improve retrieval because titles often contain high-signal keywords. A practical pattern is to embed: “Document Title — Section Heading — Paragraph text” as a single chunk string.
Domain-Specific Vocabulary
General embedding models may not perfectly capture specialized jargon (legal clauses, medical terms, internal product codenames). If your domain is specialized, evaluate retrieval quality with real queries. Sometimes you can improve results by:
- Adding short glossaries or expansions in the text (e.g., “SSO (single sign-on)”).
- Ensuring acronyms appear alongside their expanded form at least once.
- Using a domain-tuned embedding model if available.
Language and Multilingual Content
Some embedding models are multilingual, meaning they place semantically equivalent sentences from different languages near each other. If you need cross-language search (Spanish query retrieving English docs), choose a multilingual embedding model and test with bilingual query sets.
Understanding “Nearest Neighbors” and Vector Databases
When you have thousands to millions of vectors, comparing a query vector to every stored vector can be slow. Vector databases and libraries use approximate nearest neighbor (ANN) algorithms to retrieve close vectors quickly. “Approximate” means you trade a tiny amount of recall for big speed gains.
Key ideas you will encounter:
- Indexing: building a structure that makes similarity search fast.
- Top-k search: retrieve the k most similar vectors.
- Filters: restrict search to a subset (e.g., a specific customer’s documents).
- Hybrid search: combine keyword constraints with vector similarity (useful when exact terms matter, like part numbers).
Step-by-Step: Evaluating Semantic Similarity in Your Own Data
Embeddings can feel “magical” until you measure them. A lightweight evaluation process helps you catch failure modes early.
Step 1: Create a Small Test Set of Queries
Collect 30–100 realistic queries from users (or write them based on support logs). For each query, identify the correct chunk(s) manually. This becomes your ground truth.
Step 2: Run Retrieval and Record Metrics
For each query, retrieve top-k results and check whether a correct chunk appears in the top 1, top 3, top 5, etc.
- Recall@k: how often the correct answer appears in the top k.
- MRR (mean reciprocal rank): rewards putting the correct result near the top.
Step 3: Inspect Failures
When retrieval fails, categorize why:
- Chunk too broad or too narrow.
- Important terms missing (e.g., acronym not present).
- Query ambiguous (needs clarification).
- Corpus missing the answer (no chunk contains the needed info).
- Model mismatch (embedding model not good for the domain/language).
Step 4: Iterate on Chunking and Metadata
Many retrieval improvements come from data preparation rather than changing models. Try:
- Splitting long chunks that cover multiple topics.
- Adding headings/titles into the embedded text.
- Removing repeated boilerplate.
- Adding metadata filters (e.g., product version) to prevent irrelevant matches.
Common Failure Modes and How to Mitigate Them
1) “Topic Similar” but Not “Answer Similar”
A query like “How do I cancel my subscription?” might retrieve chunks about billing in general, not cancellation steps. This happens because embeddings capture topical similarity. Mitigations:
- Use re-ranking to improve precision.
- Chunk by task/procedure so cancellation steps are isolated.
- Add structured metadata like “intent = cancellation” if you can label content.
2) Negation and Subtle Constraints
Embeddings can struggle with fine-grained distinctions like “supported” vs “not supported,” or “works offline” vs “doesn’t work offline.” Mitigations:
- Keep critical constraints close to the relevant statement in the same chunk.
- Use re-ranking, which reads the full text and can better handle negation.
- In high-stakes domains, add rule-based checks or require explicit citations.
3) Numbers, IDs, and Exact Matches
Embeddings are not optimized for exact string matching. Queries like “error code 0x80070005” or “part number AB-1234” may not retrieve reliably with pure semantic similarity. Mitigations:
- Use hybrid search: keyword match for codes + vector similarity for meaning.
- Store normalized forms of IDs in metadata and filter on them.
- Include error codes verbatim in chunks that discuss them.
4) Overly Generic Queries
“Help me” or “It doesn’t work” has little semantic content. Retrieval will be noisy. Mitigations:
- Ask a clarifying question before retrieval (e.g., “What product and what error message?”).
- Use UI prompts that encourage users to include details.
Embeddings Beyond Text: Cross-Modal Similarity
Some embedding models map different modalities into a shared space, such as images and text. In such a system, an image of a “red running shoe” and the text “red running shoe” can be close in vector space. This enables:
- Text-to-image search (“show me photos of cracked screens”).
- Image-to-text retrieval (find product listings similar to a photo).
- Deduplication of images by visual similarity.
The same core idea applies: represent items as vectors, then retrieve nearest neighbors.
Implementation Notes: Storage, Updates, and Versioning
Embedding Storage
Store each chunk’s vector, raw text, and metadata. Keep an embedding model identifier and version in metadata so you can re-embed later without confusion.
Updating Content
When documents change, update the affected chunks and their embeddings. If you change chunking rules, you may need to rebuild the index.
Model Changes
Switching embedding models changes the geometry of the space. You generally cannot mix vectors from different embedding models in the same index and expect meaningful similarity. Plan for re-embedding and re-indexing when upgrading.
Hands-On Mini Walkthrough: Similarity Search on a Tiny Corpus
To make the mechanics concrete, consider a tiny corpus of four chunks:
- Chunk A: “Reset your password from Settings → Security.”
- Chunk B: “Update billing information and view invoices.”
- Chunk C: “Cancel your subscription from the Billing page.”
- Chunk D: “Troubleshoot login issues when you can’t sign in.”
You embed all four chunks and store them. Now a user query arrives: “I forgot my password and can’t log in.” You embed the query and compute similarity to each chunk. A typical outcome is that Chunk A and Chunk D score highest, because the query combines password and login trouble. If you only want one result, you might return Chunk A. If you want to support multi-step help, you might return both A and D and present them as “Password reset” and “Login troubleshooting.”
Now consider the query: “How do I stop being charged?” Keyword search might match “charged” poorly if the docs say “cancel subscription.” Embeddings often connect “stop being charged” with “cancel subscription” and retrieve Chunk C.
Practical Checklist for Using Embeddings Effectively
- Use the same embedding model for indexing and querying.
- Chunk so each chunk covers one coherent topic or task.
- Remove repeated boilerplate that appears in every chunk.
- Include titles/headings in the embedded text when helpful.
- Use metadata filters to prevent irrelevant matches.
- Consider re-ranking for higher precision.
- Evaluate with real queries and track Recall@k and MRR.
- Use hybrid search when exact codes/IDs matter.