All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Embeddings and Semantic Similarity

Capítulo 4

Estimated reading time: 13 minutes

What “Embeddings” Are (and What They Are Not)

An embedding is a numerical representation of a piece of content—most often text, but also images, audio, or code—designed so that “meaningfully similar” items end up close to each other in a high-dimensional vector space. In practice, an embedding is a list of numbers (a vector), such as 384, 768, or 1536 floating-point values. Each value is not interpretable on its own; the meaning is distributed across the entire vector.

Embeddings are not summaries. They do not necessarily preserve the original wording, and you cannot reliably reconstruct the original text from an embedding. Instead, embeddings preserve relationships: if two texts talk about the same concept, their vectors tend to be close; if they talk about different concepts, their vectors tend to be far apart.

Embeddings are also not “truth.” They capture patterns learned from data and can reflect biases, domain gaps, or ambiguity. Two texts can be close because they share a topic, style, or common co-occurrence patterns—even if one is incorrect. Treat embeddings as a powerful similarity tool, not a fact-checker.

Why Embeddings Enable Semantic Similarity

Traditional keyword search matches exact words. Semantic similarity aims to match meaning: “How do I reset my password?” should match “I forgot my login credentials” even if they share few words. Embeddings make this possible because the model has learned to place related concepts near each other in vector space.

Think of the embedding space as a map where distance corresponds to semantic relatedness. “Dog” and “puppy” land near each other; “dog” and “quantum tunneling” land far apart. The map is not 2D; it may have hundreds or thousands of dimensions, which allows it to encode many subtle relationships at once (topic, intent, sentiment, domain, etc.).

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Intuition: Directions and Neighborhoods

In an embedding space, you can imagine “directions” that correspond to latent features. For example, one direction might roughly correlate with “customer support intent,” another with “programming,” another with “medical.” These are not explicit labels; they are emergent patterns. A text’s embedding is a point whose position reflects many such features simultaneously.

Semantic similarity then becomes a geometric problem: given two vectors, compute how close they are. If close, treat them as semantically similar.

Common Similarity Measures (Cosine, Dot Product, Euclidean)

To compare embeddings, you use a similarity (or distance) function. The most common are cosine similarity, dot product, and Euclidean distance.

Cosine Similarity

Cosine similarity measures the angle between two vectors, ignoring their magnitude. It is widely used because many embedding models are trained so that direction matters more than length.

cos_sim(a, b) = (a · b) / (||a|| * ||b||)

Values range from -1 to 1 (often 0 to 1 in practice for many embedding spaces). Higher means more similar.

Dot Product

The dot product is similar to cosine similarity but includes magnitude. Some systems normalize vectors to unit length so dot product and cosine similarity become equivalent.

dot(a, b) = a · b

Euclidean Distance

Euclidean distance measures straight-line distance between points. It can work well, but cosine similarity is often preferred for text embeddings because it is less sensitive to vector length.

euclid(a, b) = ||a - b||

Which Should You Use?

If your embedding vectors are normalized (unit length), cosine similarity and dot product rank results the same.
If you are using a vector database, follow its recommended metric for the embedding model you chose.
When in doubt for text: cosine similarity is a safe default.

What Embeddings Are Used For

1) Semantic Search

You embed documents (or document chunks) and embed the user query. Then you retrieve the nearest document vectors to the query vector. This finds conceptually relevant results even when wording differs.

2) Retrieval-Augmented Generation (RAG)

Embeddings are commonly used to retrieve relevant passages that are then provided to a language model as reference material. The embedding step is the “retrieval” part: it selects which text snippets are most relevant to the user’s question.

3) Clustering and Topic Discovery

If you embed many items (support tickets, product reviews, meeting notes), you can cluster vectors to discover common themes. This is useful for analytics and triage.

4) Deduplication and Near-Duplicate Detection

Embeddings can detect paraphrases and near-duplicates. For example, two bug reports with different wording but same issue can be flagged as duplicates.

5) Recommendation and Matching

You can embed user profiles and items (jobs, products, articles) and recommend items whose embeddings are close to the user’s embedding.

Step-by-Step: Building a Simple Semantic Search Pipeline

This section outlines a practical workflow you can implement in many languages and tooling stacks. The core idea is always the same: embed, store, query, rank.

Step 1: Prepare Your Documents

Start with a set of documents you want to search: FAQs, internal policies, product manuals, or notes. Clean obvious noise (broken encoding, repeated headers/footers) because repeated boilerplate can dominate similarity.

Remove irrelevant navigation text (e.g., “Privacy Policy | Terms | Contact”).
Keep meaningful headings; they often help retrieval.
Decide what metadata you need: document ID, title, URL, date, product version, access permissions.

Step 2: Chunk the Documents for Retrieval

Semantic search usually works best on smaller passages rather than entire long documents. If passages are too long, they may mix multiple topics and dilute similarity. If too short, they may lack context.

Common chunk sizes: a few paragraphs or a few hundred words.
Prefer chunk boundaries at natural breaks (headings, paragraphs) rather than arbitrary cuts.
Store metadata linking each chunk back to its source document and section.

Even though you are not doing keyword search, chunking still matters because retrieval returns chunks. The goal is: each chunk should be “about one thing.”

Step 3: Generate Embeddings for Each Chunk

Use an embedding model to convert each chunk into a vector. Store the vector alongside the chunk text and metadata.

# Pseudocode (language-agnostic) for indexing documents into a vector store

chunks = chunk_documents(docs)

for chunk in chunks:

    vector = embed(chunk.text)

    vector_store.upsert(id=chunk.id, vector=vector, metadata=chunk.metadata, text=chunk.text)

Practical tips:

Batch embedding requests for speed and cost efficiency.
Cache embeddings so you don’t recompute them unnecessarily.
Use the same embedding model for documents and queries.

Step 4: Embed the User Query

When a user searches, embed their query using the same model.

query_vector = embed(user_query)

Step 5: Retrieve Nearest Neighbors

Ask your vector store for the top k nearest chunks to the query vector.

results = vector_store.search(vector=query_vector, top_k=10, filter=optional_metadata_filter)

Many systems support metadata filters, such as “product_version = v3” or “language = en” or “user has access = true.” Filtering is often essential in real applications.

Step 6: Re-rank (Optional but Often Valuable)

Nearest-neighbor retrieval is fast, but the top results can still include “almost relevant” chunks. A common improvement is re-ranking: take the top 20–50 retrieved chunks and score them with a more precise model (often a cross-encoder or a specialized re-ranker) that reads the query and chunk together.

candidates = vector_store.search(query_vector, top_k=30)

ranked = rerank(user_query, [c.text for c in candidates])

final = ranked[:10]

Re-ranking improves precision, especially when your corpus contains many similar items.

Step 7: Present Results (or Feed Them Into Another System)

For semantic search, you display the top chunks with titles and links. For RAG, you pass the top chunks as reference context to a language model. In both cases, keep track of which chunks were used so you can audit behavior and improve indexing.

Practical Example: Why Semantic Similarity Beats Keywords

Imagine a knowledge base with a chunk that says: “To change your account password, go to Settings → Security and select ‘Update Password’.” A user searches: “I can’t remember my password, how do I make a new one?” Keyword search might miss it if it overweights “remember” and “new,” while embeddings will likely place these texts close because they share intent: password reset/change.

Another example: a chunk says “Request a refund within 30 days of purchase.” A user asks “Can I get my money back after buying?” Embeddings often connect “refund” with “money back” even though the exact term differs.

Design Choices That Strongly Affect Embedding Quality

Chunk Content and Noise

Embeddings reflect whatever text you feed them. If every chunk begins with the same boilerplate (“This article explains…”), that repeated text can reduce distinctiveness. Remove or minimize repeated templates.

Titles and Headings

Including a section title at the top of a chunk can improve retrieval because titles often contain high-signal keywords. A practical pattern is to embed: “Document Title — Section Heading — Paragraph text” as a single chunk string.

Domain-Specific Vocabulary

General embedding models may not perfectly capture specialized jargon (legal clauses, medical terms, internal product codenames). If your domain is specialized, evaluate retrieval quality with real queries. Sometimes you can improve results by:

Adding short glossaries or expansions in the text (e.g., “SSO (single sign-on)”).
Ensuring acronyms appear alongside their expanded form at least once.
Using a domain-tuned embedding model if available.

Language and Multilingual Content

Some embedding models are multilingual, meaning they place semantically equivalent sentences from different languages near each other. If you need cross-language search (Spanish query retrieving English docs), choose a multilingual embedding model and test with bilingual query sets.

Understanding “Nearest Neighbors” and Vector Databases

When you have thousands to millions of vectors, comparing a query vector to every stored vector can be slow. Vector databases and libraries use approximate nearest neighbor (ANN) algorithms to retrieve close vectors quickly. “Approximate” means you trade a tiny amount of recall for big speed gains.

Key ideas you will encounter:

Indexing: building a structure that makes similarity search fast.
Top-k search: retrieve the k most similar vectors.
Filters: restrict search to a subset (e.g., a specific customer’s documents).
Hybrid search: combine keyword constraints with vector similarity (useful when exact terms matter, like part numbers).

Step-by-Step: Evaluating Semantic Similarity in Your Own Data

Embeddings can feel “magical” until you measure them. A lightweight evaluation process helps you catch failure modes early.

Step 1: Create a Small Test Set of Queries

Collect 30–100 realistic queries from users (or write them based on support logs). For each query, identify the correct chunk(s) manually. This becomes your ground truth.

Step 2: Run Retrieval and Record Metrics

For each query, retrieve top-k results and check whether a correct chunk appears in the top 1, top 3, top 5, etc.

Recall@k: how often the correct answer appears in the top k.
MRR (mean reciprocal rank): rewards putting the correct result near the top.

Step 3: Inspect Failures

When retrieval fails, categorize why:

Chunk too broad or too narrow.
Important terms missing (e.g., acronym not present).
Query ambiguous (needs clarification).
Corpus missing the answer (no chunk contains the needed info).
Model mismatch (embedding model not good for the domain/language).

Step 4: Iterate on Chunking and Metadata

Many retrieval improvements come from data preparation rather than changing models. Try:

Splitting long chunks that cover multiple topics.
Adding headings/titles into the embedded text.
Removing repeated boilerplate.
Adding metadata filters (e.g., product version) to prevent irrelevant matches.

Common Failure Modes and How to Mitigate Them

1) “Topic Similar” but Not “Answer Similar”

A query like “How do I cancel my subscription?” might retrieve chunks about billing in general, not cancellation steps. This happens because embeddings capture topical similarity. Mitigations:

Use re-ranking to improve precision.
Chunk by task/procedure so cancellation steps are isolated.
Add structured metadata like “intent = cancellation” if you can label content.

2) Negation and Subtle Constraints

Embeddings can struggle with fine-grained distinctions like “supported” vs “not supported,” or “works offline” vs “doesn’t work offline.” Mitigations:

Keep critical constraints close to the relevant statement in the same chunk.
Use re-ranking, which reads the full text and can better handle negation.
In high-stakes domains, add rule-based checks or require explicit citations.

3) Numbers, IDs, and Exact Matches

Embeddings are not optimized for exact string matching. Queries like “error code 0x80070005” or “part number AB-1234” may not retrieve reliably with pure semantic similarity. Mitigations:

Use hybrid search: keyword match for codes + vector similarity for meaning.
Store normalized forms of IDs in metadata and filter on them.
Include error codes verbatim in chunks that discuss them.

4) Overly Generic Queries

“Help me” or “It doesn’t work” has little semantic content. Retrieval will be noisy. Mitigations:

Ask a clarifying question before retrieval (e.g., “What product and what error message?”).
Use UI prompts that encourage users to include details.

Embeddings Beyond Text: Cross-Modal Similarity

Some embedding models map different modalities into a shared space, such as images and text. In such a system, an image of a “red running shoe” and the text “red running shoe” can be close in vector space. This enables:

Text-to-image search (“show me photos of cracked screens”).
Image-to-text retrieval (find product listings similar to a photo).
Deduplication of images by visual similarity.

The same core idea applies: represent items as vectors, then retrieve nearest neighbors.

Implementation Notes: Storage, Updates, and Versioning

Embedding Storage

Store each chunk’s vector, raw text, and metadata. Keep an embedding model identifier and version in metadata so you can re-embed later without confusion.

Updating Content

When documents change, update the affected chunks and their embeddings. If you change chunking rules, you may need to rebuild the index.

Model Changes

Switching embedding models changes the geometry of the space. You generally cannot mix vectors from different embedding models in the same index and expect meaningful similarity. Plan for re-embedding and re-indexing when upgrading.

Hands-On Mini Walkthrough: Similarity Search on a Tiny Corpus

To make the mechanics concrete, consider a tiny corpus of four chunks:

Chunk A: “Reset your password from Settings → Security.”
Chunk B: “Update billing information and view invoices.”
Chunk C: “Cancel your subscription from the Billing page.”
Chunk D: “Troubleshoot login issues when you can’t sign in.”

You embed all four chunks and store them. Now a user query arrives: “I forgot my password and can’t log in.” You embed the query and compute similarity to each chunk. A typical outcome is that Chunk A and Chunk D score highest, because the query combines password and login trouble. If you only want one result, you might return Chunk A. If you want to support multi-step help, you might return both A and D and present them as “Password reset” and “Login troubleshooting.”

Now consider the query: “How do I stop being charged?” Keyword search might match “charged” poorly if the docs say “cancel subscription.” Embeddings often connect “stop being charged” with “cancel subscription” and retrieve Chunk C.

Practical Checklist for Using Embeddings Effectively

Use the same embedding model for indexing and querying.
Chunk so each chunk covers one coherent topic or task.
Remove repeated boilerplate that appears in every chunk.
Include titles/headings in the embedded text when helpful.
Use metadata filters to prevent irrelevant matches.
Consider re-ranking for higher precision.
Evaluate with real queries and track Recall@k and MRR.
Use hybrid search when exact codes/IDs matter.

Now answer the exercise about the content:

In a semantic search system using embeddings, why is it important to chunk long documents into smaller passages before indexing?

You are right! Congratulations, now go to the next page

You missed! Try again.

Retrieval returns chunks, and long passages can mix topics and dilute similarity. Chunking helps each piece be about one thing, improving semantic search relevance.

Next chapter

Pretraining Versus Fine-Tuning

33%

Introduction to Large Language Models (LLMs): How They Work and What They Can (and Can’t) Do

New course

12 pages