All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Tokens and Why Text Is Chunked

Capítulo 2

Estimated reading time: 12 minutes

Why Models Don’t Read “Characters” or “Words”

When you type a sentence into an LLM, it does not process the input as a continuous stream of characters the way you visually read it, and it usually does not treat it as a simple list of words either. Instead, the text is converted into tokens: small pieces of text represented as numbers. The model’s internal computations operate on these token IDs.

This token-based view matters because it affects cost, speed, maximum input size, how prompts should be structured, and why some strings (like code, URLs, or rare names) behave differently than common prose. Understanding tokens also explains why text is “chunked” into manageable pieces for both training and practical usage.

What a Token Is (Conceptually)

A token is a unit of text chosen by a tokenizer. A tokenizer is a deterministic procedure that maps text to a sequence of token IDs and back again. Tokens are often “subword” pieces rather than full words. For example, a common English word might be one token, while a rare word might be split into multiple tokens. Punctuation and spaces can also be part of tokens, depending on the tokenizer.

It is helpful to think of tokens as a compromise between characters and words:

Characters are too small: sequences become very long, making processing expensive.
Words are too large: the vocabulary would be enormous (including every inflection, misspelling, and rare proper noun), and unknown words would be a constant problem.
Subword tokens balance both: common words can be single tokens, and uncommon words can be composed from smaller pieces.

Tokens Are Not Universal

Different models can use different tokenizers and vocabularies. The same sentence may become a different number of tokens across models. Even within one model family, versions can differ. This is why “token counts” are always model-specific.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

How Tokenization Typically Works (High-Level)

Most modern LLM tokenizers are based on subword segmentation methods such as Byte Pair Encoding (BPE) or related algorithms. You do not need the full math to use them, but you should understand the behavior they produce.

Intuition: Frequent Pieces Become Single Tokens

During tokenizer construction, the algorithm looks at a large text corpus and learns which character sequences occur frequently. Frequent sequences (like “the”, “ing”, “http”, “://”, “.com”, or even “ ” plus a word) are likely to become single tokens. Less frequent sequences are broken into smaller parts.

As a result:

Common words in a language often map to one token.
Common suffixes/prefixes may be tokens (e.g., “un”, “re”, “-ing”).
Rare names may split into multiple tokens.
Numbers, emojis, and mixed scripts can tokenize in surprising ways.

Whitespace and Punctuation Can Be Part of Tokens

Many tokenizers treat a leading space as meaningful, so “ cat” and “cat” might be different tokens. This is one reason prompts sometimes behave differently depending on spacing and formatting. Punctuation can also attach to tokens (e.g., “word,” might be tokenized differently than “word”).

Why LLMs Use Tokens at All

1) Tokens Map Cleanly to Numbers

Neural networks operate on numbers. Tokenization provides a stable mapping from text to integers. Each token ID is then converted into a vector representation (an embedding) that the model can process.

2) Tokens Reduce Sequence Length

If the model processed characters, even short paragraphs would become thousands of steps. Tokens compress text so the model can handle longer contexts within a fixed compute budget.

3) Tokens Provide a Manageable Vocabulary

A word-level vocabulary would be huge and brittle. Token vocabularies are large but manageable, and they can represent arbitrary text by combining tokens.

Token Count: Why It Matters in Practice

Token count affects several practical constraints:

Context window limits: Models can only consider a limited number of tokens at once (input plus output). If you exceed the limit, you must shorten the text or chunk it.
Latency: More tokens generally means more computation and slower responses.
Cost: Many APIs price by tokens processed (input and output).
Prompt reliability: Very long prompts increase the chance that important instructions are diluted or pushed out of the context window.

Approximate Rules of Thumb (Not Guarantees)

In English prose, a rough estimate is that 1 token is often around 3–4 characters of text, or about 0.75 words on average. But this varies widely. Code, JSON, long identifiers, and languages with different writing systems can produce very different token-to-character ratios.

Examples: How Text Can Break into Tokens

Because tokenization is model-specific, the exact splits below are illustrative rather than authoritative. The goal is to build intuition about what tends to happen.

Example 1: Common Words

Sentence: “The cat sat on the mat.”

Likely behavior: many words become single tokens, punctuation may be attached, and spaces may be included in tokens. The overall token count is usually close to the number of words.

Example 2: Rare Proper Nouns

Sentence: “I met X Æ A-12 yesterday.”

Likely behavior: unusual characters and rare sequences split into multiple tokens. The name may become many tokens even though it looks short.

Example 3: Code and Identifiers

Text: “parseHTTPRequestHeadersAndValidateTLSConfig()”

Likely behavior: camelCase and long identifiers often split into multiple tokens (e.g., “parse”, “HTTP”, “Request”, “Headers”, “And”, “Validate”, “TLS”, “Config”, “()” may be separate pieces). Code can be token-expensive compared to plain language.

Example 4: URLs

Text: “https://example.com/path/to/resource?id=123&sort=asc”

Likely behavior: some common URL fragments become tokens, but many punctuation-heavy segments split, increasing token count.

What “Chunking” Means and Why It Exists

Chunking is the practice of splitting a large body of text into smaller segments (chunks) that can be processed separately. Chunking is used in multiple places:

During training: large corpora are broken into sequences of a fixed maximum length.
During inference (your usage): long documents must be split to fit within the model’s context window.
During retrieval workflows: documents are chunked so that relevant pieces can be found and provided to the model.

Chunking is not just a convenience; it is required by how transformer-based models handle context. They operate on a fixed-length sequence of tokens. If your text is longer, something must give: you either truncate, summarize, or split into chunks and process them iteratively.

Chunking vs. Tokenization

Tokenization converts text into tokens. Chunking groups tokens into segments. You can chunk by characters or words, but the most reliable approach is to chunk by tokens because the model’s limits are token-based.

Step-by-Step: Estimating Token Budget and Planning Chunks

This workflow helps you avoid context overflows and design prompts that scale to long text.

Step 1: Identify the Model’s Context Window

Find the maximum tokens the model can handle in a single request. Remember that this includes both input tokens and output tokens. If you want a long answer, you must reserve room for output.

Step 2: Decide How Much Output You Need

For example, if the model supports 16,000 tokens total and you want up to 1,000 tokens of output, you should aim to keep the input under about 15,000 tokens. In practice, keep a safety margin because system messages, tool definitions, or formatting overhead can add tokens.

Step 3: Measure or Approximate Input Token Count

The best method is to use the tokenizer for your target model to count tokens. If you cannot, use a rough estimate (e.g., English text: characters/4). Treat estimates as conservative; if you are near the limit, measure properly.

Step 4: Choose a Chunk Size in Tokens

Pick a chunk size that leaves room for instructions and output. Common choices are 300–1,500 tokens per chunk depending on task. Smaller chunks improve retrieval precision and reduce the chance of mixing unrelated topics, but too-small chunks can lose context.

Step 5: Add Overlap (When Needed)

Overlap means repeating some tokens from the end of one chunk at the start of the next (e.g., 10–20% overlap). Overlap helps preserve continuity for tasks like summarization, entity tracking, or extracting arguments that span boundaries.

Step 6: Process Chunks with a Stable Instruction Template

Use consistent instructions so outputs are comparable and can be merged. For example: “Extract key claims and supporting evidence. Output JSON with fields: claim, evidence, location.”

Step 7: Merge Results

After processing all chunks, combine outputs. For summaries, you can do a second pass: summarize the summaries. For extraction, deduplicate and reconcile conflicts.

Step-by-Step: Chunking a Document for Summarization

Suppose you have a 40-page report that exceeds the context window. Here is a practical approach.

1) Preprocess the Text

Remove boilerplate that does not matter (e.g., repeated headers/footers).
Preserve section headings because they provide structure.
Normalize whitespace to avoid accidental token inflation from messy formatting.

2) Chunk by Sections First, Then by Token Limit

Prefer semantic boundaries (headings, paragraphs) over arbitrary cuts. If a section is too large, split it further. This reduces the chance that a chunk contains unrelated topics.

3) Use Overlap at Paragraph Boundaries

Instead of overlapping raw tokens mid-sentence, overlap whole paragraphs when possible. This keeps the repeated content coherent.

4) Summarize Each Chunk with Constraints

Ask for a structured summary with a fixed maximum length and required fields. Example fields: “main points”, “numbers/metrics”, “risks”, “open questions”.

5) Create a Second-Level Summary

Combine chunk summaries and ask the model to produce a higher-level summary. Because the second-level input is already compressed, it usually fits in the context window.

Chunk prompt template (example):
You are summarizing a report chunk.
- Keep it under 200 tokens.
- Extract: (1) key points, (2) important numbers, (3) decisions, (4) unresolved questions.
Return JSON with keys: key_points, numbers, decisions, open_questions.

CHUNK:
{chunk_text}

Chunking for Search and Retrieval (Why Smaller Pieces Help)

When you want an LLM to answer questions about a large knowledge base, you often retrieve relevant text snippets and provide them as context. Chunking is crucial because retrieval systems typically work better when documents are split into focused passages.

If chunks are too large:

Retrieval may return a chunk that contains the answer but also lots of irrelevant text, wasting tokens.
The model may miss the key sentence because it is buried.

If chunks are too small:

Retrieval may return fragments that lack necessary context, causing ambiguity.
Important definitions or constraints may be separated from the statements that use them.

A practical compromise is to chunk by paragraphs or small groups of paragraphs, with modest overlap, and to store metadata like section titles so retrieved chunks remain interpretable.

Common Pitfalls and How to Avoid Them

Pitfall 1: Chunking by Character Count Only

Character-based chunking can accidentally exceed token limits because tokenization is not proportional to characters in a stable way (especially for code, tables, or non-English text). If you must chunk by characters, use conservative limits and test.

Pitfall 2: Cutting in the Middle of Structured Data

Splitting JSON, CSV rows, or code blocks mid-structure can make the chunk hard to interpret. Prefer chunk boundaries that align with structure: whole objects, whole functions, whole table rows.

Pitfall 3: Losing Definitions at Boundaries

If a term is defined in one chunk and used in the next, the second chunk may become confusing. Use overlap or ensure that definitions are repeated where needed (for example, include a “glossary chunk” that is always provided).

Pitfall 4: Overlapping Too Much

Overlap improves continuity but increases token usage and can cause duplicated extracted items. If you overlap heavily, add deduplication rules in your merge step (e.g., deduplicate by normalized text or by IDs).

Pitfall 5: Ignoring Instruction Overhead

Your prompt instructions, formatting requirements, and any included schemas consume tokens too. A “chunk size” of 1,000 tokens is not the same as “1,000 tokens of document text” if you also include 300 tokens of instructions and metadata.

Practical Heuristics for Choosing Chunk Size

Q&A over documents: 200–500 tokens per chunk often works well, with 20–60 tokens overlap.
Summarization: 800–1,500 tokens per chunk can be efficient, with overlap aligned to paragraph boundaries.
Code understanding: chunk by function/class boundaries; token counts can spike, so measure.
Legal/policy text: chunk by numbered clauses/sections; keep references intact.

Why Token Boundaries Affect Model Behavior

Because the model predicts the next token, not the next character, the “grain” of tokens influences what is easy or hard to predict. Common sequences that are single tokens are easier to handle than rare sequences split into many tokens. This can show up as:

Spelling of rare names: a rare name split into many tokens may be more error-prone.
Exact string matching: tasks requiring exact reproduction (IDs, hashes) can be difficult because the model is not designed as a copying machine; tokenization can make the sequence longer and more fragile.
Formatting sensitivity: small changes in whitespace can change tokenization, which can slightly change model behavior.

Step-by-Step: Designing Prompts with Token Constraints in Mind

1) Put the Most Important Instructions First

If you are near the context limit, later instructions are more likely to be truncated or ignored. Place critical constraints and output format requirements early.

2) Use Compact, Structured Output

Requesting verbose prose increases output tokens. If you need to process many chunks, ask for concise bullet points or JSON fields with strict length limits.

3) Avoid Redundant Context

Do not paste the same long background into every request if you can summarize it once and reuse a shorter version. Repetition consumes tokens quickly.

4) Include Only Relevant Excerpts

When answering a question about a document, provide the smallest set of chunks that contain the necessary evidence. This improves both cost and answer focus.

5) Reserve Output Headroom

If you want a detailed answer, ensure the input is small enough to allow the model to generate it. Otherwise, the response may be cut off or forced to be brief.

Token-aware prompt skeleton (example):
[Critical rules: output format, constraints]
[Task definition]
[Only the necessary context chunks]
[Question]

Tokenization and Multilingual Text

Tokenization behavior varies significantly across languages. Languages with clear word boundaries (like English) often tokenize efficiently. Languages without spaces between words, or with large character sets, may produce different token counts. Mixed-language text, transliteration, and domain-specific jargon can also increase token counts. The practical takeaway is to measure token usage on representative samples of your real data rather than relying on English-based rules of thumb.

Checklist: When You Should Chunk

The input document is longer than the model’s context window.
You need to process many documents and want predictable token usage per request.
You are building retrieval-based Q&A and want precise, relevant snippets.
You are extracting structured data and want to isolate topics to reduce confusion.

Now answer the exercise about the content:

Why is chunking by tokens often more reliable than chunking by a fixed number of characters when working with LLM context limits?

You are right! Congratulations, now go to the next page

You missed! Try again.

LLMs operate on tokens and have context windows measured in tokens. Since tokenization is not a stable function of character count (especially for code, URLs, or non-English text), character-sized chunks can overshoot token limits. Token-based chunking aligns with the real constraint.

Next chapter

Context Windows and In-Session Memory

17%

Introduction to Large Language Models (LLMs): How They Work and What They Can (and Can’t) Do

New course

12 pages