All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Context Windows and In-Session Memory

Capítulo 3

Estimated reading time: 12 minutes

What “context window” means in practice

A large language model does not “remember” a conversation the way a person does. During a chat, it generates each new response by looking at a limited slice of text called the context window. The context window is the maximum amount of information the model can consider at one time: the current user message plus whatever prior messages, instructions, and other text are still included in the active context.

Think of the context window as the model’s working scratchpad for this session. If something is inside that scratchpad, the model can use it to answer consistently. If it falls outside (because the conversation got too long or because it was never provided), the model cannot directly access it and may guess, contradict earlier details, or ask for clarification.

In a typical chat setup, the system assembles a “prompt” behind the scenes that includes: the system instructions (rules and role), developer instructions (if any), the recent conversation turns, and sometimes extra retrieved documents. The model then predicts the next text based on that assembled prompt. The key limitation is that the assembled prompt cannot exceed the model’s context window.

Why context windows matter

Consistency: Names, constraints, preferences, and decisions must remain in the active context to be reliably followed.
Long tasks: Multi-step projects (writing, coding, planning) can drift if earlier requirements are no longer visible.
Document work: Summarizing, extracting, or comparing long documents requires careful chunking and/or retrieval so the relevant parts are present when needed.
Cost and latency: More context generally means more computation. Even when a model supports a large window, sending unnecessary text can slow responses and increase cost.

In-session memory: what it is (and what it is not)

In-session memory is the practical effect of keeping important information inside the current context window across turns. It is not a separate storage system inside the model; it is simply the continued inclusion of prior text in the prompt that the model sees each time it generates a response.

This is why a model can appear to “remember” what you said five messages ago: those messages are still being sent back to the model as part of the conversation context. If the chat becomes long enough, older messages may be truncated or summarized by the application to stay within the context window. When that happens, the model’s apparent memory changes accordingly.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Common misconceptions

“The model stores my conversation internally.” In a standard setup, it does not. It only sees what is provided in the current prompt.
“If it forgot, it’s being careless.” Often it is a context management issue: the relevant detail is no longer in the active context.
“A bigger context window means perfect long-term memory.” A larger window helps, but it does not guarantee perfect recall or reasoning. It only increases how much text can be considered at once.

How context gets filled (and why it overflows)

As you chat, each turn adds more text. The application typically keeps a running transcript and sends some portion of it back to the model each time. When the transcript grows beyond the allowed context window, something must give. Different systems handle this differently, but common strategies include:

Truncation: Drop the oldest messages first, keeping the most recent turns.
Summarization: Replace older portions with a shorter summary to preserve key facts while saving space.
Selective retention: Keep only messages marked as important (e.g., pinned instructions, requirements, decisions).
Retrieval augmentation: Store conversation notes externally and re-inject only the relevant parts when needed.

Overflow is not only about long conversations. It can also happen when you paste large documents, include verbose logs, or ask for multiple outputs at once (for example, “write 10 pages plus a detailed table plus code”). The model must fit both the input and the output within system limits, so extremely large outputs can also force the system to reduce how much input context it includes.

Practical signs you are hitting context limits

The model contradicts earlier constraints (tone, format, scope) that it previously followed.
It forgets names, definitions, or decisions made earlier in the session.
It re-asks questions that were already answered.
It starts producing generic responses, as if it lost the specifics of your situation.
When you reference an earlier section (“as you said above”), it responds as if it cannot see it.

These signs do not always mean context overflow; they can also come from ambiguity or the model making an incorrect inference. But if the conversation is long or the prompt is large, context management is a prime suspect.

Working effectively within a context window

Good results often come from treating the context window as a scarce resource. The goal is to keep the most important information inside it, in a form the model can use reliably.

Technique 1: Start with a compact “working spec”

For any multi-step task, create a short specification that you can keep reusing. This spec should include only the constraints that truly matter. Place it near the top of the conversation (or re-send it when needed) so it remains visible.

Example working spec for a writing task:

Working spec (keep following unless I update it): 1) Audience: non-technical managers. 2) Style: concise, bullet-friendly. 3) Must include: examples and step-by-step instructions. 4) Avoid: hype, long intros, and conclusions. 5) Output format: HTML body only with h2/h3/p/ul/li/pre/code.

This kind of spec reduces drift because it is short, explicit, and easy to re-inject. It also helps you notice when the model deviates: you can point to a specific rule.

Technique 2: Use “state summaries” at checkpoints

When a task spans many turns (planning a course, designing an app, drafting a long report), periodically create a checkpoint summary that captures decisions and open questions. The summary becomes the new memory anchor.

Checkpoint summary template:

Checkpoint summary (current state): - Goal: ... - Audience/constraints: ... - Decisions made: ... - Definitions: ... - Open questions: ... - Next step: ...

Ask the model to produce this summary, then you can paste it back later (or keep it in your own notes) to restore context quickly.

Technique 3: Prefer references over repetition for long materials

If you are working with a long document, avoid repeatedly pasting the entire text. Instead, structure your workflow so the model only sees the relevant excerpts at each step. For example:

First, ask the model to propose an outline of what to extract.
Then, provide section A and ask for extraction into a table.
Then, provide section B and ask for the same table schema.
Finally, provide the filled tables (which are compact) and ask for synthesis.

This approach keeps the context focused and reduces the chance that important details are pushed out by irrelevant text.

Step-by-step: keeping a long conversation coherent

The following workflow is a practical way to manage in-session memory for a project that takes many turns (for example, drafting a policy, building a lesson plan, or iterating on code and documentation).

Step 1: Create a “pinned” instruction block

Write a short block that includes: the goal, non-negotiable constraints, and the output format. Keep it under a few dozen lines.

Pinned instructions: - Goal: Draft a customer support playbook for a SaaS product. - Must cover: escalation rules, tone guidelines, and response templates. - Constraints: no legal advice; keep it practical; use short sections. - Output: HTML body only using h2/h3/p/ul/li/pre/code. - When unsure: ask a clarifying question instead of guessing.

Step 2: Maintain a “facts and decisions” ledger

As you make decisions (product details, definitions, naming conventions), store them in a compact ledger. This is the content most worth preserving in the context window.

Facts & decisions ledger: - Product name: AcmeDesk. - Support tiers: Basic (email), Pro (chat+email), Enterprise (24/7). - SLA: Enterprise first response within 30 minutes. - Tone: friendly, direct, no slang. - Escalation: billing issues to Finance queue; outages to On-Call.

When the conversation gets long, re-send only the pinned instructions and the ledger, rather than the entire transcript.

Step 3: Work in small, verifiable increments

Ask for one section at a time, then validate it. This reduces the need to keep large amounts of draft text in context. For example:

“Write the escalation rules section only.”
“Now write tone guidelines only.”
“Now generate three response templates for billing, outage, and feature request.”

After each section, you can store the finalized version outside the chat (your document editor) and only keep a short summary in the chat context.

Step 4: Use a rolling summary to prevent drift

Every few turns, ask the model to produce a rolling summary that includes: what has been completed, what remains, and any constraints that must be preserved. Then instruct it to use that summary as the authoritative memory going forward.

Rolling summary request: Summarize the current playbook in 12 bullets max: include constraints, key decisions, and what sections are finalized. This summary will be used as the memory for the next steps.

Step 5: When you notice forgetting, “re-anchor” explicitly

If the model contradicts earlier requirements, do not rely on “you already know.” Instead, re-anchor the key constraints by pasting the pinned block and ledger again, then restate the immediate task.

Example re-anchor message:

Re-anchor: Please follow these pinned instructions and ledger (pasted below). Task: rewrite the outage template to match the tone rules and include the Enterprise SLA (30 minutes). [paste pinned instructions] [paste ledger]

How to write prompts that survive long sessions

In long sessions, prompts should be designed for durability: short, structured, and explicit about what matters. A few practical patterns help.

Pattern: “Role + task + constraints + output schema”

You are a technical editor. Task: rewrite the following section for clarity. Constraints: keep meaning; keep examples; do not add new claims. Output: HTML with h3 and p only. Text: ...

This pattern reduces the need to rely on earlier context. If the chat truncates older messages, the model still has what it needs.

Pattern: “Use this as the source of truth”

When you provide a ledger or a spec, explicitly label it as authoritative. This helps the model prioritize it over conflicting text elsewhere in the context.

Source of truth (authoritative): - Pricing tiers: ... - Definitions: ... - Forbidden topics: ...

Pattern: “Ask before assuming”

Long sessions often accumulate ambiguity. Give the model permission to pause and ask a question rather than inventing details.

If any requirement is missing or conflicts with the ledger, ask up to 3 clarifying questions before writing.

Document workflows: fitting long sources into limited context

Many real tasks involve sources that are longer than the context window: contracts, research reports, transcripts, or codebases. In those cases, you need a workflow that does not depend on the model seeing everything at once.

Workflow A: Extract → normalize → synthesize

Extract: Provide one chunk of the source and ask for specific fields (dates, claims, requirements) in a structured format.
Normalize: Combine extracted fields into a consistent schema (tables, bullet lists, JSON-like blocks).
Synthesize: Provide the normalized data (which is compact) and ask for insights, comparisons, or a narrative.

Example extraction prompt:

From the text below, extract: (1) obligations, (2) deadlines, (3) exceptions. Output as a bullet list under each heading. Text: ...

Example normalization prompt:

Convert these extracted bullets into a table with columns: Item, Type (obligation/deadline/exception), Owner, Due date, Notes.

Workflow B: Section indexing

If you control the source document, add stable section IDs (e.g., “Section 3.2 Payment Terms”). Then you can work by reference: “Use Section 3.2 and Section 5.1 only.” Even if you must paste text, the IDs help you keep track of what has been processed and reduce confusion.

In-session memory vs. external memory

In-session memory is limited to what fits in the context window. For longer projects, teams often add external memory: notes stored outside the model and re-injected as needed. This can be as simple as a human-maintained “project brief” file, or as advanced as an automated system that retrieves relevant notes based on the current question.

From a user perspective, the key idea is: if something must persist across many turns, it should exist in a compact, reusable form that can be pasted back into the chat at any time.

Practical external memory artifacts you can maintain

Project brief: goal, audience, constraints, definitions.
Decision log: dated list of decisions and rationale.
Style guide: tone, formatting rules, forbidden phrases.
Data dictionary: definitions of fields and allowed values.

These artifacts are not about making the model smarter; they are about making the session resilient to context loss.

Quality control when memory is fragile

When important details can fall out of context, build lightweight checks into your workflow.

Ask for a constraint checklist before final output

Instead of asking the model to “just write,” ask it to first list the constraints it will follow, based on the pinned instructions and ledger. If it misses something, you can correct it before it generates a long response.

Before writing, list the constraints you will follow (max 8 bullets). Then write the section.

Ask it to cite where each key fact came from (within the session)

You can request that the model tie key facts to the ledger items or to the excerpt you provided. This reduces silent drift.

For each numeric or policy claim, reference the ledger bullet it came from (e.g., “Ledger: SLA 30 minutes”).

Use “diff-style” edits to avoid re-sending everything

If you have a large draft, do not paste it repeatedly. Instead, ask for targeted edits: “Replace paragraph 2 with a shorter version,” or “Rewrite only the bullet list under ‘Escalation’.” This keeps the context smaller and reduces overflow risk.

Designing tasks around the context window

Some tasks naturally fit within a context window; others should be decomposed. A useful rule: if you cannot summarize the requirements and the necessary source material in a page or two of text, you should assume the model will need a staged workflow.

Examples of tasks that fit well

Drafting a short email with a few constraints.
Generating a checklist from a short policy excerpt.
Refactoring a small function with provided code and tests.

Examples of tasks that need decomposition

Summarizing a long report while preserving all key details.
Comparing multiple long documents for inconsistencies.
Maintaining a coherent narrative across many chapters without a maintained outline and style guide.

Decomposition is not a workaround; it is a normal way to collaborate with a system that has a bounded working context.

Now answer the exercise about the content:

What is the most accurate explanation for why a model may seem to forget earlier details in a long chat?

You are right! Congratulations, now go to the next page

You missed! Try again.

The model only uses the text included in the current prompt (its context window). If earlier messages are truncated or summarized to fit limits, those details may drop out of active context, changing what the model can use.