Chapter 6: ChatCompletionContext - Remembering the Conversation

In Chapter 5: ChatCompletionClient, we learned how agents talk to Large Language Models (LLMs) using a ChatCompletionClient. We saw that we need to send a list of messages (the conversation history) to the LLM so it knows the context.

But conversations can get very long! Imagine talking on the phone for an hour. Can you remember every single word that was said? Probably not. You remember the main points, the beginning, and what was said most recently. LLMs have a similar limitation – they can only pay attention to a certain amount of text at once (called the “context window”).

If we send the entire history of a very long chat, it might be too much for the LLM, lead to errors, be slow, or cost more money (since many LLMs charge based on the amount of text).

So, how do we smartly choose which parts of the conversation history to send? This is the problem that ChatCompletionContext solves.

Motivation: Keeping LLM Conversations Focused

Let’s say we have a helpful assistant agent chatting with a user:

  1. User: “Hi! Can you tell me about AutoGen?”
  2. Assistant: “Sure! AutoGen is a framework…” (provides details)
  3. User: “Thanks! Now, can you draft an email to my team about our upcoming meeting?”
  4. Assistant: “Okay, what’s the meeting about?”
  5. User: “It’s about the project planning for Q3.”
  6. Assistant: (Needs to draft the email)

When the Assistant needs to draft the email (step 6), does it need the exact text from step 2 about what AutoGen is? Probably not. It definitely needs the instructions from step 3 and the topic from step 5. Maybe the initial greeting isn’t super important either.

ChatCompletionContext acts like a smart transcript editor. Before sending the history to the LLM via the ChatCompletionClient, it reviews the full conversation log and prepares a shorter, focused version containing only the messages it thinks are most relevant for the LLM’s next response.

Key Concepts: Managing the Chat History

  1. The Full Transcript Holder: A ChatCompletionContext object holds the complete list of messages (LLMMessage objects like SystemMessage, UserMessage, AssistantMessage from Chapter 5) that have occurred in a specific conversation thread. You add new messages using its add_message method.

  2. The Smart View Generator (get_messages): The core job of ChatCompletionContext is done by its get_messages method. When called, it looks at the full transcript it holds, but returns only a subset of those messages based on its specific strategy. This subset is what you’ll actually send to the ChatCompletionClient.

  3. Different Strategies for Remembering: Because different situations require different focus, AutoGen Core provides several ChatCompletionContext implementations (strategies):

    • UnboundedChatCompletionContext: The simplest (and sometimes riskiest!). It doesn’t edit anything; get_messages just returns the entire history. Good for short chats, but can break with long ones.
    • BufferedChatCompletionContext: Like remembering only the last few things someone said. It keeps the most recent N messages (where N is the buffer_size you set). Good for focusing on recent interactions.
    • HeadAndTailChatCompletionContext: Tries to get the best of both worlds. It keeps the first few messages (the “head”, maybe containing initial instructions) and the last few messages (the “tail”, the recent context). It skips the messages in the middle.

Use Case Example: Chatting with Different Memory Strategies

Let’s simulate adding messages to different context managers and see what get_messages returns.

Step 1: Define some messages

# File: define_chat_messages.py
from autogen_core.models import (
    SystemMessage, UserMessage, AssistantMessage, LLMMessage
)
from typing import List

# The initial instruction for the assistant
system_msg = SystemMessage(content="You are a helpful assistant.")

# A sequence of user/assistant turns
chat_sequence: List[LLMMessage] = [
    UserMessage(content="What is AutoGen?", source="User"),
    AssistantMessage(content="AutoGen is a multi-agent framework...", source="Agent"),
    UserMessage(content="What can it do?", source="User"),
    AssistantMessage(content="It can build complex LLM apps.", source="Agent"),
    UserMessage(content="Thanks!", source="User")
]

# Combine system message and the chat sequence
full_history: List[LLMMessage] = [system_msg] + chat_sequence

print(f"Total messages in full history: {len(full_history)}")
# Output: Total messages in full history: 6

We have a full history of 6 messages (1 system + 5 chat turns).

Step 2: Use UnboundedChatCompletionContext

This context keeps everything.

# File: use_unbounded_context.py
import asyncio
from define_chat_messages import full_history
from autogen_core.model_context import UnboundedChatCompletionContext

async def main():
    # Create context and add all messages
    context = UnboundedChatCompletionContext()
    for msg in full_history:
        await context.add_message(msg)

    # Get the messages to send to the LLM
    messages_for_llm = await context.get_messages()

    print(f"--- Unbounded Context ({len(messages_for_llm)} messages) ---")
    for i, msg in enumerate(messages_for_llm):
        print(f"{i+1}. [{msg.type}]: {msg.content[:30]}...")

# asyncio.run(main()) # If run

Expected Output (Unbounded):

--- Unbounded Context (6 messages) ---
1. [SystemMessage]: You are a helpful assistant....
2. [UserMessage]: What is AutoGen?...
3. [AssistantMessage]: AutoGen is a multi-agent fram...
4. [UserMessage]: What can it do?...
5. [AssistantMessage]: It can build complex LLM apps...
6. [UserMessage]: Thanks!...

It returns all 6 messages, exactly as added.

Step 3: Use BufferedChatCompletionContext

Let’s keep only the last 3 messages.

# File: use_buffered_context.py
import asyncio
from define_chat_messages import full_history
from autogen_core.model_context import BufferedChatCompletionContext

async def main():
    # Keep only the last 3 messages
    context = BufferedChatCompletionContext(buffer_size=3)
    for msg in full_history:
        await context.add_message(msg)

    messages_for_llm = await context.get_messages()

    print(f"--- Buffered Context (buffer=3, {len(messages_for_llm)} messages) ---")
    for i, msg in enumerate(messages_for_llm):
        print(f"{i+1}. [{msg.type}]: {msg.content[:30]}...")

# asyncio.run(main()) # If run

Expected Output (Buffered):

--- Buffered Context (buffer=3, 3 messages) ---
1. [UserMessage]: What can it do?...
2. [AssistantMessage]: It can build complex LLM apps...
3. [UserMessage]: Thanks!...

It only returns the last 3 messages from the full history. The system message and the first chat turn are omitted.

Step 4: Use HeadAndTailChatCompletionContext

Let’s keep the first message (head=1) and the last two messages (tail=2).

# File: use_head_tail_context.py
import asyncio
from define_chat_messages import full_history
from autogen_core.model_context import HeadAndTailChatCompletionContext

async def main():
    # Keep first 1 and last 2 messages
    context = HeadAndTailChatCompletionContext(head_size=1, tail_size=2)
    for msg in full_history:
        await context.add_message(msg)

    messages_for_llm = await context.get_messages()

    print(f"--- Head & Tail Context (h=1, t=2, {len(messages_for_llm)} messages) ---")
    for i, msg in enumerate(messages_for_llm):
        print(f"{i+1}. [{msg.type}]: {msg.content[:30]}...")

# asyncio.run(main()) # If run

Expected Output (Head & Tail):

--- Head & Tail Context (h=1, t=2, 4 messages) ---
1. [SystemMessage]: You are a helpful assistant....
2. [UserMessage]: Skipped 3 messages....
3. [AssistantMessage]: It can build complex LLM apps...
4. [UserMessage]: Thanks!...

It keeps the very first message (SystemMessage), then inserts a placeholder telling the LLM that some messages were skipped, and finally includes the last two messages. This preserves the initial instruction and the most recent context.

Which one to choose? It depends on your agent’s task!

  • Simple Q&A? Buffered might be fine.
  • Following complex initial instructions? HeadAndTail or even Unbounded (if short) might be better.

Under the Hood: How Context is Managed

The core idea is defined by the ChatCompletionContext abstract base class.

Conceptual Flow:

sequenceDiagram
    participant Agent as Agent Logic
    participant Context as ChatCompletionContext
    participant FullHistory as Internal Message List

    Agent->>+Context: add_message(newMessage)
    Context->>+FullHistory: Append newMessage to list
    FullHistory-->>-Context: List updated
    Context-->>-Agent: Done

    Agent->>+Context: get_messages()
    Context->>+FullHistory: Read the full list
    FullHistory-->>-Context: Return full list
    Context->>Context: Apply Strategy (e.g., slice list for Buffered/HeadTail)
    Context-->>-Agent: Return selected list of messages
  1. Adding: When add_message(message) is called, the context simply appends the message to its internal list (self._messages).
  2. Getting: When get_messages() is called:
    • The context accesses its internal self._messages list.
    • The specific implementation (Unbounded, Buffered, HeadAndTail) applies its logic to select which messages to return.
    • It returns the selected list.

Code Glimpse:

  • Base Class (_chat_completion_context.py): Defines the structure and common methods.

    # From: model_context/_chat_completion_context.py (Simplified)
    from abc import ABC, abstractmethod
    from typing import List
    from ..models import LLMMessage
    
    class ChatCompletionContext(ABC):
        component_type = "chat_completion_context" # Identifies this as a component type
    
        def __init__(self, initial_messages: List[LLMMessage] | None = None) -> None:
            # Holds the COMPLETE history
            self._messages: List[LLMMessage] = initial_messages or []
    
        async def add_message(self, message: LLMMessage) -> None:
            """Add a message to the full context."""
            self._messages.append(message)
    
        @abstractmethod
        async def get_messages(self) -> List[LLMMessage]:
            """Get the subset of messages based on the strategy."""
            # Each subclass MUST implement this logic
            ...
    
        # Other methods like clear(), save_state(), load_state() exist too
    

    The base class handles storing messages; subclasses define how to retrieve them.

  • Unbounded (_unbounded_chat_completion_context.py): The simplest implementation.

    # From: model_context/_unbounded_chat_completion_context.py (Simplified)
    from typing import List
    from ._chat_completion_context import ChatCompletionContext
    from ..models import LLMMessage
    
    class UnboundedChatCompletionContext(ChatCompletionContext):
        async def get_messages(self) -> List[LLMMessage]:
            """Returns all messages."""
            return self._messages # Just return the whole internal list
    
  • Buffered (_buffered_chat_completion_context.py): Uses slicing to get the end of the list.

    # From: model_context/_buffered_chat_completion_context.py (Simplified)
    from typing import List
    from ._chat_completion_context import ChatCompletionContext
    from ..models import LLMMessage, FunctionExecutionResultMessage
    
    class BufferedChatCompletionContext(ChatCompletionContext):
        def __init__(self, buffer_size: int, ...):
            super().__init__(...)
            self._buffer_size = buffer_size
    
        async def get_messages(self) -> List[LLMMessage]:
            """Get at most `buffer_size` recent messages."""
            # Slice the list to get the last 'buffer_size' items
            messages = self._messages[-self._buffer_size :]
            # Special case: Avoid starting with a function result message
            if messages and isinstance(messages[0], FunctionExecutionResultMessage):
                messages = messages[1:]
            return messages
    
  • Head and Tail (_head_and_tail_chat_completion_context.py): Combines slices from the beginning and end.

    # From: model_context/_head_and_tail_chat_completion_context.py (Simplified)
    from typing import List
    from ._chat_completion_context import ChatCompletionContext
    from ..models import LLMMessage, UserMessage
    
    class HeadAndTailChatCompletionContext(ChatCompletionContext):
        def __init__(self, head_size: int, tail_size: int, ...):
            super().__init__(...)
            self._head_size = head_size
            self._tail_size = tail_size
    
        async def get_messages(self) -> List[LLMMessage]:
            head = self._messages[: self._head_size] # First 'head_size' items
            tail = self._messages[-self._tail_size :] # Last 'tail_size' items
            num_skipped = len(self._messages) - len(head) - len(tail)
    
            if num_skipped <= 0: # If no overlap or gap
                return self._messages
            else: # If messages were skipped
                placeholder = [UserMessage(content=f"Skipped {num_skipped} messages.", source="System")]
                # Combine head + placeholder + tail
                return head + placeholder + tail
    

    These implementations provide different ways to manage the context window effectively.

Putting it Together with ChatCompletionClient

How does an agent use ChatCompletionContext with the ChatCompletionClient from Chapter 5?

  1. An agent has an instance of a ChatCompletionContext (e.g., BufferedChatCompletionContext) to store its conversation history.
  2. When the agent receives a new message (e.g., a UserMessage), it calls await context.add_message(new_user_message).
  3. To prepare for calling the LLM, the agent calls messages_to_send = await context.get_messages(). This gets the strategically selected subset of the history.
  4. The agent then passes this list to the ChatCompletionClient: response = await llm_client.create(messages=messages_to_send, ...).
  5. When the LLM replies (e.g., with an AssistantMessage), the agent adds it back to the context: await context.add_message(llm_response_message).

This loop ensures that the history is continuously updated and intelligently trimmed before each call to the LLM.

Next Steps

You’ve learned how ChatCompletionContext helps manage the conversation history sent to LLMs, preventing context window overflows and keeping the interaction focused using different strategies (Unbounded, Buffered, HeadAndTail).

This context management is a specific form of memory. Agents might need to remember things beyond just the chat history. How do they store general information, state, or knowledge over time?

  • Chapter 7: Memory: Explore the broader concept of Memory in AutoGen Core, which provides more general ways for agents to store and retrieve information.
  • Chapter 8: Component: Understand how ChatCompletionContext fits into the general Component model, allowing configuration and integration within the AutoGen system.

Generated by AI Codebase Knowledge Builder