Chapter 7: AgentType - Handling More Than Just Text
Welcome back! In the previous chapters, especially when discussing Tools and the PythonExecutor, we saw how agents can perform actions and generate results. So far, we’ve mostly focused on text-based tasks and results.
But what happens when an agent needs to work with images, audio, or other types of data? For example:
- An agent uses a tool to generate an image based on a description.
- An agent uses a tool to transcribe an audio file into text.
- An agent receives an image as input and needs to describe it.
How does the SmolaAgents
framework handle these different kinds of data consistently? How does it make sure an image generated by a tool is displayed correctly in your notebook, or saved properly in the agent’s Memory?
This is where the AgentType
concept comes in!
The Problem: Shipping Different Kinds of Cargo
Imagine you run a shipping company. Most of the time, you ship standard boxes (like text). But sometimes, customers need to ship different things:
- Fresh produce that needs a refrigerated container (like audio data).
- Large machinery that needs a flatbed truck (like image data).
You can’t just stuff the fresh produce into a standard box – it would spoil! And the machinery won’t even fit. You need specialized containers designed for specific types of cargo.
Similarly, our agents need a way to handle data beyond simple text strings. Using Python’s built-in types directly (like a raw PIL.Image
object for images) can cause problems:
- How do you display it? A raw image object doesn’t automatically show up as a picture in a Jupyter notebook.
- How do you save it? How do you store an image or audio clip in the agent’s text-based Memory log? You can’t just put the raw image data there.
- How do you pass it around? How does the framework ensure different components (tools, agent core, memory) know how to handle these different data types consistently?
The Solution: Specialized Data Containers (AgentType
)
SmolaAgents
introduces special “data containers” to solve this problem. These are custom data types that inherit from a base AgentType
class:
AgentText
: For handling plain text. It behaves just like a standard Python string.AgentImage
: For handling images (usually asPIL.Image
objects).AgentAudio
: For handling audio data (often astorch.Tensor
or file paths).
Think of these as the specialized shipping containers:
AgentText
is like the standard shipping box.AgentImage
is like a container designed to safely transport and display pictures.AgentAudio
is like a container designed to safely transport and play audio clips.
These AgentType
objects wrap the actual data (the string, the image object, the audio data) but add extra capabilities.
Why Use AgentType
? (The Benefits)
Using these specialized containers gives us several advantages:
- Consistent Handling: The
SmolaAgents
framework knows how to recognize and work withAgentType
objects, regardless of whether they contain text, images, or audio. - Smart Display: Objects like
AgentImage
andAgentAudio
know how to display themselves correctly in environments like Jupyter notebooks or Gradio interfaces. For example, anAgentImage
will automatically render as an image, not just print<PIL.Image.Image ...>
. - Proper Serialization: They know how to convert themselves into a string representation suitable for logging or storing in Memory.
AgentText
simply returns its string content.AgentImage
automatically saves the image to a temporary file and returns the path to that file when converted to a string (to_string()
method). This path can be safely logged.AgentAudio
does something similar for audio data, saving it to a temporary.wav
file.
- Clear Communication: Tools can clearly state what type of output they produce (e.g.,
output_type="image"
), and the framework ensures the output is wrapped correctly.
How is AgentType
Used? (Mostly Automatic!)
The best part is that you often don’t need to manually create or handle these AgentType
objects. The framework does the heavy lifting.
Scenario 1: A Tool Returning an Image
Imagine you have a tool that generates images using a library like diffusers
.
# --- File: image_tool.py ---
from smolagents import Tool
from PIL import Image
# Assume 'diffusion_pipeline' is a pre-loaded image generation model
# from diffusers import DiffusionPipeline
# diffusion_pipeline = DiffusionPipeline.from_pretrained(...)
class ImageGeneratorTool(Tool):
name: str = "image_generator"
description: str = "Generates an image based on a text prompt."
inputs: dict = {
"prompt": {
"type": "string",
"description": "The text description for the image."
}
}
# Tell the framework this tool outputs an image!
output_type: str = "image" # <--- Crucial Hint!
def forward(self, prompt: str) -> Image.Image:
"""Generates the image using a diffusion model."""
print(f"--- ImageGeneratorTool generating image for: '{prompt}' ---")
# image = diffusion_pipeline(prompt).images[0] # Actual generation
# For simplicity, let's create a dummy blank image
image = Image.new('RGB', (60, 30), color = 'red')
print(f"--- Tool returning a PIL Image object ---")
return image
# --- How the framework uses it (conceptual) ---
image_tool = ImageGeneratorTool()
prompt = "A red rectangle"
raw_output = image_tool(prompt=prompt) # Calls forward(), gets a PIL.Image object
# Framework automatically wraps the output because output_type="image"
# Uses handle_agent_output_types(raw_output, output_type="image")
from smolagents.agent_types import handle_agent_output_types
wrapped_output = handle_agent_output_types(raw_output, output_type="image")
print(f"Raw output type: {type(raw_output)}")
print(f"Wrapped output type: {type(wrapped_output)}")
# When storing in memory or logging, the framework calls to_string()
output_string = wrapped_output.to_string()
print(f"String representation for logs: {output_string}")
# Expected Output (path will vary):
# --- ImageGeneratorTool generating image for: 'A red rectangle' ---
# --- Tool returning a PIL Image object ---
# Raw output type: <class 'PIL.Image.Image'>
# Wrapped output type: <class 'smolagents.agent_types.AgentImage'>
# String representation for logs: /tmp/tmpxxxxxx/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.png
Explanation:
- We define
ImageGeneratorTool
and crucially setoutput_type="image"
. - The
forward
method does its work and returns a standardPIL.Image.Image
object. - When the agent framework receives this output, it checks the tool’s
output_type
. Since it’s"image"
, it automatically uses thehandle_agent_output_types
function (or similar internal logic) to wrap thePIL.Image.Image
object inside anAgentImage
container. - If this
AgentImage
needs to be logged or stored in Memory, the framework calls itsto_string()
method, which saves the image to a temporary file and returns the file path.
Scenario 2: Passing an AgentType
to a Tool
What if an AgentImage
object (maybe retrieved from memory or state) needs to be passed into another tool, perhaps one that analyzes images?
# --- File: image_analyzer_tool.py ---
from smolagents import Tool
from PIL import Image
from smolagents.agent_types import AgentImage, handle_agent_input_types
class ImageAnalyzerTool(Tool):
name: str = "image_analyzer"
description: str = "Analyzes an image and returns its dimensions."
inputs: dict = {
"input_image": {
"type": "image", # Expects an image type
"description": "The image to analyze."
}
}
output_type: str = "string"
def forward(self, input_image: Image.Image) -> str:
"""Analyzes the image."""
# IMPORTANT: input_image here is ALREADY the raw PIL.Image object!
print(f"--- ImageAnalyzerTool received image of type: {type(input_image)} ---")
width, height = input_image.size
return f"Image dimensions are {width}x{height}."
# --- How the framework uses it (conceptual) ---
analyzer_tool = ImageAnalyzerTool()
# Let's pretend 'agent_image_object' is an AgentImage retrieved from memory
# (It wraps a red PIL.Image.Image object like the one from Scenario 1)
agent_image_object = AgentImage(Image.new('RGB', (60, 30), color = 'red'))
print(f"Input object type: {type(agent_image_object)}")
# Framework automatically unwraps the input before calling 'forward'
# Uses handle_agent_input_types(input_image=agent_image_object)
# args_tuple, kwargs_dict = handle_agent_input_types(input_image=agent_image_object)
# result = analyzer_tool.forward(**kwargs_dict) # Simplified conceptual call
# Simulate the unwrapping and call:
raw_image = agent_image_object.to_raw() # Get the underlying PIL Image
result = analyzer_tool.forward(input_image=raw_image)
print(f"Analysis result: {result}")
# Expected Output:
# Input object type: <class 'smolagents.agent_types.AgentImage'>
# --- ImageAnalyzerTool received image of type: <class 'PIL.Image.Image'> ---
# Analysis result: Image dimensions are 60x30.
Explanation:
ImageAnalyzerTool
defines its inputinput_image
as type"image"
. Itsforward
method expects a standardPIL.Image.Image
.- We have an
AgentImage
object (maybe from a previous step). - When the framework prepares to call
analyzer_tool.forward
, it sees that the inputagent_image_object
is anAgentType
. It useshandle_agent_input_types
(or similar logic) to automatically call the.to_raw()
method onagent_image_object
. - This
to_raw()
method extracts the underlyingPIL.Image.Image
object. - The framework passes this raw image object to the
forward
method. The tool developer doesn’t need to worry about unwrapping theAgentType
inside their tool logic.
Under the Hood: A Peek at the Code
Let’s look at simplified versions of the AgentType
classes and helper functions from agent_types.py
.
-
Base
AgentType
Class:# --- File: agent_types.py (Simplified AgentType) --- import logging logger = logging.getLogger(__name__) class AgentType: """Abstract base class for custom agent data types.""" def __init__(self, value): # Stores the actual data (string, PIL Image, etc.) self._value = value def __str__(self): # Default string conversion uses the to_string method return self.to_string() def to_raw(self): """Returns the underlying raw Python object.""" logger.error("to_raw() called on base AgentType!") return self._value def to_string(self) -> str: """Returns a string representation suitable for logging/memory.""" logger.error("to_string() called on base AgentType!") return str(self._value) # Other potential common methods...
- It holds the original
_value
. - Defines the basic methods
to_raw
andto_string
that subclasses will implement properly.
- It holds the original
-
AgentImage
Implementation:# --- File: agent_types.py (Simplified AgentImage) --- import PIL.Image import os import tempfile import uuid from io import BytesIO class AgentImage(AgentType): # Doesn't inherit from PIL.Image directly in reality, but conceptually similar """Handles image data, behaving like a PIL.Image.""" def __init__(self, value): # value can be PIL.Image, path string, bytes, etc. AgentType.__init__(self, value) # Store original value form self._raw_image = None # To store the loaded PIL Image self._path = None # To store the path if saved to temp file # Logic to load image from different input types (simplified) if isinstance(value, PIL.Image.Image): self._raw_image = value elif isinstance(value, (str, os.PathLike)): # We might load it lazily later in to_raw() self._path = str(value) # Assume it's already a path # In reality, it loads here if path exists elif isinstance(value, bytes): self._raw_image = PIL.Image.open(BytesIO(value)) # ... (handle tensors, etc.) ... else: raise TypeError(f"Unsupported type for AgentImage: {type(value)}") def to_raw(self) -> PIL.Image.Image: """Returns the raw PIL.Image.Image object.""" if self._raw_image is None: # Lazy loading if initialized with a path if self._path and os.path.exists(self._path): self._raw_image = PIL.Image.open(self._path) else: # Handle error or create placeholder raise ValueError("Cannot get raw image data.") return self._raw_image def to_string(self) -> str: """Saves image to temp file (if needed) and returns the path.""" if self._path and os.path.exists(self._path): # Already have a path (e.g., loaded from file initially) return self._path # Need to save the raw image data to a temp file raw_img = self.to_raw() # Ensure image is loaded directory = tempfile.mkdtemp() # Generate a unique filename self._path = os.path.join(directory, str(uuid.uuid4()) + ".png") raw_img.save(self._path, format="png") print(f"--- AgentImage saved to temp file: {self._path} ---") return self._path def _ipython_display_(self): """Special method for display in Jupyter/IPython.""" from IPython.display import display display(self.to_raw()) # Display the raw PIL image # We can also make AgentImage behave like PIL.Image by delegating methods # (e.g., using __getattr__ or explicit wrappers) @property def size(self): return self.to_raw().size def save(self, *args, **kwargs): self.to_raw().save(*args, **kwargs) # ... other PIL.Image methods ...
- It can be initialized with various image sources (PIL object, path, bytes).
to_raw()
ensures a PIL Image object is returned, loading from disk if necessary.to_string()
saves the image to a temporary PNG file if it doesn’t already have a path, and returns that path._ipython_display_
allows Jupyter notebooks to automatically display the image.- It can delegate common image methods (like
.size
,.save
) to the underlying raw image.
-
Helper Functions (Conceptual):
# --- File: agent_types.py / agents.py (Simplified Helpers) --- # Mapping from type name string to AgentType class _AGENT_TYPE_MAPPING = {"string": AgentText, "image": AgentImage, "audio": AgentAudio} def handle_agent_output_types(output: Any, output_type: Optional[str] = None) -> Any: """Wraps raw output into an AgentType if needed.""" if output_type in _AGENT_TYPE_MAPPING: # If the tool explicitly defines output type (e.g., "image") wrapper_class = _AGENT_TYPE_MAPPING[output_type] return wrapper_class(output) else: # If no type defined, try to guess based on Python type (optional) if isinstance(output, str): return AgentText(output) if isinstance(output, PIL.Image.Image): return AgentImage(output) # ... add checks for audio tensors etc. ... # Otherwise, return the output as is return output def handle_agent_input_types(*args, **kwargs) -> tuple[list, dict]: """Unwraps AgentType inputs into raw types before passing to a tool.""" processed_args = [] for arg in args: # If it's an AgentType instance, call to_raw(), otherwise keep as is processed_args.append(arg.to_raw() if isinstance(arg, AgentType) else arg) processed_kwargs = {} for key, value in kwargs.items(): processed_kwargs[key] = value.to_raw() if isinstance(value, AgentType) else value return tuple(processed_args), processed_kwargs
handle_agent_output_types
checks the tool’soutput_type
or the actual Python type of the output and wraps it in the correspondingAgentType
class (e.g.,AgentImage
).handle_agent_input_types
iterates through arguments, checks if any areAgentType
instances, and calls.to_raw()
on them to get the underlying data before the tool’sforward
method is called.
Conclusion
AgentType
(AgentText
, AgentImage
, AgentAudio
) provides a crucial layer for handling diverse data types within the SmolaAgents
framework. They act as specialized containers that ensure non-text data can be consistently processed, displayed correctly (especially in notebooks), and serialized appropriately for logging and memory.
You’ve learned:
- Why standard Python types aren’t always enough for agent inputs/outputs.
- The “specialized shipping container” analogy for
AgentType
. - The benefits: consistent handling, smart display, and proper serialization (like saving images/audio to temp files).
- How the framework automatically wraps tool outputs (
handle_agent_output_types
) and unwraps tool inputs (handle_agent_input_types
). - Seen simplified code examples for
AgentImage
and the helper functions.
By using AgentType
, SmolaAgents
makes it much easier to build agents that can work seamlessly with multi-modal data like images and audio, without you having to manually handle the complexities of display and serialization in most cases.
Now that we understand how agents handle different data types, how can we keep track of everything the agent is doing, monitor its performance, and debug issues?
Next Chapter: Chapter 8: AgentLogger & Monitor - Observing Your Agent in Action.
Generated by AI Codebase Knowledge Builder