Chapter 2: The System Prompt - Setting the Rules for Your AI Assistant
In Chapter 1: The Agent, we met the Agent
, our project manager for automating browser tasks. We saw it consults a Large Language Model (LLM) – the “planner” – to decide the next steps based on the current state of the webpage. But how does the Agent tell the LLM how it should think, behave, and respond? Just giving it the task isn’t enough!
Imagine hiring a new assistant. You wouldn’t just say, “Organize my files!” You’d give them specific instructions: “Please sort the files alphabetically by client name, put them in the blue folders, and give me a summary list when you’re done.” Without these rules, the assistant might do something completely different!
The System Prompt solves this exact problem for our LLM. It’s the set of core instructions and rules we give the LLM at the very beginning, telling it exactly how to act as a browser automation assistant and, crucially, how to format its responses so the Agent
can understand them.
What is the System Prompt? The AI’s Rulebook
Think of the System Prompt like the AI assistant’s fundamental operating manual, its “Prime Directive,” or the rules of a board game. It defines:
- Persona: “You are an AI agent designed to automate browser tasks.”
- Goal: “Your goal is to accomplish the ultimate task…”
- Input: How to understand the information it receives about the webpage (DOM Representation).
- Capabilities: What actions it can take (Action Controller & Registry).
- Limitations: What it shouldn’t do (e.g., hallucinate actions).
- Response Format: The exact structure (JSON format) its thoughts and planned actions must follow.
Without this rulebook, the LLM might just chat casually, give vague suggestions, or produce output in a format the Agent
code can’t parse. The System Prompt ensures the LLM behaves like the specialized tool we need.
Why is the Response Format So Important?
This is a critical point. The Agent
code isn’t a human reading the LLM’s response. It’s a program expecting data in a very specific structure. The System Prompt tells the LLM to always respond in a JSON format that looks something like this (simplified):
{
"current_state": {
"evaluation_previous_goal": "Success - Found the search bar.",
"memory": "On google.com main page. Need to search for cats.",
"next_goal": "Type 'cute cat pictures' into the search bar."
},
"action": [
{
"input_text": {
"index": 5, // The index of the search bar element
"text": "cute cat pictures"
}
},
{
"press_keys": {
"keys": "Enter" // Press the Enter key
}
}
]
}
The Agent
can easily read this JSON:
- It understands the LLM’s thoughts (
current_state
). - It sees the exact
action
list the LLM wants to perform. - It passes these actions (like
input_text
orpress_keys
) to the Action Controller & Registry to execute them in the browser.
If the LLM responded with just “Okay, I’ll type ‘cute cat pictures’ into the search bar and press Enter,” the Agent
wouldn’t know which element index corresponds to the search bar or exactly which actions to call. The strict JSON format is essential for automation.
A Peek Inside the Rulebook (system_prompt.md
)
The actual instructions live in a text file within the Browser Use
library: browser_use/agent/system_prompt.md
. It’s quite detailed, but here’s a tiny snippet focusing on the response format rule:
# Response Rules
1. RESPONSE FORMAT: You must ALWAYS respond with valid JSON in this exact format:
current_state,
"action":[one_action_name}}, ...]}}
2. ACTIONS: You can specify multiple actions in the list... Use maximum actions...
(This is heavily simplified! The real file has many more rules about element interaction, error handling, task completion, etc.)
This file clearly defines the JSON structure (current_state
and action
) and other crucial behaviors required from the LLM.
How the Agent Uses the System Prompt
The Agent
uses a helper class called SystemPrompt
(found in agent/prompts.py
) to manage these rules. Here’s the flow:
- Loading: When you create an
Agent
, it internally creates aSystemPrompt
object. This object reads the rules from thesystem_prompt.md
file. - Formatting: The
SystemPrompt
object formats these rules into a specialSystemMessage
object that LLMs understand as foundational instructions. - Conversation Start: This
SystemMessage
is given to the Message Manager, which keeps track of the conversation history with the LLM. TheSystemMessage
becomes the very first message, setting the context for all future interactions in that session.
Think of it like starting a meeting: the first thing you do is state the agenda and rules (System Prompt), and then the discussion (LLM interaction) follows based on that foundation.
Let’s look at a simplified view of the SystemPrompt
class loading the rules:
# --- File: agent/prompts.py (Simplified) ---
import importlib.resources # Helps find files within the installed library
from langchain_core.messages import SystemMessage # Special message type for LLMs
class SystemPrompt:
def __init__(self, action_description: str, max_actions_per_step: int = 10):
# We ignore these details for now
self.default_action_description = action_description
self.max_actions_per_step = max_actions_per_step
self._load_prompt_template() # <--- Loads the rules file
def _load_prompt_template(self) -> None:
"""Load the prompt rules from the system_prompt.md file."""
try:
# Finds the 'system_prompt.md' file inside the browser_use package
filepath = importlib.resources.files('browser_use.agent').joinpath('system_prompt.md')
with filepath.open('r') as f:
self.prompt_template = f.read() # Read the text content
print("System Prompt template loaded successfully!")
except Exception as e:
print(f"Error loading system prompt: {e}")
self.prompt_template = "Error: Could not load prompt." # Fallback
def get_system_message(self) -> SystemMessage:
"""Format the loaded rules into a message for the LLM."""
# Replace placeholders like with actual values
prompt = self.prompt_template.format(max_actions=self.max_actions_per_step)
# Wrap the final rules text in a SystemMessage object
return SystemMessage(content=prompt)
# --- How it plugs into Agent creation (Conceptual) ---
# from browser_use import Agent, SystemPrompt
# from my_llm_setup import my_llm # Your LLM
# ... other setup ...
# When you create an Agent:
# agent = Agent(
# task="Find cat pictures",
# llm=my_llm,
# browser_context=...,
# controller=...,
# # The Agent's __init__ method does something like this internally:
# # system_prompt_obj = SystemPrompt(action_description="...", max_actions_per_step=10)
# # system_message_for_llm = system_prompt_obj.get_system_message()
# # This system_message_for_llm is then passed to the Message Manager.
# )
This code shows how the SystemPrompt
class finds and reads the system_prompt.md
file and prepares the instructions as a SystemMessage
ready for the LLM conversation.
Under the Hood: Initialization and Conversation Flow
Let’s visualize how the System Prompt fits into the Agent’s setup and interaction loop:
sequenceDiagram
participant User
participant Agent_Init as Agent Initialization
participant SP as SystemPrompt Class
participant MM as Message Manager
participant Agent_Run as Agent Run Loop
participant LLM
User->>Agent_Init: Create Agent(task, llm, ...)
Note over Agent_Init: Agent needs the rules!
Agent_Init->>SP: Create SystemPrompt(...)
SP->>SP: _load_prompt_template() reads system_prompt.md
SP-->>Agent_Init: SystemPrompt instance
Agent_Init->>SP: get_system_message()
SP-->>Agent_Init: system_message (The Formatted Rules)
Note over Agent_Init: Pass rules to conversation manager
Agent_Init->>MM: Initialize MessageManager(task, system_message)
MM->>MM: Store system_message as message #1
MM-->>Agent_Init: MessageManager instance ready
Agent_Init-->>User: Agent created and ready
User->>Agent_Run: agent.run() starts the task
Note over Agent_Run: Agent needs context for LLM
Agent_Run->>MM: get_messages()
MM-->>Agent_Run: [system_message, user_message(state), ...]
Note over Agent_Run: Send rules + current state to LLM
Agent_Run->>LLM: Ask for next action (Input includes rules)
LLM-->>Agent_Run: JSON response (LLM followed rules!)
Agent_Run->>MM: add_model_output(...)
Note over Agent_Run: Loop continues...
Internally, the Agent
’s initialization code (__init__
in agent/service.py
) explicitly creates the SystemPrompt
and passes its output to the MessageManager
:
# --- File: agent/service.py (Simplified Agent __init__) ---
# ... other imports ...
from browser_use.agent.prompts import SystemPrompt # Import the class
from browser_use.agent.message_manager.service import MessageManager, MessageManagerSettings
class Agent:
def __init__(
self,
task: str,
llm: BaseChatModel,
browser_context: BrowserContext,
controller: Controller,
system_prompt_class: Type[SystemPrompt] = SystemPrompt, # Allows customizing the prompt class
max_actions_per_step: int = 10,
# ... other parameters ...
**kwargs
):
self.task = task
self.llm = llm
# ... store other components ...
# Get the list of available actions from the controller
self.available_actions = controller.registry.get_prompt_description()
# 1. Create the SystemPrompt instance using the provided class
system_prompt_instance = system_prompt_class(
action_description=self.available_actions,
max_actions_per_step=max_actions_per_step,
)
# 2. Get the formatted SystemMessage (the rules)
system_message = system_prompt_instance.get_system_message()
# 3. Initialize the Message Manager with the task and the rules
self._message_manager = MessageManager(
task=self.task,
system_message=system_message, # <--- Pass the rules here!
settings=MessageManagerSettings(...)
# ... other message manager setup ...
)
# ... rest of initialization ...
logger.info("Agent initialized with System Prompt.")
When the Agent
runs its loop (agent.run()
calls agent.step()
), it asks the MessageManager
for the current conversation history (self._message_manager.get_messages()
). The MessageManager
always ensures that the SystemMessage
(containing the rules) is the very first item in that history list sent to the LLM.
Conclusion
The System Prompt is the essential rulebook that governs the LLM’s behavior within the Browser Use
framework. It tells the LLM how to interpret the browser state, what actions it can take, and most importantly, dictates the exact JSON format for its responses. This structured communication is key to enabling the Agent
to reliably understand the LLM’s plan and execute browser automation tasks.
Without a clear System Prompt, the LLM would be like an untrained assistant – potentially intelligent, but unable to follow the specific procedures needed for the job.
Now that we understand how the Agent
gets its fundamental instructions, how does it actually perceive the webpage it’s supposed to interact with? In the next chapter, we’ll explore the component responsible for representing the browser’s state: the BrowserContext.
Generated by AI Codebase Knowledge Builder