Chapter 4: DOM Representation - Mapping the Webpage

In the previous chapter, we learned about the BrowserContext, the Agent’s private workspace for browsing. We saw that the Agent uses browser_context.get_state() to get a snapshot of the current webpage. But how does the Agent actually understand the content of that snapshot?

Imagine you’re looking at the Google homepage. You instantly recognize the logo, the search bar, and the buttons. But a computer program just sees a wall of code (HTML). How can our Agent figure out: “This rectangular box is the search bar I need to type into,” or “This specific image link is the first result I should click”?

This is the problem solved by DOM Representation.

What Problem Does DOM Representation Solve?

Webpages are built using HTML (HyperText Markup Language), which describes the structure and content. Your browser reads this HTML and creates an internal, structured representation called the Document Object Model (DOM). It’s like the browser builds a detailed blueprint or an outline from the HTML instructions.

However, this raw DOM blueprint is incredibly complex and contains lots of information irrelevant to our Agent’s task. The Agent doesn’t need to know about every single tiny visual detail; it needs a simplified map focused on what’s important for interaction:

  1. What elements are on the page? (buttons, links, input fields, text)
  2. Are they visible to a user? (Hidden elements shouldn’t be interacted with)
  3. Are they interactive? (Can you click it? Can you type in it?)
  4. How can the Agent refer to them? (We need a simple way to say “click this button”)

DOM Representation solves the problem of translating the complex, raw DOM blueprint into a simplified, structured map that highlights the interactive “landmarks” and pathways the Agent can use.

Meet DomService: The Map Maker

The component responsible for creating this map is the DomService. Think of it as a cartographer specializing in webpages.

When the Agent (via the BrowserContext) asks for the current state of the page, the BrowserContext employs the DomService to analyze the page’s live DOM.

Here’s what the DomService does:

  1. Examines the Live Page: It looks at the current structure rendered in the browser tab, not just the initial HTML source code (because JavaScript can change the page after it loads).
  2. Identifies Elements: It finds all the meaningful elements like buttons, links, input fields, and text blocks.
  3. Checks Properties: For each element, it determines crucial properties:
    • Visibility: Is it actually displayed on the screen?
    • Interactivity: Is it something a user can click, type into, or otherwise interact with?
    • Position: Where is it located (roughly)?
  4. Assigns Interaction Indices: This is key! For elements deemed interactive and visible, DomService assigns a unique number, called a highlight_index (like [5], [12], etc.). This gives the Agent and the LLM a simple, unambiguous way to refer to specific elements.
  5. Builds a Structured Tree: It organizes this information into a simplified tree structure (element_tree) that reflects the page layout but is much easier to process than the full DOM.
  6. Creates an Index Map: It generates a selector_map, which is like an index in a book, mapping each highlight_index directly to its corresponding element node in the tree.

The final output is a DOMState object containing the simplified element_tree and the handy selector_map. This DOMState is then included in the BrowserState that BrowserContext.get_state() returns to the Agent.

The Output: DOMState - The Agent’s Map

The DOMState object produced by DomService has two main parts:

  1. element_tree: This is the root of our simplified map, represented as a DOMElementNode object (defined in dom/views.py). Each node in the tree can be either an element (DOMElementNode) or a piece of text (DOMTextNode). DOMElementNodes contain information like the tag name (<button>, <input>), attributes (aria-label="Search"), visibility, interactivity, and importantly, the highlight_index if applicable. The tree structure helps understand the page layout (e.g., this button is inside that section).

    Conceptual Example Tree:

    <body> [no index]
     |-- <div> [no index]
     |    |-- <input aria-label="Search"> [highlight_index: 5]
     |    +-- <button> [highlight_index: 6]
     |         +-- "Google Search" (TextNode)
     +-- <a> href="/images"> [highlight_index: 7]
          +-- "Images" (TextNode)
    
  2. selector_map: This is a Python dictionary that acts as a quick lookup. It maps the integer highlight_index directly to the corresponding DOMElementNode object in the element_tree.

    Conceptual Example Map:

    {
        5: <DOMElementNode tag_name='input', attributes={'aria-label':'Search'}, ...>,
        6: <DOMElementNode tag_name='button', ...>,
        7: <DOMElementNode tag_name='a', attributes={'href':'/images'}, ...>
    }
    

This selector_map is incredibly useful because when the LLM decides “click element 5”, the Agent can instantly find the correct DOMElementNode using selector_map[5] and tell the Action Controller & Registry exactly which element to interact with.

How the Agent Uses the Map

The Agent takes the DOMState (usually simplifying the element_tree further into a text representation) and includes it in the information sent to the LLM. Remember the JSON response format from Chapter 2? The LLM uses the highlight_index from this map to specify actions:

// LLM might receive a simplified text view like:
// "[5]<input aria-label='Search'>\n[6]<button>Google Search</button>\n[7]<a>Images</a>"

// And respond with:
{
  "current_state": {
    "evaluation_previous_goal": "...",
    "memory": "On Google homepage, need to search for cats.",
    "next_goal": "Type 'cute cats' into the search bar [5]."
  },
  "action": [
    {
      "input_text": {
        "index": 5, // <-- Uses the highlight_index from the DOM map!
        "text": "cute cats"
      }
    }
    // ... maybe press Enter action ...
  ]
}

Code Example: Seeing the Map

We don’t usually interact with DomService directly. Instead, we get its output via the BrowserContext. Let’s revisit the example from Chapter 3 and see where the DOM representation fits:

import asyncio
from browser_use import Browser, BrowserConfig, BrowserContext, BrowserContextConfig

async def main():
    browser_config = BrowserConfig(headless=False)
    browser = Browser(config=browser_config)
    context_config = BrowserContextConfig()

    async with browser.new_context(config=context_config) as browser_context:
        # Navigate to a page (e.g., Google)
        await browser_context.navigate_to("https://www.google.com")

        print("Getting current page state...")
        # This call uses DomService internally to generate the DOM representation
        current_state = await browser_context.get_state()

        print(f"\nCurrent Page URL: {current_state.url}")
        print(f"Current Page Title: {current_state.title}")

        # Accessing the DOM Representation parts within the BrowserState
        print("\n--- DOM Representation Details ---")
        # The element_tree is the root node of our simplified DOM map
        if current_state.element_tree:
            print(f"Root element tag of simplified tree: <{current_state.element_tree.tag_name}>")
        else:
            print("Element tree is empty.")

        # The selector_map provides direct access to interactive elements by index
        if current_state.selector_map:
            print(f"Number of interactive elements found: {len(current_state.selector_map)}")

            # Let's try to find the element the LLM might call [5] (often the search bar)
            example_index = 5 # Note: Indices can change depending on the page!
            if example_index in current_state.selector_map:
                element_node = current_state.selector_map[example_index]
                print(f"Element [{example_index}]: Tag=<{element_node.tag_name}>, Attributes={element_node.attributes}")
                # The Agent uses this node reference to perform actions
            else:
                print(f"Element [{example_index}] not found in the selector map for this page state.")
        else:
            print("No interactive elements found (selector map is empty).")

        # The Agent would typically convert element_tree into a compact text format
        # (using methods like element_tree.clickable_elements_to_string())
        # to send to the LLM along with the task instructions.

    print("\nBrowserContext closed.")
    await browser.close()
    print("Browser closed.")

# Run the asynchronous code
asyncio.run(main())

What happens here?

  1. We set up the Browser and BrowserContext.
  2. We navigate to Google.
  3. browser_context.get_state() is called. Internally, this triggers the DomService.
  4. DomService analyzes the Google page, finds interactive elements (like the search bar, buttons), assigns them highlight_index numbers, and builds the element_tree and selector_map.
  5. This DOMState (containing the tree and map) is packaged into the BrowserState object returned by get_state().
  6. Our code then accesses current_state.element_tree and current_state.selector_map to peek at the map created by DomService.
  7. We demonstrate looking up an element using its potential index (selector_map[5]).

How It Works Under the Hood: DomService in Action

Let’s trace the flow when BrowserContext.get_state() is called:

sequenceDiagram
    participant Agent
    participant BC as BrowserContext
    participant DomService
    participant PlaywrightPage as Browser Page (JS Env)
    participant buildDomTree_js as buildDomTree.js

    Agent->>BC: get_state()
    Note over BC: Needs to analyze the page content
    BC->>DomService: get_clickable_elements(...)
    Note over DomService: Needs to run analysis script in browser
    DomService->>PlaywrightPage: evaluate(js_code='buildDomTree.js', args={...})
    Note over PlaywrightPage: Execute JavaScript code
    PlaywrightPage->>buildDomTree_js: Run analysis function
    Note over buildDomTree_js: Analyzes live DOM, finds visible & interactive elements, assigns highlight_index
    buildDomTree_js-->>PlaywrightPage: Return structured data (nodes, indices, map)
    PlaywrightPage-->>DomService: Return JS execution result (JSON-like data)
    Note over DomService: Process the raw data from JS
    DomService->>DomService: _construct_dom_tree(result)
    Note over DomService: Builds Python DOMElementNode tree and selector_map
    DomService-->>BC: Return DOMState (element_tree, selector_map)
    Note over BC: Combine DOMState with URL, title, screenshot etc.
    BC->>BC: Create BrowserState object
    BC-->>Agent: Return BrowserState (containing DOM map)

Key Code Points:

  1. BrowserContext calls DomService: Inside browser/context.py, the _update_state method (called by get_state) initializes and uses the DomService:

    # --- File: browser/context.py (Simplified _update_state) ---
    from browser_use.dom.service import DomService # Import the service
    from browser_use.browser.views import BrowserState
    
    class BrowserContext:
        # ... other methods ...
        async def _update_state(self) -> BrowserState:
            page = await self.get_current_page() # Get the active Playwright page object
            # ... error handling ...
            try:
                # 1. Create DomService instance for the current page
                dom_service = DomService(page)
    
                # 2. Call DomService to get the DOM map (DOMState)
                content_info = await dom_service.get_clickable_elements(
                    highlight_elements=self.config.highlight_elements,
                    viewport_expansion=self.config.viewport_expansion,
                    # ... other options ...
                )
    
                # 3. Get other info (screenshot, URL, title etc.)
                screenshot_b64 = await self.take_screenshot()
                url = page.url
                title = await page.title()
                # ... gather more state ...
    
                # 4. Package everything into BrowserState
                browser_state = BrowserState(
                    element_tree=content_info.element_tree, # <--- From DomService
                    selector_map=content_info.selector_map, # <--- From DomService
                    url=url,
                    title=title,
                    screenshot=screenshot_b64,
                    # ... other state info ...
                )
                return browser_state
            except Exception as e:
                logger.error(f'Failed to update state: {str(e)}')
                raise # Or handle error
    
  2. DomService runs JavaScript: Inside dom/service.py, the _build_dom_tree method executes the JavaScript code stored in buildDomTree.js within the browser page’s context.

    # --- File: dom/service.py (Simplified _build_dom_tree) ---
    import logging
    from importlib import resources
    # ... other imports ...
    
    logger = logging.getLogger(__name__)
    
    class DomService:
        def __init__(self, page: 'Page'):
            self.page = page
            # Load the JavaScript code from the file when DomService is created
            self.js_code = resources.read_text('browser_use.dom', 'buildDomTree.js')
            # ...
    
        async def _build_dom_tree(
            self, highlight_elements: bool, focus_element: int, viewport_expansion: int
        ) -> tuple[DOMElementNode, SelectorMap]:
    
            # Prepare arguments for the JavaScript function
            args = {
                'doHighlightElements': highlight_elements,
                'focusHighlightIndex': focus_element,
                'viewportExpansion': viewport_expansion,
                'debugMode': logger.getEffectiveLevel() == logging.DEBUG,
            }
    
            try:
                # Execute the JavaScript code in the browser page!
                # The JS code analyzes the live DOM and returns a structured result.
                eval_page = await self.page.evaluate(self.js_code, args)
            except Exception as e:
                logger.error('Error evaluating JavaScript: %s', e)
                raise
    
            # ... (optional debug logging) ...
    
            # Parse the result from JavaScript into Python objects
            return await self._construct_dom_tree(eval_page)
    
        async def _construct_dom_tree(self, eval_page: dict) -> tuple[DOMElementNode, SelectorMap]:
            # ... (logic to parse js_node_map from eval_page) ...
            # ... (loops through nodes, creates DOMElementNode/DOMTextNode objects) ...
            # ... (builds the tree structure by linking parents/children) ...
            # ... (populates the selector_map dictionary) ...
            # This uses the structures defined in dom/views.py
            # ...
            root_node = ... # Parsed root DOMElementNode
            selector_map = ... # Populated dictionary {index: DOMElementNode}
            return root_node, selector_map
        # ... other methods like get_clickable_elements ...
    
  3. buildDomTree.js (Conceptual): This JavaScript file (located at dom/buildDomTree.js in the library) is the core map-making logic that runs inside the browser. It traverses the live DOM, checks element visibility and interactivity using browser APIs (like element.getBoundingClientRect(), window.getComputedStyle(), document.elementFromPoint()), assigns the highlight_index, and packages the results into a structured format that the Python DomService can understand. We don’t need to understand the JS code itself, just its purpose.

  4. Python Data Structures (DOMElementNode, DOMTextNode): The results from the JavaScript are parsed into Python objects defined in dom/views.py. These dataclasses (DOMElementNode, DOMTextNode) hold the information about each mapped element or text segment.

Conclusion

DOM Representation, primarily handled by the DomService, is crucial for bridging the gap between the complex reality of a webpage (the DOM) and the Agent/LLM’s need for a simplified, actionable understanding. By creating a structured element_tree and an indexed selector_map, it provides a clear map of interactive landmarks on the page, identified by simple highlight_index numbers.

This map allows the LLM to make specific plans like “type into element [5]” or “click element [12]”, which the Agent can then reliably translate into concrete actions.

Now that we understand how the Agent sees the page, how does it actually perform those actions like clicking or typing? In the next chapter, we’ll explore the component responsible for executing the LLM’s plan: the Action Controller & Registry.

Next Chapter: Action Controller & Registry


Generated by AI Codebase Knowledge Builder