Chapter 3: Giving Instructions - CrawlerRunConfig

In Chapter 2: Meet the General Manager - AsyncWebCrawler, we met the AsyncWebCrawler, the central coordinator for our web crawling tasks. We saw how to tell it what URL to crawl using the arun method.

But what if we want to tell the crawler how to crawl that URL? Maybe we want it to take a picture (screenshot) of the page? Or perhaps we only care about a specific section of the page? Or maybe we want to ignore the cache and get the very latest version?

Passing all these different instructions individually every time we call arun could get complicated and messy.

# Imagine doing this every time - it gets long!
# result = await crawler.arun(
#     url="https://example.com",
#     take_screenshot=True,
#     ignore_cache=True,
#     only_look_at_this_part="#main-content",
#     wait_for_this_element="#data-table",
#     # ... maybe many more settings ...
# )

That’s where CrawlerRunConfig comes in!

What Problem Does CrawlerRunConfig Solve?

Think of CrawlerRunConfig as the Instruction Manual for a specific crawl job. Instead of giving the AsyncWebCrawler manager lots of separate instructions each time, you bundle them all neatly into a single CrawlerRunConfig object.

This object tells the AsyncWebCrawler exactly how to handle a particular URL or set of URLs for that specific run. It makes your code cleaner and easier to manage.

What is CrawlerRunConfig?

CrawlerRunConfig is a configuration class that holds all the settings for a single crawl operation initiated by AsyncWebCrawler.arun() or arun_many().

It allows you to customize various aspects of the crawl, such as:

  • Taking Screenshots: Should the crawler capture an image of the page? (screenshot)
  • Waiting: How long should the crawler wait for the page or specific elements to load? (page_timeout, wait_for)
  • Focusing Content: Should the crawler only process a specific part of the page? (css_selector)
  • Extracting Data: Should the crawler use a specific method to pull out structured data? (ExtractionStrategy)
  • Caching: How should the crawler interact with previously saved results? (CacheMode)
  • And much more! (like handling JavaScript, filtering links, etc.)

Using CrawlerRunConfig

Let’s see how to use it. Remember our basic crawl from Chapter 2?

# chapter3_example_1.py
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        url_to_crawl = "https://httpbin.org/html"
        print(f"Crawling {url_to_crawl} with default settings...")

        # This uses the default behavior (no specific config)
        result = await crawler.arun(url=url_to_crawl)

        if result.success:
            print("Success! Got the content.")
            print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Likely No
            # We'll learn about CacheMode later, but it defaults to using the cache
        else:
            print(f"Failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(main())

Now, let’s say for this specific crawl, we want to bypass the cache (fetch fresh) and also take a screenshot.

We create a CrawlerRunConfig instance and pass it to arun:

# chapter3_example_2.py
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import CrawlerRunConfig # 1. Import the config class
from crawl4ai import CacheMode        # Import cache options

async def main():
    async with AsyncWebCrawler() as crawler:
        url_to_crawl = "https://httpbin.org/html"
        print(f"Crawling {url_to_crawl} with custom settings...")

        # 2. Create an instance of CrawlerRunConfig with our desired settings
        my_instructions = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS, # Don't use the cache, fetch fresh
            screenshot=True              # Take a screenshot
        )
        print("Instructions: Bypass cache, take screenshot.")

        # 3. Pass the config object to arun()
        result = await crawler.arun(
            url=url_to_crawl,
            config=my_instructions # Pass our instruction manual
        )

        if result.success:
            print("\nSuccess! Got the content with custom config.")
            print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Should be Yes
            # Check if the screenshot file path exists in result.screenshot
            if result.screenshot:
                print(f"Screenshot saved to: {result.screenshot}")
        else:
            print(f"\nFailed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(main())

Explanation:

  1. Import: We import CrawlerRunConfig and CacheMode.
  2. Create Config: We create an instance: my_instructions = CrawlerRunConfig(...). We set cache_mode to CacheMode.BYPASS and screenshot to True. All other settings remain at their defaults.
  3. Pass Config: We pass this my_instructions object to crawler.arun using the config= parameter.

Now, when AsyncWebCrawler runs this job, it will look inside my_instructions and follow those specific settings for this run only.

Some Common CrawlerRunConfig Parameters

CrawlerRunConfig has many options, but here are a few common ones you might use:

  • cache_mode: Controls caching behavior.
    • CacheMode.ENABLED (Default): Use the cache if available, otherwise fetch and save.
    • CacheMode.BYPASS: Always fetch fresh, ignoring any cached version (but still save the new result).
    • CacheMode.DISABLED: Never read from or write to the cache.
    • (More details in Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode)
  • screenshot (bool): If True, takes a screenshot of the fully rendered page. The path to the screenshot file will be in CrawlResult.screenshot. Default: False.
  • pdf (bool): If True, generates a PDF of the page. The path to the PDF file will be in CrawlResult.pdf. Default: False.
  • css_selector (str): If provided (e.g., "#main-content" or .article-body), the crawler will try to extract only the HTML content within the element(s) matching this CSS selector. This is great for focusing on the important part of a page. Default: None (process the whole page).
  • wait_for (str): A CSS selector (e.g., "#data-loaded-indicator"). The crawler will wait until an element matching this selector appears on the page before proceeding. Useful for pages that load content dynamically with JavaScript. Default: None.
  • page_timeout (int): Maximum time in milliseconds to wait for page navigation or certain operations. Default: 60000 (60 seconds).
  • extraction_strategy: An object that defines how to extract specific, structured data (like product names and prices) from the page. Default: None. (See Chapter 6: Getting Specific Data - ExtractionStrategy)
  • scraping_strategy: An object defining how the raw HTML is cleaned and basic content (like text and links) is extracted. Default: WebScrapingStrategy(). (See Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy)

Let’s try combining a few: focus on a specific part of the page and wait for something to appear.

# chapter3_example_3.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    # This example site has a heading 'H1' inside a 'body' tag.
    url_to_crawl = "https://httpbin.org/html"
    async with AsyncWebCrawler() as crawler:
        print(f"Crawling {url_to_crawl}, focusing on the H1 tag...")

        # Instructions: Only get the H1 tag, wait max 10s for it
        specific_config = CrawlerRunConfig(
            css_selector="h1", # Only grab content inside <h1> tags
            page_timeout=10000 # Set page timeout to 10 seconds
            # We could also add wait_for="h1" if needed for dynamic loading
        )

        result = await crawler.arun(url=url_to_crawl, config=specific_config)

        if result.success:
            print("\nSuccess! Focused crawl completed.")
            # The markdown should now ONLY contain the H1 content
            print(f"Markdown content:\n---\n{result.markdown.raw_markdown.strip()}\n---")
        else:
            print(f"\nFailed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(main())

This time, the result.markdown should only contain the text from the <h1> tag on that page, because we used css_selector="h1" in our CrawlerRunConfig.

How AsyncWebCrawler Uses the Config (Under the Hood)

You don’t need to know the exact internal code, but it helps to understand the flow. When you call crawler.arun(url, config=my_config), the AsyncWebCrawler essentially does this:

  1. Receives the url and the my_config object.
  2. Before fetching, it checks my_config.cache_mode to see if it should look in the cache first.
  3. If fetching is needed, it passes my_config to the underlying AsyncCrawlerStrategy.
  4. The strategy uses settings from my_config like page_timeout, wait_for, and whether to take a screenshot.
  5. After getting the raw HTML, AsyncWebCrawler uses the my_config.scraping_strategy and my_config.css_selector to process the content.
  6. If my_config.extraction_strategy is set, it uses that to extract structured data.
  7. Finally, it bundles everything into a CrawlResult and returns it.

Here’s a simplified view:

sequenceDiagram
    participant User
    participant AWC as AsyncWebCrawler
    participant Config as CrawlerRunConfig
    participant Fetcher as AsyncCrawlerStrategy
    participant Processor as Scraping/Extraction

    User->>AWC: arun(url, config=my_config)
    AWC->>Config: Check my_config.cache_mode
    alt Need to Fetch
        AWC->>Fetcher: crawl(url, config=my_config)
        Note over Fetcher: Uses my_config settings (timeout, wait_for, screenshot...)
        Fetcher-->>AWC: Raw Response (HTML, screenshot?)
        AWC->>Processor: Process HTML (using my_config.css_selector, my_config.extraction_strategy...)
        Processor-->>AWC: Processed Data
    else Use Cache
        AWC->>AWC: Retrieve from Cache
    end
    AWC-->>User: Return CrawlResult

The CrawlerRunConfig acts as a messenger carrying your specific instructions throughout the crawling process.

Inside the crawl4ai library, in the file async_configs.py, you’ll find the definition of the CrawlerRunConfig class. It looks something like this (simplified):

# Simplified from crawl4ai/async_configs.py

from .cache_context import CacheMode
from .extraction_strategy import ExtractionStrategy
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
# ... other imports ...

class CrawlerRunConfig():
    """
    Configuration class for controlling how the crawler runs each crawl operation.
    """
    def __init__(
        self,
        # Caching
        cache_mode: CacheMode = CacheMode.BYPASS, # Default behavior if not specified

        # Content Selection / Waiting
        css_selector: str = None,
        wait_for: str = None,
        page_timeout: int = 60000, # 60 seconds

        # Media
        screenshot: bool = False,
        pdf: bool = False,

        # Processing Strategies
        scraping_strategy: ContentScrapingStrategy = None, # Defaults internally if None
        extraction_strategy: ExtractionStrategy = None,

        # ... many other parameters omitted for clarity ...
        **kwargs # Allows for flexibility
    ):
        self.cache_mode = cache_mode
        self.css_selector = css_selector
        self.wait_for = wait_for
        self.page_timeout = page_timeout
        self.screenshot = screenshot
        self.pdf = pdf
        # Assign scraping strategy, ensuring a default if None is provided
        self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
        self.extraction_strategy = extraction_strategy
        # ... initialize other attributes ...

    # Helper methods like 'clone', 'to_dict', 'from_kwargs' might exist too
    # ...

The key idea is that it’s a class designed to hold various settings together. When you create an instance CrawlerRunConfig(...), you’re essentially creating an object that stores your choices for these parameters.

Conclusion

You’ve learned about CrawlerRunConfig, the “Instruction Manual” for individual crawl jobs in Crawl4AI!

  • It solves the problem of passing many settings individually to AsyncWebCrawler.
  • You create an instance of CrawlerRunConfig and set the parameters you want to customize (like cache_mode, screenshot, css_selector, wait_for).
  • You pass this config object to crawler.arun(url, config=your_config).
  • This makes your code cleaner and gives you fine-grained control over how each crawl is performed.

Now that we know how to fetch content (AsyncCrawlerStrategy), manage the overall process (AsyncWebCrawler), and give specific instructions (CrawlerRunConfig), let’s look at how the raw, messy HTML fetched from the web is initially cleaned up and processed.

Next: Let’s explore Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy.


Generated by AI Codebase Knowledge Builder