Chapter 8: Exploring Websites - DeepCrawlStrategy
In Chapter 7: Understanding the Results - CrawlResult, we saw the final report (CrawlResult
) that Crawl4AI gives us after processing a single URL. This report contains cleaned content, links, metadata, and maybe even extracted data.
But what if you want to explore a website beyond just the first page? Imagine you land on a blog’s homepage. You don’t just want the homepage content; you want to automatically discover and crawl all the individual blog posts linked from it. How can you tell Crawl4AI to act like an explorer, following links and venturing deeper into the website?
What Problem Does DeepCrawlStrategy
Solve?
Think of the AsyncWebCrawler.arun()
method we’ve used so far like visiting just the entrance hall of a vast library. You get information about that specific hall, but you don’t automatically explore the adjoining rooms or different floors.
What if you want to systematically explore the library? You need a plan:
- Do you explore room by room on the current floor before going upstairs? (Level by level)
- Do you pick one wing and explore all its rooms down to the very end before exploring another wing? (Go deep first)
- Do you have a map highlighting potentially interesting sections and prioritize visiting those first? (Prioritize promising paths)
DeepCrawlStrategy
provides this exploration plan. It defines the logic for how Crawl4AI should discover and crawl new URLs starting from the initial one(s) by following the links it finds on each page. It turns the crawler from a single-page visitor into a website explorer.
What is DeepCrawlStrategy
?
DeepCrawlStrategy
is a concept (a blueprint) in Crawl4AI that represents the method or logic used to navigate and crawl multiple pages by following links. It tells the crawler which links to follow and in what order to visit them.
It essentially takes over the process when you call arun()
if a deep crawl is requested, managing a queue or list of URLs to visit and coordinating the crawling of those URLs, potentially up to a certain depth or number of pages.
Different Exploration Plans: The Strategies
Crawl4AI provides several concrete exploration plans (implementations) for DeepCrawlStrategy
:
BFSDeepCrawlStrategy
(Level-by-Level Explorer):- Analogy: Like ripples spreading in a pond.
- How it works: It first crawls the starting URL (Level 0). Then, it crawls all the valid links found on that page (Level 1). Then, it crawls all the valid links found on those pages (Level 2), and so on. It explores the website layer by layer.
- Good for: Finding the shortest path to all reachable pages, getting a broad overview quickly near the start page.
DFSDeepCrawlStrategy
(Deep Path Explorer):- Analogy: Like exploring one specific corridor in a maze all the way to the end before backtracking and trying another corridor.
- How it works: It starts at the initial URL, follows one link, then follows a link from that page, and continues going deeper down one path as far as possible (or until a specified depth limit). Only when it hits a dead end or the limit does it backtrack and try another path.
- Good for: Exploring specific branches of a website thoroughly, potentially reaching deeper pages faster than BFS (if the target is down a specific path).
BestFirstCrawlingStrategy
(Priority Explorer):- Analogy: Like using a treasure map where some paths are marked as more promising than others.
- How it works: This strategy uses a scoring system. It looks at all the discovered (but not yet visited) links and assigns a score to each one based on how “promising” it seems (e.g., does the URL contain relevant keywords? Is it from a trusted domain?). It then crawls the link with the best score first, regardless of its depth.
- Good for: Focusing the crawl on the most relevant or important pages first, especially useful when you can’t crawl the entire site and need to prioritize.
Guiding the Explorer: Filters and Scorers
Deep crawl strategies often work together with:
- Filters: Rules that decide if a discovered link should even be considered for crawling. Examples:
DomainFilter
: Only follow links within the starting website’s domain.URLPatternFilter
: Only follow links matching a specific pattern (e.g.,/blog/posts/...
).ContentTypeFilter
: Avoid following links to non-HTML content like PDFs or images.
- Scorers: (Used mainly by
BestFirstCrawlingStrategy
) Rules that assign a score to a potential link to help prioritize it. Examples:KeywordRelevanceScorer
: Scores links higher if the URL contains certain keywords.PathDepthScorer
: Might score links differently based on how deep they are.
These act like instructions for the explorer: “Only explore rooms on this floor (filter),” “Ignore corridors marked ‘Staff Only’ (filter),” or “Check rooms marked with a star first (scorer).”
How to Use a DeepCrawlStrategy
You enable deep crawling by adding a DeepCrawlStrategy
instance to your CrawlerRunConfig
. Let’s try exploring a website layer by layer using BFSDeepCrawlStrategy
, going only one level deep from the start page.
# chapter8_example_1.py
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
BFSDeepCrawlStrategy, # 1. Import the desired strategy
DomainFilter # Import a filter to stay on the same site
)
async def main():
# 2. Create an instance of the strategy
# - max_depth=1: Crawl start URL (depth 0) + links found (depth 1)
# - filter_chain: Use DomainFilter to only follow links on the same website
bfs_explorer = BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=[DomainFilter()] # Stay within the initial domain
)
print(f"Strategy: BFS, Max Depth: {bfs_explorer.max_depth}")
# 3. Create CrawlerRunConfig and set the deep_crawl_strategy
# Also set stream=True to get results as they come in.
run_config = CrawlerRunConfig(
deep_crawl_strategy=bfs_explorer,
stream=True # Get results one by one using async for
)
# 4. Run the crawl - arun now handles the deep crawl!
async with AsyncWebCrawler() as crawler:
start_url = "https://httpbin.org/links/10/0" # A page with 10 internal links
print(f"\nStarting deep crawl from: {start_url}...")
crawl_results_generator = await crawler.arun(url=start_url, config=run_config)
crawled_count = 0
# Iterate over the results as they are yielded
async for result in crawl_results_generator:
crawled_count += 1
status = "✅" if result.success else "❌"
depth = result.metadata.get("depth", "N/A")
parent = result.metadata.get("parent_url", "Start")
url_short = result.url.split('/')[-1] # Show last part of URL
print(f" {status} Crawled: {url_short:<6} (Depth: {depth})")
print(f"\nFinished deep crawl. Total pages processed: {crawled_count}")
# Expecting 1 (start URL) + 10 (links) = 11 results
if __name__ == "__main__":
asyncio.run(main())
Explanation:
- Import: We import
AsyncWebCrawler
,CrawlerRunConfig
,BFSDeepCrawlStrategy
, andDomainFilter
. - Instantiate Strategy: We create
BFSDeepCrawlStrategy
.max_depth=1
: We tell it to crawl the starting URL (depth 0) and any valid links it finds on that page (depth 1), but not to go any further.filter_chain=[DomainFilter()]
: We provide a list containingDomainFilter
. This tells the strategy to only consider following links that point to the same domain as thestart_url
. Links to external sites will be ignored.
- Configure Run: We create a
CrawlerRunConfig
and pass ourbfs_explorer
instance to thedeep_crawl_strategy
parameter. We also setstream=True
so we can process results as soon as they are ready, rather than waiting for the entire crawl to finish. - Crawl: We call
await crawler.arun(url=start_url, config=run_config)
. Because the config contains adeep_crawl_strategy
,arun
doesn’t just crawl the singlestart_url
. Instead, it activates the deep crawl logic defined byBFSDeepCrawlStrategy
. - Process Results: Since we used
stream=True
, the return value is an asynchronous generator. We useasync for result in crawl_results_generator:
to loop through theCrawlResult
objects as they are produced by the deep crawl. For each result, we print its status and depth.
You’ll see the output showing the crawl starting, then processing the initial page (links/10/0
at depth 0), followed by the 10 linked pages (e.g., 9
, 8
, … 0
at depth 1).
How It Works (Under the Hood)
How does simply putting a strategy in the config change arun
’s behavior? It involves a bit of Python magic called a decorator.
- Decorator: When you create an
AsyncWebCrawler
, itsarun
method is automatically wrapped by aDeepCrawlDecorator
. - Check Config: When you call
await crawler.arun(url=..., config=...)
, this decorator checks ifconfig.deep_crawl_strategy
is set. - Delegate or Run Original:
- If a strategy is set, the decorator doesn’t run the original single-page crawl logic. Instead, it calls the
arun
method of your chosenDeepCrawlStrategy
instance (e.g.,bfs_explorer.arun(...)
), passing it thecrawler
itself, thestart_url
, and theconfig
. - If no strategy is set, the decorator simply calls the original
arun
logic to crawl the single page.
- If a strategy is set, the decorator doesn’t run the original single-page crawl logic. Instead, it calls the
- Strategy Takes Over: The
DeepCrawlStrategy
’sarun
method now manages the crawl.- It maintains a list or queue of URLs to visit (e.g.,
current_level
in BFS, a stack in DFS, a priority queue in BestFirst). - It repeatedly takes batches of URLs from its list/queue.
- For each batch, it calls
crawler.arun_many(urls=batch_urls, config=batch_config)
(with deep crawling disabled inbatch_config
to avoid infinite loops!). - As results come back from
arun_many
, the strategy processes them:- It yields the
CrawlResult
if running in stream mode. - It extracts links using its
link_discovery
method. link_discovery
usescan_process_url
(which applies filters) to validate links.- Valid new links are added to the list/queue for future crawling.
- It yields the
- This continues until the list/queue is empty, the max depth/pages limit is reached, or it’s cancelled.
- It maintains a list or queue of URLs to visit (e.g.,
sequenceDiagram
participant User
participant Decorator as DeepCrawlDecorator
participant Strategy as DeepCrawlStrategy (e.g., BFS)
participant AWC as AsyncWebCrawler
User->>Decorator: arun(start_url, config_with_strategy)
Decorator->>Strategy: arun(start_url, crawler=AWC, config)
Note over Strategy: Initialize queue/level with start_url
loop Until Queue Empty or Limits Reached
Strategy->>Strategy: Get next batch of URLs from queue
Note over Strategy: Create batch_config (deep_crawl=None)
Strategy->>AWC: arun_many(batch_urls, config=batch_config)
AWC-->>Strategy: batch_results (List/Stream of CrawlResult)
loop For each result in batch_results
Strategy->>Strategy: Process result (yield if streaming)
Strategy->>Strategy: Discover links (apply filters)
Strategy->>Strategy: Add valid new links to queue
end
end
Strategy-->>Decorator: Final result (List or Generator)
Decorator-->>User: Final result
Code Glimpse
Let’s peek at the simplified structure:
1. The Decorator (deep_crawling/base_strategy.py
)
# Simplified from deep_crawling/base_strategy.py
from contextvars import ContextVar
from functools import wraps
# ... other imports
class DeepCrawlDecorator:
deep_crawl_active = ContextVar("deep_crawl_active", default=False)
def __init__(self, crawler: AsyncWebCrawler):
self.crawler = crawler
def __call__(self, original_arun):
@wraps(original_arun)
async def wrapped_arun(url: str, config: CrawlerRunConfig = None, **kwargs):
# Is a strategy present AND not already inside a deep crawl?
if config and config.deep_crawl_strategy and not self.deep_crawl_active.get():
# Mark that we are starting a deep crawl
token = self.deep_crawl_active.set(True)
try:
# Call the STRATEGY's arun method instead of the original
strategy_result = await config.deep_crawl_strategy.arun(
crawler=self.crawler,
start_url=url,
config=config
)
# Handle streaming if needed
if config.stream:
# Return an async generator that resets the context var on exit
async def result_wrapper():
try:
async for result in strategy_result: yield result
finally: self.deep_crawl_active.reset(token)
return result_wrapper()
else:
return strategy_result # Return the list of results directly
finally:
# Reset the context var if not streaming (or handled in wrapper)
if not config.stream: self.deep_crawl_active.reset(token)
else:
# No strategy or already deep crawling, call the original single-page arun
return await original_arun(url, config=config, **kwargs)
return wrapped_arun
2. The Strategy Blueprint (deep_crawling/base_strategy.py
)
# Simplified from deep_crawling/base_strategy.py
from abc import ABC, abstractmethod
# ... other imports
class DeepCrawlStrategy(ABC):
@abstractmethod
async def _arun_batch(self, start_url, crawler, config) -> List[CrawlResult]:
# Implementation for non-streaming mode
pass
@abstractmethod
async def _arun_stream(self, start_url, crawler, config) -> AsyncGenerator[CrawlResult, None]:
# Implementation for streaming mode
pass
async def arun(self, start_url, crawler, config) -> RunManyReturn:
# Decides whether to call _arun_batch or _arun_stream
if config.stream:
return self._arun_stream(start_url, crawler, config)
else:
return await self._arun_batch(start_url, crawler, config)
@abstractmethod
async def can_process_url(self, url: str, depth: int) -> bool:
# Applies filters to decide if a URL is valid to crawl
pass
@abstractmethod
async def link_discovery(self, result, source_url, current_depth, visited, next_level, depths):
# Extracts, validates, and prepares links for the next step
pass
@abstractmethod
async def shutdown(self):
# Cleanup logic
pass
3. Example: BFS Implementation (deep_crawling/bfs_strategy.py
)
# Simplified from deep_crawling/bfs_strategy.py
# ... imports ...
from .base_strategy import DeepCrawlStrategy # Import the base class
class BFSDeepCrawlStrategy(DeepCrawlStrategy):
def __init__(self, max_depth, filter_chain=None, url_scorer=None, ...):
self.max_depth = max_depth
self.filter_chain = filter_chain or FilterChain() # Use default if none
self.url_scorer = url_scorer
# ... other init ...
self._pages_crawled = 0
async def can_process_url(self, url: str, depth: int) -> bool:
# ... (validation logic using self.filter_chain) ...
is_valid = True # Placeholder
if depth != 0 and not await self.filter_chain.apply(url):
is_valid = False
return is_valid
async def link_discovery(self, result, source_url, current_depth, visited, next_level, depths):
# ... (logic to get links from result.links) ...
links = result.links.get("internal", []) # Example: only internal
for link_data in links:
url = link_data.get("href")
if url and url not in visited:
if await self.can_process_url(url, current_depth + 1):
# Check scoring, max_pages limit etc.
depths[url] = current_depth + 1
next_level.append((url, source_url)) # Add (url, parent) tuple
async def _arun_batch(self, start_url, crawler, config) -> List[CrawlResult]:
visited = set()
current_level = [(start_url, None)] # List of (url, parent_url)
depths = {start_url: 0}
all_results = []
while current_level: # While there are pages in the current level
next_level = []
urls_in_level = [url for url, parent in current_level]
visited.update(urls_in_level)
# Create config for this batch (no deep crawl recursion)
batch_config = config.clone(deep_crawl_strategy=None, stream=False)
# Crawl all URLs in the current level
batch_results = await crawler.arun_many(urls=urls_in_level, config=batch_config)
for result in batch_results:
# Add metadata (depth, parent)
depth = depths.get(result.url, 0)
result.metadata = result.metadata or {}
result.metadata["depth"] = depth
# ... find parent ...
all_results.append(result)
# Discover links for the *next* level
if result.success:
await self.link_discovery(result, result.url, depth, visited, next_level, depths)
current_level = next_level # Move to the next level
return all_results
async def _arun_stream(self, start_url, crawler, config) -> AsyncGenerator[CrawlResult, None]:
# Similar logic to _arun_batch, but uses 'yield result'
# and processes results as they come from arun_many stream
visited = set()
current_level = [(start_url, None)] # List of (url, parent_url)
depths = {start_url: 0}
while current_level:
next_level = []
urls_in_level = [url for url, parent in current_level]
visited.update(urls_in_level)
# Use stream=True for arun_many
batch_config = config.clone(deep_crawl_strategy=None, stream=True)
batch_results_gen = await crawler.arun_many(urls=urls_in_level, config=batch_config)
async for result in batch_results_gen:
# Add metadata
depth = depths.get(result.url, 0)
result.metadata = result.metadata or {}
result.metadata["depth"] = depth
# ... find parent ...
yield result # Yield result immediately
# Discover links for the next level
if result.success:
await self.link_discovery(result, result.url, depth, visited, next_level, depths)
current_level = next_level
# ... shutdown method ...
Conclusion
You’ve learned about DeepCrawlStrategy
, the component that turns Crawl4AI into a website explorer!
- It solves the problem of crawling beyond a single starting page by following links.
- It defines the exploration plan:
BFSDeepCrawlStrategy
: Level by level.DFSDeepCrawlStrategy
: Deep paths first.BestFirstCrawlingStrategy
: Prioritized by score.
- Filters and Scorers help guide the exploration.
- You enable it by setting
deep_crawl_strategy
in theCrawlerRunConfig
. - A decorator mechanism intercepts
arun
calls to activate the strategy. - The strategy manages the queue of URLs and uses
crawler.arun_many
to crawl them in batches.
Deep crawling allows you to gather information from multiple related pages automatically. But how does Crawl4AI avoid re-fetching the same page over and over again, especially during these deeper crawls? The answer lies in caching.
Next: Let’s explore how Crawl4AI smartly caches results with Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode.
Generated by AI Codebase Knowledge Builder