Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode
In the previous chapter, Chapter 8: Exploring Websites - DeepCrawlStrategy, we saw how Crawl4AI can explore websites by following links, potentially visiting many pages. During such explorations, or even when you run the same crawl multiple times, the crawler might try to fetch the exact same webpage again and again. This can be slow and might unnecessarily put a load on the website you’re crawling. Wouldn’t it be smarter to remember the result from the first time and just reuse it?
What Problem Does Caching Solve?
Imagine you need to download a large instruction manual (a webpage) from the internet.
- Without Caching: Every single time you need the manual, you download the entire file again. This takes time and uses bandwidth every time.
- With Caching: The first time you download it, you save a copy on your computer (the “cache”). The next time you need it, you first check your local copy. If it’s there, you use it instantly! You only download it again if you specifically want the absolute latest version or if your local copy is missing.
Caching in Crawl4AI works the same way. It’s a mechanism to store the results of crawling a webpage locally (in a database file). When asked to crawl a URL again, Crawl4AI can check its cache first. If a valid result is already stored, it can return that saved result almost instantly, saving time and resources.
Introducing CacheMode
and CacheContext
Crawl4AI uses two key concepts to manage this caching behavior:
CacheMode
(The Cache Policy):- Think of this like setting the rules for how you interact with your saved instruction manuals.
- It’s an instruction you give the crawler for a specific run, telling it how to use the cache.
- Analogy: Should you always use your saved copy if you have one? (
ENABLED
) Should you ignore your saved copies and always download a fresh one? (BYPASS
) Should you never save any copies? (DISABLED
) Should you save new copies but never reuse old ones? (WRITE_ONLY
) CacheMode
lets you choose the caching behavior that best fits your needs for a particular task.
CacheContext
(The Decision Maker):- This is an internal helper that Crawl4AI uses during a crawl. You don’t usually interact with it directly.
- It looks at the
CacheMode
you provided (the policy) and the type of URL being processed. - Analogy: Imagine a librarian who checks the library’s borrowing rules (
CacheMode
) and the type of item you’re requesting (e.g., a reference book that can’t be checked out, likeraw:
HTML which isn’t cached). Based on these, the librarian (CacheContext
) decides if you can borrow an existing copy (read from cache) or if a new copy should be added to the library (write to cache). - It helps the main
AsyncWebCrawler
make the right decision about reading from or writing to the cache for each specific URL based on the active policy.
Setting the Cache Policy: Using CacheMode
You control the caching behavior by setting the cache_mode
parameter within the CrawlerRunConfig
object that you pass to crawler.arun()
or crawler.arun_many()
.
Let’s explore the most common CacheMode
options:
1. CacheMode.ENABLED
(The Default Behavior - If not specified)
- Policy: “Use the cache if a valid result exists. If not, fetch the page, save the result to the cache, and then return it.”
- This is the standard, balanced approach. It saves time on repeated crawls but ensures you get the content eventually.
- Note: In recent versions, the default if
cache_mode
is left completely unspecified might beCacheMode.BYPASS
. Always check the documentation or explicitly set the mode for clarity. For this tutorial, let’s assume we explicitly set it.
# chapter9_example_1.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
url = "https://httpbin.org/html"
async with AsyncWebCrawler() as crawler:
# Explicitly set the mode to ENABLED
config_enabled = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
print(f"Running with CacheMode: {config_enabled.cache_mode.name}")
# First run: Fetches, caches, and returns result
print("First run (ENABLED)...")
result1 = await crawler.arun(url=url, config=config_enabled)
print(f"Got result 1? {'Yes' if result1.success else 'No'}")
# Second run: Finds result in cache and returns it instantly
print("Second run (ENABLED)...")
result2 = await crawler.arun(url=url, config=config_enabled)
print(f"Got result 2? {'Yes' if result2.success else 'No'}")
# This second run should be much faster!
if __name__ == "__main__":
asyncio.run(main())
Explanation:
- We create a
CrawlerRunConfig
withcache_mode=CacheMode.ENABLED
. - The first
arun
call fetches the page from the web and saves the result in the cache. - The second
arun
call (for the same URL and config affecting cache key) finds the saved result in the cache and returns it immediately, skipping the web fetch.
2. CacheMode.BYPASS
- Policy: “Ignore any existing saved copy. Always fetch a fresh copy from the web. After fetching, save this new result to the cache (overwriting any old one).”
- Useful when you always need the absolute latest version of the page, but you still want to update the cache for potential future use with
CacheMode.ENABLED
.
# chapter9_example_2.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
import time
async def main():
url = "https://httpbin.org/html"
async with AsyncWebCrawler() as crawler:
# Set the mode to BYPASS
config_bypass = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
print(f"Running with CacheMode: {config_bypass.cache_mode.name}")
# First run: Fetches, caches, and returns result
print("First run (BYPASS)...")
start_time = time.perf_counter()
result1 = await crawler.arun(url=url, config=config_bypass)
duration1 = time.perf_counter() - start_time
print(f"Got result 1? {'Yes' if result1.success else 'No'} (took {duration1:.2f}s)")
# Second run: Ignores cache, fetches again, updates cache, returns result
print("Second run (BYPASS)...")
start_time = time.perf_counter()
result2 = await crawler.arun(url=url, config=config_bypass)
duration2 = time.perf_counter() - start_time
print(f"Got result 2? {'Yes' if result2.success else 'No'} (took {duration2:.2f}s)")
# Both runs should take a similar amount of time (fetching time)
if __name__ == "__main__":
asyncio.run(main())
Explanation:
- We set
cache_mode=CacheMode.BYPASS
. - Both the first and second
arun
calls will fetch the page directly from the web, ignoring any previously cached result. They will still write the newly fetched result to the cache. Notice both runs take roughly the same amount of time (network fetch time).
3. CacheMode.DISABLED
- Policy: “Completely ignore the cache. Never read from it, never write to it.”
- Useful when you don’t want Crawl4AI to interact with the cache files at all, perhaps for debugging or if you have storage constraints.
# chapter9_example_3.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
import time
async def main():
url = "https://httpbin.org/html"
async with AsyncWebCrawler() as crawler:
# Set the mode to DISABLED
config_disabled = CrawlerRunConfig(cache_mode=CacheMode.DISABLED)
print(f"Running with CacheMode: {config_disabled.cache_mode.name}")
# First run: Fetches, returns result (does NOT cache)
print("First run (DISABLED)...")
start_time = time.perf_counter()
result1 = await crawler.arun(url=url, config=config_disabled)
duration1 = time.perf_counter() - start_time
print(f"Got result 1? {'Yes' if result1.success else 'No'} (took {duration1:.2f}s)")
# Second run: Fetches again, returns result (does NOT cache)
print("Second run (DISABLED)...")
start_time = time.perf_counter()
result2 = await crawler.arun(url=url, config=config_disabled)
duration2 = time.perf_counter() - start_time
print(f"Got result 2? {'Yes' if result2.success else 'No'} (took {duration2:.2f}s)")
# Both runs fetch fresh, and nothing is ever saved to the cache.
if __name__ == "__main__":
asyncio.run(main())
Explanation:
- We set
cache_mode=CacheMode.DISABLED
. - Both
arun
calls fetch fresh content from the web. Crucially, neither run reads from nor writes to the cache database.
Other Modes (READ_ONLY
, WRITE_ONLY
):
CacheMode.READ_ONLY
: Only uses existing cached results. If a result isn’t in the cache, it will fail or return an empty result rather than fetching it. Never saves anything new.CacheMode.WRITE_ONLY
: Never reads from the cache (always fetches fresh). It only writes the newly fetched result to the cache.
How Caching Works Internally
When you call crawler.arun(url="...", config=...)
:
- Create Context: The
AsyncWebCrawler
creates aCacheContext
instance using theurl
and theconfig.cache_mode
. - Check Read: It asks the
CacheContext
, “Should I read from the cache?” (cache_context.should_read()
). - Try Reading: If
should_read()
isTrue
, it asks the database manager (AsyncDatabaseManager
) to look for a cached result for theurl
. - Cache Hit?
- If a valid cached result is found: The
AsyncWebCrawler
returns this cachedCrawlResult
immediately. Done! - If no cached result is found (or if
should_read()
wasFalse
): Proceed to fetching.
- If a valid cached result is found: The
- Fetch: The
AsyncWebCrawler
calls the appropriate AsyncCrawlerStrategy to fetch the content from the web. - Process: It processes the fetched HTML (scraping, filtering, extracting) to create a new
CrawlResult
. - Check Write: It asks the
CacheContext
, “Should I write this result to the cache?” (cache_context.should_write()
). - Write Cache: If
should_write()
isTrue
, it tells the database manager to save the newCrawlResult
into the cache database. - Return: The
AsyncWebCrawler
returns the newly createdCrawlResult
.
sequenceDiagram
participant User
participant AWC as AsyncWebCrawler
participant Ctx as CacheContext
participant DB as DatabaseManager
participant Fetcher as AsyncCrawlerStrategy
User->>AWC: arun(url, config)
AWC->>Ctx: Create CacheContext(url, config.cache_mode)
AWC->>Ctx: should_read()?
alt Cache Read Allowed
Ctx-->>AWC: Yes
AWC->>DB: aget_cached_url(url)
DB-->>AWC: Cached Result (or None)
alt Cache Hit & Valid
AWC-->>User: Return Cached CrawlResult
else Cache Miss or Invalid
AWC->>AWC: Proceed to Fetch
end
else Cache Read Not Allowed
Ctx-->>AWC: No
AWC->>AWC: Proceed to Fetch
end
Note over AWC: Fetching Required
AWC->>Fetcher: crawl(url, config)
Fetcher-->>AWC: Raw Response
AWC->>AWC: Process HTML -> New CrawlResult
AWC->>Ctx: should_write()?
alt Cache Write Allowed
Ctx-->>AWC: Yes
AWC->>DB: acache_url(New CrawlResult)
DB-->>AWC: OK
else Cache Write Not Allowed
Ctx-->>AWC: No
end
AWC-->>User: Return New CrawlResult
Code Glimpse
Let’s look at simplified code snippets.
Inside async_webcrawler.py
(where arun
uses caching):
# Simplified from crawl4ai/async_webcrawler.py
from .cache_context import CacheContext, CacheMode
from .async_database import async_db_manager
from .models import CrawlResult
# ... other imports
class AsyncWebCrawler:
# ... (init, other methods) ...
async def arun(self, url: str, config: CrawlerRunConfig = None) -> CrawlResult:
# ... (ensure config exists, set defaults) ...
if config.cache_mode is None:
config.cache_mode = CacheMode.ENABLED # Example default
# 1. Create CacheContext
cache_context = CacheContext(url, config.cache_mode)
cached_result = None
# 2. Check if cache read is allowed
if cache_context.should_read():
# 3. Try reading from database
cached_result = await async_db_manager.aget_cached_url(url)
# 4. If cache hit and valid, return it
if cached_result and self._is_cache_valid(cached_result, config):
self.logger.info("Cache hit for: %s", url) # Example log
return cached_result # Return early
# 5. Fetch fresh content (if no cache hit or read disabled)
async_response = await self.crawler_strategy.crawl(url, config=config)
html = async_response.html # ... and other data ...
# 6. Process the HTML to get a new CrawlResult
crawl_result = await self.aprocess_html(
url=url, html=html, config=config, # ... other params ...
)
# 7. Check if cache write is allowed
if cache_context.should_write():
# 8. Write the new result to the database
await async_db_manager.acache_url(crawl_result)
# 9. Return the new result
return crawl_result
def _is_cache_valid(self, cached_result: CrawlResult, config: CrawlerRunConfig) -> bool:
# Internal logic to check if cached result meets current needs
# (e.g., was screenshot requested now but not cached?)
if config.screenshot and not cached_result.screenshot: return False
if config.pdf and not cached_result.pdf: return False
# ... other checks ...
return True
Inside cache_context.py
(defining the concepts):
# Simplified from crawl4ai/cache_context.py
from enum import Enum
class CacheMode(Enum):
"""Defines the caching behavior for web crawling operations."""
ENABLED = "enabled" # Read and Write
DISABLED = "disabled" # No Read, No Write
READ_ONLY = "read_only" # Read Only, No Write
WRITE_ONLY = "write_only" # Write Only, No Read
BYPASS = "bypass" # No Read, Write Only (similar to WRITE_ONLY but explicit intention)
class CacheContext:
"""Encapsulates cache-related decisions and URL handling."""
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
self.url = url
self.cache_mode = cache_mode
self.always_bypass = always_bypass # Usually False
# Determine if URL type is cacheable (e.g., not 'raw:')
self.is_cacheable = url.startswith(("http://", "https://", "file://"))
# ... other URL type checks ...
def should_read(self) -> bool:
"""Determines if cache should be read based on context."""
if self.always_bypass or not self.is_cacheable:
return False
# Allow read if mode is ENABLED or READ_ONLY
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
def should_write(self) -> bool:
"""Determines if cache should be written based on context."""
if self.always_bypass or not self.is_cacheable:
return False
# Allow write if mode is ENABLED, WRITE_ONLY, or BYPASS
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY, CacheMode.BYPASS]
@property
def display_url(self) -> str:
"""Returns the URL in display format."""
return self.url if not self.url.startswith("raw:") else "Raw HTML"
# Helper for backward compatibility (may be removed later)
def _legacy_to_cache_mode(...) -> CacheMode:
# ... logic to convert old boolean flags ...
pass
Conclusion
You’ve learned how Crawl4AI uses caching to avoid redundant work and speed up repeated crawls!
- Caching stores results locally to reuse them later.
CacheMode
is the policy you set inCrawlerRunConfig
to control how the cache is used (ENABLED
,BYPASS
,DISABLED
, etc.).CacheContext
is an internal helper that makes decisions based on theCacheMode
and URL type.- Using the cache effectively (especially
CacheMode.ENABLED
) can significantly speed up your crawling tasks, particularly during development or when dealing with many URLs, including deep crawls.
We’ve seen how Crawl4AI can crawl single pages, lists of pages (arun_many
), and even explore websites (DeepCrawlStrategy
). But how does arun_many
or a deep crawl manage running potentially hundreds or thousands of individual crawl tasks efficiently without overwhelming your system or the target website?
Next: Let’s explore the component responsible for managing concurrent tasks: Chapter 10: Orchestrating the Crawl - BaseDispatcher.
Generated by AI Codebase Knowledge Builder