Chapter 7: Understanding the Results - CrawlResult
In the previous chapter, Chapter 6: Getting Specific Data - ExtractionStrategy, we learned how to teach Crawl4AI to act like an analyst, extracting specific, structured data points from a webpage using an ExtractionStrategy
. We’ve seen how Crawl4AI can fetch pages, clean them, filter them, and even extract precise information.
But after all that work, where does all the gathered information go? When you ask the AsyncWebCrawler
to crawl a URL using arun()
, what do you actually get back?
What Problem Does CrawlResult
Solve?
Imagine you sent a research assistant to the library (a website) with a set of instructions: “Find this book (URL), make a clean copy of the relevant chapter (clean HTML/Markdown), list all the cited references (links), take photos of the illustrations (media), find the author and publication date (metadata), and maybe extract specific quotes (structured data).”
When the assistant returns, they wouldn’t just hand you a single piece of paper. They’d likely give you a folder containing everything you asked for: the clean copy, the list of references, the photos, the metadata notes, and the extracted quotes, all neatly organized. They might also include a note if they encountered any problems (errors).
CrawlResult
is exactly this final report folder or delivery package. It’s a single object that neatly contains all the information Crawl4AI gathered and processed for a specific URL during a crawl operation. Instead of getting lots of separate pieces of data back, you get one convenient container.
What is CrawlResult
?
CrawlResult
is a Python object (specifically, a Pydantic model, which is like a super-powered dictionary) that acts as a data container. It holds the results of a single crawl task performed by AsyncWebCrawler.arun()
or one of the results from arun_many()
.
Think of it as a toolbox filled with different tools and information related to the crawled page.
Key Information Stored in CrawlResult
:
url
(string): The original URL that was requested.success
(boolean): Did the crawl complete without critical errors?True
if successful,False
otherwise. Always check this first!html
(string): The raw, original HTML source code fetched from the page.cleaned_html
(string): The HTML after initial cleaning by the ContentScrapingStrategy (e.g., scripts, styles removed).markdown
(object): An object containing different Markdown representations of the content.markdown.raw_markdown
: Basic Markdown generated fromcleaned_html
.markdown.fit_markdown
: Markdown generated only from content deemed relevant by a RelevantContentFilter (if one was used). Might be empty if no filter was applied.- (Other fields like
markdown_with_citations
might exist)
extracted_content
(string): If you used an ExtractionStrategy, this holds the extracted structured data, usually formatted as a JSON string.None
if no extraction was performed or nothing was found.metadata
(dictionary): Information extracted from the page’s metadata tags, like the page title (metadata['title']
), description, keywords, etc.links
(object): Contains lists of links found on the page.links.internal
: List of links pointing to the same website.links.external
: List of links pointing to other websites.
media
(object): Contains lists of media items found.media.images
: List of images (<img>
tags).media.videos
: List of videos (<video>
tags).- (Other media types might be included)
screenshot
(string): If you requested a screenshot (screenshot=True
inCrawlerRunConfig
), this holds the file path to the saved image.None
otherwise.pdf
(bytes): If you requested a PDF (pdf=True
inCrawlerRunConfig
), this holds the PDF data as bytes.None
otherwise. (Note: Previously might have been a path, now often bytes).error_message
(string): Ifsuccess
isFalse
, this field usually contains details about what went wrong.status_code
(integer): The HTTP status code received from the server (e.g., 200 for OK, 404 for Not Found).response_headers
(dictionary): The HTTP response headers sent by the server.redirected_url
(string): If the original URL redirected, this shows the final URL the crawler landed on.
Accessing the CrawlResult
You get a CrawlResult
object back every time you await
a call to crawler.arun()
:
# chapter7_example_1.py
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/html"
print(f"Crawling {url}...")
# The 'arun' method returns a CrawlResult object
result: CrawlResult = await crawler.arun(url=url) # Type hint optional
print("Crawl finished!")
# Now 'result' holds all the information
print(f"Result object type: {type(result)}")
if __name__ == "__main__":
asyncio.run(main())
Explanation:
- We call
crawler.arun(url=url)
. - The
await
keyword pauses execution until the crawl is complete. - The value returned by
arun
is assigned to theresult
variable. - This
result
variable is ourCrawlResult
object.
If you use crawler.arun_many()
, it returns a list where each item is a CrawlResult
object for one of the requested URLs (or an async generator if stream=True
).
Exploring the Attributes: Using the Toolbox
Once you have the result
object, you can access its attributes using dot notation (e.g., result.success
, result.markdown
).
1. Checking for Success (Most Important!)
Before you try to use any data, always check if the crawl was successful:
# chapter7_example_2.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlResult # Import CrawlResult for type hint
async def main():
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/html" # A working URL
# url = "https://httpbin.org/status/404" # Try this URL to see failure
result: CrawlResult = await crawler.arun(url=url)
# --- ALWAYS CHECK 'success' FIRST! ---
if result.success:
print(f"✅ Successfully crawled: {result.url}")
# Now it's safe to access other attributes
print(f" Page Title: {result.metadata.get('title', 'N/A')}")
else:
print(f"❌ Failed to crawl: {result.url}")
print(f" Error: {result.error_message}")
print(f" Status Code: {result.status_code}")
if __name__ == "__main__":
asyncio.run(main())
Explanation:
- We use an
if result.success:
block. - If
True
, we proceed to access other data likeresult.metadata
. - If
False
, we print theresult.error_message
andresult.status_code
to understand why it failed.
2. Accessing Content (HTML, Markdown)
# chapter7_example_3.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlResult
async def main():
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/html"
result: CrawlResult = await crawler.arun(url=url)
if result.success:
print("--- Content ---")
# Print the first 150 chars of raw HTML
print(f"Raw HTML snippet: {result.html[:150]}...")
# Access the raw markdown
if result.markdown: # Check if markdown object exists
print(f"Markdown snippet: {result.markdown.raw_markdown[:150]}...")
else:
print("Markdown not generated.")
else:
print(f"Crawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
Explanation:
- We access
result.html
for the original HTML. - We access
result.markdown.raw_markdown
for the main Markdown content. Note the two dots:result.markdown
gives theMarkdownGenerationResult
object, and.raw_markdown
accesses the specific string within it. We also checkif result.markdown:
first, just in case markdown generation failed for some reason.
3. Getting Metadata, Links, and Media
# chapter7_example_4.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlResult
async def main():
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/links/10/0" # A page with links
result: CrawlResult = await crawler.arun(url=url)
if result.success:
print("--- Metadata & Links ---")
print(f"Title: {result.metadata.get('title', 'N/A')}")
print(f"Found {len(result.links.internal)} internal links.")
print(f"Found {len(result.links.external)} external links.")
if result.links.internal:
print(f" First internal link text: '{result.links.internal[0].text}'")
# Similarly access result.media.images etc.
else:
print(f"Crawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
Explanation:
result.metadata
is a dictionary; use.get()
for safe access.result.links
andresult.media
are objects containing lists (internal
,external
,images
, etc.). We can check their lengths (len()
) and access individual items by index (e.g.,[0]
).
4. Checking for Extracted Data, Screenshots, PDFs
# chapter7_example_5.py
import asyncio
import json
from crawl4ai import (
AsyncWebCrawler, CrawlResult, CrawlerRunConfig,
JsonCssExtractionStrategy # Example extractor
)
async def main():
# Define a simple extraction strategy (from Chapter 6)
schema = {"baseSelector": "body", "fields": [{"name": "heading", "selector": "h1", "type": "text"}]}
extractor = JsonCssExtractionStrategy(schema=schema)
# Configure the run to extract and take a screenshot
config = CrawlerRunConfig(
extraction_strategy=extractor,
screenshot=True
)
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/html"
result: CrawlResult = await crawler.arun(url=url, config=config)
if result.success:
print("--- Extracted Data & Media ---")
# Check if structured data was extracted
if result.extracted_content:
print("Extracted Data found:")
data = json.loads(result.extracted_content) # Parse the JSON string
print(json.dumps(data, indent=2))
else:
print("No structured data extracted.")
# Check if a screenshot was taken
if result.screenshot:
print(f"Screenshot saved to: {result.screenshot}")
else:
print("Screenshot not taken.")
# Check for PDF (would be bytes if requested and successful)
if result.pdf:
print(f"PDF data captured ({len(result.pdf)} bytes).")
else:
print("PDF not generated.")
else:
print(f"Crawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
Explanation:
- We check if
result.extracted_content
is notNone
or empty before trying to parse it as JSON. - We check if
result.screenshot
is notNone
to see if the file path exists. - We check if
result.pdf
is notNone
to see if the PDF data (bytes) was captured.
How is CrawlResult
Created? (Under the Hood)
You don’t interact with the CrawlResult
constructor directly. The AsyncWebCrawler
creates it for you at the very end of the arun
process, typically inside its internal aprocess_html
method (or just before returning if fetching from cache).
Here’s a simplified sequence:
- Fetch:
AsyncWebCrawler
calls the AsyncCrawlerStrategy to get the rawhtml
,status_code
,response_headers
, etc. - Scrape: It passes the
html
to the ContentScrapingStrategy to getcleaned_html
,links
,media
,metadata
. - Markdown: It generates Markdown using the configured generator, possibly involving a RelevantContentFilter, resulting in a
MarkdownGenerationResult
object. - Extract (Optional): If an ExtractionStrategy is configured, it runs it on the appropriate content (HTML or Markdown) to get
extracted_content
. - Screenshot/PDF (Optional): If requested, the fetching strategy captures the
screenshot
path orpdf
data. - Package:
AsyncWebCrawler
gathers all these pieces (url
,html
,cleaned_html
, the markdown object,links
,media
,metadata
,extracted_content
,screenshot
,pdf
,success
status,error_message
, etc.). - Instantiate: It creates the
CrawlResult
object, passing all the gathered data into its constructor. - Return: It returns this fully populated
CrawlResult
object to your code.
Code Glimpse (models.py
)
The CrawlResult
is defined in the crawl4ai/models.py
file. It uses Pydantic, a library that helps define data structures with type hints and validation. Here’s a simplified view:
# Simplified from crawl4ai/models.py
from pydantic import BaseModel, HttpUrl
from typing import List, Dict, Optional, Any
# Other related models (simplified)
class MarkdownGenerationResult(BaseModel):
raw_markdown: str
fit_markdown: Optional[str] = None
# ... other markdown fields ...
class Links(BaseModel):
internal: List[Dict] = []
external: List[Dict] = []
class Media(BaseModel):
images: List[Dict] = []
videos: List[Dict] = []
# The main CrawlResult model
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Media = Media() # Use the Media model
links: Links = Links() # Use the Links model
screenshot: Optional[str] = None
pdf: Optional[bytes] = None
# Uses a private attribute and property for markdown for compatibility
_markdown: Optional[MarkdownGenerationResult] = None # Actual storage
extracted_content: Optional[str] = None # JSON string
metadata: Optional[Dict[str, Any]] = None
error_message: Optional[str] = None
status_code: Optional[int] = None
response_headers: Optional[Dict[str, str]] = None
redirected_url: Optional[str] = None
# ... other fields like session_id, ssl_certificate ...
# Custom property to access markdown data
@property
def markdown(self) -> Optional[MarkdownGenerationResult]:
return self._markdown
# Configuration for Pydantic
class Config:
arbitrary_types_allowed = True
# Custom init and model_dump might exist for backward compatibility handling
# ... (omitted for simplicity) ...
Explanation:
- It’s defined as a
class CrawlResult(BaseModel):
. - Each attribute (like
url
,html
,success
) is defined with a type hint (likestr
,bool
,Optional[str]
).Optional[str]
means the field can be a string orNone
. - Some attributes are themselves complex objects defined by other Pydantic models (like
media: Media
,links: Links
). - The
markdown
field uses a common pattern (property wrapping a private attribute) to provide theMarkdownGenerationResult
object while maintaining some backward compatibility. You access it simply asresult.markdown
.
Conclusion
You’ve now met the CrawlResult
object – the final, comprehensive report delivered by Crawl4AI after processing a URL.
- It acts as a container holding all gathered information (HTML, Markdown, metadata, links, media, extracted data, errors, etc.).
- It’s the return value of
AsyncWebCrawler.arun()
andarun_many()
. - The most crucial attribute is
success
(boolean), which you should always check first. - You can easily access all the different pieces of information using dot notation (e.g.,
result.metadata['title']
,result.markdown.raw_markdown
,result.links.external
).
Understanding the CrawlResult
is key to effectively using the information Crawl4AI provides.
So far, we’ve focused on crawling single pages or lists of specific URLs. But what if you want to start at one page and automatically discover and crawl linked pages, exploring a website more deeply?
Next: Let’s explore how to perform multi-page crawls with Chapter 8: Exploring Websites - DeepCrawlStrategy.
Generated by AI Codebase Knowledge Builder