PlaywrightScraperAgent¶

The PlaywrightScraperAgent is a web scraping agent that uses the Playwright framework to automate browser interactions and extract content from web pages.

Purpose¶

This agent provides automated web scraping capabilities within a Rustic AI guild, enabling retrieval of web content for further processing. It's designed to handle requests for scraping multiple URLs and returns the scraped content in the requested format.

When to Use¶

Use the PlaywrightScraperAgent when your application needs to:

Extract content from web pages
Store web content for later analysis
Convert HTML content to more processable formats like Markdown
Get web content for LLM processing or other AI tasks

Dependencies¶

The PlaywrightScraperAgent requires:

filesystem (Guild-level dependency): A file system implementation for storing scraped content

Message Types¶

Input Messages¶

WebScrapingRequest¶

A request to scrape web pages.

class WebScrapingRequest(BaseModel):
    id: str  # ID of the request
    links: List[MediaLink]  # URLs to scrape
    output_format: ScrapingOutputFormat = ScrapingOutputFormat.TEXT_HTML  # Output format
    transformer_options: JsonDict = {}  # Options for transforming the content

The ScrapingOutputFormat enum supports: - TEXT_HTML: Returns the content as HTML - MARKDOWN: Converts the HTML to Markdown before returning

Output Messages¶

MediaLink¶

For each URL successfully scraped, a MediaLink message is emitted with the scraped content:

class MediaLink(BaseModel):
    url: str  # Path to the scraped file
    name: str  # Filename
    metadata: Dict  # Metadata including original URL, title, etc.
    on_filesystem: bool  # Always True for scraped content
    mimetype: str  # Content type
    encoding: str  # Always "utf-8"

WebScrapingCompleted¶

Sent when all requested URLs have been processed:

class WebScrapingCompleted(BaseModel):
    id: str  # ID of the original request
    links: List[MediaLink]  # List of all successfully scraped URLs

ErrorMessage¶

Sent when scraping fails for a specific URL:

class ErrorMessage(BaseModel):
    agent_type: str
    error_type: str
    error_message: str

Behavior¶

The agent launches a headless Chrome browser using Playwright
For each URL in the request:
Navigates to the URL
Checks for successful HTTP status (200)
Extracts the page content
Transforms the content (to Markdown if requested)
Stores the content to the filesystem
Emits a MediaLink message for the stored content
After processing all URLs, it emits a WebScrapingCompleted message

The scraped content is saved in the scraped_data/ directory with filenames based on a hash of the content.

Sample Usage¶

from rustic_ai.core.guild.builders import AgentBuilder
from rustic_ai.core.guild.agent_ext.depends.dependency_resolver import DependencySpec
from rustic_ai.playwright.agent import PlaywrightScraperAgent

# Define a file system dependency
filesystem = DependencySpec(
    class_name="rustic_ai.core.guild.agent_ext.depends.filesystem.FileSystemResolver",
    properties={
        "path_base": "/tmp",
        "protocol": "file",
        "storage_options": {
            "auto_mkdir": True,
        },
    },
)

# Create the agent spec
playwright_agent_spec = (
    AgentBuilder(PlaywrightScraperAgent)
    .set_id("web_scraper")
    .set_name("Web Scraper")
    .set_description("Scrapes web content using Playwright")
    .build_spec()
)

# Add dependency to guild when launching
guild_builder.add_dependency("filesystem", filesystem)
guild_builder.add_agent_spec(playwright_agent_spec)

Example Request¶

from rustic_ai.core.agents.commons.media import MediaLink
from rustic_ai.playwright.agent import WebScrapingRequest, ScrapingOutputFormat

# Create a request to scrape two URLs and convert to Markdown
request = WebScrapingRequest(
    links=[
        MediaLink(url="https://example.com"),
        MediaLink(url="https://rustic.ai/docs"),
    ],
    output_format=ScrapingOutputFormat.MARKDOWN
)

# Send to the agent via messaging system
client.publish("default_topic", request)

Notes and Limitations¶

The agent requires a working installation of Playwright and a browser
For security, it's recommended to run with appropriate sandboxing
Consider rate limiting and respecting robots.txt for production use
Large pages may consume significant memory