Web Scraper

Extract content from websites without external APIs or credentials.

Provider: Built-in
Authentication: None required
Category: Web Scraping
Credit Cost: 1 credit per request

Overview

Web Scraper tools provide simple, fast webpage content extraction without requiring API keys or external services. Perfect for quick scraping tasks, documentation extraction, and content aggregation.

For advanced scraping with JavaScript rendering and AI extraction, see Firecrawl.

Available Tools

Scrape Webpage

Extract comprehensive structured data from a webpage including HTML, links, images, and metadata.

Tool ID: web_scraper_Scrape_Webpage
Credit Cost: 1 credit

Parameters:

url (string, required): URL of the webpage to scrape
include_links (boolean, optional): Include all links found on the page
- Default: true
include_images (boolean, optional): Include all images found on the page
- Default: true

Response:

{
  "success": true,
  "url": "https://example.com",
  "title": "Example Domain",
  "html": "<!DOCTYPE html><html>...</html>",
  "markdown": "# Example Domain\n\nThis domain is for use...",
  "links": {
    "internal": ["/about", "/contact"],
    "external": ["https://www.iana.org/domains/example"]
  },
  "media": {
    "images": [
      {
        "src": "https://example.com/logo.png",
        "alt": "Company Logo"
      }
    ]
  },
  "metadata": {
    "description": "Example Domain",
    "keywords": "example, domain",
    "author": "IANA"
  }
}

Example Usage:

# Python
response = client.call_tool(
    name="web_scraper_Scrape_Webpage",
    arguments={
        "url": "https://example.com",
        "include_links": True,
        "include_images": True
    }
)

print(f"Title: {response['title']}")
print(f"Links found: {len(response['links']['internal']) + len(response['links']['external'])}")
print(f"Images found: {len(response['media']['images'])}")

// TypeScript
const response = await client.callTool({
  name: "web_scraper_Scrape_Webpage",
  arguments: {
    url: "https://example.com/article"
  }
});

// Get clean markdown content
const content = response.markdown;

Use Cases:

Extract blog post content
Gather links from documentation
Collect image URLs from galleries
Parse article metadata
Build sitemaps

Get Website Markdown

Extract clean markdown content from any webpage.

Tool ID: web_scraper_Get_Website_Markdown
Credit Cost: 1 credit

Parameters:

url (string, required): URL of the webpage to extract

Response:

{
  "success": true,
  "url": "https://example.com/article",
  "title": "How to Use MCP Servers",
  "markdown": "# How to Use MCP Servers\n\nMCP (Model Context Protocol) is...\n\n## Getting Started\n\n1. Install the MCP server\n2. Configure your IDE..."
}

Example Usage:

# Python - Extract article content
response = client.call_tool(
    name="web_scraper_Get_Website_Markdown",
    arguments={"url": "https://blog.example.com/post-123"}
)

# Save to file
with open("article.md", "w") as f:
    f.write(response["markdown"])

// TypeScript - Quick content extraction
const response = await client.callTool({
  name: "web_scraper_Get_Website_Markdown",
  arguments: {
    url: "https://docs.example.com/guide"
  }
});

// Use markdown directly
console.log(response.markdown);

Use Cases:

Convert web pages to markdown for documentation
Extract blog posts for analysis
Save articles for offline reading
Build knowledge bases from web content
Feed content to AI for summarization

Summarize Webpage

Get an AI-generated summary of a webpage's content.

Tool ID: web_scraper_Summarize_Webpage
Credit Cost: 1 credit (base) + LLM token costs

Parameters:

url (string, required): URL of the webpage to summarize
max_length (integer, optional): Maximum summary length in words
- Default: 500
model (string, optional): LLM model to use for summarization
- Default: "anthropic/claude-sonnet-4-5"
- Options:
  - "openai/gpt-5"
  - "openai/gpt-5-mini"
  - "openai/gpt-5-nano"
  - "anthropic/claude-sonnet-4-5"
  - "anthropic/claude-haiku-4-5"
  - "google/gemini-2.5-pro"
  - "google/gemini-2.5-flash"
use_thinking (boolean, optional): Enable extended thinking (Anthropic only)
- Default: true
thinking_budget (integer, optional): Maximum thinking tokens
- Default: 10000

Response:

{
  "success": true,
  "url": "https://example.com/article",
  "summary": "This article explains how to set up MCP servers for AI applications. It covers installation, configuration, and common use cases. Key points include connecting to IDEs, managing credentials, and best practices for server management.",
  "word_count": 42
}

Example Usage:

# Python - Summarize article
response = client.call_tool(
    name="web_scraper_Summarize_Webpage",
    arguments={
        "url": "https://news.example.com/long-article",
        "max_length": 200
    }
)

print(f"Summary ({response['word_count']} words):")
print(response["summary"])

// TypeScript - Quick summary with specific model
const response = await client.callTool({
  name: "web_scraper_Summarize_Webpage",
  arguments: {
    url: "https://research.example.com/paper",
    max_length: 300,
    model: "anthropic/claude-haiku-4-5"
  }
});

Use Cases:

Research paper summaries
News article digests
Documentation overviews
Content curation
Quick content analysis

Credentials:

Optional: OpenAI, Anthropic, or Google credentials
If not provided, platform credentials are used
When using user credentials: only base cost charged
When using platform credentials: base cost + LLM token usage charged

Common Patterns

Documentation Aggregation

# Scrape multiple documentation pages
docs_urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/examples"
]

all_content = []
for url in docs_urls:
    response = client.call_tool(
        name="web_scraper_Get_Website_Markdown",
        arguments={"url": url}
    )
    all_content.append({
        "title": response["title"],
        "content": response["markdown"],
        "url": url
    })

# Combine into single document
combined = "\n\n---\n\n".join([
    f"# {doc['title']}\n\n{doc['content']}"
    for doc in all_content
])

Link Discovery and Validation

# Find all links on a page and check them
response = client.call_tool(
    name="web_scraper_Scrape_Webpage",
    arguments={
        "url": "https://example.com",
        "include_links": True
    }
)

all_links = response["links"]["internal"] + response["links"]["external"]

# Check each link
for link in all_links:
    check = client.call_tool(
        name="HTTPS_Call",
        arguments={
            "method": "HEAD",
            "url": link,
            "timeout": 5
        }
    )
    
    if check["status_code"] != 200:
        print(f"Broken link: {link}")

Content Summarization Pipeline

# Scrape, analyze, and store summaries
def process_article(url):
    # Get full content
    content = client.call_tool(
        name="web_scraper_Get_Website_Markdown",
        arguments={"url": url}
    )
    
    # Generate summary
    summary = client.call_tool(
        name="web_scraper_Summarize_Webpage",
        arguments={
            "url": url,
            "max_length": 150
        }
    )
    
    # Count words
    word_count = client.call_tool(
        name="text_Count_Words",
        arguments={"text": content["markdown"]}
    )
    
    # Store in database
    return {
        "url": url,
        "title": content["title"],
        "summary": summary["summary"],
        "word_count": word_count["word_count"],
        "full_content": content["markdown"]
    }

Image Collection

# Collect all images from a page
response = client.call_tool(
    name="web_scraper_Scrape_Webpage",
    arguments={
        "url": "https://gallery.example.com",
        "include_images": True
    }
)

images = response["media"]["images"]

# Filter and download
high_res_images = [
    img for img in images
    if "high-res" in img.get("src", "") or "large" in img.get("src", "")
]

print(f"Found {len(high_res_images)} high-resolution images")

Comparison: Web Scraper vs Firecrawl

Feature	Web Scraper	Firecrawl
Cost	1 credit	3 credits
Authentication	None	API Key required
JavaScript	Static HTML only	Full JavaScript rendering
Speed	Very fast (< 1s)	Slower (5-15s)
AI Extraction	No	Yes, with custom schemas
Crawling	Single page	Multi-page crawling
Best For	Simple pages, blogs, docs	SPAs, dynamic content, complex sites

Use Web Scraper when:

Page is static HTML
Need fast results
Don't want to manage API keys
Scraping simple content

Use Firecrawl when:

Page requires JavaScript
Need structured data extraction
Crawling multiple pages
Working with complex modern websites

Best Practices

Performance

Web scraper is very fast (typically < 1 second)
Cache results to avoid repeated requests
Batch multiple URLs efficiently
No rate limiting on Reeva side

Content Quality

Works best with static HTML pages
May miss content loaded by JavaScript
Clean, semantic HTML produces better markdown
Metadata extraction depends on page structure

Ethics and Legality

Respect robots.txt
Check website's terms of service
Don't overload servers with requests
Add delays between bulk scraping
Attribute content to original source

Troubleshooting

"Failed to scrape" Error

Cause: Unable to fetch or parse webpage

Solutions:

Verify URL is correct and accessible
Check if page requires JavaScript (use Firecrawl instead)
Ensure page is not behind authentication
Try the page in a browser first

Missing Content

Cause: Content loaded dynamically by JavaScript

Solutions:

Use Firecrawl for JavaScript-heavy sites
Check if page has a print view or API
Look for alternative data sources

Slow Scraping

Cause: Large pages or slow website response

Solutions:

This is expected for large pages
Consider using Get_Website_Markdown for faster, lighter extraction
Check website's server response time
Try during off-peak hours

Markdown Formatting Issues

Cause: Complex HTML structure

Solutions:

Markdown conversion works best with semantic HTML
Some complex layouts may not convert perfectly
Use the full HTML response for precise parsing
Consider post-processing the markdown

Integration Examples

Example 1: Blog Content Aggregator

# Aggregate blog posts and store in Notion
blog_urls = get_blog_urls()

for url in blog_urls:
    # Extract content
    content = client.call_tool(
        name="web_scraper_Get_Website_Markdown",
        arguments={"url": url}
    )
    
    # Summarize
    summary = client.call_tool(
        name="web_scraper_Summarize_Webpage",
        arguments={
            "url": url,
            "max_length": 200
        }
    )
    
    # Create Notion page
    notion.call_tool(
        name="notion_create_page",
        arguments={
            "title": content["title"],
            "content": content["markdown"],
            "properties": {
                "Summary": summary["summary"],
                "Source URL": url,
                "Scraped Date": datetime.now().isoformat()
            }
        }
    )

Example 2: Documentation Mirror

# Create local mirror of documentation
def mirror_docs(base_url):
    # Get main page
    main = client.call_tool(
        name="web_scraper_Scrape_Webpage",
        arguments={"url": base_url}
    )
    
    # Extract all internal links
    internal_links = main["links"]["internal"]
    
    # Scrape each page
    for link in internal_links:
        full_url = urljoin(base_url, link)
        
        page = client.call_tool(
            name="web_scraper_Get_Website_Markdown",
            arguments={"url": full_url}
        )
        
        # Save locally
        filename = link.strip("/").replace("/", "_") + ".md"
        save_file(filename, page["markdown"])

Example 3: News Digest

# Create daily news digest
news_sites = [
    "https://news1.com/tech",
    "https://news2.com/ai",
    "https://news3.com/startups"
]

digest = []
for url in news_sites:
    summary = client.call_tool(
        name="web_scraper_Summarize_Webpage",
        arguments={
            "url": url,
            "max_length": 100
        }
    )
    
    digest.append(f"### {url}\n{summary['summary']}\n")

# Send digest
send_email(
    subject="Daily Tech News Digest",
    body="\n\n".join(digest)
)

Firecrawl - Advanced scraping with JavaScript support
Web Search - Find URLs to scrape
Text Tools - Process scraped content
HTTPS Tool - Make custom HTTP requests
Notion - Store scraped content

Overview​

Available Tools​

Scrape Webpage​

Get Website Markdown​

Summarize Webpage​

Common Patterns​

Documentation Aggregation​

Link Discovery and Validation​

Content Summarization Pipeline​

Image Collection​

Comparison: Web Scraper vs Firecrawl​

Best Practices​

Performance​

Content Quality​

Ethics and Legality​

Troubleshooting​

"Failed to scrape" Error​

Missing Content​

Slow Scraping​

Markdown Formatting Issues​

Integration Examples​

Example 1: Blog Content Aggregator​

Example 2: Documentation Mirror​

Example 3: News Digest​

Related Tools​

See Also​

Overview

Available Tools

Scrape Webpage

Get Website Markdown

Summarize Webpage

Common Patterns

Documentation Aggregation

Link Discovery and Validation

Content Summarization Pipeline

Image Collection

Comparison: Web Scraper vs Firecrawl

Best Practices

Performance

Content Quality

Ethics and Legality

Troubleshooting

"Failed to scrape" Error

Missing Content

Slow Scraping

Markdown Formatting Issues

Integration Examples

Example 1: Blog Content Aggregator

Example 2: Documentation Mirror

Example 3: News Digest

Related Tools

See Also