Skip to main content

Web Scraper

Extract content from websites without external APIs or credentials.

Provider: Built-in
Authentication: None required
Category: Web Scraping
Credit Cost: 1 credit per request

Overview

Web Scraper tools provide simple, fast webpage content extraction without requiring API keys or external services. Perfect for quick scraping tasks, documentation extraction, and content aggregation.

For advanced scraping with JavaScript rendering and AI extraction, see Firecrawl.

Available Tools

Scrape Webpage

Extract comprehensive structured data from a webpage including HTML, links, images, and metadata.

Tool ID: web_scraper_Scrape_Webpage
Credit Cost: 1 credit

Parameters:

  • url (string, required): URL of the webpage to scrape
  • include_links (boolean, optional): Include all links found on the page
    • Default: true
  • include_images (boolean, optional): Include all images found on the page
    • Default: true

Response:

{
"success": true,
"url": "https://example.com",
"title": "Example Domain",
"html": "<!DOCTYPE html><html>...</html>",
"markdown": "# Example Domain\n\nThis domain is for use...",
"links": {
"internal": ["/about", "/contact"],
"external": ["https://www.iana.org/domains/example"]
},
"media": {
"images": [
{
"src": "https://example.com/logo.png",
"alt": "Company Logo"
}
]
},
"metadata": {
"description": "Example Domain",
"keywords": "example, domain",
"author": "IANA"
}
}

Example Usage:

# Python
response = client.call_tool(
name="web_scraper_Scrape_Webpage",
arguments={
"url": "https://example.com",
"include_links": True,
"include_images": True
}
)

print(f"Title: {response['title']}")
print(f"Links found: {len(response['links']['internal']) + len(response['links']['external'])}")
print(f"Images found: {len(response['media']['images'])}")
// TypeScript
const response = await client.callTool({
name: "web_scraper_Scrape_Webpage",
arguments: {
url: "https://example.com/article"
}
});

// Get clean markdown content
const content = response.markdown;

Use Cases:

  • Extract blog post content
  • Gather links from documentation
  • Collect image URLs from galleries
  • Parse article metadata
  • Build sitemaps

Get Website Markdown

Extract clean markdown content from any webpage.

Tool ID: web_scraper_Get_Website_Markdown
Credit Cost: 1 credit

Parameters:

  • url (string, required): URL of the webpage to extract

Response:

{
"success": true,
"url": "https://example.com/article",
"title": "How to Use MCP Servers",
"markdown": "# How to Use MCP Servers\n\nMCP (Model Context Protocol) is...\n\n## Getting Started\n\n1. Install the MCP server\n2. Configure your IDE..."
}

Example Usage:

# Python - Extract article content
response = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": "https://blog.example.com/post-123"}
)

# Save to file
with open("article.md", "w") as f:
f.write(response["markdown"])
// TypeScript - Quick content extraction
const response = await client.callTool({
name: "web_scraper_Get_Website_Markdown",
arguments: {
url: "https://docs.example.com/guide"
}
});

// Use markdown directly
console.log(response.markdown);

Use Cases:

  • Convert web pages to markdown for documentation
  • Extract blog posts for analysis
  • Save articles for offline reading
  • Build knowledge bases from web content
  • Feed content to AI for summarization

Summarize Webpage

Get an AI-generated summary of a webpage's content.

Tool ID: web_scraper_Summarize_Webpage
Credit Cost: 1 credit (base) + LLM token costs

Parameters:

  • url (string, required): URL of the webpage to summarize
  • max_length (integer, optional): Maximum summary length in words
    • Default: 500
  • model (string, optional): LLM model to use for summarization
    • Default: "anthropic/claude-sonnet-4-5"
    • Options:
      • "openai/gpt-5"
      • "openai/gpt-5-mini"
      • "openai/gpt-5-nano"
      • "anthropic/claude-sonnet-4-5"
      • "anthropic/claude-haiku-4-5"
      • "google/gemini-2.5-pro"
      • "google/gemini-2.5-flash"
  • use_thinking (boolean, optional): Enable extended thinking (Anthropic only)
    • Default: true
  • thinking_budget (integer, optional): Maximum thinking tokens
    • Default: 10000

Response:

{
"success": true,
"url": "https://example.com/article",
"summary": "This article explains how to set up MCP servers for AI applications. It covers installation, configuration, and common use cases. Key points include connecting to IDEs, managing credentials, and best practices for server management.",
"word_count": 42
}

Example Usage:

# Python - Summarize article
response = client.call_tool(
name="web_scraper_Summarize_Webpage",
arguments={
"url": "https://news.example.com/long-article",
"max_length": 200
}
)

print(f"Summary ({response['word_count']} words):")
print(response["summary"])
// TypeScript - Quick summary with specific model
const response = await client.callTool({
name: "web_scraper_Summarize_Webpage",
arguments: {
url: "https://research.example.com/paper",
max_length: 300,
model: "anthropic/claude-haiku-4-5"
}
});

Use Cases:

  • Research paper summaries
  • News article digests
  • Documentation overviews
  • Content curation
  • Quick content analysis

Credentials:

  • Optional: OpenAI, Anthropic, or Google credentials
  • If not provided, platform credentials are used
  • When using user credentials: only base cost charged
  • When using platform credentials: base cost + LLM token usage charged

Common Patterns

Documentation Aggregation

# Scrape multiple documentation pages
docs_urls = [
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"https://docs.example.com/examples"
]

all_content = []
for url in docs_urls:
response = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": url}
)
all_content.append({
"title": response["title"],
"content": response["markdown"],
"url": url
})

# Combine into single document
combined = "\n\n---\n\n".join([
f"# {doc['title']}\n\n{doc['content']}"
for doc in all_content
])
# Find all links on a page and check them
response = client.call_tool(
name="web_scraper_Scrape_Webpage",
arguments={
"url": "https://example.com",
"include_links": True
}
)

all_links = response["links"]["internal"] + response["links"]["external"]

# Check each link
for link in all_links:
check = client.call_tool(
name="HTTPS_Call",
arguments={
"method": "HEAD",
"url": link,
"timeout": 5
}
)

if check["status_code"] != 200:
print(f"Broken link: {link}")

Content Summarization Pipeline

# Scrape, analyze, and store summaries
def process_article(url):
# Get full content
content = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": url}
)

# Generate summary
summary = client.call_tool(
name="web_scraper_Summarize_Webpage",
arguments={
"url": url,
"max_length": 150
}
)

# Count words
word_count = client.call_tool(
name="text_Count_Words",
arguments={"text": content["markdown"]}
)

# Store in database
return {
"url": url,
"title": content["title"],
"summary": summary["summary"],
"word_count": word_count["word_count"],
"full_content": content["markdown"]
}

Image Collection

# Collect all images from a page
response = client.call_tool(
name="web_scraper_Scrape_Webpage",
arguments={
"url": "https://gallery.example.com",
"include_images": True
}
)

images = response["media"]["images"]

# Filter and download
high_res_images = [
img for img in images
if "high-res" in img.get("src", "") or "large" in img.get("src", "")
]

print(f"Found {len(high_res_images)} high-resolution images")

Comparison: Web Scraper vs Firecrawl

FeatureWeb ScraperFirecrawl
Cost1 credit3 credits
AuthenticationNoneAPI Key required
JavaScriptStatic HTML onlyFull JavaScript rendering
SpeedVery fast (< 1s)Slower (5-15s)
AI ExtractionNoYes, with custom schemas
CrawlingSingle pageMulti-page crawling
Best ForSimple pages, blogs, docsSPAs, dynamic content, complex sites

Use Web Scraper when:

  • Page is static HTML
  • Need fast results
  • Don't want to manage API keys
  • Scraping simple content

Use Firecrawl when:

  • Page requires JavaScript
  • Need structured data extraction
  • Crawling multiple pages
  • Working with complex modern websites

Best Practices

Performance

  • Web scraper is very fast (typically < 1 second)
  • Cache results to avoid repeated requests
  • Batch multiple URLs efficiently
  • No rate limiting on Reeva side

Content Quality

  • Works best with static HTML pages
  • May miss content loaded by JavaScript
  • Clean, semantic HTML produces better markdown
  • Metadata extraction depends on page structure

Ethics and Legality

  • Respect robots.txt
  • Check website's terms of service
  • Don't overload servers with requests
  • Add delays between bulk scraping
  • Attribute content to original source

Troubleshooting

"Failed to scrape" Error

Cause: Unable to fetch or parse webpage

Solutions:

  • Verify URL is correct and accessible
  • Check if page requires JavaScript (use Firecrawl instead)
  • Ensure page is not behind authentication
  • Try the page in a browser first

Missing Content

Cause: Content loaded dynamically by JavaScript

Solutions:

  • Use Firecrawl for JavaScript-heavy sites
  • Check if page has a print view or API
  • Look for alternative data sources

Slow Scraping

Cause: Large pages or slow website response

Solutions:

  • This is expected for large pages
  • Consider using Get_Website_Markdown for faster, lighter extraction
  • Check website's server response time
  • Try during off-peak hours

Markdown Formatting Issues

Cause: Complex HTML structure

Solutions:

  • Markdown conversion works best with semantic HTML
  • Some complex layouts may not convert perfectly
  • Use the full HTML response for precise parsing
  • Consider post-processing the markdown

Integration Examples

Example 1: Blog Content Aggregator

# Aggregate blog posts and store in Notion
blog_urls = get_blog_urls()

for url in blog_urls:
# Extract content
content = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": url}
)

# Summarize
summary = client.call_tool(
name="web_scraper_Summarize_Webpage",
arguments={
"url": url,
"max_length": 200
}
)

# Create Notion page
notion.call_tool(
name="notion_create_page",
arguments={
"title": content["title"],
"content": content["markdown"],
"properties": {
"Summary": summary["summary"],
"Source URL": url,
"Scraped Date": datetime.now().isoformat()
}
}
)

Example 2: Documentation Mirror

# Create local mirror of documentation
def mirror_docs(base_url):
# Get main page
main = client.call_tool(
name="web_scraper_Scrape_Webpage",
arguments={"url": base_url}
)

# Extract all internal links
internal_links = main["links"]["internal"]

# Scrape each page
for link in internal_links:
full_url = urljoin(base_url, link)

page = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": full_url}
)

# Save locally
filename = link.strip("/").replace("/", "_") + ".md"
save_file(filename, page["markdown"])

Example 3: News Digest

# Create daily news digest
news_sites = [
"https://news1.com/tech",
"https://news2.com/ai",
"https://news3.com/startups"
]

digest = []
for url in news_sites:
summary = client.call_tool(
name="web_scraper_Summarize_Webpage",
arguments={
"url": url,
"max_length": 100
}
)

digest.append(f"### {url}\n{summary['summary']}\n")

# Send digest
send_email(
subject="Daily Tech News Digest",
body="\n\n".join(digest)
)

See Also