Web Scraper
Extract content from websites without external APIs or credentials.
Provider: Built-in
Authentication: None required
Category: Web Scraping
Credit Cost: 1 credit per request
Overview
Web Scraper tools provide simple, fast webpage content extraction without requiring API keys or external services. Perfect for quick scraping tasks, documentation extraction, and content aggregation.
For advanced scraping with JavaScript rendering and AI extraction, see Firecrawl.
Available Tools
Scrape Webpage
Extract comprehensive structured data from a webpage including HTML, links, images, and metadata.
Tool ID: web_scraper_Scrape_Webpage
Credit Cost: 1 credit
Parameters:
url(string, required): URL of the webpage to scrapeinclude_links(boolean, optional): Include all links found on the page- Default:
true
- Default:
include_images(boolean, optional): Include all images found on the page- Default:
true
- Default:
Response:
{
"success": true,
"url": "https://example.com",
"title": "Example Domain",
"html": "<!DOCTYPE html><html>...</html>",
"markdown": "# Example Domain\n\nThis domain is for use...",
"links": {
"internal": ["/about", "/contact"],
"external": ["https://www.iana.org/domains/example"]
},
"media": {
"images": [
{
"src": "https://example.com/logo.png",
"alt": "Company Logo"
}
]
},
"metadata": {
"description": "Example Domain",
"keywords": "example, domain",
"author": "IANA"
}
}
Example Usage:
# Python
response = client.call_tool(
name="web_scraper_Scrape_Webpage",
arguments={
"url": "https://example.com",
"include_links": True,
"include_images": True
}
)
print(f"Title: {response['title']}")
print(f"Links found: {len(response['links']['internal']) + len(response['links']['external'])}")
print(f"Images found: {len(response['media']['images'])}")
// TypeScript
const response = await client.callTool({
name: "web_scraper_Scrape_Webpage",
arguments: {
url: "https://example.com/article"
}
});
// Get clean markdown content
const content = response.markdown;
Use Cases:
- Extract blog post content
- Gather links from documentation
- Collect image URLs from galleries
- Parse article metadata
- Build sitemaps
Get Website Markdown
Extract clean markdown content from any webpage.
Tool ID: web_scraper_Get_Website_Markdown
Credit Cost: 1 credit
Parameters:
url(string, required): URL of the webpage to extract
Response:
{
"success": true,
"url": "https://example.com/article",
"title": "How to Use MCP Servers",
"markdown": "# How to Use MCP Servers\n\nMCP (Model Context Protocol) is...\n\n## Getting Started\n\n1. Install the MCP server\n2. Configure your IDE..."
}
Example Usage:
# Python - Extract article content
response = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": "https://blog.example.com/post-123"}
)
# Save to file
with open("article.md", "w") as f:
f.write(response["markdown"])
// TypeScript - Quick content extraction
const response = await client.callTool({
name: "web_scraper_Get_Website_Markdown",
arguments: {
url: "https://docs.example.com/guide"
}
});
// Use markdown directly
console.log(response.markdown);
Use Cases:
- Convert web pages to markdown for documentation
- Extract blog posts for analysis
- Save articles for offline reading
- Build knowledge bases from web content
- Feed content to AI for summarization
Summarize Webpage
Get an AI-generated summary of a webpage's content.
Tool ID: web_scraper_Summarize_Webpage
Credit Cost: 1 credit (base) + LLM token costs
Parameters:
url(string, required): URL of the webpage to summarizemax_length(integer, optional): Maximum summary length in words- Default: 500
model(string, optional): LLM model to use for summarization- Default:
"anthropic/claude-sonnet-4-5" - Options:
"openai/gpt-5""openai/gpt-5-mini""openai/gpt-5-nano""anthropic/claude-sonnet-4-5""anthropic/claude-haiku-4-5""google/gemini-2.5-pro""google/gemini-2.5-flash"
- Default:
use_thinking(boolean, optional): Enable extended thinking (Anthropic only)- Default:
true
- Default:
thinking_budget(integer, optional): Maximum thinking tokens- Default: 10000
Response:
{
"success": true,
"url": "https://example.com/article",
"summary": "This article explains how to set up MCP servers for AI applications. It covers installation, configuration, and common use cases. Key points include connecting to IDEs, managing credentials, and best practices for server management.",
"word_count": 42
}
Example Usage:
# Python - Summarize article
response = client.call_tool(
name="web_scraper_Summarize_Webpage",
arguments={
"url": "https://news.example.com/long-article",
"max_length": 200
}
)
print(f"Summary ({response['word_count']} words):")
print(response["summary"])
// TypeScript - Quick summary with specific model
const response = await client.callTool({
name: "web_scraper_Summarize_Webpage",
arguments: {
url: "https://research.example.com/paper",
max_length: 300,
model: "anthropic/claude-haiku-4-5"
}
});
Use Cases:
- Research paper summaries
- News article digests
- Documentation overviews
- Content curation
- Quick content analysis
Credentials:
- Optional: OpenAI, Anthropic, or Google credentials
- If not provided, platform credentials are used
- When using user credentials: only base cost charged
- When using platform credentials: base cost + LLM token usage charged
Common Patterns
Documentation Aggregation
# Scrape multiple documentation pages
docs_urls = [
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"https://docs.example.com/examples"
]
all_content = []
for url in docs_urls:
response = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": url}
)
all_content.append({
"title": response["title"],
"content": response["markdown"],
"url": url
})
# Combine into single document
combined = "\n\n---\n\n".join([
f"# {doc['title']}\n\n{doc['content']}"
for doc in all_content
])
Link Discovery and Validation
# Find all links on a page and check them
response = client.call_tool(
name="web_scraper_Scrape_Webpage",
arguments={
"url": "https://example.com",
"include_links": True
}
)
all_links = response["links"]["internal"] + response["links"]["external"]
# Check each link
for link in all_links:
check = client.call_tool(
name="HTTPS_Call",
arguments={
"method": "HEAD",
"url": link,
"timeout": 5
}
)
if check["status_code"] != 200:
print(f"Broken link: {link}")
Content Summarization Pipeline
# Scrape, analyze, and store summaries
def process_article(url):
# Get full content
content = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": url}
)
# Generate summary
summary = client.call_tool(
name="web_scraper_Summarize_Webpage",
arguments={
"url": url,
"max_length": 150
}
)
# Count words
word_count = client.call_tool(
name="text_Count_Words",
arguments={"text": content["markdown"]}
)
# Store in database
return {
"url": url,
"title": content["title"],
"summary": summary["summary"],
"word_count": word_count["word_count"],
"full_content": content["markdown"]
}
Image Collection
# Collect all images from a page
response = client.call_tool(
name="web_scraper_Scrape_Webpage",
arguments={
"url": "https://gallery.example.com",
"include_images": True
}
)
images = response["media"]["images"]
# Filter and download
high_res_images = [
img for img in images
if "high-res" in img.get("src", "") or "large" in img.get("src", "")
]
print(f"Found {len(high_res_images)} high-resolution images")
Comparison: Web Scraper vs Firecrawl
| Feature | Web Scraper | Firecrawl |
|---|---|---|
| Cost | 1 credit | 3 credits |
| Authentication | None | API Key required |
| JavaScript | Static HTML only | Full JavaScript rendering |
| Speed | Very fast (< 1s) | Slower (5-15s) |
| AI Extraction | No | Yes, with custom schemas |
| Crawling | Single page | Multi-page crawling |
| Best For | Simple pages, blogs, docs | SPAs, dynamic content, complex sites |
Use Web Scraper when:
- Page is static HTML
- Need fast results
- Don't want to manage API keys
- Scraping simple content
Use Firecrawl when:
- Page requires JavaScript
- Need structured data extraction
- Crawling multiple pages
- Working with complex modern websites
Best Practices
Performance
- Web scraper is very fast (typically < 1 second)
- Cache results to avoid repeated requests
- Batch multiple URLs efficiently
- No rate limiting on Reeva side
Content Quality
- Works best with static HTML pages
- May miss content loaded by JavaScript
- Clean, semantic HTML produces better markdown
- Metadata extraction depends on page structure
Ethics and Legality
- Respect robots.txt
- Check website's terms of service
- Don't overload servers with requests
- Add delays between bulk scraping
- Attribute content to original source
Troubleshooting
"Failed to scrape" Error
Cause: Unable to fetch or parse webpage
Solutions:
- Verify URL is correct and accessible
- Check if page requires JavaScript (use Firecrawl instead)
- Ensure page is not behind authentication
- Try the page in a browser first
Missing Content
Cause: Content loaded dynamically by JavaScript
Solutions:
- Use Firecrawl for JavaScript-heavy sites
- Check if page has a print view or API
- Look for alternative data sources
Slow Scraping
Cause: Large pages or slow website response
Solutions:
- This is expected for large pages
- Consider using
Get_Website_Markdownfor faster, lighter extraction - Check website's server response time
- Try during off-peak hours
Markdown Formatting Issues
Cause: Complex HTML structure
Solutions:
- Markdown conversion works best with semantic HTML
- Some complex layouts may not convert perfectly
- Use the full HTML response for precise parsing
- Consider post-processing the markdown
Integration Examples
Example 1: Blog Content Aggregator
# Aggregate blog posts and store in Notion
blog_urls = get_blog_urls()
for url in blog_urls:
# Extract content
content = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": url}
)
# Summarize
summary = client.call_tool(
name="web_scraper_Summarize_Webpage",
arguments={
"url": url,
"max_length": 200
}
)
# Create Notion page
notion.call_tool(
name="notion_create_page",
arguments={
"title": content["title"],
"content": content["markdown"],
"properties": {
"Summary": summary["summary"],
"Source URL": url,
"Scraped Date": datetime.now().isoformat()
}
}
)
Example 2: Documentation Mirror
# Create local mirror of documentation
def mirror_docs(base_url):
# Get main page
main = client.call_tool(
name="web_scraper_Scrape_Webpage",
arguments={"url": base_url}
)
# Extract all internal links
internal_links = main["links"]["internal"]
# Scrape each page
for link in internal_links:
full_url = urljoin(base_url, link)
page = client.call_tool(
name="web_scraper_Get_Website_Markdown",
arguments={"url": full_url}
)
# Save locally
filename = link.strip("/").replace("/", "_") + ".md"
save_file(filename, page["markdown"])
Example 3: News Digest
# Create daily news digest
news_sites = [
"https://news1.com/tech",
"https://news2.com/ai",
"https://news3.com/startups"
]
digest = []
for url in news_sites:
summary = client.call_tool(
name="web_scraper_Summarize_Webpage",
arguments={
"url": url,
"max_length": 100
}
)
digest.append(f"### {url}\n{summary['summary']}\n")
# Send digest
send_email(
subject="Daily Tech News Digest",
body="\n\n".join(digest)
)
Related Tools
- Firecrawl - Advanced scraping with JavaScript support
- Web Search - Find URLs to scrape
- Text Tools - Process scraped content
- HTTPS Tool - Make custom HTTP requests
- Notion - Store scraped content
See Also
- Creating Custom Tools - Pre-configure scraping parameters
- Tool Testing - Test scraping in playground
- All Tools - Complete tool catalog