OpenClaw for Web Scraping: Tools, Skills, and What Actually Works

OpenClaw for Web Scraping: Tools, Skills, and What Actually Works

Web scraping is one of the top use cases our customers have for Klaus. It’s often the biggest value add that teams can’t get with other Chat or OpenClaw products.

OpenClaw gives agents several ways to read the web, but the docs don’t tell you when each one works and when it falls over. I’ve watched hundreds of agents try to scrape websites, and the pattern is always the same: the agent picks the wrong tool, burns tokens retrying, and comes back with nothing.

We ship all three scraping approaches pre-configured, and I see which ones our customers actually use. Here’s what works.

Three Ways OpenClaw Agents Read the Web

OpenClaw agents have three approaches for pulling data from websites, and choosing the right one is the difference between clean results and a pile of empty responses.

  1. web_fetch: the built-in HTTP tool. Fast, free, no JavaScript execution.
  2. Browser automation: full Chromium via Chrome DevTools Protocol. Handles JavaScript, expensive and fragile.
  3. Scraping APIs (Firecrawl, Olostep, Exa): remote services that handle rendering, anti-bot measures, and structured extraction.

Here’s when to use each one:

Site typeweb_fetchBrowser automationScraping API
Static HTML (docs, blogs, press releases)Best choiceOverkillUnnecessary
JS-rendered SPA (most modern SaaS sites)Returns emptyWorksWorks
Login-protected pagesFailsWorks (fragile)Depends on the service
Anti-bot protected (LinkedIn, Amazon)FailsRiskyBest choice
Bulk extraction (100+ pages)SlowVery slowBest choice

The default tool your agent reaches for is web_fetch, and for a lot of sites that’s fine. The problems start when it isn’t.

When web_fetch Is Enough

web_fetch makes a plain HTTP GET request, extracts readable text, and returns markdown. Results get cached for 15 minutes. No API keys, no cost, no setup.

It works well for government sites, documentation pages, blog posts, Wikipedia, static job boards, and press releases. Anything where the HTML contains the actual content.

It fails on single-page applications, sites behind Cloudflare challenges, and anything that requires JavaScript to render. The tool does not execute JavaScript. If the server sends a shell HTML page that JavaScript fills in, web_fetch returns that shell.

OpenClaw has a three-tier extraction fallback: it tries Readability first (local HTML extraction), then Firecrawl if you’ve configured an API key, then basic HTML cleanup. If you’re on Klaus, the Firecrawl fallback is already configured.

Practical signal: if web_fetch returns very little text from a page you know has content, the page is JS-rendered. Switch approaches. Don’t let your agent retry the same fetch five times hoping for a different result.

One more thing about web_fetch: OpenClaw supports nine search providers (Brave, DuckDuckGo, Exa, Firecrawl, Gemini, Grok, Kimi, Perplexity, Tavily). web_search returns titles, URLs, and short snippets. It’s useful for finding pages, but it doesn’t give you full content. Your agent still needs to fetch each page separately. That round-trip (search, then fetch each result) adds up in tokens and time. Scraping APIs like Firecrawl and Exa can combine search and content retrieval in a single call.

When You Need Browser Automation (and When You Don’t)

OpenClaw’s managed browser uses Chromium via the Chrome DevTools Protocol. Your agent sees the page, clicks buttons, fills forms, and reads JavaScript-rendered content.

It sounds like the answer to everything, but it isn’t.

Browser automation works well for navigating complex settings pages, filling multi-step forms, and clicking through UIs that don’t have APIs. I wrote about this before: the sweet spot is navigating intricate interfaces, not extracting data at scale.

Where it breaks down: sites with bot detection analyze your browser fingerprint and IP address, not CAPTCHAs. LinkedIn and Amazon actively prevent automation. Complex form validation with autocomplete and reactive inputs is still surprisingly bad. And rate-limited sites will block you after a few requests.

There’s also the token cost. Browser automation means your agent is reading and reasoning about full rendered pages, screenshots, and DOM elements. A single browser scraping session can consume more tokens than dozens of API calls that return clean, structured data. For a lead list of 50 companies, browser automation turns a 5-minute job into an hour of agent time.

The compute cost matters too. Browser automation eats far more RAM than web_fetch. On a 2GB Starter instance, running Chromium can spike memory and crash the gateway. If you’re doing browser automation regularly, you want at least the Plus plan.

The rule I keep coming back to: if an API can do the job, always prefer it. That applies to scraping too. Browser automation is the tool of last resort, not the first thing you reach for.

Scraping APIs: What Most People Should Actually Use

For most business scraping (lead enrichment, competitor monitoring, directory extraction), a dedicated scraping API handles the hard parts: JavaScript rendering, anti-bot measures, rate limiting, and structured output. You don’t run a browser. You don’t manage sessions. You send a URL and get data back.

Three tools cover most use cases, and on Klaus they’re all available through Orthogonal — an integration layer that gives your agent access to hundreds of APIs without managing separate accounts or API keys.

Firecrawl

Firecrawl turns any URL into clean markdown or structured JSON. The Scrape endpoint handles single pages. The Search endpoint combines web search with full page content in one call, so your agent doesn’t need separate search and fetch steps. The Interact endpoint lets your agent click buttons and fill forms on remote pages without running a local browser.

Pricing: free tier gives you 500 lifetime credits. Hobby plan is $16/month for 3,000 credits (1 credit per page scrape, 2 credits per minute of browser time).

On Klaus, Firecrawl is pre-configured via Orthogonal. Your agent can use it immediately.

Best for: general-purpose page scraping, crawling entire sites, extracting content from JavaScript-heavy pages.

Olostep

Olostep is a web data API with AI-powered structured extraction. Define a JSON schema or describe what you want in natural language, and it returns clean, structured data regardless of how the HTML is organized. The AI understands page content semantically, so you can ask for fields like “company name, revenue, employee count” and it finds them even when the page layout changes.

Best for: pulling specific fields from varied page layouts. Think extracting company profiles from different directory sites, or pulling product specs from competitor pages that all structure their HTML differently.

Pricing: free tier gives you 500 requests/month. Starter plan is $9/month for 5,000 requests. Also available through Orthogonal, no separate API key needed.

Exa

Exa is an AI-native search engine that returns full page content, not just snippets. When your agent needs to discover and read pages in one step, Exa handles both. Search costs $7 per 1,000 requests, with a free tier of 1,000 requests per month.

Best for: research workflows where you need to find companies matching criteria, pull their pages, and extract data in a single flow. Your agent searches, reads, and structures without juggling separate tools.

Pre-configured on Klaus through Orthogonal, same as the others.

When to Use Which

NeedBest toolWhy
Scrape a specific URL into markdownFirecrawlHandles JS, returns clean output
Extract structured data from varied layoutsOlostepAI-powered field extraction
Find and read pages matching criteriaExaSearch + full content in one call
Crawl an entire siteFirecrawlCrawl endpoint handles pagination
Quick scrape of a static pageweb_fetchFree, no API needed

When Not to Scrape at All

The best scraping strategy is often not scraping. Many data sources have APIs that are faster, more reliable, and less likely to break.

Apollo and Hunter.io provide enriched lead data without scraping LinkedIn or company websites. Coresignal gives you structured data on companies and professionals. Google Workspace APIs handle calendar, email, and document data. All of these are available via Orthogonal on Klaus, or directly through OpenClaw skills on ClawHub.

Before your agent writes a scraping workflow, it should check: does this site have a public API? The answer is yes more often than you’d think.

One more thing worth noting: the EU AI Act enters full enforcement in August 2026. Respect robots.txt. Don’t scrape personal data without a lawful basis under GDPR. AI-powered scraping that collects personally identifiable information still needs to comply with data protection laws, even if the data is publicly available.

A Real Scraping Workflow: Building a Prospect List

Here’s how a realistic workflow looks when you combine multiple tools. Say you need a list of SaaS companies in a specific vertical with their pricing and team size.

Step 1: Discover companies with Exa. Tell your agent: “Find B2B SaaS companies in the HR tech space with fewer than 200 employees.” Exa returns URLs and full page content, so your agent already has initial company descriptions and homepage data to work with.

Step 2: Scrape pricing pages with Firecrawl. Most pricing pages are JavaScript-rendered. Firecrawl handles the rendering and returns structured content. Your agent can extract tier names, prices, and feature lists without parsing raw HTML.

Step 3: Pull team data with Olostep or web_fetch. About pages and team pages are often static HTML. Try web_fetch first since it’s free. If it returns empty, fall back to Olostep with a schema like “team size, key people, locations.” Olostep handles the variation between sites where one company lists their team on /about, another on /team, and a third buries it in the footer.

Step 4: Enrich with Apollo. Get contact info for decision-makers at each company. No scraping needed. Apollo has structured contact data that would take hours to scrape from individual company websites and LinkedIn profiles.

Step 5: Output. Your agent writes a structured spreadsheet, pipes data into your CRM, or formats a research report. The whole workflow runs as a single conversation: “Build me a list of HR tech companies with under 200 employees, their pricing tiers, team size, and the VP of Sales at each one.”

This workflow uses four different tools because each data source needs a different approach. The key insight is that your agent doesn’t need to figure this out from scratch every time. On Klaus, all of these tools are pre-configured through Orthogonal, and you can save the workflow as a skill so it runs the same way next time.

Frequently Asked Questions

What is the best OpenClaw skill for web scraping?

It depends on the site. Firecrawl for general-purpose scraping and JS-heavy pages. Exa for research workflows where you need to find and read pages together. Olostep for structured extraction from varied layouts. Install from ClawHub: clawhub install firecrawl.

Can OpenClaw scrape JavaScript-rendered websites?

Not with web_fetch alone. It makes a plain HTTP GET request without JavaScript execution. For JS-rendered sites, use browser automation or a scraping API like Firecrawl. On Klaus, the Firecrawl fallback is pre-configured in web_fetch’s extraction chain.

Scraping publicly available, non-personal data is generally legal. Respect robots.txt. Don’t collect personal data without a lawful basis. The EU AI Act (full enforcement August 2026) requires respecting machine-readable opt-out signals. When in doubt, use an API instead.

How much does web scraping cost with OpenClaw?

web_fetch is free and built-in. Firecrawl starts at $16/month for 3,000 pages. Exa offers 1,000 free searches per month. On Klaus, Orthogonal credits cover initial usage on all plans.

Why does my OpenClaw agent return empty results when scraping?

The page is likely JavaScript-rendered, and web_fetch cannot execute JavaScript. Try the same URL with Firecrawl or browser automation. If the page loads content dynamically after the initial HTML, web_fetch will only see the shell.

Key Takeaways

  • OpenClaw agents have three web scraping approaches: web_fetch (free, static sites only), browser automation (handles JavaScript, expensive and fragile), and scraping APIs (best for most business use cases).
  • web_fetch is your first choice for static HTML pages. If it returns empty or very little text, the page is JavaScript-rendered, and you need a different approach.
  • Browser automation is the tool of last resort, not the first thing to reach for. It eats RAM, breaks on anti-bot sites, and costs more in tokens than scraping APIs.
  • For most business scraping, use dedicated APIs: Firecrawl for general scraping, Olostep for structured extraction, Exa for research workflows.
  • Check for a public API before scraping. Apollo, Hunter.io, and Coresignal often have the data you need without scraping.
  • On Klaus, all three scraping approaches plus Orthogonal integrations come pre-configured. No API keys to manage.
  • Respect robots.txt and GDPR. The EU AI Act enters full enforcement August 2026.

Want to skip the API key setup? Sign up at klausai.com and start scraping with Firecrawl, Exa, and Olostep pre-configured via Orthogonal.

For more on recommended skills, see Best OpenClaw Skills for Business. For a broader view of what customers build, see OpenClaw Use Cases: 10 Ways Our Customers Actually Use It.

Sources