Web scraping at scale is one of the hardest problems in web engineering. A script that works for 10 pages breaks at 1,000. A system that handles 1,000 pages crumbles at 100,000. The challenges are not just technical. They are architectural, operational, and economic.
This guide covers everything you need to build a production scraping system that handles millions of pages reliably.
Why Scraping Is Hard
Scraping a single page is easy. You fetch the HTML, parse it, extract data. Anyone can do it in 20 lines of code.
Scraping at scale is hard for five reasons:
Dynamic content. Over 90% of modern websites render content with JavaScript. A simple HTTP GET returns an empty shell. You need a real browser to render the page, execute JavaScript, and wait for dynamic content to load. This means running headless browsers, which consume 200-500MB of RAM each.
Anti-bot detection. Websites use fingerprinting (checking browser properties for inconsistencies), behavioral analysis (detecting inhuman browsing patterns), CAPTCHAs (requiring human verification), and rate limiting (blocking IPs that make too many requests). These systems are sophisticated and constantly evolving.
Scale challenges. Each headless browser instance needs significant RAM and CPU. Running 100 concurrent browsers requires careful resource management. Browser instances leak memory, crash, and accumulate state that needs cleanup.
Content structure changes. Websites change their HTML structure without notice. A selector that worked yesterday returns nothing today. Your scraping system needs to handle these changes gracefully.
Reliability. At scale, everything that can fail will fail. Network errors, timeouts, partial page loads, unexpected redirects, cookie walls, login prompts, and rate limit responses. Your system needs to handle all of these.
Architecture for Scale
A production scraping system has five components:
URL queue. A prioritized queue of URLs to scrape. This can be Redis, PostgreSQL, or a managed queue service. URLs enter the queue from a seed list, sitemap parser, or link extractor. The queue tracks each URL's state (pending, in-progress, completed, failed) and handles retries.
Browser pool. A pool of browser sessions managed by BrowseFleet. Instead of running local browsers, you create cloud sessions on demand and release them when done. BrowseFleet handles browser lifecycle, stealth, and isolation.
Worker processes. Workers pull URLs from the queue, acquire a browser session, scrape the page, and push extracted data to a storage layer. Workers should be stateless so you can scale them horizontally.
Data pipeline. Raw scraped data goes through validation, transformation, and deduplication before being stored. This prevents garbage data from entering your database.
Monitoring. Dashboards tracking success rates, error rates, latency, and cost. Alerts when success rates drop below thresholds.
Here is a simplified implementation:
import { BrowseFleet } from 'browsefleet';
import { Queue } from './queue';
import { DataStore } from './store';
const bf = new BrowseFleet({ apiKey: 'bf_...' });
const queue = new Queue('scrape-urls');
const store = new DataStore();
const CONCURRENCY = 20;
async function worker() {
while (true) {
const url = await queue.dequeue();
if (!url) {
await sleep(1000);
continue;
}
try {
// Use quick action for simple pages
const { markdown, html } = await bf.scrape(url, {
stealth: 'full',
timeout: 30000,
});
const data = extractData(html);
await store.save(url, data);
await queue.complete(url);
} catch (error) {
await queue.retry(url, error);
}
}
}
// Run workers in parallel
await Promise.all(
Array.from({ length: CONCURRENCY }, () => worker())
);Choosing Your Approach: Quick Actions vs Sessions
BrowseFleet offers two approaches to scraping, and choosing the right one matters for both cost and reliability.
Quick actions (bf.scrape, bf.screenshot, bf.pdf) are single API calls that handle the entire browser lifecycle. You give it a URL, it launches a browser, navigates to the page, extracts content, and returns the result. Quick actions are ideal for simple, single-page scraping where you do not need to interact with the page.
Sessions give you a persistent browser that you control via Puppeteer or Playwright. You create a session, connect your automation library, and have full control over navigation, clicks, form filling, and multi-page workflows. Sessions are necessary when you need to log in, paginate, or interact with the page.
The rule of thumb: use quick actions for simple extraction, sessions for complex workflows. Quick actions are cheaper because the browser lifecycle is optimized. Sessions give you more control but cost more because the browser stays running.
Handling Anti-Bot Protections
Anti-bot systems detect scrapers through several signals:
Browser fingerprinting. Real browsers have consistent properties: navigator.webdriver is undefined, WebGL renders correctly, the User-Agent matches the actual browser version. Headless browsers have telltale inconsistencies. BrowseFleet's stealth mode patches all known fingerprint leaks.
Behavioral analysis. Real humans scroll, pause, move the mouse, and browse at varying speeds. Bots navigate instantly and interact at inhuman speeds. When using sessions for interactive scraping, add realistic delays and avoid perfectly consistent timing.
IP reputation. Data center IPs are flagged by most anti-bot services. Residential and mobile proxies are less suspicious. BrowseFleet's per-session proxy support lets you rotate through residential proxies.
CAPTCHAs. When fingerprinting and behavioral analysis are not conclusive, sites present CAPTCHAs. BrowseFleet solves reCAPTCHA, hCaptcha, and Turnstile automatically when CAPTCHA solving is enabled.
const session = await bf.sessions.create({
stealth: 'full',
captchaSolving: true,
proxy: 'socks5://user:pass@residential-proxy:1080',
});Proxy Strategy
At scale, proxy management is critical. Here are the strategies that work:
Datacenter proxies are cheap ($1-5/GB) but easily detected. Use them for sites that do not have aggressive anti-bot measures.
Residential proxies use real consumer IP addresses and are much harder to detect. They cost more ($5-15/GB) but are necessary for sites with strong anti-bot systems. Rotate IPs per session to avoid patterns.
ISP proxies are datacenter IPs registered with ISPs, offering a middle ground between cost and detection resistance.
Per-session rotation. Assign each BrowseFleet session a different proxy to prevent IP correlation between requests. BrowseFleet supports this natively:
const proxies = ['socks5://proxy1:1080', 'socks5://proxy2:1080', ...];
let proxyIndex = 0;
const session = await bf.sessions.create({
stealth: 'full',
proxy: proxies[proxyIndex++ % proxies.length],
});Error Handling and Retries
Every failure mode needs a specific response:
Timeout errors. The page did not load within the time limit. Retry with a longer timeout, or flag the URL for investigation if it fails repeatedly.
HTTP errors (4xx, 5xx). 403 usually means bot detection, so retry with a different proxy and fresh session. 429 means rate limiting, so back off and retry later. 5xx means the server is struggling, so retry with exponential backoff.
CAPTCHA failures. CAPTCHA solving can fail. Retry with a fresh session. If CAPTCHAs persist, the site may require a higher-quality proxy.
Content validation failures. The page loaded but the expected data is missing. This could mean the page structure changed, the content is behind a login wall, or the page is a soft block. Log the full HTML for investigation.
Implement a retry budget. Each URL gets 3-5 attempts before being moved to a dead letter queue for manual investigation.
async function scrapeWithRetry(url: string, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const { markdown } = await bf.scrape(url, {
stealth: 'full',
timeout: 30000,
});
const data = extractData(markdown);
if (validateData(data)) return data;
throw new Error('Validation failed');
} catch (error) {
if (attempt === maxRetries - 1) throw error;
await sleep(Math.pow(2, attempt) * 1000);
}
}
}Rate Limiting and Politeness
Aggressive scraping can harm the target site and get you permanently banned. Implement these safeguards:
Per-domain rate limiting. Never exceed 1 request per second to a single domain unless you know the site can handle more. BrowseFleet's concurrent sessions make it easy to scrape multiple domains in parallel while respecting per-domain limits.
robots.txt compliance. Check robots.txt before scraping. While robots.txt is not legally binding in most jurisdictions, respecting it is good practice and reduces the chance of being blocked.
Off-peak scraping. Schedule heavy scraping during the target site's off-peak hours to minimize impact on their infrastructure.
Data Quality and Validation
Raw scraped data is noisy. Build validation into your pipeline:
Schema validation. Define expected data shapes and reject results that do not match. If you expect a price field, verify it is a valid number.
Deduplication. The same content can appear at multiple URLs. Hash the extracted data and skip duplicates.
Freshness tracking. Track when each piece of data was last scraped. Stale data needs re-scraping.
Change detection. Compare new data with previous scrapes. Large unexpected changes may indicate a scraping error rather than a real change.
Monitoring Your Scraping Pipeline
Monitor these metrics:
Success rate. Percentage of URLs that return valid data. Target above 95%. A drop below 90% indicates a systemic issue.
Error distribution. Break down failures by type (timeout, HTTP error, validation failure, CAPTCHA). This tells you where to focus optimization.
Latency. Average and p99 time per scrape. Increasing latency often predicts upcoming failures.
Cost per page. Track browser-hours and API calls per successfully scraped page. Optimize the most expensive patterns.
Cost Optimization
Scraping at scale can be expensive. Here are the biggest wins:
Use quick actions for simple pages. Quick actions are cheaper than sessions because the browser lifecycle is optimized.
Minimize session duration. With sessions, close them as soon as you are done. Do not keep sessions idle.
Cache aggressively. If a page has not changed (check via HTTP headers or content hash), do not re-scrape it.
Use the right concurrency. More concurrent sessions means faster scraping but higher costs. Find the concurrency level that balances speed and budget.
Self-host for high volume. If you are scraping millions of pages per month, self-hosting BrowseFleet on your own infrastructure can reduce costs by 80% or more compared to any cloud browser API.