Firecrawl logo

firecrawl

Web scraping API that turns websites into LLM-ready data

$ npx docs2skills add firecrawl-web-scraper
SKILL.md

Firecrawl

Web scraping and crawling API that converts websites to LLM-ready data

What this skill does

Firecrawl transforms web pages into clean, structured data optimized for AI applications. Unlike traditional scrapers that struggle with JavaScript-heavy sites, Firecrawl handles dynamic content, waits for JS rendering, bypasses anti-bot mechanisms, and converts content into markdown or structured JSON perfect for LLM consumption.

The service excels at crawling entire websites, following all accessible links, and extracting data at scale. It provides intelligent waiting mechanisms for content to load, selective caching, and interactive capabilities like clicking buttons or filling forms before scraping. This makes it ideal for building AI agents that need current web context, training datasets from modern websites, or powering chatbots with real-time information.

Firecrawl fits into the AI data pipeline as a reliable preprocessing layer, handling the complexity of modern web scraping while delivering token-efficient, clean data that requires minimal post-processing for LLM applications.

Prerequisites

  • Firecrawl API key (free tier: 500 credits)
  • Node.js 16+ for JavaScript SDK
  • Python 3.7+ for Python SDK
  • HTTP client for direct API access
  • Understanding of web scraping ethics and robots.txt compliance

Quick start

npm install @mendable/firecrawl-js
import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: "your-api-key" });

// Scrape a single page
const scrapeResult = await app.scrapeUrl("https://example.com", {
  formats: ["markdown", "html"]
});

console.log(scrapeResult.markdown);

// Crawl an entire website
const crawlResult = await app.crawlUrl("https://example.com", {
  limit: 100,
  scrapeOptions: {
    formats: ["markdown"]
  }
});

console.log(crawlResult.data);

Core concepts

Scraping vs Crawling: Scraping extracts data from a single URL, while crawling follows links to scrape multiple pages across a website. Crawling respects robots.txt and provides comprehensive site coverage.

Format Options: Firecrawl returns data in multiple formats - markdown (LLM-optimized), HTML (full structure), structured data (extracted via schemas), and screenshots. Markdown is the primary format for AI applications as it's clean and token-efficient.

Smart Waiting: The service intelligently waits for JavaScript content to load, handling SPAs, lazy-loaded content, and dynamic elements. This eliminates the need for manual delay configuration.

Actions and Interactions: Before scraping, you can perform actions like clicking buttons, filling forms, scrolling, or navigating. This enables scraping of content behind interactions or authentication walls.

Caching and Rate Management: Firecrawl handles proxy rotation, rate limiting, and provides selective caching to avoid re-scraping unchanged content, making it suitable for production applications.

Key API surface

MethodPurpose
scrapeUrl(url, options)Scrape single page with format options
crawlUrl(url, options)Crawl entire website following links
crawlUrlAndWait(url, options)Synchronous crawl, waits for completion
checkCrawlStatus(jobId)Check async crawl job status
cancelCrawl(jobId)Cancel running crawl job
search(query, options)Search web and return scraped results
batchScrapeUrls(urls, options)Scrape multiple URLs in parallel
extractStructuredData(url, schema)Extract data using JSON schema

Common patterns

AI Training Data Collection:

const crawlResult = await app.crawlUrl("https://docs.example.com", {
  limit: 1000,
  scrapeOptions: {
    formats: ["markdown"],
    onlyMainContent: true
  },
  excludePaths: ["/api/", "/images/"]
});

Real-time Agent Context:

const pageData = await app.scrapeUrl("https://news.example.com", {
  formats: ["markdown"],
  includeTags: ["article", "main"],
  waitFor: 3000
});

Interactive Scraping:

const result = await app.scrapeUrl("https://app.example.com", {
  actions: [
    { type: "click", selector: "#load-more" },
    { type: "wait", milliseconds: 2000 },
    { type: "scroll", coordinate: { x: 0, y: 500 } }
  ]
});

Structured Data Extraction:

const schema = {
  type: "object",
  properties: {
    title: { type: "string" },
    price: { type: "number" },
    description: { type: "string" }
  }
};

const extracted = await app.extractStructuredData("https://product.com", schema);

Configuration

OptionDefaultPurpose
formats["markdown"]Output formats: markdown, html, rawHtml, screenshot
onlyMainContentfalseExtract only main content, skip nav/footer
includeTags[]HTML tags to include in extraction
excludeTags["script", "style"]HTML tags to exclude
waitFor0Milliseconds to wait before scraping
timeout30000Request timeout in milliseconds
allowBackwardLinksfalseFollow links to parent directories
allowExternalLinksfalseFollow external domain links
limit10000Maximum pages to crawl
maxDepthnullMaximum crawl depth from starting URL

Best practices

Optimize for LLM consumption by using onlyMainContent: true and markdown format to reduce token usage and improve context quality.

Handle rate limits gracefully by setting appropriate delays and using the async crawl methods for large jobs:

const job = await app.crawlUrl(url, { limit: 5000 });
// Poll job status instead of blocking

Use structured extraction for consistent data formats when building datasets:

const schema = { type: "object", properties: { /* your schema */ }};
await app.extractStructuredData(url, schema);

Implement proper error handling for failed scrapes and respect HTTP status codes:

try {
  const result = await app.scrapeUrl(url);
  if (!result.success) {
    console.error("Scrape failed:", result.error);
  }
} catch (error) {
  // Handle network/API errors
}

Cache strategically by tracking crawl jobs and avoiding re-crawling unchanged content within reasonable timeframes.

Filter paths efficiently using includePaths and excludePaths to focus on relevant content and avoid binary files or irrelevant sections.

Gotchas and common mistakes

Credits are consumed even for failed requests when using advanced features. Monitor your usage through the dashboard.

Social media platforms are not supported - Firecrawl focuses on business websites, documentation, and public web content, not social platforms.

Robots.txt is respected by default - if crawling returns fewer pages than expected, check the target site's robots.txt file. Firecrawl identifies as 'FirecrawlAgent'.

JavaScript rendering adds latency - while Firecrawl handles JS automatically, complex SPAs may need explicit waitFor values or actions to ensure full content loading.

Crawl depth can explode quickly - a site with many cross-links can consume credits rapidly. Use maxDepth and limit parameters to control scope.

Format combinations affect performance - requesting multiple formats (markdown + html + screenshot) increases processing time and credit usage.

Actions are billed regardless of outcome - interactive scraping with actions consumes credits even if the scrape ultimately fails.

Rate limiting is per-account - multiple concurrent crawls share the same rate limits. Stagger large crawling jobs.

Async crawls require polling - crawlUrl() returns immediately with a job ID. Use crawlUrlAndWait() for synchronous behavior or implement status polling.

External links are blocked by default - enable allowExternalLinks: true only if you need to follow links outside the starting domain.