firecrawl

Web scraping API that turns websites into LLM-ready data

$ npx docs2skills add firecrawl-web-scraper

SKILL.md

Firecrawl

Web scraping and crawling API that converts websites to LLM-ready data

What this skill does

Firecrawl transforms web pages into clean, structured data optimized for AI applications. Unlike traditional scrapers that struggle with JavaScript-heavy sites, Firecrawl handles dynamic content, waits for JS rendering, bypasses anti-bot mechanisms, and converts content into markdown or structured JSON perfect for LLM consumption.

The service excels at crawling entire websites, following all accessible links, and extracting data at scale. It provides intelligent waiting mechanisms for content to load, selective caching, and interactive capabilities like clicking buttons or filling forms before scraping. This makes it ideal for building AI agents that need current web context, training datasets from modern websites, or powering chatbots with real-time information.

Firecrawl fits into the AI data pipeline as a reliable preprocessing layer, handling the complexity of modern web scraping while delivering token-efficient, clean data that requires minimal post-processing for LLM applications.

Prerequisites

Firecrawl API key (free tier: 500 credits)
Node.js 16+ for JavaScript SDK
Python 3.7+ for Python SDK
HTTP client for direct API access
Understanding of web scraping ethics and robots.txt compliance

Quick start

npm install @mendable/firecrawl-js

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: "your-api-key" });

// Scrape a single page
const scrapeResult = await app.scrapeUrl("https://example.com", {
  formats: ["markdown", "html"]
});

console.log(scrapeResult.markdown);

// Crawl an entire website
const crawlResult = await app.crawlUrl("https://example.com", {
  limit: 100,
  scrapeOptions: {
    formats: ["markdown"]
  }
});

console.log(crawlResult.data);

Core concepts

Scraping vs Crawling: Scraping extracts data from a single URL, while crawling follows links to scrape multiple pages across a website. Crawling respects robots.txt and provides comprehensive site coverage.

Format Options: Firecrawl returns data in multiple formats - markdown (LLM-optimized), HTML (full structure), structured data (extracted via schemas), and screenshots. Markdown is the primary format for AI applications as it's clean and token-efficient.

Smart Waiting: The service intelligently waits for JavaScript content to load, handling SPAs, lazy-loaded content, and dynamic elements. This eliminates the need for manual delay configuration.

Actions and Interactions: Before scraping, you can perform actions like clicking buttons, filling forms, scrolling, or navigating. This enables scraping of content behind interactions or authentication walls.

Caching and Rate Management: Firecrawl handles proxy rotation, rate limiting, and provides selective caching to avoid re-scraping unchanged content, making it suitable for production applications.

Key API surface

Method	Purpose
`scrapeUrl(url, options)`	Scrape single page with format options
`crawlUrl(url, options)`	Crawl entire website following links
`crawlUrlAndWait(url, options)`	Synchronous crawl, waits for completion
`checkCrawlStatus(jobId)`	Check async crawl job status
`cancelCrawl(jobId)`	Cancel running crawl job
`search(query, options)`	Search web and return scraped results
`batchScrapeUrls(urls, options)`	Scrape multiple URLs in parallel
`extractStructuredData(url, schema)`	Extract data using JSON schema

Common patterns

AI Training Data Collection:

const crawlResult = await app.crawlUrl("https://docs.example.com", {
  limit: 1000,
  scrapeOptions: {
    formats: ["markdown"],
    onlyMainContent: true
  },
  excludePaths: ["/api/", "/images/"]
});

Real-time Agent Context:

const pageData = await app.scrapeUrl("https://news.example.com", {
  formats: ["markdown"],
  includeTags: ["article", "main"],
  waitFor: 3000
});

Interactive Scraping:

const result = await app.scrapeUrl("https://app.example.com", {
  actions: [
    { type: "click", selector: "#load-more" },
    { type: "wait", milliseconds: 2000 },
    { type: "scroll", coordinate: { x: 0, y: 500 } }
  ]
});

Structured Data Extraction:

const schema = {
  type: "object",
  properties: {
    title: { type: "string" },
    price: { type: "number" },
    description: { type: "string" }
  }
};

const extracted = await app.extractStructuredData("https://product.com", schema);

Configuration

Option	Default	Purpose
`formats`	`["markdown"]`	Output formats: markdown, html, rawHtml, screenshot
`onlyMainContent`	`false`	Extract only main content, skip nav/footer
`includeTags`	`[]`	HTML tags to include in extraction
`excludeTags`	`["script", "style"]`	HTML tags to exclude
`waitFor`	`0`	Milliseconds to wait before scraping
`timeout`	`30000`	Request timeout in milliseconds
`allowBackwardLinks`	`false`	Follow links to parent directories
`allowExternalLinks`	`false`	Follow external domain links
`limit`	`10000`	Maximum pages to crawl
`maxDepth`	`null`	Maximum crawl depth from starting URL

Best practices

Optimize for LLM consumption by using onlyMainContent: true and markdown format to reduce token usage and improve context quality.

Handle rate limits gracefully by setting appropriate delays and using the async crawl methods for large jobs:

const job = await app.crawlUrl(url, { limit: 5000 });
// Poll job status instead of blocking

Use structured extraction for consistent data formats when building datasets:

const schema = { type: "object", properties: { /* your schema */ }};
await app.extractStructuredData(url, schema);

Implement proper error handling for failed scrapes and respect HTTP status codes:

try {
  const result = await app.scrapeUrl(url);
  if (!result.success) {
    console.error("Scrape failed:", result.error);
  }
} catch (error) {
  // Handle network/API errors
}

Cache strategically by tracking crawl jobs and avoiding re-crawling unchanged content within reasonable timeframes.

Filter paths efficiently using includePaths and excludePaths to focus on relevant content and avoid binary files or irrelevant sections.

Gotchas and common mistakes

Credits are consumed even for failed requests when using advanced features. Monitor your usage through the dashboard.

Social media platforms are not supported - Firecrawl focuses on business websites, documentation, and public web content, not social platforms.

Robots.txt is respected by default - if crawling returns fewer pages than expected, check the target site's robots.txt file. Firecrawl identifies as 'FirecrawlAgent'.

JavaScript rendering adds latency - while Firecrawl handles JS automatically, complex SPAs may need explicit waitFor values or actions to ensure full content loading.

Crawl depth can explode quickly - a site with many cross-links can consume credits rapidly. Use maxDepth and limit parameters to control scope.

Format combinations affect performance - requesting multiple formats (markdown + html + screenshot) increases processing time and credit usage.

Actions are billed regardless of outcome - interactive scraping with actions consumes credits even if the scrape ultimately fails.

Rate limiting is per-account - multiple concurrent crawls share the same rate limits. Stagger large crawling jobs.

Async crawls require polling - crawlUrl() returns immediately with a job ID. Use crawlUrlAndWait() for synchronous behavior or implement status polling.

External links are blocked by default - enable allowExternalLinks: true only if you need to follow links outside the starting domain.