The Stupidly Simple Firecrawl Replacement You Can Build Today

Everyone building AI agents ends up at Firecrawl. It's the obvious choice for web crawling infrastructure. Then you check the pricing for production scale and realise you're about to haemorrhage money. Here's how I replaced it with a few lines of Go.

I'm not against managed crawling services like Firecrawl or Zyte. They solve real problems. But treating them as your default for every scrape job is expensive and unnecessary. Save them for the genuinely difficult sites. For everything else, you can do better.

The self-hosted trap

Firecrawl deserves credit for offering a self-hosted option. Many IaaS companies lock you into their cloud pricing completely, so having the choice to run it yourself is valuable.

However, the self-hosted version requires Docker, Playwright, RabbitMQ, Postgres, and Redis just to get started. You're running five different services to achieve something fundamentally simple. Sure, it's cheaper than their cloud offering, but the architecture feels bloated for what most projects actually need.

What I actually needed

Three requirements drove the way that I decided to build the crawler for this project:

Lightweight and portable with limited local dependencies
Recursive crawling across entire site
Output result as markdown documents

I've tried various approaches over the years. Scrapy, Selenium, Puppeteer, Curl and wget. For this use case, Go was the obvious choice.

The performance characteristics of Go are amazing. I'm crawling several million sites a month from a single Mac Mini M2 with 8GB of memory. The hardware never breaks a sweat.

The actual crawling logic

The core functionality of crawling a URL and converting the HTML response to Markdown is really easy in Go:

resp, err := http.Get(targetURL)
body, _ := io.ReadAll(resp.Body)
markdown, _ := md.NewConverter("", true, nil).ConvertString(string(body))

The heavy lifting of converting HTML to markdown comes from html-to-markdown by Johannes Kaufmann. That's it. One singel external dependency doing one thing well.

The recursive logic is slightly more complicated but it’s still easy to understand even for non-Go developers that are looking at the code for the first time:

base, _ := url.Parse(targetURL)
body, _ := io.ReadAll(resp.Body)
doc, _ := html.Parse(strings.NewReader(string(body)))
var findLinks func(*html.Node)
findLinks = func(n *html.Node) {
    if n.Type == html.ElementNode && n.Data == "a" {
        for _, a := range n.Attr {
            if a.Key == "href" {
                link, _ := base.Parse(a.Val)
                if link != nil && link.Host == base.Host {
                    crawl(link.String(), visited)
                }
            }
        }
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        findLinks(c)
    }
}
findLinks(doc)

The simplicity is deliberate. No frameworks, no abstractions that hide what's actually happening. When something breaks, (and it will), you can understand exactly what went wrong because you're looking at straightforward Go code making HTTP requests and parsing HTML.

Making it run anywhere

Wrap this in a main function with proper imports and you've got a complete program:

package main

import (
    "fmt"
    "io"
    "net/http"
    "net/url"
    "os"
    "strings"
    md "github.com/JohannesKaufmann/html-to-markdown"
    "golang.org/x/net/html"
)

func main() {
    crawl(os.Args[1], make(map[string]bool))
}

func crawl(targetURL string, visited map[string]bool) {
    if visited[targetURL] {
        return
    }
    visited[targetURL] = true

    base, _ := url.Parse(targetURL)
    resp, err := http.Get(targetURL)
    if err != nil || resp.StatusCode != 200 {
        return
    }
    defer resp.Body.Close()

    body, _ := io.ReadAll(resp.Body)
    markdown, _ := md.NewConverter("", true, nil).ConvertString(string(body))
    fmt.Printf("\n=== %s ===\n%s\n", targetURL, markdown)

    doc, _ := html.Parse(strings.NewReader(string(body)))
    var findLinks func(*html.Node)
    findLinks = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, a := range n.Attr {
                if a.Key == "href" {
                    link, _ := base.Parse(a.Val)
                    if link != nil && link.Host == base.Host {
                        crawl(link.String(), visited)
                    }
                }
            }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            findLinks(c)
        }
    }
    findLinks(doc)
}

Build it:

go mod init crawler
go get github.com/JohannesKaufmann/html-to-markdown
go get golang.org/x/net/html
go build crawler.go

Copy the binary to any machine and run it like this:

./crawler https://egil.biz

Want to run ten crawlers in parallel on different machines? Copy the binary ten times and run them. No coordination required if you're crawling different domains.

Adding JavaScript support when you need it

Some sites require JavaScript rendering. Chrome Headless integrates cleanly with Go through chromedp. Google built Go, so the Chrome integration is first-class and actively maintained.

Replace the http.Get call with chromedp.Run:

chromedp.Run(ctx,
    chromedp.Navigate(targetURL),
    chromedp.Sleep(2*time.Second),
    chromedp.OuterHTML("html", &htmlContent),
)

This reduces the portability since the host needs Chrome Headless installed. But it's still vastly simpler than running the full Firecrawl stack. Most sites that care about being agent friendly don't need JavaScript rendering.

This approach works well for most crawling needs. For most cases Firecrawl is overkill and you're essentially paying for infrastructure you don't need. If you need great crawling infrastructure I would offload that to Zyte which has built this type of infrastructure for decades.

The difference in economics is stark. Self-hosted crawling costs you server time and bandwidth. Managed services cost you per request, which scales poorly (literally).

This is just the foundation

What I've shown here isn't production-ready. It's the base that I've built on for the AI crawlers running in Agentable. The real implementation needs proper error handling, retry logic with exponential backoff, rate limiting, robots.txt respect, graceful failure modes.

But that's the point. Start with something simple that works, then add complexity only where you actually need it. Most web crawling problems just require a bit of thoughtful code and some practical error handling.

I keep iterating on this setup as my AI agents research more and more of the internet. The foundation stays simple while the sophistication grows where it needs to.