Reader: The Web Scraping Engine Built for AI Agents
If you've ever tried to feed web data to an AI agent, you know the pain. Raw HTML is messy, full of navigation junk, ads, and scripts. Cleaning it up for an LLM is a chore. What if you could get just the actual content—the article text, the product description, the core data—in a clean, structured format, automatically?
That's exactly what Reader does. It's a new open-source web scraping engine designed from the ground up for production AI agents. It doesn't just fetch HTML; it intelligently extracts the primary readable content and strips away everything else, delivering exactly what your agent needs to process.
What It Does
Reader is a specialized web scraping tool with one primary job: to turn a URL into clean, usable text content. You give it a URL, and it returns a simplified JSON object containing the page's title and its main content, all boiled down to plain text. It handles the parsing, cleaning, and noise removal so you don't have to.
Think of it as a focused, single-purpose API that sits between your agent and the chaotic web, ensuring the agent only gets the signal, not the noise.
Why It's Cool
The magic of Reader is in its simplicity and its specific design choice. It's not trying to be a general-purpose scraper for every use case. It's built for one user: an AI agent.
- Content-Dedicated Parsing: It uses a combination of heuristics and parsing strategies (like Mozilla's Readability) to identify the core article or content block on a page. This means your AI isn't wasting tokens analyzing "Related Articles" sidebars or cookie consent banners.
- Clean Text Output: It returns plain text. This is perfect for stuffing into an LLM context window or for further processing. No HTML tags, minimal formatting cruft—just the words that matter.
- Production-Ready Mindset: The project is built with deployment in mind. It's a self-contained service (with a Dockerfile provided) that you can run, scale, and integrate into your own agent pipelines. It's a reliable component, not just a script.
- Developer Experience: It's straightforward. A single
POSTrequest to/parsewith aurlgives you back exactly what you need. This reduces cognitive overhead when you're building more complex systems.
How to Try It
Getting started with Reader is straightforward. You can run it locally in a couple of minutes.
First, clone the repository:
git clone https://github.com/vakra-dev/reader
cd reader
The easiest way to run it is using Docker Compose:
docker-comp