Screenshot API for AI Agents

Updated Jun 19, 2026

AI agents read the web, but they don't see it. An agent can parse HTML, find a button by its selector, check the text, and report that everything looks fine. Meanwhile the button is invisible to the user, hidden behind a broken z-index. The DOM says one thing, the rendered page shows another. As long as the agent has no eyes, it's working blind.

This is the core gap with AI agents that interact with websites: they operate on text, not pixels. An LLM can parse markup, read JSON responses, and navigate APIs, but it cannot see what a page actually looks like when rendered in a browser. And for a growing number of tasks (UI verification, content moderation, competitor monitoring, accessibility auditing) seeing the page is the entire point.

A screenshot API fills that gap. You give the agent a tool that takes a URL and returns an image. The vision model does the rest.

What the agent sees without screenshots

When an AI agent fetches a web page, it gets one of two things: raw HTML (which includes layout markup, scripts, and inline styles that don't tell you what the page looks like) or a text extraction (which strips all visual context entirely). Neither representation captures overlapping elements, broken images, misaligned grids, color contrast issues, or the cookie banner covering half the viewport.

Vision-capable models like GPT-4o, Claude, and Gemini can analyze images with surprising accuracy. They can tell you if a call-to-action button is visible, whether the text is readable against its background, if the layout collapsed on mobile. But they need the image first. That's where a screenshot API fits into the agent's tool chain: capture what the browser renders, hand the image to the vision model, let it reason about what it sees.

The capture step: one POST request, one image back

The simplest integration is a direct API call. The agent decides it needs to see a page, sends a POST request, and gets an image back. Here's the capture step:

curl -X POST https://screenshotrun.com/api/v1/screenshots \
  -H "Authorization: Bearer sr_live_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/checkout",
    "width": 1280,
    "height": 800,
    "format": "png",
    "block_cookies": true,
    "wait_for_selector": ".checkout-form",
    "delay": 2
  }'

The response includes a screenshot ID. Poll the status endpoint until the capture completes, then download the image URL. In Python (where most AI agent frameworks live), the full capture-and-download flow looks like this:

import requests
import time
import base64

API_KEY = "sr_live_your_key_here"
BASE_URL = "https://screenshotrun.com/api/v1"

def capture_screenshot(url, **kwargs):
    """Capture a screenshot and return the image as bytes."""
    payload = {"url": url, "format": "png", "block_cookies": True, **kwargs}
    resp = requests.post(
        f"{BASE_URL}/screenshots",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json=payload,
    )
    screenshot_id = resp.json()["data"]["id"]

    # Poll until the capture finishes
    for _ in range(30):
        status = requests.get(
            f"{BASE_URL}/screenshots/{screenshot_id}",
            headers={"Authorization": f"Bearer {API_KEY}"},
        ).json()["data"]
        if status["status"] == "completed":
            return requests.get(status["url"]).content
        time.sleep(1)

    raise TimeoutError(f"Screenshot {screenshot_id} did not complete in time")

That capture_screenshot function is the tool you hand to your agent. It takes a URL, returns image bytes. Everything else (browser management, rendering, blocking cookie banners, waiting for selectors) happens on the API side.

Sending the screenshot to Claude, GPT-4o, and Gemini

Once you have the image bytes, every major vision model accepts them in roughly the same format: a base64-encoded image in the message payload. Here's how the same screenshot flows to three different models.

Claude (Anthropic API):

import anthropic

def analyze_with_claude(image_bytes, prompt):
    client = anthropic.Anthropic()
    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": base64.b64encode(image_bytes).decode(),
                    },
                },
                {"type": "text", "text": prompt},
            ],
        }],
    )
    return message.content[0].text

GPT-4o (OpenAI API):

from openai import OpenAI

def analyze_with_gpt4o(image_bytes, prompt):
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64.b64encode(image_bytes).decode()}",
                    },
                },
                {"type": "text", "text": prompt},
            ],
        }],
    )
    return response.choices[0].message.content

Gemini (Google AI API):

import google.generativeai as genai
from PIL import Image
import io

def analyze_with_gemini(image_bytes, prompt):
    genai.configure()
    model = genai.GenerativeModel("gemini-2.0-flash")
    image = Image.open(io.BytesIO(image_bytes))
    response = model.generate_content([prompt, image])
    return response.text

The usage pattern is identical across all three. Capture the screenshot once, then pass the same image bytes to whichever model your agent uses:

screenshot = capture_screenshot(
    "https://example.com/checkout",
    width=1280,
    height=800,
    wait_for_selector=".checkout-form",
)

prompt = "Is the 'Complete Purchase' button visible and clickable? Are there any UI elements overlapping it?"

# Pick your model
result = analyze_with_claude(screenshot, prompt)
# result = analyze_with_gpt4o(screenshot, prompt)
# result = analyze_with_gemini(screenshot, prompt)

Most screenshot API docs stop at "here's how to capture an image" and leave the AI integration as an exercise. But the capture is the easy part. The interesting work starts when the vision model analyzes what the browser rendered.

The MCP server path: zero-code screenshot access

If you're building with Claude Desktop or any MCP-compatible client, there's a faster path that skips the REST API entirely. The MCP server for AI agents exposes screenshot capture as a tool that Claude can call directly. No Python wrapper, no polling logic. You add the server to your config and the model gains the ability to screenshot any URL on its own.

The setup is a single JSON block in your claude_desktop_config.json:

{
  "mcpServers": {
    "screenshotrun": {
      "command": "npx",
      "args": ["-y", "screenshotrun-mcp"],
      "env": {
        "SCREENSHOTRUN_API_KEY": "sr_live_your_key_here"
      }
    }
  }
}

After restarting Claude Desktop, the model can take screenshots in conversation. Ask "show me what example.com looks like on mobile" and it calls the tool with the right viewport parameters, captures the page, and displays the result inline. Ask "is there a cookie banner blocking the content?" and it captures, inspects, and answers. The agent loop (decide to screenshot, capture, analyze, respond) happens without you writing a single line of integration code.

This is particularly useful for ad-hoc investigation tasks where you don't know upfront which pages you'll need to inspect. A QA session, a competitor audit, a support ticket where a customer reports "the page looks broken." Instead of building a dedicated tool, you hand the model a general-purpose screenshot capability and let it decide when to use it.

Getting clean captures for vision models

A screenshot that confuses the vision model is worse than no screenshot at all. If half the image is a cookie consent dialog, the model spends its attention on GDPR legalese instead of the actual page content. If the capture fires before a single-page application finishes rendering, the model sees a blank white rectangle or a loading spinner and reports "the page appears to be empty," which isn't useful.

Three parameters handle the most common problems:

block_cookies: true strips consent banners before capture. About 40% of European sites show a dialog that covers a significant portion of the viewport. For vision analysis, you want the model looking at the page, not the banner. I covered the mechanics in the block cookie banners feature page.

wait_for_selector pauses the capture until a specific DOM element exists. Modern dashboards and SPAs load data from APIs after the initial page load. The HTML shell arrives instantly, but the actual content (charts, tables, product listings) appears seconds later. If your agent is monitoring a React dashboard, "wait_for_selector": ".dashboard-loaded" ensures the capture happens after the data renders. For more on this problem, the guide on screenshotting single-page applications walks through the common failure modes.

width and height control the viewport. This matters more than it seems, because viewport size directly affects token consumption. A 1920x1080 full-page PNG of a long landing page can be 3-5 MB. Vision models tokenize images based on their dimensions, and larger images cost proportionally more tokens. For most agent tasks, 1280x800 gives you enough detail without burning through your token budget. If you need to check mobile layouts specifically, custom viewport and device emulation lets you set 375x812 for an iPhone-like view.

Token cost: how image size affects your API bill

Every image you send to a vision model gets tokenized. The exact formula varies by provider, but the pattern is consistent: larger images consume more tokens, and tokens cost money. This adds up fast in agentic workflows where the model might take 10-20 screenshots in a single session.

Some practical numbers. Claude charges based on image dimensions, with a 1280x800 PNG costing roughly 1,600 tokens. A full-page screenshot of a long page at 1920x1080 can hit 4,000+ tokens. GPT-4o uses a tile-based system where images are broken into 512x512 tiles, so a 1280x800 image costs about 765 tokens (6 tiles), while a 2560x1600 image jumps to 2,380 tokens (20 tiles).

Three levers to control this:

Capture at the smallest viewport that still shows what you need. 1280x800 covers desktop layouts. Don't default to 1920x1080 if you're just checking whether a button exists.
Use JPEG or WebP format instead of PNG. Vision models handle compressed formats fine, and a JPEG at quality 80 is typically 60-70% smaller than the equivalent PNG. Smaller file, fewer bytes to base64-encode, same visual fidelity for analysis tasks.
Skip full_page: true unless you actually need the entire scrollable page. A full-page screenshot capture of a landing page with 8 sections produces a tall, narrow image that consumes tokens for content the model may not need. Capture the viewport only, and if the agent needs to see below the fold, let it take a second screenshot with a scroll offset.

This isn't about being cheap. It's about building agents that can run hundreds of iterations without the API bill becoming the bottleneck. A monitoring agent that checks 50 competitor pages daily at 4,000 tokens per image burns 200,000 vision tokens per day. At 1,600 tokens per image with a smaller viewport, that's 80,000. Over a month, the difference is meaningful.

Why not just run Playwright yourself?

Fair question. If you're already building an AI agent in Python, Playwright has a Python package. You could spin up a browser, navigate to the URL, take a screenshot, and feed it to the model without any external API. I wrote a full breakdown in the screenshot API vs Puppeteer/Playwright comparison, but here's the short version for the AI agent use case specifically.

Local Playwright works fine in development. It breaks in production in ways that are annoying to debug and expensive to maintain. Headless Chromium consumes 200-500 MB of RAM per instance. Some sites detect headless browsers and serve different content or block the request entirely. Cookie consent banners require site-specific logic to dismiss. Proxy rotation for geo-targeted captures needs infrastructure. And if your agent runs in a serverless environment (Lambda, Cloud Functions, a containerized worker), bundling Chromium adds 300+ MB to your deployment package and cold start time.

An API call is 200 bytes of JSON. The browser runs somewhere else. The anti-bot handling, the cookie blocking, the waiting for pages to fully load, the fixing blank images from lazy loading are all problems you're paying someone else to solve. For an AI agent that needs screenshots as one tool among many, that tradeoff makes sense. For a dedicated screenshot pipeline where you need full control over the browser, self-hosting might be the better call.

Agent workflow patterns that work well

I've seen three patterns emerge as AI agents start using screenshots as a regular tool.

Monitoring is the most common. An agent runs on a schedule, captures a set of pages daily (or hourly), and feeds each screenshot to a vision model with a specific question: "Has the pricing changed?" "Is there a new banner?" "Does the layout look broken?" The model's response gets logged or triggers an alert. This works well for competitor tracking, uptime verification, and compliance checks. It pairs naturally with visual regression testing workflows where you're comparing the current capture against a baseline.

Post-deploy QA is the second pattern. An agent navigates a checklist of pages after each release and verifies visual correctness. "Is the hero image loading?" "Is the CTA button visible above the fold?" "Does the mobile layout show the hamburger menu?" This is the hidden-button problem from the opening paragraph, done properly. The agent doesn't just check the DOM. It looks at the rendered page and confirms that what's supposed to be visible is actually visible.

Content extraction rounds out the three. Some sites render pricing tables as canvas elements or SVG graphics that don't appear in the HTML. Others use anti-scraping techniques that block HTTP requests but allow normal browser rendering. A screenshot plus vision model extracts the information visually, the same way a human would. You can also convert HTML to image for generating visual snapshots of your own content, then feed those to the model for verification.

For building automated link previews, the screenshot-plus-vision approach also lets you generate descriptive alt text or summaries alongside the thumbnail.

What screenshots can't do for your agent

I should be honest about what screenshots don't solve for AI agents.

Screenshots are static. They show one moment in time, one viewport, one scroll position. If the bug only appears during a hover interaction, an animation sequence, or after scrolling to a specific element, a single screenshot won't catch it. You can work around this with multiple captures at different scroll positions, or by using the css parameter to force hover states, but it's a workaround, not a solution. Claude's computer use feature handles interactive flows better, and a screenshot API complements that by providing clean, high-quality captures for the static verification steps.

Vision model accuracy varies. GPT-4o and Claude are good at identifying major layout issues, reading text from images, and spotting obvious visual problems. They're less reliable at catching subtle issues like a 2-pixel misalignment, a slightly wrong shade of brand blue, or a font weight that's 400 instead of 500. For pixel-perfect verification, you still need deterministic tooling (pixel-diff libraries, not probabilistic models).

And latency matters for agent loops. A screenshot capture takes 3-5 seconds. Sending the image to a vision model takes another 2-5 seconds for the response. If your agent needs to take 20 screenshots in sequence, that's 2+ minutes of wall-clock time just for the visual checks. Parallelizing the captures helps, but vision model calls are typically sequential within a conversation. Build your agent workflows with this latency budget in mind.

Give your AI agents eyes on the web

Get your API key — 200 free screenshots/month

The gap between what AI agents can read and what they can see is closing fast. Vision models are already good enough for most practical verification tasks, and they're improving every quarter. The missing piece has been a clean, reliable way to get screenshots into the agent's context. A REST API call or an MCP server tool fills that gap without the infrastructure headache of self-hosted browsers. If you're building agents that interact with the web, giving them the ability to actually look at it changes what they can do.

Frequently asked questions

Capture a screenshot by sending a POST request to the API with the target URL and parameters (viewport size, format, wait conditions). The API returns an image. Base64-encode the image and include it in the vision model's message payload. Claude, GPT-4o, and Gemini all accept base64 images. Your agent calls the capture function whenever it needs to see a page, then reasons about the image using the vision model.

An MCP (Model Context Protocol) server exposes screenshot capture as a tool that Claude Desktop can call directly, without writing API integration code. You add the server configuration to your claude_desktop_config.json file, and Claude gains the ability to screenshot any URL during a conversation. The model decides when to take a screenshot, calls the tool, and analyzes the result within the chat interface.

Smaller images consume fewer tokens. Capture at 1280x800 instead of 1920x1080 to reduce token usage by 40-60%. Use JPEG or WebP format instead of PNG for smaller file sizes. Avoid full_page: true unless you need the entire scrollable page. These optimizations add up quickly when an agent processes dozens or hundreds of screenshots per session.

A screenshot API eliminates the need to bundle and manage headless Chromium (200-500 MB RAM per instance), handle anti-bot detection, dismiss cookie consent banners, rotate proxies for geo-targeted captures, and manage browser crashes. An API call is a 200-byte JSON request. This is especially important for agents running in serverless environments where bundling Chromium adds 300+ MB to deployment size.

Three common patterns: monitoring agents that capture pages on a schedule and check for visual changes (pricing updates, layout breakage, new banners); QA agents that verify UI correctness after deploys by looking at rendered pages instead of just parsing HTML; and content extraction agents that use vision models to read information from pages that resist traditional scraping.