Two Tools for Seeing What Your Agent Built

Rod Bland · · 6 min read

AI agents can write code, but they can’t see what it looks like. That’s a problem when you’re shipping web apps. I built two tools to close this gap: screenshot.py captures web pages through a headless browser, and grab.py captures whatever’s on my physical screen. Together they give my agents (and me) eyes on everything.

Both tools are Python 3.10+ command-line scripts. screenshot.py uses Playwright (headless Chromium — no display needed, works on servers and CI). grab.py uses mss (requires a physical or virtual display). Both are open source in the agent-screenshot repo.

The screenshot tool

screenshot.py takes automated screenshots of any URL using headless Chromium and saves the result as a JPEG.

Basic usage

python screenshot.py https://example.com

python screenshot.py https://example.com --mobile

python screenshot.py https://example.com --dismiss-popups --wait-until load --wait 3000

The --dismiss-popups flag auto-closes cookie banners, geo-redirect modals, and email signup overlays. Without it, half your screenshots are obscured by “Subscribe to our newsletter!” popups.

Full-page tiling for Vision models

This is the feature that makes the tool genuinely useful for AI-assisted development. When you pass --full-page, the tool captures the entire scrollable page and then tiles it into 1072x1072 pixel chunks:

python screenshot.py https://example.com --full-page

Why 1072x1072? Vision models (Claude, GPT-4o, Gemini) have a maximum effective resolution. Feed a 1072x15000 pixel full-page screenshot and the model can’t process the detail — text becomes unreadable, UI elements blur together. But split that same page into a sequence of 1072x1072 tiles and the model can read every label, every button, every table cell.

The tool handles this automatically. A long page might produce 6-8 tiles. Each one is saved as a separate file, and the agent processes them in sequence. The result is that your AI agent can “see” an entire web page in full detail, not just a viewport-sized slice of it.

There’s also a safety valve: --max-height defaults to 15000 pixels. Pages longer than that get truncated before tiling, preventing a runaway infinite-scroll page from producing hundreds of tiles.

The mandatory verification step

In my workflow, screenshots aren’t optional. Every code change that affects the UI gets screenshotted before it’s reported as done. Agents know they must take a screenshot after any visual change and actually look at it.

This catches real bugs. An agent might change a CSS class and report “done” without realising it broke the layout on a different section of the page. The screenshot makes that visible. I’ve lost count of the times a post-change screenshot revealed a problem that the agent would have missed if it had just read the code.

Quick reference

FlagPurpose
--full-pageTile entire page into 1072x1072 chunks
--mobileMobile viewport (375x812)
--dismiss-popupsAuto-close cookie/popup overlays
--selectorScreenshot a specific CSS element
--wait NWait N ms after page load
--quality NJPEG quality 1-100 (default 85)
--out /pathOutput directory
--header K=VCustom HTTP header (repeatable)

The grab tool

grab.py solves a different problem. Where screenshot.py renders web pages through a headless browser, grab.py captures the actual physical desktop screen — whatever’s on the monitor right now.

Requires a display. grab.py needs X11 or Wayland (Linux), a desktop session (macOS/Windows), or WSLg. It won’t work on headless servers or CI runners without a display.

How it works

The tool uses Python’s mss library to capture the display, with 14 region presets for common screen layouts:

python grab.py              # Full screen
python grab.py left         # Left half
python grab.py right        # Right half
python grab.py top-left     # Top-left quadrant

The output is a single JPEG, sized to keep the file manageable (typically 200-800KB).

Why this exists

Sometimes an agent needs to see something that isn’t a web page. A design mockup in Figma. A spreadsheet in Google Sheets. An error dialog in a desktop app. Anything on the screen that’s easier to show than describe.

The workflow: arrange what you want the agent to see on your monitor, run grab.py, and the agent reads the captured image. It preserves visual context that text can’t capture — layout, colours, spatial relationships between elements.

Under the hood, mss captures the screen in about 3 lines. grab.py adds region cropping via Pillow and wraps it in a command-line interface with JPEG quality control.


Get both tools

Both tools are open source and packaged together: agent-screenshot on GitHub

You’ll need Python 3.10+ and git.

Install

git clone https://github.com/rodbland2021/agent-screenshot.git
cd agent-screenshot
pip install -r requirements.txt
playwright install chromium

Verify it works:

python screenshot.py https://example.com
# Should print a file path like /tmp/screenshots/example-com_1234567890.jpg

Add to your agent’s workflow

The real value is making screenshots automatic — part of every UI change, not an afterthought.

Claude Code — add to your CLAUDE.md:

## Screenshots
After any visual change, verify with a screenshot:
python /path/to/screenshot.py <url> --full-page
Read the screenshots to check for visual regressions before reporting done.

Cursor — same instruction in .cursorrules. Aider — add to .aider.conf.yml. OpenClaw — add to your agent’s system prompt or AGENTS.md.

The pattern is the same regardless of tool: tell your agent to screenshot after visual changes and check the result before reporting done.

Requirements

  • Python 3.10+ on Linux, macOS, or Windows (including WSL)
  • screenshot.py: Playwright + Chromium. Works on any platform including headless servers, Docker, and CI.
  • grab.py: mss + Pillow. Requires a physical or virtual display (X11, Wayland, macOS desktop, or Windows).

Two tools, two purposes

  • screenshot.py — renders a URL through a headless browser. Works anytime, no display needed. Used for automated verification of web apps.
  • grab.py — captures the physical display. Requires a monitor. Used for sharing visual context that isn’t a web page.

They complement each other. screenshot.py is part of the automated pipeline — every UI change triggers one. grab.py is for ad-hoc situations where you need an agent to see what you see. Between the two, the agents aren’t blind anymore.