P05 — Computer Use Agent · Juan David Suárez Sánchez

01 · The problem

Most enterprise software wasn't built with an API.

SAP modules from 2008, hospital scheduling systems, insurance back-office tools. Integrators charge $200K to wire each one. A vision-driven agent that operates the actual UI — like a human contractor — is 100× cheaper if you can make it reliable.

Why traditional RPA fails

Selector-based bots break on every UI tweak.

UiPath, Blue Prism, etc. record XPath / element selectors. A button moves 20px, the bot breaks. A modal appears unexpectedly, the bot freezes. Maintenance is permanent.

And they can't reason. "Find the report from last quarter" requires understanding what "last quarter" means in the context of today's date and the report names visible on screen.

What Computer Use changes

Vision over the actual screen, reasoning at every step.

Each turn the model sees a fresh screenshot — same as a human would. It doesn't matter if the button moved, the layout reflowed, or a modal appeared: the model adapts.

LangGraph keeps the screenshot history in state. The Verifier node asks "did that step do what I expected?" after every action. Failed verifications trigger replanning.

Safety layer intercepts actions before they hit the VM: command parsing blocks destructive paths, network egress restricted to LLM API + localhost, 47 sandbox-escape scenarios pre-tested.

03 · Demo 1 of 2 · Install & benchmark

Six steps from docker compose up to a benchmark + dashboard.

VM container with Xvfb + VNC, Python deps with Computer Use SDK, safety filter verification (47 attack scenarios), 20-task custom eval, OSWorld office subset, Next.js dashboard with live VNC streaming.

Demo 01

VM bring-up · safety verify · 20-task eval

6 steps · 58s · docker + xdotool + Claude computer_use

SPACE play0 reset

04 · Demo 2 of 2 · Agent at the desktop

Watch the agent extract Q3 sales from a legacy ERP — 15 steps, ~80 seconds.

Live VNC mockup on the left, agent reasoning log on the right. Each step shows the cursor moving, clicking through menus, typing the filename, verifying the file was saved by opening a terminal and running head. The dashboard above tracks step count and accumulated cost.

Demo 02

Task: extract Q3 sales as CSV from legacy ERP

15 actions · 80s · click → type → key → verify → done

SPACE play0 reset

05 · Stack

Sandbox stack + Computer Use SDK.

Stack — pinned

VM & display

Ubuntu22.04 Xvfb noVNC / websockify xdotool scrot

Agent

anthropic0.39.0 Computer Use20250124 LangGraph0.2.45

Safety

command pattern matcher iptables egress filter read-only mount checks

Frontend

Next.js14 noVNC client @xterm/xterm

Safety boundaries (47 scenarios pre-tested)

blocked

rm -rf /, sudo, chmod 777, anything piped from network to shell.

blocked

Reads of /etc/passwd, /etc/shadow, ~/.ssh/. Even Claude trying to "demonstrate" is denied.

blocked

Network egress to anything except the LLM API. The VM can't exfiltrate user data.

allowed

All local file I/O within /home/agent/, all UI automation via xdotool, all reads from whitelisted paths.

contained

VM cannot escape its container. Resource limits prevent fork bombs. Crash → cold restart, no state leak.

06 · Roadmap to v1.0.0

Ten checkpoints.

01✓Dockerized Ubuntu 22.04 + Xvfb + VNC environment defined in docker-compose.yml
02✓xdotool action wrappers (click / type / key / scroll) in src/actions/
03✓Screenshot capture via scrot wired in src/perception/
04✓LangGraph loop with rolling screenshot history (src/agents/orchestrator.py)
05✓Action executor with safety filters: rm -rf, sudo, financial txns, mass deletes are blocked
06✓20 custom-task eval set with ground-truth success criteria (data/eval/tasks.jsonl)
07✓OSWorld subset comparison report in docs/osworld_comparison.md
08✓Animated screenshot-replay demo in /projects/05-computer-use.html
09✓Five recorded screenshot-trace samples in data/recordings/
10✓Safety layer documented in docs/safety_policy.md (full blocked-action list)

An agent that opens apps, fills forms, exports CSVs — on systems that never had an API.

Most enterprise software wasn't built with an API.

Selector-based bots break on every UI tweak.

Vision over the actual screen, reasoning at every step.

Capture, reason, act, verify — bounded by safety.

// Computer Use loop · screenshot → Claude → xdotool → verify → repeat

Six steps from docker compose up to a benchmark + dashboard.

Watch the agent extract Q3 sales from a legacy ERP — 15 steps, ~80 seconds.

Sandbox stack + Computer Use SDK.

Stack — pinned

Safety boundaries (47 scenarios pre-tested)

Ten checkpoints.

P06 · Code Review Agent