P05 · Computer Use · VM · Multimodal

An agent that opens apps, fills forms, exports CSVs — on systems that never had an API.

Anthropic's Computer Use frontier applied to back-office RPA. A Dockerized Ubuntu 22.04 VM (Xvfb + VNC + Chrome + LibreOffice) is the sandbox. Each turn: capture screenshot → Claude with the computer_20250124 tool decides next action → xdotool executes click / type / key / scroll → verify → repeat. Safety layer blocks rm -rf, sudo, network egress outside whitelist. 20-task eval suite + OSWorld subset.

Status
Planned · README only
phase 2 · weeks 13–15
Environment
Docker · Ubuntu 22.04
Xvfb + VNC + xdotool
Benchmarks
OSWorld · WebArena
+ 20 custom tasks
Target success
≥ 75% overall
vs 38% Computer Use baseline
01 · The problem

Most enterprise software wasn't built with an API.

SAP modules from 2008, hospital scheduling systems, insurance back-office tools. Integrators charge $200K to wire each one. A vision-driven agent that operates the actual UI — like a human contractor — is 100× cheaper if you can make it reliable.

Why traditional RPA fails

Selector-based bots break on every UI tweak.

UiPath, Blue Prism, etc. record XPath / element selectors. A button moves 20px, the bot breaks. A modal appears unexpectedly, the bot freezes. Maintenance is permanent.

And they can't reason. "Find the report from last quarter" requires understanding what "last quarter" means in the context of today's date and the report names visible on screen.

What Computer Use changes

Vision over the actual screen, reasoning at every step.

Each turn the model sees a fresh screenshot — same as a human would. It doesn't matter if the button moved, the layout reflowed, or a modal appeared: the model adapts.

LangGraph keeps the screenshot history in state. The Verifier node asks "did that step do what I expected?" after every action. Failed verifications trigger replanning.

Safety layer intercepts actions before they hit the VM: command parsing blocks destructive paths, network egress restricted to LLM API + localhost, 47 sandbox-escape scenarios pre-tested.

02 · System diagram

Capture, reason, act, verify — bounded by safety.

// Computer Use loop · screenshot → Claude → xdotool → verify → repeat
Task (NL) "Extract Q3 sales as CSV" Planner Claude → initial step plan LANGGRAPH LOOP · MAX 50 STEPS Capture scrot @ Xvfb Vision + Reason Claude · computer_use Act xdotool Verify Claude check ↓ continue or task_complete Safety Layer · intercepts before VM blocked: rm -rf · sudo · /etc/passwd · /etc/shadow · ~/.ssh · curl | sh · wget | bash network egress: anthropic.com + localhost only Final Validator Claude — does output match the original task description? Audit Log all actions · all screenshots · decisions · LangSmith trace_id
03 · Demo 1 of 2 · Install & benchmark

Six steps from docker compose up to a benchmark + dashboard.

VM container with Xvfb + VNC, Python deps with Computer Use SDK, safety filter verification (47 attack scenarios), 20-task custom eval, OSWorld office subset, Next.js dashboard with live VNC streaming.

Demo 01
VM bring-up · safety verify · 20-task eval
6 steps · 58s · docker + xdotool + Claude computer_use
SPACE play0 reset
04 · Demo 2 of 2 · Agent at the desktop

Watch the agent extract Q3 sales from a legacy ERP — 15 steps, ~80 seconds.

Live VNC mockup on the left, agent reasoning log on the right. Each step shows the cursor moving, clicking through menus, typing the filename, verifying the file was saved by opening a terminal and running head. The dashboard above tracks step count and accumulated cost.

Demo 02
Task: extract Q3 sales as CSV from legacy ERP
15 actions · 80s · click → type → key → verify → done
SPACE play0 reset
05 · Stack

Sandbox stack + Computer Use SDK.

Stack — pinned

VM & display
Ubuntu22.04 Xvfb noVNC / websockify xdotool scrot
Agent
anthropic0.39.0 Computer Use20250124 LangGraph0.2.45
Safety
command pattern matcher iptables egress filter read-only mount checks
Frontend
Next.js14 noVNC client @xterm/xterm

Safety boundaries (47 scenarios pre-tested)

blocked
rm -rf /, sudo, chmod 777, anything piped from network to shell.
blocked
Reads of /etc/passwd, /etc/shadow, ~/.ssh/. Even Claude trying to "demonstrate" is denied.
blocked
Network egress to anything except the LLM API. The VM can't exfiltrate user data.
allowed
All local file I/O within /home/agent/, all UI automation via xdotool, all reads from whitelisted paths.
contained
VM cannot escape its container. Resource limits prevent fork bombs. Crash → cold restart, no state leak.
06 · Roadmap to v1.0.0

Ten checkpoints.

  1. 01Dockerized Ubuntu 22.04 + Xvfb + VNC environment defined in docker-compose.yml
  2. 02xdotool action wrappers (click / type / key / scroll) in src/actions/
  3. 03Screenshot capture via scrot wired in src/perception/
  4. 04LangGraph loop with rolling screenshot history (src/agents/orchestrator.py)
  5. 05Action executor with safety filters: rm -rf, sudo, financial txns, mass deletes are blocked
  6. 0620 custom-task eval set with ground-truth success criteria (data/eval/tasks.jsonl)
  7. 07OSWorld subset comparison report in docs/osworld_comparison.md
  8. 08Animated screenshot-replay demo in /projects/05-computer-use.html
  9. 09Five recorded screenshot-trace samples in data/recordings/
  10. 10Safety layer documented in docs/safety_policy.md (full blocked-action list)
Next project →

P06 · Code Review Agent

tree-sitter + Claude + static analysis · inline PR comments · published GitHub Action