Anthropic's Computer Use frontier applied to back-office RPA. A Dockerized Ubuntu 22.04 VM (Xvfb + VNC + Chrome + LibreOffice) is the sandbox. Each turn: capture screenshot → Claude with the computer_20250124 tool decides next action → xdotool executes click / type / key / scroll → verify → repeat. Safety layer blocks rm -rf, sudo, network egress outside whitelist. 20-task eval suite + OSWorld subset.
SAP modules from 2008, hospital scheduling systems, insurance back-office tools. Integrators charge $200K to wire each one. A vision-driven agent that operates the actual UI — like a human contractor — is 100× cheaper if you can make it reliable.
UiPath, Blue Prism, etc. record XPath / element selectors. A button moves 20px, the bot breaks. A modal appears unexpectedly, the bot freezes. Maintenance is permanent.
And they can't reason. "Find the report from last quarter" requires understanding what "last quarter" means in the context of today's date and the report names visible on screen.
Each turn the model sees a fresh screenshot — same as a human would. It doesn't matter if the button moved, the layout reflowed, or a modal appeared: the model adapts.
LangGraph keeps the screenshot history in state. The Verifier node asks "did that step do what I expected?" after every action. Failed verifications trigger replanning.
Safety layer intercepts actions before they hit the VM: command parsing blocks destructive paths, network egress restricted to LLM API + localhost, 47 sandbox-escape scenarios pre-tested.
VM container with Xvfb + VNC, Python deps with Computer Use SDK, safety filter verification (47 attack scenarios), 20-task custom eval, OSWorld office subset, Next.js dashboard with live VNC streaming.
Live VNC mockup on the left, agent reasoning log on the right. Each step shows the cursor moving, clicking through menus, typing the filename, verifying the file was saved by opening a terminal and running head. The dashboard above tracks step count and accumulated cost.