Skip to content

Commit f3280e8

Browse files
PrateekJannuclaude
andcommitted
feat(e2e+docs): Playwright web E2E green, security scan, core docs
- e2e: playwright config orchestrating mock-coasty + backend (test key, in-memory DB) + built web app; 3 passing web flows: full delegate->watch-> approve->complete with live screen frames + cost summary; workflow builder-> validate->run->approve->output; server-side BUDGET_EXCEEDED enforcement. Runtime security watcher asserts no sk-coasty-*/whsec_* in any browser request. - desktop spec ready (runs once apps/desktop lands) - backend: REST events polling fallback for mobile (/events.json?after=) - mock: unauthenticated /health for orchestration probes; tolerant empty-JSON parser - web: vite host pinned to 127.0.0.1 (IPv6 localhost broke readiness probes) - docs: README (10-min offline quickstart + cost warnings), ARCHITECTURE, SECURITY (threat table), DEPLOYMENT, COOKBOOK (10 recipes) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1 parent 6118fcd commit f3280e8

51 files changed

Lines changed: 4480 additions & 8 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

ARCHITECTURE.md

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
# ARCHITECTURE
2+
3+
How open-cowork is put together, and why. Companion docs: `DECISIONS.md`
4+
(choices + trade-offs), `SECURITY.md` (trust boundaries), `PLAN.md` (original
5+
build plan and package contracts).
6+
7+
## The two hard truths the design hangs on
8+
9+
**1. Local vs remote execution.** Only the desktop app can capture and control
10+
the user's *own* screen; web and mobile drive a Coasty cloud machine instead.
11+
This is modeled as one `Executor` interface with three implementations behind a
12+
single shared agent loop — the rest of the product never cares which screen it
13+
is driving.
14+
15+
**2. The API key never touches a client.** All clients speak to the
16+
open-cowork backend, which is the only holder of `COASTY_API_KEY` and of every
17+
per-run `webhook_secret`. Clients hold short-lived session tokens.
18+
19+
## Component map
20+
21+
```text
22+
┌────────────────────────────────────────────┐
23+
│ Coasty API │
24+
│ /predict /sessions /runs /workflows │
25+
│ /machines · SSE events · HMAC webhooks │
26+
└────────▲───────────────────────┬───────────┘
27+
│ X-API-Key (backend only)│ webhooks (HMAC)
28+
┌──────────────┐ ┌────────┴───────────────────────▼───────────┐
29+
│ apps/desktop │ IPC │ apps/backend (Fastify) │
30+
│ Electron ├──────►│ auth (bearer tokens) · Coasty proxy │
31+
│ main proc: │ local │ estimates + confirmCostCents + budget caps │
32+
│ LocalRun- │ runs │ Ingestor: Coasty SSE → events table → bus │
33+
│ Manager │ mirror│ webhook receiver (verify before mutate) │
34+
│ + Local- │ │ SQLite (node:sqlite) · SSE fan-out │
35+
│ Executor │ └────────▲───────────────▲────────────────────┘
36+
└──────▲───────┘ │ REST + SSE │ REST + polling
37+
│ hosts │ │
38+
┌──────┴───────┐ ┌───────┴──────┐ ┌──────┴───────┐
39+
│ webview: │ │ apps/web │ │ apps/mobile │
40+
│ the same SPA │ │ Vite+React │ │ Expo / RN │
41+
└──────────────┘ └──────────────┘ └──────────────┘
42+
shared: packages/core · packages/executor · packages/ui
43+
```
44+
45+
## packages/core — the framework-agnostic heart
46+
47+
Zero runtime dependencies, isomorphic (Node + browser): injectable `fetch`,
48+
Web Crypto for HMAC, injectable clocks/sleeps for deterministic tests.
49+
50+
- **`CoastyClient`** — typed methods for every documented endpoint. Transport
51+
policy: timeouts compose with caller signals; retries use exponential backoff
52+
with full jitter and honor `Retry-After`; **GET/DELETE retry by default, POST
53+
retries only when an `Idempotency-Key` was provided** (a retried unkeyed POST
54+
could double-bill). Errors map to `CoastyApiError` carrying `code`,
55+
`request_id`, and code-specific extras.
56+
- **`runAgentLoop`** — screenshot → predict → execute → repeat until
57+
done/fail/cap/abort. Takes an `AgentScreen` (what executors implement) and a
58+
`PredictStepFn`, so predictions can come from a raw Coasty session *or* the
59+
backend proxy. Emits structured events; tolerates up to 3 consecutive
60+
action-execution failures; cooperative cancellation via `AbortSignal`.
61+
- **Workflow DSL** — validator enforcing every documented limit (≤200 steps,
62+
≤8 nesting, ≤16 parallel branches, retry 1–20, no approvals inside parallel,
63+
reserved `save_as` names), the 13-op condition evaluator, `{{path}}`
64+
templating, and a deterministic executor with `budget_cents` /
65+
`max_iterations` / `deadline_seconds` guards — used for builder feedback,
66+
dry-run estimates, and cross-checking the server.
67+
- **Cost estimator** — mirrors the documented pricing table exactly (including
68+
the strict HD boundary: 1280×720 is *not* HD) and computes run/workflow
69+
worst-case estimates the backend uses for the confirmation handshake.
70+
- **Webhook HMAC** — sign/verify `t=<unix>,v1=<hex>` over `"<t>.<body>"`,
71+
constant-time byte comparison, ±300s tolerance both directions, multiple
72+
`v1` entries accepted (rotation).
73+
- **SSE** — a spec-correct parser plus reconnecting event streams that resume
74+
via `Last-Event-ID` with overlap de-duplication.
75+
76+
## packages/executor — one loop, three screens
77+
78+
```ts
79+
interface Executor extends AgentScreen {
80+
kind: 'local' | 'remote-machine' | 'browser';
81+
screenshot(): Promise<{ base64; width; height }>;
82+
execute(action: CuaAction): Promise<void>;
83+
dimensions(): Promise<{ width; height }>;
84+
dispose(): Promise<void>;
85+
}
86+
```
87+
88+
- **RemoteMachineExecutor** maps canonical actions onto the documented machine
89+
endpoints (`GET /machines/{id}/screenshot`, `POST /machines/{id}/actions`)
90+
through an injected transport — `CoastyClient` on the backend, a thin proxy
91+
client elsewhere. `wait` sleeps locally; `raw` code execution is refused by
92+
policy on every target.
93+
- **LocalExecutor** wraps a `NativeBridge` and solves the #1 documented
94+
pitfall — coordinate scaling — by mapping model-space (screenshot pixels) to
95+
input-space (real screen pixels) on every action.
96+
- **Bridges**: Windows is the reference — a persistent PowerShell daemon
97+
(`System.Drawing` capture + `user32` SendInput-family input) speaking
98+
JSON-lines over stdio, started via `-EncodedCommand`; zero native npm
99+
modules, so installs never compile anything. macOS (`screencapture`/
100+
`cliclick`/`osascript`) and Linux (`import`/`xdotool`) are best-effort
101+
equivalents behind the same interface.
102+
- Actions are normalized first (`normalizeAction`) because the upstream docs'
103+
reference table and examples disagree on some param shapes — both are
104+
accepted, one canonical shape is executed.
105+
106+
## apps/backend — proxy, custodian, fan-out
107+
108+
- **Auth**: `POST /api/auth/login {email}` issues an opaque random token
109+
(stored hashed, 7-day expiry). Single-tenant demo auth by design
110+
(`DECISIONS.md` D6); every table already carries `user_id`.
111+
- **Spend safety — the confirmCostCents handshake.** Billable routes compute
112+
the relevant number server-side (run worst case = `maxSteps × perStep`;
113+
machines = first-hour rate; workflows = the budget cap itself) and reject
114+
unless the client echoes it exactly (`409 ESTIMATE_CHANGED` with the expected
115+
value). Budgets are then enforced again: runs whose worst case exceeds the
116+
user's cap are refused with a suggested `maxSteps`; workflow runs pass
117+
`budget_cents` so *Coasty* halts them at the cap (`GUARD_EXCEEDED`); wallet
118+
pre-flight checks surface 402s before anything starts.
119+
- **Event pipeline.** Creating a run starts an **Ingestor** subscription to
120+
Coasty's SSE stream (resuming from the last stored seq). Events are mirrored
121+
into the `events` table **keeping the upstream `seq`**, applied to run state,
122+
and published on an in-process bus. Client SSE routes replay from SQLite
123+
(`Last-Event-ID`), then attach to the bus — with gap-filling if live events
124+
race the replay. The same table serves cloud runs, local runs, workflow
125+
runs, and per-user notification feeds (`stream_kind` + `stream_id`).
126+
- **Webhooks as reconciliation.** `POST /webhooks/coasty` verifies the HMAC
127+
against the per-run secret (looked up by the payload's run id) over the
128+
exact raw bytes before *any* state change; stale/tampered/unknown deliveries
129+
get 401. Verified events update run state and post to the owner's
130+
notification stream — so terminal transitions arrive even if an SSE
131+
subscription dropped. `GET /api/runs/:id` additionally reconciles
132+
non-terminal runs against Coasty on read.
133+
- **Local runs.** The desktop app mirrors its LocalExecutor loop through
134+
`POST /api/local-runs(/:id/events)`, so a run on your laptop is supervisable
135+
from your phone exactly like a cloud run — same timeline route, same
136+
approval notifications.
137+
- **Persistence**: `node:sqlite` behind a repository class (`db.ts`); events
138+
have `(stream_kind, stream_id, seq)` primary keys so ingestion is idempotent
139+
and replay is a range scan. Postgres is a contained swap (`DEPLOYMENT.md`).
140+
141+
## Realtime model (end to end)
142+
143+
```text
144+
Coasty SSE ──► Ingestor ──► events table (durable, seq) ──► bus ──► client SSE
145+
Coasty webhooks ──► HMAC verify ──► state + notification stream ──► bus ──► feeds
146+
desktop local loop ──► POST /api/local-runs/:id/events ──► same table/bus
147+
```
148+
149+
Every hop resumes: the Ingestor reconnects to Coasty with `Last-Event-ID`;
150+
clients reconnect to the backend the same way; mobile polls
151+
`/api/runs/:id/events.json?after=N` (React Native fetch lacks streaming).
152+
Nothing is lost or duplicated because the durable seq is the single cursor.
153+
154+
## apps/desktop — local control, safely
155+
156+
Electron with `contextIsolation: true`, `nodeIntegration: false`. The renderer
157+
is the same SPA as the web app; a small preload exposes
158+
`window.cowork = { platform, backendUrl, startLocalRun, cancelLocalRun }`.
159+
`LocalRunManager` (main process) runs core's `runAgentLoop` with
160+
`LocalExecutor`, gets predictions through the backend's `/api/proxy/sessions`
161+
(key stays server-side), and mirrors events to `/api/local-runs` in batches.
162+
The E2E suite deliberately never starts a local run (it would seize the real
163+
mouse); that path is covered by unit tests plus an opt-in native capture smoke
164+
test (`COWORK_NATIVE_SMOKE=1`).
165+
166+
## apps/web + packages/ui
167+
168+
One Vite SPA serves browsers and the desktop webview. `packages/ui` is a
169+
dependency-free design system (dark-first tokens, accessible primitives,
170+
domain components like `EventTimeline`, `ScreenView`, `ApprovalBar`,
171+
`WorkflowStepTree`); apps map API DTOs into its presentational props. The live
172+
screen view polls machine screenshots every 2s while a run is active — frames
173+
are cross-platform and cheap (`DECISIONS.md` A3).
174+
175+
## apps/mobile
176+
177+
Expo/React Native with zero extra native deps; every screen is
178+
react-native-web-compatible, which is how the same UI is verified in CI
179+
(D7). Timelines poll the REST fallback; approvals hit the same resume routes.
180+
181+
## tools/mock-coasty
182+
183+
A faithful offline twin of the documented API: key kinds + billing headers,
184+
the full error catalog, exact pricing math, run/workflow steppers with the
185+
documented state machine, durable SSE with replay, HMAC-signed webhook
186+
delivery, sandbox machines with generated-PNG screenshots. Deliberately does
187+
**not** import `core` (D9) so contract bugs can't hide; behavior triggers in
188+
task text (`NEEDS_HUMAN`, `MUST_FAIL`, `RUN_LONG`, `MOCK_DONE`) make every
189+
lifecycle deterministic for tests and demos.
190+
191+
## Data model (SQLite)
192+
193+
```text
194+
users(id, email, budget_cents, created_at)
195+
sessions(token_hash PK, user_id, expires_at) -- tokens stored hashed
196+
runs(id, user_id, kind coasty|local, coasty_run_id, machine_id, task, status,
197+
cua_version, max_steps, budget_cents, cost_cents, steps_completed,
198+
result_json, error_json, awaiting_human_reason, webhook_secret, …)
199+
workflow_runs(id, user_id, coasty_workflow_run_id, workflow_id, status,
200+
budget_cents, spent_cents, awaiting_step_id, webhook_secret, …)
201+
events(stream_kind, stream_id, seq, type, data_json, created_at,
202+
PRIMARY KEY (stream_kind, stream_id, seq)) -- the realtime spine
203+
```

COOKBOOK.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# COOKBOOK
2+
3+
Recipes for common open-cowork jobs. All of them work against the mock server
4+
(`pnpm dev:mock`) with zero spend — add ` NEEDS_HUMAN `, ` MUST_FAIL `, or
5+
` RUN_LONG ` to any task text to script the mock's behavior deterministically.
6+
7+
---
8+
9+
## 1. Run a task on a cloud machine and approve it from your phone
10+
11+
1. **Laptop (web)**: Machines → *Provision machine* (Linux, confirm the
12+
$0.05/hr rate) → Delegate → "Download the invoices NEEDS_HUMAN and file
13+
them" → confirm the worst-case cost → the run starts.
14+
2. **Phone (mobile app)**: sign in with the same email. The Runs tab shows the
15+
run; when it pauses, the *"A run needs your approval"* banner appears.
16+
3. Open it → read the pause reason and the event timeline → add a note →
17+
**Approve**. The laptop's live view resumes within a second (SSE), and the
18+
run finishes with a cost summary on both devices.
19+
20+
Why it works: the backend mirrors every event into a durable per-run stream
21+
and pushes an `awaiting_human` notification onto your per-user feed; both
22+
clients are just subscribers (see `ARCHITECTURE.md` → Realtime model).
23+
24+
## 2. Let the agent drive *your own* computer (desktop)
25+
26+
1. Run the stack (`pnpm dev:mock` + `dev:backend` + `dev:web`), then
27+
`pnpm dev:desktop`.
28+
2. In the desktop window, the Delegate screen's machine list has **"This
29+
computer (local screen)"** as the first target. Pick it, describe the task,
30+
and read the confirmation — it means it: the agent moves your real mouse.
31+
3. Watch the timeline; cancel anytime from the desktop, the web app, or your
32+
phone (local runs are mirrored like cloud runs).
33+
34+
Notes: Windows needs nothing extra (PowerShell bridge). macOS: grant Screen
35+
Recording + Accessibility. Tip: keep tasks narrow and use approval-style
36+
wording; the loop aborts after 3 consecutive failed actions.
37+
38+
## 3. Build a workflow with a human gate (and a hard budget)
39+
40+
Workflows → *New workflow*. The prefilled template is exactly this recipe:
41+
42+
```json
43+
{
44+
"steps": [
45+
{ "id": "fetch", "type": "task", "task": "Open order {{inputs.order_id}} and read the invoice total", "save_as": "invoice" },
46+
{ "id": "check", "type": "assert", "condition": { "op": "truthy", "value": "{{invoice.passed}}" }, "message": "Could not read the invoice" },
47+
{ "id": "gate", "type": "human_approval", "message": "Approve publishing the result?" },
48+
{ "id": "ok", "type": "succeed", "output": { "total": "{{invoice.result}}" } }
49+
]
50+
}
51+
```
52+
53+
*Validate + estimate* gives instant local validation (every documented DSL
54+
limit) plus typical/worst-case cost. Save, then *Run workflow*: pick a
55+
machine, set the **budget cap** — that number is the `budget_cents` guard
56+
Coasty enforces server-side; breaching it stops the run with
57+
`GUARD_EXCEEDED`. Approve the gate when it pauses; the output panel shows the
58+
templated result.
59+
60+
## 4. Retry flaky steps automatically
61+
62+
Wrap the fragile part in a `retry` and assert on the bound result:
63+
64+
```json
65+
{ "id": "r", "type": "retry", "max_attempts": 3, "body": [
66+
{ "id": "submit", "type": "task", "task": "Submit the expense form", "save_as": "out" },
67+
{ "id": "verify", "type": "assert", "condition": { "op": "truthy", "value": "{{out.passed}}" } }
68+
]}
69+
```
70+
71+
(Mock tip: a task containing `MUST_FAIL_ONCE` fails the first attempt and
72+
succeeds on the second — handy for demos.)
73+
74+
## 5. Fan out across parallel branches
75+
76+
```json
77+
{ "id": "p", "type": "parallel", "branches": [
78+
[ { "id": "a", "type": "task", "task": "Export the sales report", "save_as": "sales" } ],
79+
[ { "id": "b", "type": "task", "task": "Export the support report", "save_as": "support" } ]
80+
]},
81+
{ "id": "both", "type": "assert", "condition": { "op": "and", "conditions": [
82+
{ "op": "truthy", "value": "{{sales.passed}}" },
83+
{ "op": "truthy", "value": "{{support.passed}}" }
84+
]}}
85+
```
86+
87+
Branches run concurrently and bind results under their `save_as` names.
88+
Remember the documented limits: ≤16 branches, and no
89+
`human_approval`/`succeed`/`fail` inside a branch.
90+
91+
## 6. Bound machine spend with TTLs
92+
93+
When provisioning, set *Auto-terminate after N minutes* — that is the
94+
documented `ttl_minutes` (5 min–7 days): the VM terminates itself and all
95+
billing stops, even if everyone forgets it. Stopped machines bill $0.01/hr
96+
(storage); terminate to reach $0.
97+
98+
## 7. Use the agent loop programmatically (no UI)
99+
100+
```ts
101+
import { CoastyClient, runAgentLoop } from '@open-cowork/core';
102+
import { RemoteMachineExecutor } from '@open-cowork/executor';
103+
104+
const coasty = new CoastyClient({
105+
baseUrl: process.env.COASTY_BASE_URL!, // mock or real — same code
106+
apiKey: process.env.COASTY_API_KEY!, // server-side only!
107+
});
108+
const { machine } = await coasty.createMachine({ display_name: 'script-vm' });
109+
const session = await coasty.createSession({ screen_width: 1280, screen_height: 720 });
110+
const executor = new RemoteMachineExecutor({ machineId: machine.id, transport: coasty });
111+
112+
const outcome = await runAgentLoop({
113+
screen: executor,
114+
task: 'Open the settings page and enable dark mode',
115+
maxSteps: 15,
116+
predictStep: (input) =>
117+
coasty.sessionPredict(session.session_id, {
118+
screenshot: input.screenshotB64,
119+
instruction: input.instruction,
120+
}),
121+
});
122+
console.log(outcome.status, outcome.stepsUsed, `${outcome.totalCostCents`);
123+
await coasty.deleteSession(session.session_id); // free the concurrency slot
124+
```
125+
126+
## 8. Verify your webhook receiver with signed test vectors
127+
128+
```ts
129+
import { signWebhookPayload, verifyWebhookSignature } from '@open-cowork/core';
130+
131+
const body = JSON.stringify({ event: 'run.succeeded', run: { id: 'run_1' } });
132+
const header = await signWebhookPayload({ secret: 'whsec_yours', body });
133+
const verdict = await verifyWebhookSignature({ body, header, secret: 'whsec_yours' });
134+
// verdict.valid === true; tamper with `body` and it flips with reason 'bad_signature'
135+
```
136+
137+
## 9. Estimate costs before committing to anything
138+
139+
```ts
140+
import { runEstimateCents, workflowEstimateCents, formatCents } from '@open-cowork/core';
141+
142+
runEstimateCents({ cuaVersion: 'v3', maxSteps: 40 }); // { perStep: 5, min: 5, max: 200 }
143+
workflowEstimateCents(definition); // { typicalCents, worstCaseCents }
144+
formatCents(125); // "$1.25"
145+
```
146+
147+
The backend exposes the same math at `POST /api/estimate` — the UI's numbers
148+
and the enforcement numbers can never drift apart.
149+
150+
## 10. Script the mock for demos and tests
151+
152+
```ts
153+
import { createMockCoasty } from '@open-cowork/mock-coasty';
154+
155+
const { app, state } = createMockCoasty({ tickMs: 50, walletCents: 200 });
156+
await app.listen({ port: 4010 });
157+
// task text drives behavior: NEEDS_HUMAN pauses, MUST_FAIL fails verification,
158+
// RUN_LONG takes 20 steps, MOCK_DONE finishes a predict immediately.
159+
// state.webhookDeliveries / state.events let tests assert everything.
160+
```

0 commit comments

Comments
 (0)