|
| 1 | +# SUMMARY |
| 2 | + |
| 3 | +What was built, how it is verified, the platform status matrix, coverage |
| 4 | +numbers, and every deviation from the brief / live Coasty docs. Built |
| 5 | +2026-06-11 against the live docs snapshot (`https://coasty.ai/docs/llms.txt`, |
| 6 | +fetched the same day). |
| 7 | + |
| 8 | +## What was built |
| 9 | + |
| 10 | +A complete, working implementation of the brief: a cross-platform agentic |
| 11 | +coworker on the Coasty Computer Use API, as a pnpm + Turborepo monorepo |
| 12 | +(TypeScript `strict` everywhere, zero native npm modules): |
| 13 | + |
| 14 | +- **`packages/core`** — typed client for every documented Coasty endpoint |
| 15 | + (timeouts, Retry-After-aware backoff with full jitter, POSTs retried only |
| 16 | + with an `Idempotency-Key`, reconnecting SSE streams), the shared agent loop, |
| 17 | + the full workflow-DSL validator/evaluator (13 ops, templating, guards), the |
| 18 | + cost estimator mirroring the documented pricing table, and isomorphic |
| 19 | + webhook HMAC sign/verify. Zero runtime deps. |
| 20 | +- **`packages/executor`** — the `Executor` interface + |
| 21 | + `RemoteMachineExecutor` (cloud VMs), `BrowserExecutor` (Playwright), and |
| 22 | + `LocalExecutor` with native OS bridges (Windows reference implementation: a |
| 23 | + persistent PowerShell daemon — verified live on real hardware via the |
| 24 | + opt-in capture smoke test; macOS/Linux best-effort). Model→input coordinate |
| 25 | + scaling handled; `raw` code execution refused everywhere by policy. |
| 26 | +- **`tools/mock-coasty`** — a faithful offline mock of the entire API: key |
| 27 | + kinds + billing headers, the full error catalog, exact pricing math, the |
| 28 | + run state machine with per-step billing, durable SSE with `Last-Event-ID` |
| 29 | + replay, HMAC-signed webhooks, a workflow interpreter, sandbox machines with |
| 30 | + generated-PNG screenshots. Every test and demo runs against it — **no test |
| 31 | + can ever spend money**. |
| 32 | +- **`apps/backend`** — Fastify + `node:sqlite`: bearer-token auth, the Coasty |
| 33 | + proxy (sole key holder), HMAC-verified webhook receiver, durable event |
| 34 | + mirroring + SSE fan-out with replay, server-side estimates with the |
| 35 | + `confirmCostCents` handshake and budget caps, local-run mirroring for the |
| 36 | + desktop, per-user notification feed. |
| 37 | +- **`packages/ui` + `apps/web`** — dark-first design system (20 accessible |
| 38 | + components) and the SPA: delegate-with-cost-confirm, live run view (SSE |
| 39 | + timeline + screen frames + approvals), workflow builder with instant |
| 40 | + validation + estimates, machines + wallet, settings. |
| 41 | +- **`apps/desktop`** — Electron shell (contextIsolation, no Node in the |
| 42 | + renderer) hosting the same SPA; `LocalRunManager` runs the agent loop on the |
| 43 | + user's own screen through the backend inference proxy and mirrors events so |
| 44 | + any device can supervise. |
| 45 | +- **`apps/mobile`** — Expo/React Native companion: runs, live machine frames, |
| 46 | + approvals with notes, workflow approvals, machines, wallet; in-app |
| 47 | + awaiting-approval banner; Maestro flows included. |
| 48 | +- **Docs**: README (≤10-min offline quickstart), ARCHITECTURE, SECURITY, |
| 49 | + DECISIONS, DEPLOYMENT, COOKBOOK, CONTRIBUTING, per-app READMEs. **CI**: |
| 50 | + GitHub Actions (ubuntu + windows matrix: lint/format/typecheck/unit/ |
| 51 | + integration/security-scan on push; E2E with xvfb on PRs; non-blocking audit). |
| 52 | + |
| 53 | +## Verification status |
| 54 | + |
| 55 | +`pnpm test`, `pnpm typecheck`, `pnpm lint`, `pnpm format`, |
| 56 | +`pnpm security:scan` — **all green, fully offline** (18/18 turbo tasks across |
| 57 | +9 packages). E2E (Playwright, against mock + real backend + built SPA): |
| 58 | +**web 3/3, desktop 1/1 — green** on Windows 11. |
| 59 | + |
| 60 | +| Suite | Tests | Notes | |
| 61 | +| --- | --- | --- | |
| 62 | +| core (unit) | 166 + live-smoke gate | loop, DSL, cost table, HMAC vectors (valid/tampered/stale/future/malformed/rotation), retry, SSE parser, client incl. SSE-reconnect Last-Event-ID | |
| 63 | +| executor (unit) | 31 | fake-daemon protocol, DPI scaling, action mapping; +1 opt-in native capture smoke (passed on real hardware) | |
| 64 | +| mock-coasty | 56 | pricing math incl. HD boundary, run state machine, SSE drop→reconnect (no dupes/gaps), signed webhooks verified by hand-rolled HMAC, workflow guards/approvals, machines | |
| 65 | +| backend (integration) | 22 | real HTTP vs in-process mock: lifecycle, awaiting_human→resume, webhook tamper/stale/unknown → 401, SSE replay+reconnect, BUDGET_EXCEEDED / ESTIMATE_CHANGED / 402 paths, local runs, allowlisted actions | |
| 66 | +| ui (RTL) | 107 | all 20 components: roles/names, loading/error/empty, keyboard interactions | |
| 67 | +| web (RTL) | 19 | login, delegate→confirm-cost→create, budget-error surfacing, empty/error states, event mapping | |
| 68 | +| desktop (unit) | 8 | LocalRunManager happy path/cancel/failure/batching vs fake executor + scripted backend; build smoke | |
| 69 | +| mobile (RTL via react-native-web) | 33 | all 5 screens incl. cursor-polled timeline, approval flow, banners | |
| 70 | +| **E2E web** | 3 | full journey: login→provision→delegate→confirm $1.25→live timeline+frames→approve with note→succeeded+cost summary; workflow build→validate→run→approve→output; server-side budget refusal. Plus a runtime watcher asserting **no request ever contains key/secret material** | |
| 71 | +| **E2E desktop** | 1 | Electron boots, secure bridge present, no Node leak in renderer, login works, "This computer (local screen)" target + local-control warning | |
| 72 | +| **Total** | **≈446** | | |
| 73 | + |
| 74 | +Coverage (v8, lines): core **94.1%**, ui **99.9%**, mobile **98.4%**, |
| 75 | +mock-coasty **84.2%**, backend **83.5%**, executor **64.7%** (the embedded |
| 76 | +PowerShell daemon string and untestable-on-CI unix bridges dominate the |
| 77 | +uncovered lines), desktop **63.4%** (Electron main/preload are E2E-covered |
| 78 | +instead), web **25.5% by unit tests** — the pages are primarily covered by the |
| 79 | +three full-journey E2E flows. |
| 80 | + |
| 81 | +## Platform status matrix |
| 82 | + |
| 83 | +| Capability | Desktop (Electron) | Web | Mobile (Expo) | |
| 84 | +| --- | --- | --- | --- | |
| 85 | +| Local screen control | ✅ LocalExecutor + PowerShell bridge (capture verified on real hardware; input path unit-tested + gated) | ❌ by design → cloud machine | ❌ by design → cloud machine | |
| 86 | +| Cloud-machine control + live view | ✅ (same SPA) | ✅ E2E-verified | ✅ frames polled 2s (component-tested) | |
| 87 | +| Task chat + run dashboard | ✅ | ✅ E2E | ✅ | |
| 88 | +| Workflow builder | ✅ full | ✅ full, E2E | ✅ view + approve | |
| 89 | +| Approvals / human takeover | ✅ | ✅ E2E | ✅ approve/reject + note | |
| 90 | +| Cost / wallet view | ✅ | ✅ E2E | ✅ | |
| 91 | +| Verified how | unit + Playwright `_electron` | unit + Playwright | unit via react-native-web; Maestro flows shipped (emulator required) | |
| 92 | + |
| 93 | +## Spend-safety guarantees (tested) |
| 94 | + |
| 95 | +Estimate shown → `confirmCostCents` must echo the server's number → per-user |
| 96 | +budget cap must cover the worst case (else 422 with a suggested `maxSteps`) → |
| 97 | +wallet pre-flight → Coasty-side `budget_cents` / `max_steps` / `ttl_minutes` |
| 98 | +guards. Test keys/mock bill $0; the live-smoke suite refuses non-sandbox keys. |
| 99 | + |
| 100 | +## Deviations from the brief (rationale in DECISIONS.md) |
| 101 | + |
| 102 | +1. **Electron instead of Tauri** (D1) — no Rust toolchain on the dev machine; |
| 103 | + the brief's fallback. Native access isolated behind `NativeBridge` for a |
| 104 | + future Tauri port. |
| 105 | +2. **`node:sqlite` instead of Postgres + Prisma** (D4) — offline tests + |
| 106 | + <10-min newcomer setup; repository layer makes Postgres a contained swap. |
| 107 | +3. **Vite SPA instead of Next.js** (D3) — same bundle serves web + desktop. |
| 108 | +4. **Mobile E2E via react-native-web + shipped Maestro flows** (D7) — no |
| 109 | + emulator on the build machine; same screens E2E-able in chromium. |
| 110 | +5. **OS push stubbed; in-app notifications real** (D8). |
| 111 | +6. **Contract testing approach**: instead of a standalone schema suite, the |
| 112 | + contract is pinned three ways — core's client tests assert exact outbound |
| 113 | + paths/headers/bodies for all 43 endpoints; mock-coasty (built independently |
| 114 | + of core, D9) asserts documented field names/status codes/pricing; backend |
| 115 | + integration runs the real client against the mock end-to-end. |
| 116 | +7. **Schedules & Triggers API not implemented** — documented but outside the |
| 117 | + product surface of the brief (runs/workflows/machines cover the scope). |
| 118 | + |
| 119 | +## Drift between the brief and the live docs (docs were followed) |
| 120 | + |
| 121 | +- Run resume body is `{note}`; **workflow** resume is `{approved, note}` — the |
| 122 | + brief implied `{approved}` for runs. |
| 123 | +- Idempotency is an `Idempotency-Key` **header**, not a body field. |
| 124 | +- `cua_version` values are `v1 | v3 | v4` (no v2; v4 needs professional tier). |
| 125 | +- The docs' Reference action table and its code examples disagree on params |
| 126 | + (`wait` `{ms}` vs `{seconds}`; `key_press` `{key}` vs `{keys}`; `scroll` |
| 127 | + `{direction,amount}` vs `{clicks}`; `drag` `{from_x…}` vs `{x1…}`) — core |
| 128 | + accepts both shapes and canonicalizes (`normalizeAction`); the mock emits |
| 129 | + the Reference shape. |
| 130 | +- HD surcharge boundary is strict (`>1280` or `>720`; exactly 1280×720 is SD) |
| 131 | + — encoded in the cost estimator and its boundary tests. |
| 132 | +- The webhook replay window (5 min) is documented for trigger webhooks; we |
| 133 | + apply the same ±300s tolerance to run webhooks (defense-in-depth). |
| 134 | + |
| 135 | +## Known limitations / next steps |
| 136 | + |
| 137 | +- Demo single-tenant auth (D6) — put real identity in front before public |
| 138 | + deployment (`SECURITY.md`). |
| 139 | +- macOS/Linux native bridges are structured + typed but untested on real |
| 140 | + hardware (no such hardware in this environment); Windows is the reference. |
| 141 | +- Live-screen view is screenshot frames (1–2s), not VNC video (A3). |
| 142 | +- Optional live sandbox smoke (`COWORK_RUN_LIVE=1` + `sk-coasty-test-*`) |
| 143 | + exercises free/sandbox endpoints only; it was not run during this build |
| 144 | + (offline-first policy) and skips cleanly when unset. |
0 commit comments