Testing
Use the smallest test type that gives strong confidence.
Test types
- unit: pure logic and contracts (parsing, effects, schemas)
- integration: real server/lifecycle/tool wiring with fake provider model calls
- visual: stable TUI rendering and interaction snapshots
- performance: trend detection for latency regressions, not correctness
Unit test boundary
*.test.tsand*.test.tsxshould avoid filesystem writes, subprocesses, and network calls.- If a test needs real fs/process/network behavior, use
*.int.test.tsinstead. - Prefer mocks for UI/layout-focused unit tests.
Integration test boundary
- Tool integration tests must dispatch through
toolsForAgent({ workspace })and calltools.<name>.execute(), not the underlying function directly. This exercises budget checks, hooks, caching, and call logging — the same path production uses. - Effect integration tests must wire handlers via
attachLifecycleEffectHandlers(ctx, session)and verify behavior through debug events, not calleffect.run()directly. - Direct function calls (e.g.,
editFile(),runShellCommand()) belong in unit tests when testing the function contract itself. Integration tests test wiring.
Test suites
*.test-suite.ts files define reusable assertions for store interfaces. They export a function that an *.int.test.ts file calls with a specific backend, so the same contract runs against every implementation.
Commands
- Full baseline:
bun run verify - All tests:
bun test - Unit only:
bun run test:unit - Integration only:
bun run test:int - Visual only:
bun run test:tui - Perf baseline:
bun run test:perf - Behavior harness:
bun run behavior:run --model anthropic/claude-sonnet-4-6 - Coverage report (unit tests only):
bun run test:coverage
Behavior harness
- use scripts/run-behavior.ts for small real-model tuning tasks across bounded temporary workspaces
- keep scenarios explicit, small, and manually inspectable; this harness is for behavioral comparison, not automatic scoring
- prefer a few stable scenarios over many overlapping ones
Perf policy
- keep scenarios deterministic and free (fake provider only)
- use multiple runs and compare median/p95 over time
- fail on meaningful regressions with a median threshold
- add scenarios only when they represent a real user-critical path
CI perf artifact
- CI uploads
perf-baseline.jsonas theperf-baselineartifact - read
scenarios.<id>.summary.medianMsas the primary regression signal - use
p95Msto detect tail-latency regressions that median may hide - use
scenarios.<id>.runsfor per-run debugging and outlier checks