Skip to main content

Testing

Use the smallest test type that gives strong confidence.

Test types

  • unit: pure logic and contracts (parsing, effects, schemas)
  • integration: real server/lifecycle/tool wiring with fake provider model calls
  • visual: stable TUI rendering and interaction snapshots
  • performance: trend detection for latency regressions, not correctness

Unit test boundary

  • *.test.ts and *.test.tsx should avoid filesystem writes, subprocesses, and network calls.
  • If a test needs real fs/process/network behavior, use *.int.test.ts instead.
  • Prefer mocks for UI/layout-focused unit tests.

Integration test boundary

  • Tool integration tests must dispatch through toolsForAgent({ workspace }) and call tools.<name>.execute(), not the underlying function directly. This exercises budget checks, hooks, caching, and call logging — the same path production uses.
  • Effect integration tests must wire handlers via attachLifecycleEffectHandlers(ctx, session) and verify behavior through debug events, not call effect.run() directly.
  • Direct function calls (e.g., editFile(), runShellCommand()) belong in unit tests when testing the function contract itself. Integration tests test wiring.

Test suites

*.test-suite.ts files define reusable assertions for store interfaces. They export a function that an *.int.test.ts file calls with a specific backend, so the same contract runs against every implementation.

Commands

  • Full baseline: bun run verify
  • All tests: bun test
  • Unit only: bun run test:unit
  • Integration only: bun run test:int
  • Visual only: bun run test:tui
  • Perf baseline: bun run test:perf
  • Behavior harness: bun run behavior:run --model anthropic/claude-sonnet-4-6
  • Coverage report (unit tests only): bun run test:coverage

Behavior harness

  • use scripts/run-behavior.ts for small real-model tuning tasks across bounded temporary workspaces
  • keep scenarios explicit, small, and manually inspectable; this harness is for behavioral comparison, not automatic scoring
  • prefer a few stable scenarios over many overlapping ones

Perf policy

  • keep scenarios deterministic and free (fake provider only)
  • use multiple runs and compare median/p95 over time
  • fail on meaningful regressions with a median threshold
  • add scenarios only when they represent a real user-critical path

CI perf artifact

  • CI uploads perf-baseline.json as the perf-baseline artifact
  • read scenarios.<id>.summary.medianMs as the primary regression signal
  • use p95Ms to detect tail-latency regressions that median may hide
  • use scenarios.<id>.runs for per-run debugging and outlier checks