fix: stop server startup from auto-failing in-flight workflow runs (#1216) #1231

PR
PR description

Summary

  • Problem: Every archon serve startup unconditionally flipped all running workflow rows to failed via failOrphanedRuns() at packages/server/src/index.ts:213. This killed CLI workflows actively executing in another process. Reproducer: start a workflow in one terminal, start the server in another while it's still running — the workflow's status flips to failed mid-execution and the CLI exits non-zero even though every node completed successfully. Filed as #1216, discovered during PR #1217 smoke testing.
  • Why it matters: Every server restart silently corrupted in-flight workflow state. Users with CI/cron-driven server restarts could lose long-running workflows without an actionable signal. The dag-executor's defensive between-layer check was the only thing preventing partial corruption — but that protection means valuable work (completed nodes, accumulated cost, generated artifacts) gets recorded with status=failed.
  • What changed: Backend removes the failOrphanedRuns() call from server startup (matches the CLI precedent at packages/cli/src/cli.ts:256-258). UI gets a numeric count badge on the Dashboard nav (replacing a binary pulse dot) and AlertDialog confirmations for destructive workflow-run actions (replacing 5 window.confirm() callsites).
  • What did not change (scope boundary): The failOrphanedRuns() function itself in packages/core/src/db/workflows.ts:911 is preserved — it's still used by archon workflow cleanup (the explicit user-driven path). Codex provider behavior unchanged. No DB migration. No new dependencies. No timer-based heuristic introduced anywhere — per the new CLAUDE.md principle.

UX Journey

Before

Terminal A                               Server (Terminal B)              UI
──────────                               ───────────────────              ──
archon workflow run e2e-claude-smoke
  ├─ creates run row (status=running)
  └─ executes nodes…
                                          archon serve  ──┐
                                          ├─ failOrphanedRuns()
                                          │  UPDATE remote_agent_workflow_runs
                                          │  SET status='failed'
                                          │  WHERE status='running'  ❌ kills A's row
                                          └─ binds port, ready
  ├─ next node finishes
  └─ between-layer status check
     sees status='failed'
     ↓
  ❌ Workflow failed:                                                    Dashboard:
     "Workflow did not complete                                           pulse dot
     successfully" (exit 1)                                               (binary signal)

After

Terminal A                               Server (Terminal B)              UI
──────────                               ───────────────────              ──
archon workflow run e2e-claude-smoke
  ├─ creates run row (status=running)
  └─ executes nodes…
                                          archon serve  ──┐
                                          *no failOrphanedRuns() call*
                                          └─ binds port, ready
  ├─ all nodes complete
  └─ between-layer status check
     sees status='running'
     ↓
  ✓ Workflow completed                                                    Dashboard nav:
    successfully (exit 0)                                                  [📊 Dashboard 1]
                                                                          (numeric count
                                                                           badge, hidden if 0)

User sees an unfamiliar "running" workflow on the dashboard?
   → clicks the workflow card → AlertDialog → "Cancel workflow" → confirmed → row marked cancelled
   (no system-driven heuristic; user owns the decision)

Architecture Diagram

Before

                 ┌──────────────────────────────┐
                 │  packages/server/src/index.ts│
                 │   startServer()              │
                 │   ├─ DB connect              │
                 │   ├─ failOrphanedRuns()  ──> │ ❌ mutates ALL `running` rows
                 │   │                          │    regardless of process owner
                 │   └─ bind port               │
                 └──────────────────────────────┘
                                    │
                                    ▼
                 ┌──────────────────────────────────────┐
                 │  packages/core/src/db/workflows.ts    │
                 │  failOrphanedRuns()                   │
                 │  UPDATE … SET status='failed'         │
                 │  WHERE status='running'  (no scope)   │
                 └──────────────────────────────────────┘

After

                 ┌──────────────────────────────┐
                 │  packages/server/src/index.ts│ [~]
                 │   startServer()              │
                 │   ├─ DB connect              │
                 │   ├─ // explanatory comment   │
                 │   │    *no autonomous mutation*│
                 │   └─ bind port               │
                 └──────────────────────────────┘

                 ┌──────────────────────────────────────┐
                 │  packages/core/src/db/workflows.ts    │ (unchanged)
                 │  failOrphanedRuns()                   │
                 │  Now only called by                   │
                 │  `archon workflow cleanup` (explicit) │
                 └──────────────────────────────────────┘

                 ┌────────────────────────────────────────────────────┐
                 │  packages/web/src/components/layout/TopNav.tsx [~] │
                 │   Dashboard nav: pulse-dot ──> count badge          │
                 │   reads /api/dashboard/runs counts.running          │
                 └────────────────────────────────────────────────────┘

                 ┌────────────────────────────────────────────────────┐
                 │  packages/web/src/components/dashboard/             │
                 │   ConfirmRunActionDialog.tsx  [+]                   │
                 │   shadcn AlertDialog wrapper, mirrors               │
                 │   sidebar/ProjectSelector codebase-delete pattern   │
                 │                                                      │
                 │   WorkflowRunCard.tsx  [~]                          │
                 │   4× window.confirm() ──> ConfirmRunActionDialog    │
                 │                                                      │
                 │   WorkflowHistoryTable.tsx  [~]                     │
                 │   1× window.confirm() ──> ConfirmRunActionDialog    │
                 └────────────────────────────────────────────────────┘

Connection inventory:

From To Status Notes
server/index.ts createWorkflowStore() removed no longer needed at startup; archon workflow cleanup retains the link
server/index.ts failOrphanedRuns() removed the offending invocation
TopNav.tsx listDashboardRuns new replaces listWorkflowRuns; reads counts.running
TopNav.tsx listWorkflowRuns removed superseded by listDashboardRuns query
WorkflowRunCard.tsx ConfirmRunActionDialog new 4 callsites (Reject, Abandon, Cancel, Delete)
WorkflowHistoryTable.tsx ConfirmRunActionDialog new 1 callsite (Delete)
ConfirmRunActionDialog.tsx shadcn AlertDialog primitives new mirrors sidebar/ProjectSelector.tsx:142–165 pattern
WorkflowRunCard.tsx window.confirm removed 4 callsites
WorkflowHistoryTable.tsx window.confirm removed 1 callsite

Label Snapshot

  • Risk: risk: low (removal of an autonomous mutation; UI changes are additive replacements of an existing pattern)
  • Size: size: M (6 files, +197 / −72; bulk in WorkflowRunCard's 4 dialog conversions)
  • Scope: core, server, web
  • Module: server:index, web:dashboard, web:layout

Change Metadata

  • Change type: bug (primary: fixes #1216) + refactor (secondary: dialog UX)
  • Primary scope: server

Linked Issue

  • Closes #1216

Validation Evidence (required)

bun run type-check     # 10/10 packages: Exited with code 0
bun run lint           # 0 errors, 0 warnings
bun run format:check   # All matched files use Prettier code style
bun run test           # one pre-existing failure on dev (cleanup-service.test.ts —
                       # `runScheduledCleanup > continues processing after error on
                       # one environment`); verified to also fail on origin/dev
                       # without this branch's changes. NOT introduced by this PR.

End-to-end reproducer (the bug fix verification):

# Terminal A
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run cli workflow run e2e-claude-smoke --no-worktree

# Terminal B (during A's execution)
ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 bun run dev:server

Result with this PR applied:

  • Terminal A → Workflow completed successfully. (exit 0) ✓
  • Server log → zero orphan / fail_orphans / orphaned_workflow_runs_failed events ✓
  • DB → run row ends with status='completed', not failed

Without this PR (verified before the fix): Terminal A exits 1 with "Workflow failed", server log emits db.orphaned_workflow_runs_failed { count: 1 } — exactly the run that was in flight.

Regression sweep:

grep -n 'window\.confirm' \
  packages/web/src/components/dashboard/WorkflowRunCard.tsx \
  packages/web/src/components/dashboard/WorkflowHistoryTable.tsx
# zero matches
grep -nE "failOrphanedRuns\(\)" packages/server/src/index.ts
# zero matches

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes (no API contract change; failOrphanedRuns() retained for explicit cleanup)
  • Config/env changes? No
  • Database migration needed? No

Behavioral change for operators: Server restarts no longer auto-mark running workflow rows as failed. Truly orphaned rows from a crashed server now persist as running until cleaned up via archon workflow cleanup or per-row Cancel/Abandon in the dashboard. The Dashboard nav count badge surfaces the count.

Human Verification (required)

  • Verified scenarios:
    • End-to-end reproducer (above) — bug confirmed fixed
    • bun run dev:server starts cleanly with no orphan-related log events
    • CLI workflow completes cleanly even with concurrent server start
    • Type-check + lint + format all green across all 10 packages
  • Edge cases checked:
    • The failOrphanedRuns() function is preserved and still callable by the explicit archon workflow cleanup path
    • The unused createWorkflowStore import in server/index.ts was also removed (caught by TS noUnusedLocals)
    • The ConfirmRunActionDialog does NOT swallow promise rejections from onConfirm — errors propagate to the parent's runAction helper which already displays them via actionError state
  • What was not verified:
    • UI manual interaction with the new AlertDialogs (no browser available in this environment) — the AlertDialog primitive and the mirrored ProjectSelector pattern are both production-tested elsewhere; the change is essentially a render-shape swap
    • The dashboard nav badge update timing (relies on existing 10s polling; should appear within 10s of a workflow start)
    • No new component tests added — the web package has no React component test infrastructure (bun test only covers src/lib/ and src/stores/); adding @testing-library/react would be significant scope creep matching no existing pattern. Type-check + lint + manual UI verification + the backend reproducer are the verification levels in this PR.

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: Server startup; CLI workflows running concurrently with server restarts; web UI Dashboard nav + workflow run cards + history table
  • Potential unintended effects:
    • Truly orphaned running rows from crashed servers will accumulate in the DB until explicit cleanup. The count badge surfaces them; users can click into the dashboard and Cancel per row. This is the intended trade-off per CLAUDE.md "No Autonomous Lifecycle Mutation Across Process Boundaries".
    • listDashboardRuns({ status: 'running', limit: 1 }) in TopNav adds one query per 10s where listWorkflowRuns was previously called. Same frequency, slightly heavier endpoint (returns enriched run + counts vs raw run array). The limit: 1 keeps the runs payload trivially small; we only consume counts.running.
  • Guardrails / monitoring for early detection:
    • The dashboard nav count badge is the primary visibility signal — operators see it grow if orphans accumulate
    • archon workflow status CLI command continues to work and lists running rows
    • Existing db.orphaned_workflow_runs_failed log event is now only emitted by the explicit cleanup path, so its presence post-merge is a useful signal that someone ran cleanup intentionally

Rollback Plan (required)

  • Fast rollback command/path: git revert 7a00e047 on dev. One commit, atomic. No DB changes to reverse.
  • Feature flags or config toggles: None.
  • Observable failure symptoms:
    • If for some reason the dashboard nav badge fails to render: existing pulse-dot pattern is restored by the revert
    • If the AlertDialogs misbehave: revert restores window.confirm (worse UX but functional)
    • If the bug fix introduces an unforeseen regression: very unlikely (the change is a removal of an unconditional mutation), but revert is safe and restores the prior behavior including the bug

Risks and Mitigations

  • Risk: Operators who relied on server restarts to "tidy up" stuck workflows will need to use archon workflow cleanup or the dashboard explicitly. Some may not realize the behavior changed.
    • Mitigation: CHANGELOG entry under [Unreleased] documents the change. The Dashboard nav count badge surfaces stuck workflows visibly. Server log on startup no longer emits the misleading db.orphaned_workflow_runs_failed event, removing a false-positive signal.
  • Risk: A pause in dialog-confirmation polish leaves the codebase with two destructive-confirm patterns (AlertDialog in some places, window.confirm elsewhere — ProjectSelector pattern, this PR's pattern, and any I missed).
    • Mitigation: This PR replaces all window.confirm in the touched files (WorkflowRunCard.tsx, WorkflowHistoryTable.tsx). Other components in packages/web/ may still use window.confirm and should be reviewed in a follow-up sweep — out of scope here.

Summary by CodeRabbit

  • New Features

    • Custom confirmation dialogs replace browser prompts for destructive workflow actions (Abandon, Cancel, Delete, Reject), providing contextual titles and descriptions.
  • Changed

    • Server no longer auto-marks actively-running workflows as failed on startup; stuck runs must be handled by users via dashboard actions or the CLI.
    • Dashboard tab shows running workflow count as a numeric badge with aria-labels.
  • Documentation

    • Guides updated to reflect orphaned-run, resume, and cleanup behavior.
CUT
cutter bot commented just now

Cutter Summary

Coverage limit — The archon stack adapter exposes no cutter:seed-apply seed() implementation, so seeded workflow_runs are not replayable via the pipeline. The before-side will see an empty dashboard (no cards) and correctly mark these steps as absent — that asymmetry is itself evidence the after-side surfaces the new UI.