Use SIGINT-first ladder in `run-tests` cleanup

The previous cleanup recipe went straight to SIGTERM+SIGKILL, which hides bugs: tractor is structured concurrent — `_trio_main` catches SIGINT as an OS-cancel and cascades `Portal.cancel_actor` over IPC to every descendant. So a graceful SIGINT exercises the actual SC teardown path; if it hangs, that's a real bug to file (the forkserver `:1616` zombie was originally suspected to be one of these but turned out to be a teardown gap in `_ForkedProc.kill()` instead). Deats, - step 1: `pkill -INT` scoped to `$(pwd)/py*` — no sleep yet, just send the signal - step 2: bounded wait loop (10 × 0.3s = ~3s) using `pgrep` to poll for exit. Loop breaks early on clean exit - step 3: `pkill -9` only if graceful timed out, w/ a logged escalation msg so it's obvious when SC teardown didn't complete - step 4: same SIGINT-first ladder for the rare `:1616`-holding zombie that doesn't match the cmdline pattern (find PID via `ss -tlnp`, then `kill -INT NNNN; sleep 1; kill -9 NNNN`) - steps 5-6: UDS-socket `rm -f` + re-verify unchanged Goal: surface real teardown bugs through the test- cleanup workflow instead of papering over them with `-9`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 14:46:16 -04:00 · 2026-04-23 14:46:16 -04:00 · 70d58c4bd2
parent 1af2121057
commit 70d58c4bd2
1 changed files with 26 additions and 10 deletions
--- a/.claude/skills/run-tests/SKILL.md
+++ b/.claude/skills/run-tests/SKILL.md
@ -249,22 +249,38 @@ ls -la /tmp/registry@*.sock 2>/dev/null \
  surface PIDs + cmdlines to the user, offer cleanup:

  ```sh
-  # 1. kill test zombies scoped to THIS repo's python only
-  #    (don't pkill by bare pattern — that'd nuke legit
-  #    long-running tractor apps like piker)
-  pkill -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
-  sleep 0.3
-  pkill -9 -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" 2>/dev/null
+  # 1. GRACEFUL FIRST (tractor is structured concurrent — it
+  #    catches SIGINT as an OS-cancel in `_trio_main` and
+  #    cascades Portal.cancel_actor via IPC to every descendant.
+  #    So always try SIGINT first with a bounded timeout; only
+  #    escalate to SIGKILL if graceful cleanup doesn't complete).
+  pkill -INT -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"

-  # 2. if a test zombie holds :1616 specifically and doesn't
+  # 2. bounded wait for graceful teardown (usually sub-second).
+  #    Loop until the processes exit, or timeout. Keep the
+  #    bound tight — hung/abrupt-killed descendants usually
+  #    hang forever, so don't wait more than a few seconds.
+  for i in $(seq 1 10); do
+    pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null || break
+    sleep 0.3
+  done
+
+  # 3. ESCALATE TO SIGKILL only if graceful didn't finish.
+  if pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null; then
+    echo 'graceful teardown timed out — escalating to SIGKILL'
+    pkill -9 -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
+  fi
+
+  # 4. if a test zombie holds :1616 specifically and doesn't
  #    match the above pattern, find its PID the hard way:
  ss -tlnp 2>/dev/null | grep ':1616'   # prints `users:(("<name>",pid=NNNN,...))`
-  # then: kill <NNNN>
+  # then (same SIGINT-first ladder):
+  # kill -INT <NNNN>; sleep 1; kill -9 <NNNN> 2>/dev/null

-  # 3. remove stale UDS sockets
+  # 5. remove stale UDS sockets
  rm -f /tmp/registry@*.sock

-  # 4. re-verify
+  # 6. re-verify
  ss -tlnp 2>/dev/null | grep ':1616' || echo 'TCP :1616 now free'
  ```