Use SIGINT-first ladder in `run-tests` cleanup
The previous cleanup recipe went straight to SIGTERM+SIGKILL, which hides bugs: tractor is structured concurrent — `_trio_main` catches SIGINT as an OS-cancel and cascades `Portal.cancel_actor` over IPC to every descendant. So a graceful SIGINT exercises the actual SC teardown path; if it hangs, that's a real bug to file (the forkserver `:1616` zombie was originally suspected to be one of these but turned out to be a teardown gap in `_ForkedProc.kill()` instead). Deats, - step 1: `pkill -INT` scoped to `$(pwd)/py*` — no sleep yet, just send the signal - step 2: bounded wait loop (10 × 0.3s = ~3s) using `pgrep` to poll for exit. Loop breaks early on clean exit - step 3: `pkill -9` only if graceful timed out, w/ a logged escalation msg so it's obvious when SC teardown didn't complete - step 4: same SIGINT-first ladder for the rare `:1616`-holding zombie that doesn't match the cmdline pattern (find PID via `ss -tlnp`, then `kill -INT NNNN; sleep 1; kill -9 NNNN`) - steps 5-6: UDS-socket `rm -f` + re-verify unchanged Goal: surface real teardown bugs through the test- cleanup workflow instead of papering over them with `-9`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-codesubint_forkserver_backend
parent
1af2121057
commit
70d58c4bd2
|
|
@ -249,22 +249,38 @@ ls -la /tmp/registry@*.sock 2>/dev/null \
|
|||
surface PIDs + cmdlines to the user, offer cleanup:
|
||||
|
||||
```sh
|
||||
# 1. kill test zombies scoped to THIS repo's python only
|
||||
# (don't pkill by bare pattern — that'd nuke legit
|
||||
# long-running tractor apps like piker)
|
||||
pkill -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
|
||||
sleep 0.3
|
||||
pkill -9 -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" 2>/dev/null
|
||||
# 1. GRACEFUL FIRST (tractor is structured concurrent — it
|
||||
# catches SIGINT as an OS-cancel in `_trio_main` and
|
||||
# cascades Portal.cancel_actor via IPC to every descendant.
|
||||
# So always try SIGINT first with a bounded timeout; only
|
||||
# escalate to SIGKILL if graceful cleanup doesn't complete).
|
||||
pkill -INT -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
|
||||
|
||||
# 2. if a test zombie holds :1616 specifically and doesn't
|
||||
# 2. bounded wait for graceful teardown (usually sub-second).
|
||||
# Loop until the processes exit, or timeout. Keep the
|
||||
# bound tight — hung/abrupt-killed descendants usually
|
||||
# hang forever, so don't wait more than a few seconds.
|
||||
for i in $(seq 1 10); do
|
||||
pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null || break
|
||||
sleep 0.3
|
||||
done
|
||||
|
||||
# 3. ESCALATE TO SIGKILL only if graceful didn't finish.
|
||||
if pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null; then
|
||||
echo 'graceful teardown timed out — escalating to SIGKILL'
|
||||
pkill -9 -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
|
||||
fi
|
||||
|
||||
# 4. if a test zombie holds :1616 specifically and doesn't
|
||||
# match the above pattern, find its PID the hard way:
|
||||
ss -tlnp 2>/dev/null | grep ':1616' # prints `users:(("<name>",pid=NNNN,...))`
|
||||
# then: kill <NNNN>
|
||||
# then (same SIGINT-first ladder):
|
||||
# kill -INT <NNNN>; sleep 1; kill -9 <NNNN> 2>/dev/null
|
||||
|
||||
# 3. remove stale UDS sockets
|
||||
# 5. remove stale UDS sockets
|
||||
rm -f /tmp/registry@*.sock
|
||||
|
||||
# 4. re-verify
|
||||
# 6. re-verify
|
||||
ss -tlnp 2>/dev/null | grep ':1616' || echo 'TCP :1616 now free'
|
||||
```
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue