tractor/ai/prompt-io/claude/20260406T222648Z_f086222_pr...

---
model: claude-opus-4-6
service: claude
timestamp: 2026-04-06T22:26:48Z
git_ref: f086222
diff_cmd: git diff f086222~1..f086222
---

Here's the problem. The per-key locking doesn't fix this race because the **lock lifetime is decoupled from `run_ctx`'s lifetime**. Trace through what happens:

**`maybe_open_context` finally** (lines 468-495):
1. `users[ctx_key] -= 1` → 0
2. `no_more_users.set()` → wakes `run_ctx` (but it doesn't run yet, just becomes ready)
3. `_Cache.locks.pop(ctx_key)` → **lock is gone**
4. `maybe_open_context` exits

**`run_ctx`** now gets scheduled (lines 237-241):
5. Inner finally: `values.pop(ctx_key)` → values gone
6. `mng.__aexit__()` runs → `in_aexit.set()`, `await trio.sleep(10)` → **CHECKPOINT**

**`reenter_during_teardown`** wakes:
7. Enters `maybe_open_context` → no lock exists for `ctx_key` → creates **fresh** lock, acquires immediately
8. `values[ctx_key]` → `KeyError`
9. `resources.get(ctx_key)` → **still exists!** (step 6 hasn't reached `resources.pop()` in `run_ctx`'s outer finally yet)
10. `raise RuntimeError('Caching resources ALREADY exist?!')`

The core issue: `no_more_users.set()` (step 2) and `locks.pop()` (step 3) happen in the **exiting caller's task**, but the actual resource cleanup (`values.pop`, acm `__aexit__`, `resources.pop`) happens in the **`run_ctx` task** inside `service_tn`. There's no synchronization between "lock removed, new callers welcome" and "run_ctx has actually finished tearing down". The per-`ctx_key` lock change just made the lock more granular — it didn't close this gap.

The fix needs to ensure a new caller can't acquire a lock for `ctx_key` until `run_ctx` has fully exited (including `resources.pop()`). Either:
- **don't pop the lock in `maybe_open_context`** — let `run_ctx` release/pop it after `resources.pop()`
- **wait on a "teardown complete" event** in the cache-miss path when stale `resources` exist
- **pop `resources` before the acm `__aexit__`** (move `resources.pop()` into the inner finally alongside `values.pop()`)
Add prompt-io log for `run_ctx` teardown analysis Documents the diagnostic session tracing why per-`ctx_key` locking alone doesn't close the `_Cache.run_ctx` teardown race — the lock pops in the exiting caller's task but resource cleanup runs in the `run_ctx` task inside `service_tn`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code 2026-04-06 22:37:14 +00:00			`---`
			`model: claude-opus-4-6`
			`service: claude`
			`timestamp: 2026-04-06T22:26:48Z`
			`git_ref: f086222`
Condense `.raw.md` prompt-IO logs, add `diff_cmd` refs Replace verbose inline code dumps in `.raw.md` entries with terse summaries and `git diff` cmd references. Add `diff_cmd` metadata to each entry's YAML frontmatter so readers can reproduce the actual output diff. Also, - rename `multiaddr_declare_eps.md_` -> `.md` (drop trailing `_` suffix) (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code 2026-04-16 21:44:14 +00:00			`diff_cmd: git diff f086222~1..f086222`
Add prompt-io log for `run_ctx` teardown analysis Documents the diagnostic session tracing why per-`ctx_key` locking alone doesn't close the `_Cache.run_ctx` teardown race — the lock pops in the exiting caller's task but resource cleanup runs in the `run_ctx` task inside `service_tn`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code 2026-04-06 22:37:14 +00:00			`---`

			Here's the problem. The per-key locking doesn't fix this race because the lock lifetime is decoupled from `run_ctx`'s lifetime. Trace through what happens:

			`maybe_open_context` finally (lines 468-495):
			1. `users[ctx_key] -= 1` → 0
			2. `no_more_users.set()` → wakes `run_ctx` (but it doesn't run yet, just becomes ready)
			3. `_Cache.locks.pop(ctx_key)` → lock is gone
			4. `maybe_open_context` exits

			`run_ctx` now gets scheduled (lines 237-241):
			5. Inner finally: `values.pop(ctx_key)` → values gone
			6. `mng.__aexit__()` runs → `in_aexit.set()`, `await trio.sleep(10)` → CHECKPOINT

			`reenter_during_teardown` wakes:
			7. Enters `maybe_open_context` → no lock exists for `ctx_key` → creates fresh lock, acquires immediately
			8. `values[ctx_key]` → `KeyError`
			9. `resources.get(ctx_key)` → still exists! (step 6 hasn't reached `resources.pop()` in `run_ctx`'s outer finally yet)
			10. `raise RuntimeError('Caching resources ALREADY exist?!')`

			The core issue: `no_more_users.set()` (step 2) and `locks.pop()` (step 3) happen in the exiting caller's task, but the actual resource cleanup (`values.pop`, acm `__aexit__`, `resources.pop`) happens in the `run_ctx` task inside `service_tn`. There's no synchronization between "lock removed, new callers welcome" and "run_ctx has actually finished tearing down". The per-`ctx_key` lock change just made the lock more granular — it didn't close this gap.

			The fix needs to ensure a new caller can't acquire a lock for `ctx_key` until `run_ctx` has fully exited (including `resources.pop()`). Either:
			- don't pop the lock in `maybe_open_context` — let `run_ctx` release/pop it after `resources.pop()`
			- wait on a "teardown complete" event in the cache-miss path when stale `resources` exist
			- pop `resources` before the acm `__aexit__` (move `resources.pop()` into the inner finally alongside `values.pop()`)