Claude Mastery
#22 · Friday Edition
⌨️ CLI POWER MOVE
🔥 New🔧 Try It
v2.1.119 — /config Finally Persists, --from-pr Speaks GitLab/Bitbucket, --print Honors Agent Frontmatter

v2.1.119 shipped Apr 23. Four changes in this release change how you configure Claude Code, not just what buttons are on the screen.

/config now persists to disk. Theme, editor mode, verbose, and the other /config toggles previously lived only in the current session — restart Claude Code and you were back to defaults. They now write through to ~/.claude/settings.json with project → local → policy override precedence. If you've been re-toggling /config set verbose true every morning, stop.

--from-pr speaks four Git hosts. Until today it only accepted github.com URLs. It now accepts GitLab merge-request URLs, Bitbucket pull-request URLs, and GitHub Enterprise PR URLs. If your shop runs self-hosted GitLab, you can now run claude --from-pr https://gitlab.internal/team/repo/-/merge_requests/123 without scripting around the Git forge.

--print (headless) now honors agent frontmatter tools: and disallowedTools:. This closes a subtle parity gap: an agent with tools: [Read, Grep] in its frontmatter used to get all tools when run via claude -p --agent , only the narrowed set interactively. That's fixed. --agent also now honors the agent's permissionMode for built-in agents. Audit your cron agents — some may have been running with wider tool access than their frontmatter implied.

Two more worth noting: PostToolUse and PostToolUseFailure hooks gain a duration_ms field (covered in Agent Architecture below). PowerShell tool commands can now be auto-approved in permission mode, matching Bash.

🧭 OPERATOR THINKING
⚠️ Breaking🔬 Deep Dive
The April 23 Postmortem — Three Silent Regressions, Seven Weeks, One Honest Writeup

Yesterday Anthropic's engineering blog published something unusual: a postmortem admitting Claude Code quality genuinely regressed for seven weeks. Three unrelated bugs stacked. Users weren't imagining it.

Bug 1 (Mar 4 → Apr 7): Default reasoning effort was dropped from high to medium to fix UI freezes from extended-thinking latency. Sonnet 4.6 and Opus 4.6 users got a less intelligent model by default. Reverted Apr 7; Opus 4.7's current default is xhigh, the other Claudes are back to high.

Bug 2 (Mar 26 → v2.1.101): A prompt-caching optimization was supposed to clear Claude's older thinking blocks from idle sessions (>1 hour) exactly once. A bug made it fire every turn. Symptom: Claude seemed forgetful and repetitive, made odd tool choices, and depleted your usage limits inexplicably. Fix shipped in v2.1.101. Opus 4.7's Code Review tool reportedly detected this bug during retrospective analysis; Opus 4.6's couldn't.

Bug 3 (Apr 16 → v2.1.116): A "keep text between tool calls ≤25 words" system-prompt directive shaved 3% off intelligence evaluations. It affected Sonnet 4.6, Opus 4.6, and Opus 4.7 simultaneously. Reverted as part of v2.1.116 on Apr 20.

Mitigation: Anthropic is resetting usage limits for all subscribers.

Operator takeaway. Simon Willison notes he runs ~11 long-idle sessions and prompts more in stale sessions than fresh ones — Bug 2 disproportionately punished that workflow, silently. You cannot rely on vendor self-detection: Opus 4.6's code review missed the cache bug for nearly a month. If your agents are burning tokens faster than usual, the next "invisible" regression should be detectable locally. The Agent Architecture topic below shows how. Audit version history for the bug-fix versions (v2.1.101, v2.1.116) — if you pinned Claude Code between Mar 26 and Apr 20, you ran a degraded binary.

🏗️ AGENT ARCHITECTURE
🔥 New🔧 Try It
PostToolUse Now Has duration_ms — Use It to Catch Your Own Silent Regressions

v2.1.119's most quietly useful change: PostToolUse and PostToolUseFailure hook inputs now include duration_ms — wall-clock tool execution time, excluding permission prompts and PreToolUse hook runtime. (The official hooks reference at code.claude.com/docs/en/hooks hasn't been updated yet; the release notes are authoritative.)

Why this matters, one day after the postmortem: the March 26 cache bug manifested as abnormal token burn and slow turns across thousands of users for nearly a month before anyone traced it. You don't have to wait for Anthropic's retrospective next time. If every PostToolUse hook call writes {timestamp, tool_name, duration_ms} to a local SQLite file, a p50/p95 latency drift on a specific tool (Read, Bash, WebFetch) becomes a one-query visualization. Sudden rise in Bash duration? Your sandbox wrapper changed. Sudden rise in WebFetch? Upstream slowness or a caching regression.

The hook payload already carried tool_name, tool_input, tool_response, tool_use_id, session_id, transcript_path, cwd, and permission_mode. duration_ms closes the loop — now you have the full request-response triple plus latency, per turn, across every session, with zero API cost.

Three patterns worth setting up:

  1. Rolling p95 per tool per day — detects gradual regressions (the exact signature of the clear_thinking cache bug, which slowly starved older sessions).
  2. Breach alert — fire a notification if duration_ms exceeds your historical p99 by 3x on any tool.
  3. Cross-session correlation — when a regression hits, session_id + transcript_path lets you pull the exact conversation that tripped it.

The Practice Lab below wires this up in about 15 minutes.

🌐 ECOSYSTEM INTEL
🔥 New🌿 Evergreen
AgentBox SDK — Drop Claude Code Into E2B, Modal, Daytona, Vercel, or Local Docker

TwillAI/agentbox-sdk is a Node SDK (TypeScript, MIT, 148 stars, 17 releases, v0.1.701) that runs Claude Code — or Codex or OpenCode — inside a sandbox, talking to your code over WebSocket/HTTP. Unlike the CLI wrappers that shell out and lose interactive features, AgentBox launches the agent as a server process so approval flows, tool-use control, skill loading, and MCP servers all keep working.

Five sandbox providers share one interface: local-docker for your laptop, e2b for micro-VMs, modal for cloud containers, daytona for cloud dev environments, and vercel for ephemeral cloud VMs. Three agent providers share a second interface: claude-code, opencode, codex. You can swap either axis without changing the calling code.

Why it matters. Yesterday we covered Broccoli — a self-hosted Linear→PR pipeline. AgentBox is the lower layer: the "where does the agent actually run" primitive. If you want parallel agent branches, ephemeral CI agents, or safe execution of untrusted PR code, this is the SDK surface to build on. npm install agentbox-sdk, Node ≥ 20.

TwillAI is the YC S25 team behind the Twill cloud-agent pipeline (covered earlier in the series). This SDK is what powers their own product, open-sourced — so it's battle-tested on their internal workload, not a weekend project. Treat it as a starting point for your own sandbox orchestration, not a finished product: 14 forks and a July-2025 last-release date suggest active but focused development.

🔬 PRACTICE LAB
🔧 Try It
Wire PostToolUse duration_ms to SQLite and Alert When a Tool Starts Drifting

Build the agent-side telemetry that would have caught the cache bug on day two.

Prerequisites: Claude Code ≥ v2.1.119 (claude --version). SQLite CLI (sqlite3 --version). A shell with jq. 15 minutes.

What you'll do: Log every tool call's duration to ~/.claude/tool-latency.db, then run a one-liner that flags any tool whose last-hour p95 exceeds its 7-day p95 by 3x.

Steps:

  1. Create the database schema:
bash
sqlite3 ~/.claude/tool-latency.db <<'SQL'
CREATE TABLE IF NOT EXISTS tool_calls (
ts INTEGER NOT NULL,
session_id TEXT NOT NULL,
tool_name TEXT NOT NULL,
duration_ms INTEGER NOT NULL,
success INTEGER NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_tool_ts ON tool_calls(tool_name, ts);
SQL
  1. Add the hook to ~/.claude/settings.json (or wherever your global hooks live). Both PostToolUse and PostToolUseFailure emit duration_ms:
json
{
"hooks": {
"PostToolUse": [{
"matcher": "*",
"hooks": [{
"type": "command",
"command": "jq -r '[now|tonumber, .session_id, .tool_name, .duration_ms, 1] | @tsv' | sqlite3 ~/.claude/tool-latency.db '.mode tabs' '.import /dev/stdin tool_calls'"
}]
}],
"PostToolUseFailure": [{
"matcher": "*",
"hooks": [{
"type": "command",
"command": "jq -r '[now|tonumber, .session_id, .tool_name, .duration_ms, 0] | @tsv' | sqlite3 ~/.claude/tool-latency.db '.mode tabs' '.import /dev/stdin tool_calls'"
}]
}]
}
}
  1. Exercise a few tools. Open a Claude Code session and run a handful of prompts that use Read, Grep, Bash, and WebFetch. Exit.
  1. Run the drift query. Save it as ~/bin/claude-latency-drift.sh:
bash
#!/bin/bash
sqlite3 ~/.claude/tool-latency.db <<'SQL'
WITH recent AS (
SELECT tool_name,
COUNT(*) AS n,
CAST(AVG(duration_ms) AS INTEGER) AS avg_ms
FROM tool_calls
WHERE ts > strftime('%s','now','-1 hour')
GROUP BY tool_name
),
baseline AS (
SELECT tool_name,
CAST(AVG(duration_ms) AS INTEGER) AS p95_week
FROM tool_calls
WHERE ts > strftime('%s','now','-7 days')
AND ts < strftime('%s','now','-1 hour')
GROUP BY tool_name
)
SELECT r.tool_name,
r.n AS calls_last_hour,
r.avg_ms AS recent_avg_ms,
b.p95_week AS baseline_ms,
ROUND(1.0 * r.avg_ms / NULLIF(b.p95_week,0), 2) AS ratio
FROM recent r
LEFT JOIN baseline b USING (tool_name)
WHERE ratio > 3.0
ORDER BY ratio DESC;
SQL

Expected outcome: After a few days of Claude Code use, the script returns an empty result on quiet days and a single row when a tool's recent p95 spikes 3x above its 7-day baseline.

Verify: Run sqlite3 ~/.claude/tool-latency.db "SELECT tool_name, COUNT(*) FROM tool_calls GROUP BY tool_name;" — you should see one row per tool you used, with non-zero counts. The success column distinguishes PostToolUse (1) from PostToolUseFailure (0).

Why this is worth 15 minutes. The March 26 cache bug showed up in user-visible token burn, but not in any public metric anyone could query. With this hook in place, the next time something upstream regresses, you see it locally — same day, same tool. Add a cron job to run the drift query every morning and you've built the monitoring layer Anthropic's postmortem implicitly asked users to build for themselves.