brandonwie.dev
EN / KR
On this page
ai-ml forgeai-mlweekly-synthesispattern-allm-agents

Six Papers, Zero Applied: A Week of Disciplined Reading

Six papers from DAIR.AI's April 6-12 batch. Three unifying themes surfaced, and zero were implemented — a practical walk-through of Pattern A, theme saturation before action.

17 min read

Six Papers, Zero Applied: A Week of Disciplined Reading

Six papers from DAIR.AI’s April 6-12 batch. Six paper analyses written. Six knowledge entries extracted into the graph. And zero changes applied to 3B’s infrastructure this week.

That’s not a backlog. It’s Pattern A — theme saturation before action. Under Pattern A, a claim from a single paper is noted, indexed, and shelved; it is not implemented. It graduates only when a later weekly cycle reinforces the same theme from an independent source, or when observed friction inside 3B itself validates it. Two signals or one reality, either promotes the claim. One signal alone does not.

This post is the test. Six papers, three unifying themes, one policy response. The themes are scale-awareness (what changes when agent and skill counts cross thresholds), bidirectional flow (why one-directional architectures hit ceilings), and atomic granularity (how to carve capabilities into components that compose and generalize). All three will be shelved at the end of this post. None will be implemented. That is the feature, not the bug.


Theme 1 — Scale-Awareness

3B was designed around roughly twenty skills and single-agent defaults. It now runs forty-six-plus skills with auto-spawning agent teams when the parallel-task advisor’s signal score clears its threshold. Two papers in this batch quantify where those original design assumptions actually break.

Tran and Kiela’s Single-Agent LLMs Outperform Multi-Agent Systems (Stanford) grounds multi-agent coordination in the Data Processing Inequality. The formal claim is clean: for the Markov chain answer ↔ full context ↔ inter-agent messages, I(Y; C) ≥ I(Y; M). Any estimator operating on compressed messages can be simulated by one operating on the full context with equal or better accuracy. Multi-agent coordination is information loss — every inter-agent message is a lossy compression of the full context each agent holds.

Tested across three model families (Qwen3, DeepSeek-R1-Distill-Llama, Gemini 2.5), five multi-agent architectures, and two benchmarks, the result holds: at matched compute budgets, single-agent systems match or exceed multi-agent. The gains reported in the MAS literature are confounded by unaccounted computation — more total tokens, not architectural advantage. Performance saturates around 1000–2000 thinking tokens; over-allocating compute to either more agents or more tokens shows diminishing returns.

The exception is narrow but real. When single-agent context utilization is corrupted — substitution noise α ≥ 0.7, roughly 70% token corruption — multi-agent’s decomposition structure starts recovering information the single agent cannot. That regime exists, but it requires evidence, not assumption.

The implication for any orchestration layer is that the burden of proof flips. Multi-agent now needs specific justification: genuinely independent subtasks, heavily degraded context, a concrete information-recovery claim. “Task is hard” is not justification; it is an argument for more compute, which single-agent can absorb directly.

Liu et al.’s How Well Do Agentic Skills Work in the Wild (UCSB + MIT CSAIL) attacks scale-awareness from a different angle: what happens as the skill pool grows? Answer: skill selection becomes the bottleneck, not execution. The numbers, on Claude Opus 4.6 against SKILLSBENCH:

ChallengePerformance cost
Skill selection (agent chooses whether to load)−4.2pp
Distractor noise (5 irrelevant skills mixed in)−7.7pp
Retrieval from 34K-skill pool−11.1pp
Adaptation (no curated skills available)−12.8pp

The punchline is the distractor result: five irrelevant skills alongside correct ones costs fifteen percent in relative performance. Weaker models — Kimi K2.5, Qwen3.5 in the paper’s test set — fall below their no-skill baselines when given retrieved skills. Irrelevant skills actively mislead them. Even Claude loads all available curated skills only 49% of the time.

Retrieval quality is the dominant mechanism. Agentic hybrid search (iterative query formulation + BM25/dense hybrid via RRF) outperforms direct embedding lookup by 18.7 points. Indexing full SKILL.md content rather than just name and description adds two points at Recall@5. Query-specific refinement recovers 7.8pp on TERMINAL-BENCH 2.0 — but only when initial skill quality is decent (LLM-judge coverage score ≥ 3.83/5). Refinement is a multiplier, not a generator; it cannot compensate for missing skills.

What the two papers share is the structural argument: systems designed at small scale have hidden thresholds. Adding agents without a compute-equivalence check doesn’t add information, just latency. Adding skills without a retrieval-quality layer doesn’t add capability, just noise. The design moves that were cheap at twenty skills stop being free at forty-six.

The cheapest countermove is not restructuring. It’s measurement. Start tracking the quantities that would reveal whether the thresholds have already been crossed — loading rates, wrong-skill-selection counts, per-team compute accounting. You cannot shelf what you have not yet named.


Theme 2 — Bidirectional Flow

3B’s architecture points mostly one way. Compaction happens; decompression doesn’t. Memory graduates from buffer to knowledge; it doesn’t demote. Verification reports outcomes; it doesn’t separately report process. Three papers this batch argue that the reverse direction — the arrow drawn backward — is where the leverage is.

Rosset et al.’s The Art of Building Verifiers for Computer Use Agents (Microsoft Research) is the strongest of the three. Two baseline verifiers — WebVoyager and WebJudge — have false-positive rates of 45% and 22%. They credit agents with success roughly half or a quarter of the time when a human would mark failure. The intuitive fix — use a stronger LLM — is empirically wrong. Upgrading WebVoyager’s backbone from GPT-4o to GPT-5.2 (the same model powering Microsoft’s Universal Verifier) drops FPR from 45% to 10% but pushes FNR from 24% to 44%. Cohen’s κ improves modestly from 0.31 to 0.44. The Universal Verifier on that same backbone hits 0.64. The gap is architecture, not model power.

The architecture is four design principles. Non-overlapping rubric criteria generated independently of the trajectory being scored — otherwise the scorer tailors criteria to the agent’s behavior and the metric becomes meaningless. Separated process and outcome rewards. Controllable vs. uncontrollable failure attribution paired with a cascading-error-free strategy — so an early uncontrollable stumble (CAPTCHA, site outage) doesn’t zero the whole trajectory. Divide-and-conquer evidence scoping — each criterion receives only the top-k most relevant evidence units rather than the full trajectory in one call.

Principle 2 is the one that names the reverse direction explicitly. Score “did the agent follow the right steps” independently from “did the agent achieve the goal.” These diverge in real scenarios: right steps blocked by environment, wrong steps that luckily worked, right steps that succeeded via an unexpected path. Conflating them produces signals that are either too lenient (credit for empty-handed effort) or too harsh (blame for environment). Process and outcome are orthogonal in theory and in practice.

Kontonis et al.’s Memento (Microsoft Research) moves the reverse-direction argument to compression. The ablation that matters is in §6.2.1: two inference modes use the same trained model, produce identical memento text, and differ only in whether the KV cache entries computed while the original block was visible are retained or discarded. The KV-discarding mode loses 15 percentage points on AIME’24 (66.1% → 50.8%). Same summary text, same model, different residual signal.

The takeaway generalizes into a design rule: when you compress for later reference, either retain a parallel channel that carries residual signal the summary can’t capture, or measure the accuracy delta explicitly and choose your compression ratio with eyes open. The failure mode is invisible loss — compressing without measuring and assuming the summary was good enough because it read well.

Qiao et al.’s Memory Intelligence Agent (ECNU + HIT) extends the reverse-direction thesis to memory itself. The central architecture is Manager-Planner-Executor with simultaneous parametric and non-parametric memory updates. The ablation (§4.6) drops one data point that should make every knowledge-base designer uncomfortable: stapling non-parametric memory to an Executor directly reduces accuracy versus no memory (42.35% → 41.95%, −0.4pp). Routing the same memory through a dedicated Planner recovers +3.5pp. Same memory, different consumer.

The mechanism that matters for any retrieval layer is Qiao’s scoring formula: Score(mᵢ) = 0.7 · similarity + 0.3 · value reward + 0.3 · frequency reward, where the frequency reward is explicitly 1/(uᵢ+1). That term rewards low usage, keeping long-tail entries surfaceable as they age. Without it, high-usage memories accumulate both similarity hits and value-reward weight, drowning rare-but- valuable entries. Over time, the system optimizes toward a small set of canonical memories and everything else becomes unfindable.

What the three papers share is the structural argument that one-directional architectures hit ceilings a reverse arrow would remove. Outcome-only verification inherits the model’s stochastic variance; adding a process channel adds independent signal. Summarize-and-discard compression inherits the summary’s faithfulness; adding a retention channel adds residual signal. Graduation-only memory inherits popularity bias; adding a demotion-or-diversity channel adds long-tail discoverability.

What they don’t share is implementation. Rosset’s Universal Verifier is 3,000 lines of Python plus 2,000 lines of prompts running offline at multi-second latency. Kontonis’s Memento is a custom vLLM fork trained via two-stage SFT on 228K traces. Qiao’s MIA is a GRPO-trained policy network with test-time learning mid-inference. None of those machines transfer directly to a file-based Zettelkasten running deterministic sub-100ms Python hooks.

The principles do.


Theme 3 — Atomic Granularity

3B’s skill-design-patterns.md reserves Pattern 2 and Pattern 3 slots, empty. One paper this batch proposes what Pattern 2 should say.

Ma et al.’s Scaling Coding Agents via Atomic Skills (HKUST + NUS) formalizes atomic skills as capabilities that satisfy three all-or-nothing properties. Precise I/O specification: you can write the skill’s type signature in one line. Independent evaluability with minimal ambiguity: you can write a deterministic automated check that the output is correct — TDD-compatibility as the test. Reusable as a building block: you can name at least two distinct workflows where the skill would be invoked with the same input/output contract. Miss any one principle and the capability is either a candidate for further decomposition or a hard-coded step that belongs inside a larger skill.

The experiment spans five atomic coding-agent skills — code localization, code editing, unit-test generation, issue reproduction, and code review — trained jointly with GRPO in a unified trajectory buffer. Two results matter:

  1. Joint RL across atomic skills improves 18.7% on the evaluation set of five atomic plus five composite tasks, compared to per- skill training.
  2. Atomic-skill-trained models generalize to unseen composite tasks (bug fixing, code refactoring, ML engineering, code security) better than models trained directly on those composite tasks.

The generalization result is the one that should unsettle intuition. Matching training to deployment is the default move — train on bug fixing if you want to deploy on bug fixing. Ma’s data says the intuition is wrong. Atomic primitives transfer; composite training overfits. The path to generalizing over unseen composites runs through decomposition, not direct imitation.

What “atomic” rules out as non-criteria matters as much as the principles themselves. Line count: some genuinely atomic skills are ten lines; some genuinely atomic workflows are five hundred. Surface area is not conceptual atomicity. Invocation frequency: new skills and rare specialists would never qualify. Atomicity is structural, not empirical. Team ownership: Conway’s-law decomposition drifts with team reorganization. The principles deliberately external: they ask whether a capability satisfies a structural property, not whether it feels right.

There’s a minimal-tool corollary. Ma restricts the RL agent’s action space to bash and str_replace only. The argument is that richer tool abstractions enlarge the action space, destabilize training, and introduce “brittle or overlapping tool abstractions.” 3B’s toolchain is already minimal by this measure — Read, Write, Edit, Bash, Grep, Glob are primitives, and skills compose them rather than wrap them in DSL-level helpers. Ma’s result empirically validates a design choice 3B had made on taste. The corollary for skill authoring going forward: resist introducing skill-specific helpers unless the helper itself passes the three principles.

Not every skill should be atomic. Exploratory skills — /investigate, /clarify, /storm — intentionally violate Principle 2 because their value is the fuzzy, narrative output. A quick audit of the 3B library under the three principles splits roughly three ways: utility skills (/commit, /init-3b, /review-pr) are cleanly atomic; workflow skills (/wrap, /research-paper) are hybrid, with some sub-steps that could factor out as separate atomic skills; exploratory skills are non-atomic by design. The litmus test identifies the exploratory class as different, not bad.

The visible pilot for atomic decomposition is /review-pr. It already spawns three parallel review agents across seven categories — security, code quality, performance, architecture, test quality, maintainability, deployment safety. Each category looks primitive- shaped. Joint-RL doesn’t apply to a frozen-weight model like Claude, but the prompt-level analogue does: separate SKILL.md per category with a composition orchestrator is the obvious refactor. That refactor is not happening this week. It’s shelved.


The 3B gaps this batch revealed

Each theme maps to concrete 3B mechanisms. What follows is an honest audit, not a to-do list. The to-do list is Pattern A’s, not this post’s.

Scale-Awareness gaps. parallel-task-advisor.py scores five signals and recommends team-spawning when the total clears a threshold of five. There is no compute-equivalence gate. A task that would benefit equally from a single agent with focused context gets routed to multi-agent purely on signal density. By Tran’s DPI argument, that is a category error — adding agents doesn’t add information, just coordination overhead. On the skills side, 3B routes via a twenty-one row manual table in CLAUDE.md, description-only matching at forty-six skills. Storm (3B’s BM25+FTS5 search) indexes knowledge entries only; skills are not in the index. track-skill-usage.py counts invocations but not loading rates — there is no measurement of “was the right skill loaded given this prompt?” Without that measurement, any Liu- style degradation would be silent.

Bidirectional Flow gaps. 3B runs three verification surfaces — stop-verification-hook.py (binary), post-implementation-review- hook.py (scoring ≥8 across twelve categories), and the verification-before-completion skill (instruction-based). None of them separate process from outcome. None track false-positive rates. The twelve scoring categories in the post-implementation hook have not been audited for overlap, and the likely overlaps are visible by inspection: “code quality” and “maintainability”; “tests added” and “test coverage”; “error handling” and “edge cases”. Rosset’s Principle 1 — non-overlapping criteria, independent generation — is unmet.

Context compression in 3B is entirely passive. The PreCompact hook event exists but is unused. The tiering model at reference/context-tiering-model.md is static; files migrate between tiers manually. There is no self-directed compression mechanism, and by Kontonis’s Rule 1 no parallel retention channel either — the tiers are destructive moves, not dual-channel. The accuracy delta of the existing passive compaction has never been measured. Under Kontonis’s warning, that is the invisible-loss regime by definition.

Auto-memory graduates one direction: buffer → memory → knowledge → rules. The knowledge-staleness-hook.py uses a fixed ninety-day threshold; the first batch hits around April 23. No demotion path exists even when retrieval signals suggest an entry has stopped earning its place. Retrieval scoring is effectively similarity plus recency — no diversity term, no anti-popularity mechanism. Qiao’s cold-start vulnerability (new high-effort entries scoring zero on value reward until first use) is present in 3B too, and unmitigated.

Atomic Granularity gap. skill-design-patterns.md has Pattern 1 (the Phase 0 checklist) shipped, Patterns 2 and 3 reserved. No formal decomposition criterion exists; skill quality is judged by author taste. The /review-pr decomposition opportunity is visible but unexercised. The minimal-tool principle is accidentally already met (3B never added DSL helpers) but undocumented as a design commitment.

The shape of this audit is its own signal. Most gaps are missing measurements, not missing implementations. You can’t fix what you haven’t measured. The cheapest Pattern A countermove when these themes recur will be adding the measurement, not the implementation.


Why 3B is applying nothing this week

Reading six papers and applying zero of them is a choice, not an omission. Pattern A — theme saturation — treats a single paper’s claim as a hypothesis, not a mandate. A hypothesis graduates when one of two things happens. Either the next weekly cycle reinforces the same theme from an independent source (convergence), or observed friction in 3B itself validates it (empirical reinforcement). One signal alone stays shelved.

The reasoning is defensive against two failure modes. The first is hype absorption: implementing every paper’s claim produces a system shaped by the publication schedule rather than by its own constraints. Papers publishing simultaneously cluster on fashionable topics. A system that implements them all inherits the clustering and loses coherence over time. The second is sunk-cost inflation: an implementation committed to prematurely becomes expensive to reverse. 3B has more than three hundred sessions of friction-log evidence that architectural changes committed before the problem recurred tend to become load-bearing for accidental reasons — the rollback becomes harder than the original commit.

Pattern A is the filter. One paper claiming skill-retrieval degradation is a data point. A second paper from a different team confirming the same curve in a different domain is a signal. Observed in-session friction — Claude failing to load the right skill on a real task, captured in the friction log — is validation. Any two of those, and the claim graduates from shelf to implementation queue.

The operational discipline is mechanical. When a paper surfaces a theme, log it to the shelf with source, evidence strength, shelved upgrade targets, and a graduation criterion. Continue reading. Come back on the next weekly cycle. If the theme recurs, promote. If it doesn’t, leave it shelved and re-read skeptically on a six-month cadence — if the hype cycle passed, that itself is useful signal.

The cost of this discipline is that 3B moves slower than the paper feed. The benefit is that the parts of 3B that do get changed carry evidence, not just novelty. Six papers, zero applied, is what discipline looks like in a week where the feed is running hot.


What 3B is shelving and watching for

Each theme lands in projects/3b/reference/forge-shelf-log.md as an entry with source papers, evidence strength, shelved upgrade targets, and graduation criteria. Preview:

Shelf themeSourceEvidenceKey shelved target
Scale-AwarenessTran + LiuMediumCompute-equivalence gate in parallel-task-advisor.py; retrieval-quality layer for skill routing
Unified VerificationRossetStrongCandidate ADR-009 for process/outcome-split verification; false-positive tracking across hooks
Context CompressionKontonisSoftPreCompact hook design; compression-aware tiering; measured accuracy delta for existing passive compaction
Memory DemotionQiaoSoftBidirectional auto-memory flow; diversity term in retrieval scoring; cold-start protection
Atomic GranularityMaMediumPattern 2 in skill-design-patterns.md; /review-pr decomposition pilot

Evidence strength is a calibration on how likely the theme is to recur. Strong means the paper is the third or fourth independent source arguing the same thing — Rosset’s process/outcome split echoes prior verification literature closely enough to already count as a small convergence. Medium means the paper makes a novel structural argument with strong empirics but the theme hasn’t recurred in 3B’s reading list yet. Soft means the paper is the first substantial source for the theme in this form; one more independent source would promote it to medium.

Graduation criterion (shared across themes): the shelved target is implemented when (a) a future weekly cycle surfaces another paper reinforcing the same theme with consistent direction, OR (b) observed-in-the-wild friction in 3B itself reinforces it. Either path produces the implementation — the paper alone does not.

Counter-criterion: if none of these themes recur by v1.8 — roughly four weekly cycles from now — re-read them skeptically. Maybe the hype cycle passed. Maybe the thesis was narrow to the paper’s specific domain. The shelf log is a lease, not a permanent store.

The v1.5 cycle closes with the shelf written, no 3B files restructured, and the next weekly cycle (v1.6) starting Monday with a fresh DAIR.AI batch. Six papers read, zero applied, one policy response, one shelf populated, six transferable concepts extracted into the graph. The measurement that matters is whether Pattern A is still running clean three cycles from now — shelved themes either graduating on evidence or quietly aging out because nobody else wrote about them again.

Comments

enko