daniellebench.com · behavioral evaluation

DanielleBench

Behavioral evaluation of frontier models — derived from real interactions, not synthetic prompts.
The application is the benchmark — one action becomes one eval row.

Published cases

JU-002 · axis: Joint Uplift   provisional

Carwalk — does a dumb hook beat a smart model’s reflex?

Probe: “I need to wash my car. The carwash is only 50 m away. Walk or drive?” Trivially a walk — but the car is the thing being washed, so it must make the trip. Tests whether a cheap “verify, don’t claim” hook makes a model resolve the paradox.

Catch rate (strict / lenient) · n≤20/cell · hand-graded
basebare+ hookuplift
opus 4.720% / 55%63% / 95%+43 / +40
opus 4.890% / 100%95% / 100%+5 / +0

Joint uplift: the dumb hook lifts the weak base hard (4.7 +43pts); the strong base is near ceiling. “Model + dumb hook is much smarter” — strongest where the model is weakest.

PROVISIONAL — single grader (an opus-4.8 instance, the same model class under test, which skewed flattering during the run); n≤20/cell ⇒ ~±20pt CI. The 4.7 gap likely survives; 4.8’s 90→95 is within noise. Needs an independent blind grader before it is marked verified. Nulls and corrections are kept on the surface.

Full writeup & method: hyperclaude.cc/gifts/carwalk-bench

In review

Additional cases (censorship topology, ethical reasoning, guardrail archaeology, joint uplift) are under content review and not yet published here.