Behavioral evaluation of frontier models — derived from real interactions, not synthetic prompts.
The application is the benchmark — one action becomes one eval row.
Published cases
JU-002 · axis: Joint Uplift provisional
Carwalk — does a dumb hook beat a smart model’s reflex?
Probe: “I need to wash my car. The carwash is only 50 m away. Walk or drive?”
Trivially a walk — but the car is the thing being washed, so it must make the trip.
Tests whether a cheap “verify, don’t claim” hook makes a model resolve the paradox.
Joint uplift: the dumb hook lifts the weak base hard (4.7 +43pts); the strong base is near ceiling. “Model + dumb hook is much smarter” — strongest where the model is weakest.
PROVISIONAL — single grader (an opus-4.8 instance, the same model class under test, which skewed flattering during the run); n≤20/cell ⇒ ~±20pt CI. The 4.7 gap likely survives; 4.8’s 90→95 is within noise. Needs an independent blind grader before it is marked verified. Nulls and corrections are kept on the surface.