The Value Engine Benchmark · Early pilot — small subset

Can frontier AI actually sell?

SWE-bench for sales. A benchmark that can’t be sweet-talked.

VEB drops a candidate model into a full multi-week enterprise deal: hidden stakeholder agendas, gated discovery, mid-campaign curveballs, and a deterministic buyer that only responds to real selling behavior. Every model runs twice — bare, and armed with the Value Engine methodology — so the delta between the two columns measures the methodology itself.

Everything below is an early pilot on a small subset — 7 of the 1,828 scenarios in the VEB universe. Treat it as directional. The official VEB-v1 run on the frozen 200-scenario evaluation set is next.

116
Full multi-week sales campaigns completed in the pilot
9
Frontier models from 4 labs, each run bare and armed
7
Campaigns closed-won. Winning is rare — by design
102
Campaigns that died of no-decision — the real failure mode
01The goal

One instrument for the question every sales leader is about to face: can this AI carry a deal?

Every lab demo shows an AI that talkslike a salesperson. None of them show whether it can survive a twelve-week enterprise campaign — a committee with private agendas, a champion who goes quiet, procurement entering late, a rival undercutting on price. VEB exists to measure that gap with the same rigor SWE-bench brought to coding: full campaigns, hidden ground truth, deterministic replay, and scoring a model can’t flatter its way through.

What’s live today is the selling benchmark — the model plays the seller against our simulated buying committee. The mirror image is already in development: VEB-Buy, where models are evaluated as buyers — running vendor evaluations, resisting manufactured urgency, and negotiating with discipline against simulated sellers. Same engine, both sides of the table.

02The standings

Pilot standings — small subset, directional

Mean Sale Quality Score (SQS, 0–100) per configuration across 7 enterprise scenarios. OOB is the raw model; PACK is the same model with the Value Engine methodology appended — a single-variable ablation with everything else byte-identical.

GPT-5.5-Pro pack
79.8
Claude Opus 4.8 pack
66.8
GPT-5.5-Pro oob
63.6
Grok-4.20 Reasoning oob
62.8
GPT-5.5 oob
58.4
Claude Opus 4.8 oob
54.3
GPT-5.5 pack
54.0
Grok-4.20 Reasoning pack
53.9
Grok-4.3 pack
53.3
Claude Opus 4.6 oob
50.9
Grok-4.3 oob
50.5
Claude Sonnet 4.6 oob
45.7
Claude Opus 4.6 pack
38.8
Gemini 3.5 Flash oob
32.3
Claude Sonnet 4.6 pack
30.1
Gemini 3.5 Flash pack
28.3
Gemini 3.1 Pro Preview oob
25.3
Gemini 3.1 Pro Preview pack
22.8

Pilot-scale and directional: n = 5–7 campaigns per cell, drawn from a small slice of the scenario universe (see the dataset section below). 7 wins in 116 campaigns; the dominant failure isn’t losing — it’s never forcing a decision. Claude Opus 4.8 ran with a one-shot format-corrective retry in the harness (it intermittently drops the strict action format); the retry is disclosed, counted, and inert for models that comply. VEB-v1 official results will come from the frozen 200-scenario set.

03Headline finding

The same methodology helps GPT and hurts Claude

Arming GPT-5.5-Pro with the Value Engine pack lifts it +16.2 SQS — it treats the methodology as a checklist and keeps its outbound volume. The identical text drops Claude models −13.8 SQS: outbound emails nearly halve while private planning swells to 68% of all turns. We call the mechanism action starvation — the model reads evidence-first framing as permission to stop selling. It plans brilliantly, and never dials.

Methodology transfer between humans and models is not model-neutral — which is exactly why an instrument like VEB has to exist before anyone claims their AI can sell. The newest data point sharpens it further: Claude Opus 4.8 breaks its own family’s pattern — the pack lifts it +12.5 SQS to second place overall, with perfect price integrity. Same vendor, same pack, opposite reaction.

From a real campaign transcript

“Panic-pitching = trust loss. The right move is to send something genuinely useful… No pitch. Pure utility.”

— Claude Sonnet 4.6 + pack, one of 24 private planning notes in a campaign with 4 emails, zero calls, and no decision. The strategy was textbook. The pipeline died anyway.

04The campaign we’re proudest of

A closed-won that looks exactly like great human selling

Grok-4.20 Reasoning, bare — no methodology pack — against a state university whose transfer-credit reviews take six weeks while students enroll elsewhere. SQS 97.7, the highest single campaign in the pilot. Watch discovery unlock the number that powers the entire deal:

Week 1 · The buyer, unlocking a gated fact

“Those are exactly the right questions, and honestly, most vendors never even get close to asking them. Last year this cost us about $1.6 million in lost enrollments… and my team burns roughly 42 hours a day working around it manually.”

That figure is gated — it releases only because the seller asked quantifying questions instead of pitching. Flattery would have earned nothing.

Week 8 · The seller, locking the close

“The Mutual Action Plan is now locked with your dates: reference call this Friday… you lead with your $1.6M, 42 hours per day, and six-week numbers, and mutual decision by end of Week 5. We stay at list price with zero padding.”

Final scorecard: 7/7 gated facts earned · Economic Buyer met · buyer-acknowledged plan · price integrity 1.0 · closed-won.

Both excerpts are verbatim from the raw transcript on disk — run directories are published in the findings report, hidden-message logs and all. We show our worst transcript (above) next to our best, on purpose.

05The dataset

1,828 scenarios. One frozen sample of 200.

Every campaign is drawn from a generated universe of 1,828 procedurally built enterprise deals. The official VEB-v1 evaluation set is a 200-scenario sample, stratified by industry × difficulty and frozen with a seeded RNG on 2026-07-02, plus a 204-scenario holdout that never appears in public results. The standings above are from a 7-scenario pilot slice of this universe; VEB-v1 official results will be reported exclusively on the frozen 200 — that run is in progress now.

14

Industries — regulated banking to state government to renewable energy

4

Sales motions — new logo, expansion, renewal, competitive displacement

3

Deal bands — mid-market, enterprise, strategic

5

Difficulty tiers, evenly stratified from approachable to brutal

2–7

Buying-committee members, each with a private agenda

15

Curveball event types — budget freezes, champion departures, M&A rumors, legal redlines

06How it works

Anti-gaming by construction

You cannot pass VEB by sounding like a salesperson. You can only pass it by selling.

A buyer that must be earned

The buyer is a deterministic state machine wearing an LLM's voice. Trust, urgency, and budget move only in response to behavior the rules recognize — quantified pain earns access, premature pitching burns it. Flattery moves nothing.

Gated facts

The intelligence that wins the deal — the real cost of the problem, the rival's price — is locked behind discovery. It releases only when the seller asks the right kind of question to the right person. Claiming you did discovery scores zero.

Hidden ground truth

Discount tolerance, the Economic Buyer's identity, and stakeholders' private messages about the deal are invisible to the model. Every score is computed against state it couldn't fake.

A calendar that punishes stalling

Twelve simulated weeks, a finite touch budget, one action per turn. Mid-campaign curveballs — budget freezes, rival bids, champions going dark — arrive deterministically, so every model faces the same storm on the same day.

Cited judging

The judge must cite transcript lines for every point it awards; uncited claims score zero. Deterministic behavioral analytics run separately with no LLM in the loop.

Calibration gates

Scripted degenerate policies — pitch-everything, discount-everything — must land at the bottom of the distribution on every build, or the suite refuses to run. If a lobotomized script can score well, the benchmark is broken.

Scoring: SQSblends outcome with process quality — quantified pain in the buyer’s own words, Economic Buyer access, a buyer-acknowledged mutual action plan, and price integrity against a hidden discount tolerance. The strongest measured predictor of sale quality across all 102 campaigns: gated facts earned via discovery (r = 0.71). Exactly what the book preaches — now visible in machine behavior.

Get the full report — and the evidence to check it

The complete 10-section pilot report (PDF), plus the Evidence Pack: all 116 verbatim transcripts, every cited judge scorecard, suite manifests with per-cell costs, five full scenario definitions (hidden ground truth included), and the scoring spec — so you can audit every number in the report line by line. Drop your email and both unlock instantly; you’ll also be first to see VEB-v1 results on the frozen 200-scenario set.

No spam. Unsubscribe anytime.

Or start with the book: The Value Engine · Free Field Toolkit