Live Production Data
The Gap Is Real — And We Have Proof
Our first real production run — unplanned, unstyled — showed exactly what the research paper describes. Three metrics at theoretical maximum. One below Tier 1. No certification issued.
The Science
Five Dimensions of Reliability
Derived from IEC 61508 and ISO 26262 safety certification standards. All five must be met simultaneously — the weakest link determines the tier.
How well does the agent recover when something breaks? Measures recovery quality on a 4-level expert-calibrated rubric — not just whether it recovered, but how intelligently.
Does the agent reach its goal in the minimum necessary steps? Penalises wasted actions and constraint violations (γ = 0.20 per violation, calibrated on a 200-episode corpus).
When the agent recovers from a fault, was it guided by memory of a past similar situation — or random trial-and-error? Intentional recoveries maintain 89% success under novel faults; stochastic ones collapse to 34%.
Can you follow the reasoning? GPT-4o acts as a calibrated judge (Pearson r = 0.89, κ = 0.82) rating execution trace coherence on a 1–5 scale. The only metric that catches subtle reasoning degradation.
Does the agent stay reliable over thousands of runs — or does performance drift? Combines variance in PEI/IRS with a failure trend slope. The early-warning system for agents that are slowly degrading.
An agent scoring Tier 3 on four metrics but Tier 1 on IRS receives Tier 1 certification only. High aggregate reliability cannot conceal a specific deficit — the same principle used in aircraft and automotive safety standards.
Beyond Scoring
A memory governed by reliability — and explanations grounded in it
Most systems remember everything and explain with confident guesses. HB-Eval OS does the opposite: it remembers only what proved reliable, and it explains only what it can ground in a certified record.
Layer 1 · EDM
Quality-governed memory
Every run is judged before it is remembered. A trajectory is consolidated into certified memory only when it clears PEI ≥ 0.80 and TI ≥ 4.0 at the same time. Everything else is discarded as noise — so the store never fills with low-quality runs.
Layer 2 · HCI-EDM
Performance-grounded explanations
Ask why a verdict stands, and the system answers from the record — citing a specific certified episode and its real metrics. When no precedent is similar enough, it does not invent a rationale: it honestly defers to human review.
Both layers are live in production and documented against the peer-reviewed research they implement (EDM & HCI-EDM).
Get Started
From zero to certified in five steps
Start evaluating today — free
500 evaluations per month, full SDK access, live dashboard. No credit card required. Agent Passport and HB-Certified badges unlock with Pro.