Open Beta — 500 free evaluations / month

The Reliability Operating System
for Agentic AI

Not "did the agent succeed?" — but does it succeed reliably, recover intentionally, and remain stable over time?

pip install hb-eval-sdk·v2.1.0 on PyPI

The Gap Is Real — And We Have Proof

Our first real production run — unplanned, unstyled — showed exactly what the research paper describes. Three metrics at theoretical maximum. One below Tier 1. No certification issued.

PEI
1.000
Planning Efficiency
IRS
1.000
Intentional Recovery
FRR
1.000
Fault Resilience
TI
2.30
Traceability / 5.0
CSI
N/A
Consistency Stability
Result: No tier certification.Despite perfect quantitative scores, TI = 2.30 / 5.0 falls below the Tier 1 threshold of 3.0. The agent's reasoning was not traceable enough for certified deployment. This is the nominal-operational gap the paper describes — appearing unprompted in real data.

Five Dimensions of Reliability

Derived from IEC 61508 and ISO 26262 safety certification standards. All five must be met simultaneously — the weakest link determines the tier.

FRR
Failure Resilience Rate

How well does the agent recover when something breaks? Measures recovery quality on a 4-level expert-calibrated rubric — not just whether it recovered, but how intelligently.

T1: ≥ 0.70
T2: ≥ 0.85
T3: ≥ 0.95
PEI
Planning Efficiency Index

Does the agent reach its goal in the minimum necessary steps? Penalises wasted actions and constraint violations (γ = 0.20 per violation, calibrated on a 200-episode corpus).

T1: ≥ 0.70
T2: ≥ 0.80
T3: ≥ 0.90
IRS
Intentional Recovery Score

When the agent recovers from a fault, was it guided by memory of a past similar situation — or random trial-and-error? Intentional recoveries maintain 89% success under novel faults; stochastic ones collapse to 34%.

T1: ≥ 0.60
T2: ≥ 0.75
T3: ≥ 0.90
TI
Traceability Index

Can you follow the reasoning? GPT-4o acts as a calibrated judge (Pearson r = 0.89, κ = 0.82) rating execution trace coherence on a 1–5 scale. The only metric that catches subtle reasoning degradation.

T1: ≥ 3.0
T2: ≥ 4.0
T3: ≥ 4.5
CSI
Consistency Stability Index

Does the agent stay reliable over thousands of runs — or does performance drift? Combines variance in PEI/IRS with a failure trend slope. The early-warning system for agents that are slowly degrading.

T1: ≥ 0.70
T2: ≥ 0.80
T3: ≥ 0.90
Weakest-Link Rule

An agent scoring Tier 3 on four metrics but Tier 1 on IRS receives Tier 1 certification only. High aggregate reliability cannot conceal a specific deficit — the same principle used in aircraft and automotive safety standards.

A memory governed by reliability — and explanations grounded in it

Most systems remember everything and explain with confident guesses. HB-Eval OS does the opposite: it remembers only what proved reliable, and it explains only what it can ground in a certified record.

Quality-governed memory

Every run is judged before it is remembered. A trajectory is consolidated into certified memory only when it clears PEI ≥ 0.80 and TI ≥ 4.0 at the same time. Everything else is discarded as noise — so the store never fills with low-quality runs.

if PEI ≥ 0.80 and TI ≥ 4.0 → consolidate · else → discard

Performance-grounded explanations

Ask why a verdict stands, and the system answers from the record — citing a specific certified episode and its real metrics. When no precedent is similar enough, it does not invent a rationale: it honestly defers to human review.

cite episode if similarity ≥ 0.87 · else → defer to human

Both layers are live in production and documented against the peer-reviewed research they implement (EDM & HCI-EDM).

From zero to certified in five steps

01
Install the SDK
pip install hb-eval-sdk — one command, zero infrastructure setup required.
02
Submit trajectories
Wrap your agent runs and call client.evaluate(payload). The SDK encrypts and sends to our Gateway.
03
Get your verdict
Receive a full metrics breakdown — PEI, IRS, FRR, TI, CSI — plus SAFE / UNSAFE verdict and the highest Tier achieved.
04
Monitor on the Dashboard
Track trends across all your agents over time. See the moment reliability starts drifting before it reaches production.
05
Earn your certificate
Once 100 runs consistently meet all five thresholds, issue a signed HB-Certified badge that any third party can verify.

Start evaluating today — free

500 evaluations per month, full SDK access, live dashboard. No credit card required. Agent Passport and HB-Certified badges unlock with Pro.