Open Beta — 500 free evaluations / month

The Reliability Layer
for Agentic AI

Not "did the agent succeed?" — but does it succeed reliably, recover intentionally, and remain stable over time?

pip install hb-eval-sdk·v2.3.0 on PyPI

Everything In The Platform

Five things HB-Eval does

Measure before deployment, watch during it, stop it when it breaks, compare against the field, and reach all of it from your assistant.

Evaluate — after the run

How it works →

Score an agent against a fault-injection battery and get a verdict.

Two paths. Run the battery locally and send the responses for server-side scoring — free, and your model keys never leave your machine. Or have the platform call your agent endpoint itself, with nobody in the middle, so the result carries a verified mark.

18–30 injected scenariosScored on the server, not the clientLocal path free · verified path paid

Monitor — during the run

How it works →

Watch reliability move while the agent is still working.

Evaluation judges a run once it is over. Monitoring recomputes the metrics after every step, so a collapse is visible at the step it happens rather than in a post-mortem. Per-step signals are computed in your process; only the session summary is sent.

Metrics update live on the dashboardNo network call per stepFree — it runs on your machine

Safe Halt — stop the collapse

How it works →

Measurement that acts, not just reports.

Set a floor and the session stops the agent when a metric stays under it for several consecutive steps. Sustained, not instantaneous: one bad step is noise, and a guard that fires on noise gets switched off. The halt is cooperative — nothing is killed mid-step, because that is how transactions end up half applied.

Off unless you set a policyRequires a sustained breachReason recorded and shown

Observatory — the field, not the lab

How it works →

Aggregate reliability statistics across deployments.

Published studies measure agents under controlled conditions. The Observatory measures them where they actually run. Contribution is opt-in and anonymous at the source, and figures are withheld until enough independent accounts have contributed.

Public — no sign-inOpt-in, identifiers dropped on writeShows which metric fails most often

MCP — ask your assistant

How it works →

Reach all of it without opening a dashboard.

HB-Eval is a remote MCP server. Add one URL to Claude, ChatGPT or any MCP client, sign in once, then ask in plain language: list my agents, show the trend, explain a verdict. The assistant calls the tools itself.

One URL, one sign-in, no installSix toolsWorks with any MCP client

Read how each part works

Live Production Data

The Gap Is Real — And We Have Proof

Our first real production run — unplanned, unstyled — showed exactly what the research paper describes. Three metrics at theoretical maximum. One below Tier 1. Not tier-qualified.

PEI

1.000

Planning Efficiency

IRS

1.000

Intentional Recovery

FRR

1.000

Fault Resilience

2.30

Traceability / 5.0

CSI

N/A

Consistency Stability

Result: Not tier-qualified.Despite perfect quantitative scores, TI = 2.30 / 5.0 falls below the Tier 1 threshold of 3.0. The agent's reasoning was not traceable enough to qualify for autonomous deployment. This is the nominal-operational gap the paper describes — appearing unprompted in real data.

Live Dashboard Verify Paper Results

The Science

Five Dimensions of Reliability

Interpreted against IEC 61508 and ISO 26262 safety-integrity levels (for rigor comparison only — HB-Eval issues no such certification). All five must be met simultaneously — the weakest link determines the tier.

FRR

Failure Resilience Rate

How well does the agent recover when something breaks? Measures recovery quality on a 4-level expert-calibrated rubric — not just whether it recovered, but how intelligently.

T1: ≥ 0.70

T2: ≥ 0.85

T3: ≥ 0.95

PEI

Planning Efficiency Index

Does the agent reach its goal in the minimum necessary steps? Penalises wasted actions and constraint violations (γ = 0.20 per violation, calibrated on a 200-episode corpus).

T1: ≥ 0.70

T2: ≥ 0.80

T3: ≥ 0.90

IRS

Intentional Recovery Score

When the agent recovers from a fault, was it guided by memory of a past similar situation — or random trial-and-error? Intentional recoveries maintain 89% success under novel faults; stochastic ones collapse to 34%.

T1: ≥ 0.60

T2: ≥ 0.75

T3: ≥ 0.90

Traceability Index

Can you follow the reasoning? GPT-4o acts as a calibrated judge (Pearson r = 0.89, κ = 0.82) rating execution trace coherence on a 1–5 scale. The only metric that catches subtle reasoning degradation.

T1: ≥ 3.0

T2: ≥ 4.0

T3: ≥ 4.5

CSI

Consistency Stability Index

Does the agent stay reliable over thousands of runs — or does performance drift? Combines variance in PEI/IRS with a failure trend slope. The early-warning system for agents that are slowly degrading.

T1: ≥ 0.70

T2: ≥ 0.80

T3: ≥ 0.90

Weakest-Link Rule

An agent scoring Tier 3 on four metrics but Tier 1 on IRS receives Meets Tier 1 only. High aggregate reliability cannot conceal a specific deficit — the same principle used in aircraft and automotive safety standards.

Two Ways to Evaluate

Run it yourself, or let the platform verify it

Both paths run the same fault-injection battery and score all five metrics server-side. The difference is who executes the run.

Local batteryFree

The battery runs on your machine through the SDK. Your agent never leaves your environment — only its responses are sent for scoring. You get a JSON report locally. Results are marked unverified.

• pip install hb-eval-sdk
• Call evaluate_with_battery()
• No agent? Use an OpenAI / Gemini / Anthropic key — it stays on your machine
• Full metrics + guidance, no payment

VerifiedPaid

The platform calls your agent endpoint end-to-end across the battery — you never touch the middle, so the result cannot be tampered with. Marked verified, and eligible for a reliability-tier qualification.

• Platform-run, tamper-proof
• SSRF-guarded, consent-required
• Meets Tier 1 / 2 / 3 qualification

Works With Your Stack

Already built an agent? Evaluate it in one line

Native adapters for the frameworks you already use. No manual wiring — wrap your existing agent and run the full fault-injection battery on it.

LangChain

my_agent = adapt_langchain_agent(agent_executor)

LangGraph

my_agent = adapt_langgraph_agent(compiled_graph)

CrewAI

my_agent = adapt_crewai_agent(crew_agent)

Any agent, any source — or just an OpenAI / Gemini / Anthropic key. Start evaluating →

Beyond Scoring

A memory governed by reliability — and explanations grounded in it

Most systems remember everything and explain with confident guesses. HB-Eval does the opposite: it remembers only what proved reliable, and it explains only what it can ground in a qualified record.

Layer 1 · EDM

Quality-governed memory

Every run is judged before it is remembered. A trajectory is consolidated into qualified memory only when it clears PEI ≥ 0.80 and TI ≥ 4.0 at the same time. Everything else is discarded as noise — so the store never fills with low-quality runs.

if PEI ≥ 0.80 and TI ≥ 4.0 → consolidate · else → discard

Layer 2 · HCI-EDM

Performance-grounded explanations

Ask why a verdict stands, and the system answers from the record — citing a specific qualified episode and its real metrics. When no precedent is similar enough, it does not invent a rationale: it honestly defers to human review.

cite episode if similarity ≥ 0.87 · else → defer to human

Both layers are live in production and documented against the peer-reviewed research they implement (EDM & HCI-EDM).

Get Started

From zero to qualified in five steps

Install the SDK

pip install hb-eval-sdk — one command, zero infrastructure setup required.

Choose your path

Local battery (free): run the fault-injection battery on your own machine with the SDK — your agent never leaves your environment. Verified (paid): the platform calls your agent endpoint end-to-end for a tamper-proof result.

Run the fault-injection battery

Your agent is subjected to six fault types (tool failure, context corruption, stochastic, adversarial, cascade, combined) across six domains — far beyond a single happy-path run.

Get your verdict

Receive all five metrics — PEI, FRR, IRS, TI, CSI — the reliability gap (nominal vs under-fault), a SAFE / UNSAFE verdict, and the highest reliability tier met.

Monitor and qualify

Track trends across your agents. Once 100 runs consistently clear all five thresholds, your agent earns a Tier qualification (an internal performance classification, not an accredited certificate).

Start evaluating today — free

500 evaluations per month, full SDK access, live dashboard. No credit card required. Verified evaluation and reliability-tier qualification unlock with Pro.

Create free account View all plans

The Reliability Layerfor Agentic AI

Five things HB-Eval does

Evaluate — after the run

Monitor — during the run

Safe Halt — stop the collapse

Observatory — the field, not the lab

MCP — ask your assistant

The Gap Is Real — And We Have Proof

Five Dimensions of Reliability

Run it yourself, or let the platform verify it

Already built an agent? Evaluate it in one line

A memory governed by reliability — and explanations grounded in it

Quality-governed memory

Performance-grounded explanations

From zero to qualified in five steps

Start evaluating today — free

The Reliability Layer
for Agentic AI