All Evaluations
Benchmarks
Domain-specific evaluations for autonomous agents. Filter by domain to explore active and upcoming benchmarks.
LegacyCodeBench
Legacy CodeEvaluating how well AI systems understand and modernize legacy code.
AutoBench
AutomotiveAutonomous agent evaluation across automotive scenarios.
TravelBench
TravelMeasuring agent capabilities in travel planning and operations.
WealthBench
FinanceAgent evaluation for wealth management and financial advisory.
KiranaBench
KiranaBenchmarking agents for small-format retail and kirana store operations.
WholesaleBench
WholesaleEvaluating agents across wholesale distribution and supply chain workflows.
HospitalBench
HealthcareAgent evaluation for hospital operations and clinical decision support.
ClinicBench
ClinicBenchmarking agents for outpatient clinic workflows and patient management.
RetailOutletBench
RetailEvaluating agents for retail outlet management and customer operations.
TransportOperatorBench
TransportAgent evaluation for transport and logistics operator workflows.