Kalmantic Labs — AI Research and Tools for Production Systems

All Evaluations

Benchmarks

Domain-specific evaluations for autonomous agents. Filter by domain to explore active and upcoming benchmarks.

LegacyCodeBench

Legacy Code

Evaluating how well AI systems understand and modernize legacy code.

120Tasks

8Models

0.623Top Score

View benchmark

AutoBench

Automotive

Autonomous agent evaluation across automotive scenarios.

48Tasks

12Models

0.847Top Score

View benchmark

TravelBench

Travel

Measuring agent capabilities in travel planning and operations.

64Tasks

10Models

0.751Top Score

View benchmark

WealthBench

Finance

Agent evaluation for wealth management and financial advisory.

Coming Soon

KiranaBench

Kirana

Benchmarking agents for small-format retail and kirana store operations.

Coming Soon

WholesaleBench

Wholesale

Evaluating agents across wholesale distribution and supply chain workflows.

Coming Soon

HospitalBench

Healthcare

Agent evaluation for hospital operations and clinical decision support.

Coming Soon

ClinicBench

Clinic

Benchmarking agents for outpatient clinic workflows and patient management.

Coming Soon

RetailOutletBench

Retail

Evaluating agents for retail outlet management and customer operations.

Coming Soon

TransportOperatorBench

Transport

Agent evaluation for transport and logistics operator workflows.

Coming Soon