Frontier Models playing the board game Diplomacy.
An agent benchmark with tasks in a simulated software company.
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 vs H100 & soon™ TPUv6e/v7/Trainium2/3
The agent engineering platform