Back to List
Notice:This resource is provided by a third-party author. Please review the code with AI tools or manually before use to ensure security and compatibility.
Pythonsuyoumo/ClawProBench

ClawProBench

ClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

69.9/100
800Forks: 52
View on GitHubHomepage →
Loading report...

Similar Projects

claw-eval

48

Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.

Python683

opencompass

86

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python7.1K

nexent

88

Nexent is a zero-code platform for auto-generating production-grade AI agents using Harness Engineering principles — unified tools, skills, memory, and orchestration with built-in constraints, feedback loops, and control planes.

Python5.3K

opensquilla

82

OpenSquilla — Token-Efficient AI Agent with same budget, higher intelligence density

Python4.8K
Back to List