Project
QDevBench — Multi-step Scientific Agent Benchmark
Project Brief
A benchmark harness for evaluating scientific agents through multi-step task execution, trace capture, and trajectory-level grading.
Project Tech Stack
Approach
Task definitions are turned into reproducible benchmark workspaces with runnable assets, Jupyter- and MCP-backed execution, trace capture, and grading pipelines that preserve what the agent actually did.
The benchmark emphasizes multi-step behavior, intermediate decisions, and trajectory-level grading rather than static single-turn prompts or final-answer-only benchmarks.
That gives QDevBench a specific identity inside the evaluation stack: it measures how scientific agents behave across a full workflow, not just whether they guessed the final sentence correctly.
Outcome
QDevBench supplies a reproducible grading layer for multi-step scientific agent work, making traces, tool use, and recovery behavior first-class evaluation targets.
Media
Work soon to be released.