Project

2026

QDevBench — Multi-step Scientific Agent Benchmark

Project Brief

A benchmark harness for evaluating scientific agents through multi-step task execution, trace capture, and trajectory-level grading.

Project Tech Stack

Core
Python · QCoDeS · LiteLLM
Interfaces
Streamlit · JupyterLab execution · MCP tools · station YAML
Artifacts
traces · responses · rubrics · grade reports

Approach

Task definitions are turned into reproducible benchmark workspaces with runnable assets, Jupyter- and MCP-backed execution, trace capture, and grading pipelines that preserve what the agent actually did.

The benchmark emphasizes multi-step behavior, intermediate decisions, and trajectory-level grading rather than static single-turn prompts or final-answer-only benchmarks.

That gives QDevBench a specific identity inside the evaluation stack: it measures how scientific agents behave across a full workflow, not just whether they guessed the final sentence correctly.

Outcome

QDevBench supplies a reproducible grading layer for multi-step scientific agent work, making traces, tool use, and recovery behavior first-class evaluation targets.

Media

Work soon to be released.