Project

2026

Scientific Task Distillation Pipeline

Project Brief

A task-construction pipeline that turns quantum-physics papers and datasets into benchmark-ready tasks, task profiles, scoring inputs, and provenance-tracked exports. It supplies the evaluation stack with structured scientific work rather than isolated question-answer pairs.

Project Tech Stack

Core
Python · pandas · openpyxl · PyMuPDF
Interfaces
offline task factory · qdevbench export CLI
Artifacts
task_blueprint.json · paper_assets.json · bundle_manifest.json · replay.sqlite

Approach

Combined LLM-based interpretation and task synthesis with deterministic validation and export controls to keep benchmark authoring traceable and reproducible.

The emphasis is on turning messy research artifacts into task structures and reviewable scientific work that can be versioned and reused.

Outcome

The pipeline acts as the task-construction layer for QDevBench, converting papers, datasets, and scientific context into benchmark-ready assets with preserved provenance.

Media

Example Generated Evaluation Asset Arp, T., Sheekey, O., Zhou, H. et al. "Intervalley coherence and intrinsic spin–orbit coupling in rhombohedral trilayer graphene." Nature Physics 20, 1413–1420 (2024).

Prompt:

Detect Spin-Orbit-Driven Phase Contrast

Use the compressibility and local magnetometry instrument to sweep the in-plane magnetic field across the VI to IVC transition in rhombohedral trilayer graphene. From the resulting field-dependent thermodynamic observables, determine whether the two phases keep the same in-plane spin response or whether low-field behavior reveals an additional coupling that differentiates them. Report the qualitative conclusion and the supporting observable pattern.

Answer:

The two phases do not keep the same in-plane spin response. Delta mu stays nearly constant while Delta m_parallel changes strongly at low field, supporting intrinsic spin-orbit coupling that suppresses the IVC phase at low in-plane field.

Rubrics:

  • Score 4: The response identifies the field-sweep measurement, uses b_parallel_t as the control, analyzes both delta_mu_microev and delta_m_parallel_mu_b_per_electron across the sweep, and concludes that low-field behavior reveals a finite in-plane spin-response difference consistent with intrinsic spin-orbit coupling suppressing the IVC phase. The workflow, not just the final sentence, is scientifically complete.
  • Score 3: The response performs the field-dependent comparison and reaches the correct qualitative conclusion, but omits one supporting detail such as the near-constancy of delta_mu_microev or the specific low-field emphasis in delta_m_parallel_mu_b_per_electron.
  • Score 2: The response shows partial workflow evidence, such as inspecting only one observable or describing the sweep only loosely, and gives a plausible but weakly supported conclusion.
  • Score 1: The response mentions the topic or gives an unsupported conclusion with little or no usable acquisition-and-analysis path.
  • Score 0: The response is missing, irrelevant, or purely guessed.
  • Key response elements: identify the field sweep, compare both observables, focus on low-field behavior, and connect the contrasted response to spin-orbit-driven differentiation of the phases.
  • Common mistakes to penalize: skipping the sweep, reading only one observable, treating the data as a single summary number, overclaiming a mechanism without comparing both observables, or giving the final answer without showing the analysis path.
  • Autograder instructions: award the highest score whose requirements are satisfied, cite missing workflow steps when deducting points, and do not award full credit for answer-only guessing even if the final conclusion text is correct.
QDB pipeline demo — paper figure 2
QDB pipeline demo — paper figure