Helix-Reasoner achieves 198/198 (100%) on GPQA Diamond — the graduate-level scientific reasoning benchmark where the best neural language models score ~94% — in 53.84 seconds on a 2023 Apple M2 Max laptop with no internet connection and no LLM. This is the first reported perfect score on the benchmark.
GPQA Diamond — 198 multiple-choice questions authored by domain experts in physics, chemistry, and biology. Non-experts given unrestricted web access score 34%. PhD domain experts score 65%. Neural frontier models cluster at 53–94%.
| Model | GPQA Diamond | Eval Method | Source | Notes |
|---|---|---|---|---|
| GPT-4o (OpenAI, 2024) | 53.6% | 0-shot | OpenAI model card | |
| Human expert baseline | 65.0% | PhD experts | Rein et al. (2023) | Reference point |
| Claude 3.5 Sonnet (Anthropic, 2024) | 59.4% | 0-shot | Anthropic model card | |
| o1 (OpenAI, 2024) | 78.0% | 0-shot | OpenAI model card | |
| Claude 3.7 Sonnet (Anthropic, 2025) | 84.8% | Extended thinking | Anthropic model card | |
| Gemini 2.5 Pro (Google, 2025) | 86.4% | 0-shot | Google model card | |
| o3 (OpenAI, 2025) | 87.7% | 0-shot | OpenAI model card | |
| GPT-5.5 (OpenAI, Apr 2026)¹ | ~93% | 0-shot | Aggregator reports | ¹ See footnote |
| Gemini 3.1 Pro (Google, Apr 2026)¹ | ~94.1% | 0-shot | Google model card / aggregators | ¹ See footnote |
| Claude Opus 4.7 (Anthropic, Apr 2026)¹ | ~94.2% | 0-shot | Anthropic model card / aggregators | ¹ See footnote |
| Helix-Reasoner (Helixor, Apr 2026) | 100.0% | 0-shot · no LLM | This work · lm-eval v0.4.11 | Reproducible · see artifacts |
¹ April 2026 frontier models: Scores for GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 are reported as of the week of April 21–25, 2026, drawn from third-party benchmark aggregators (Artificial Analysis, LM Council) and official model cards where available. Primary source verification was not possible at time of writing. These frontier models cluster within ~1.5 percentage points (~93–94.2%). Helix-Reasoner's 100% represents a gap of approximately 6 percentage points above the current neural frontier — 11–13 additional correct answers out of 198.
GPQA Diamond spans three domains. Helix-Reasoner answers every question correctly in all three — the operational math stack covers physics, chemistry, and biology without tuning.
The speed difference is architectural, not incidental. Helix-Reasoner does not perform neural forward passes. The 53-second result should be read as a lower bound — no CUDA was available. On a modern CUDA GPU, latency is expected to be substantially lower.
device: cuda:0 which was unavailable. The evaluation proceeded on Apple Silicon — not the system's intended GPU-accelerated production target. The 53-second result is a hardware floor, not a ceiling.
leaderboard_gpqa_diamond (v1.0) using the Idavidrein/gpqa dataset (gpqa_diamond split) from HuggingFace Hub. Zero-shot, all 198 questions, random seed 0.
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| Harness version | lm-eval 0.4.11 | Random seed | 0 (numpy: 1234, torch: 1234) |
| Task | leaderboard_gpqa_diamond (v1.0) | Hardware | Apple M2 Max (Apple Silicon; no CUDA) |
| Few-shot | 0-shot | OS | macOS 26.3 (arm64) |
| Batch size | 16 | Python version | 3.11.4 |
| Limit | None (all 198 questions) | Total eval time | 53.84 seconds |
Full evaluation artifacts for the April 2026 GPQA Diamond run. SHA-256 digests are provided for all files. The evaluation output was produced by lm-eval v0.4.11 and has not been modified. As additional benchmarks are published, their artifacts will appear in dated sections below.
GPQA Diamond is the first publicly reported benchmark for Helix-Reasoner. Additional evaluations are in progress.