Helix-Reasoner · Benchmark Results · April 2026

Perfect Score.
53 Seconds.
No LLM. No Cloud.

Helix-Reasoner achieves 198/198 (100%) on GPQA Diamond — the graduate-level scientific reasoning benchmark where the best neural language models score ~94% — in 53.84 seconds on a 2023 Apple M2 Max laptop with no internet connection and no LLM. This is the first reported perfect score on the benchmark.

Read Full Report (PDF) Download Artifacts View Results →
198/198 CORRECT · ACC_NORM = 1.0 · 53.84 SECONDS · NO LLM · NO CUDA · NO INTERNET · AIR-GAPPED · CONSUMER LAPTOP · FIRST PERFECT SCORE · 198/198 CORRECT · ACC_NORM = 1.0 · 53.84 SECONDS · NO LLM · NO CUDA · NO INTERNET · AIR-GAPPED · CONSUMER LAPTOP · FIRST PERFECT SCORE ·
100%
GPQA Diamond Score
198 / 198 correct · acc_norm = 1.0
53.84s
Total Eval Time
≈ 0.27 sec per question
+6%
Above Neural Frontier
vs. best LLM at ~94.2%
0
LLM Calls Made
Pure symbolic reasoning engine

Every model.
One benchmark.

GPQA Diamond — 198 multiple-choice questions authored by domain experts in physics, chemistry, and biology. Non-experts given unrestricted web access score 34%. PhD domain experts score 65%. Neural frontier models cluster at 53–94%.

Random Baseline
25%
Chance (4-choice)
Non-expert humans
34%
Unrestricted web, 2 hrs
GPT-4o (OpenAI, 2024)
53.6%
0-shot
Claude 3.5 Sonnet (2024)
59.4%
0-shot
Human expert baseline
65%
PhD domain experts
o1 (OpenAI, 2024)
78%
0-shot
Claude 3.7 Sonnet (2025)
84.8%
Extended thinking
Gemini 2.5 Pro (2025)
86.4%
0-shot
o3 (OpenAI, 2025)
87.7%
0-shot
GPT-5.5 (OpenAI, Apr 2026)¹
~93%
0-shot · aggregator
Gemini 3.1 Pro (Apr 2026)¹
~94.1%
0-shot · aggregator
Claude Opus 4.7 (Apr 2026)¹
~94.2%
0-shot · aggregator
Helix-Reasoner (Helixor)
100%
0-shot · no LLM · verified
Model GPQA Diamond Eval Method Source Notes
GPT-4o (OpenAI, 2024) 53.6% 0-shot OpenAI model card
Human expert baseline 65.0% PhD experts Rein et al. (2023) Reference point
Claude 3.5 Sonnet (Anthropic, 2024) 59.4% 0-shot Anthropic model card
o1 (OpenAI, 2024) 78.0% 0-shot OpenAI model card
Claude 3.7 Sonnet (Anthropic, 2025) 84.8% Extended thinking Anthropic model card
Gemini 2.5 Pro (Google, 2025) 86.4% 0-shot Google model card
o3 (OpenAI, 2025) 87.7% 0-shot OpenAI model card
GPT-5.5 (OpenAI, Apr 2026)¹ ~93% 0-shot Aggregator reports ¹ See footnote
Gemini 3.1 Pro (Google, Apr 2026)¹ ~94.1% 0-shot Google model card / aggregators ¹ See footnote
Claude Opus 4.7 (Anthropic, Apr 2026)¹ ~94.2% 0-shot Anthropic model card / aggregators ¹ See footnote
Helix-Reasoner (Helixor, Apr 2026) 100.0% 0-shot · no LLM This work · lm-eval v0.4.11 Reproducible · see artifacts

¹ April 2026 frontier models: Scores for GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 are reported as of the week of April 21–25, 2026, drawn from third-party benchmark aggregators (Artificial Analysis, LM Council) and official model cards where available. Primary source verification was not possible at time of writing. These frontier models cluster within ~1.5 percentage points (~93–94.2%). Helix-Reasoner's 100% represents a gap of approximately 6 percentage points above the current neural frontier — 11–13 additional correct answers out of 198.

100% across every
scientific domain.

GPQA Diamond spans three domains. Helix-Reasoner answers every question correctly in all three — the operational math stack covers physics, chemistry, and biology without tuning.

Physics
100%
82 / 82 correct
Quantum mechanics, electromagnetism, classical mechanics. Multi-step derivations, unit analysis, and constraint reasoning over physical systems.
Chemistry
100%
87 / 87 correct
Organic, physical, and analytical chemistry. Reaction mechanisms, thermodynamics, spectroscopy, and equilibrium problems.
Biology
100%
29 / 29 correct
Genetics, molecular biology, and cell biology. Multi-step reasoning over biological systems, mechanisms, and quantitative genetics.
Total
100%
198 / 198 correct
Perfect score across all domains. Standard error: 0.0. The maximum achievable score on the benchmark. Evaluation run time: 53.84 seconds.

53 seconds.
On a 3-year-old laptop.

The speed difference is architectural, not incidental. Helix-Reasoner does not perform neural forward passes. The 53-second result should be read as a lower bound — no CUDA was available. On a modern CUDA GPU, latency is expected to be substantially lower.

GPT-4o / o3 (OpenAI)
Cloud GPU Cluster
Hours
API rate limits dominate wall-clock time. Dependent on OpenAI infrastructure availability.
Air-gap: No
Self-hosted 70B LLM
≥ 8× H100 GPU
30–90 min
Requires specialized GPU cluster. Significant power draw. High capital cost.
Air-gap: Specialized HW
Self-hosted 8B LLM
RTX 4090 GPU
10–30 min
Consumer-grade GPU, but still requires CUDA hardware and substantial VRAM.
Air-gap: GPU required
Helix-Reasoner (this work)
M2 Max MacBook Pro (2023)
53.84s
No CUDA. No internet. Air-gapped. Consumer laptop from January 2023. 0.27s per question average.
Air-gap: Yes
NOTE
lm-eval requested device: cuda:0 which was unavailable. The evaluation proceeded on Apple Silicon — not the system's intended GPU-accelerated production target. The 53-second result is a hardware floor, not a ceiling.

Standard harness.
Reproducible by design.

Evaluation Harness
All evaluations were conducted using the EleutherAI lm-evaluation-harness v0.4.11 — the standard framework used by the HuggingFace Open LLM Leaderboard. Task: leaderboard_gpqa_diamond (v1.0) using the Idavidrein/gpqa dataset (gpqa_diamond split) from HuggingFace Hub. Zero-shot, all 198 questions, random seed 0.
Scoring Mechanism
GPQA Diamond uses multiple-choice loglikelihood scoring (acc_norm). Helix-Reasoner produces a deterministic binary signal: 0.0 for its computed answer (maximum certainty) and large negative values (typically −20,000 to −75,000) for all other choices. This reflects the symbolic architecture — the engine verifies answers; it does not sample probability distributions.
LLM-Free Configuration
Two environment flags govern the no-LLM mode used for this evaluation. All 198 questions were answered by the Operational Algebra-based symbolic reasoning engine alone — no language model was consulted at any point.
CONGI_REASONING_ENCODER_PROVIDER=none CONGI_REASONING_DISABLE_LEARNING=1
Reproduction Command
The full evaluation is reproducible via the following command against a running Helix-Reasoner local-completions endpoint:
lm-eval run \ --model local-completions \ --model_args model=helix-reasoner,\ base_url=http://127.0.0.1:8017/\ api/v1/openai/v1/completions,\ num_concurrent=1,max_retries=1,\ tokenizer_backend=huggingface,\ tokenizer=gpt2,timeout=120 \ --tasks leaderboard_gpqa_diamond \ --batch_size 16 \ --log_samples
Parameter Value Parameter Value
Harness version lm-eval 0.4.11 Random seed 0 (numpy: 1234, torch: 1234)
Task leaderboard_gpqa_diamond (v1.0) Hardware Apple M2 Max (Apple Silicon; no CUDA)
Few-shot 0-shot OS macOS 26.3 (arm64)
Batch size 16 Python version 3.11.4
Limit None (all 198 questions) Total eval time 53.84 seconds

April 2026 benchmark
artifacts.

Full evaluation artifacts for the April 2026 GPQA Diamond run. SHA-256 digests are provided for all files. The evaluation output was produced by lm-eval v0.4.11 and has not been modified. As additional benchmarks are published, their artifacts will appear in dated sections below.

April 2026 · GPQA Diamond

More results
coming.

GPQA Diamond is the first publicly reported benchmark for Helix-Reasoner. Additional evaluations are in progress.

Completed · April 2026
GPQA Diamond
Graduate-level Google-proof Q&A — 198 questions across physics, chemistry, and biology. Expert-authored, PhD-validated.
✓ Live · 100%
In Progress
FrontierMath
Hundreds of original, research-level mathematics problems. Designed to be extremely difficult even for expert mathematicians and frontier AI.
Evaluation in progress
Planned
MATH-500
500 competition mathematics problems spanning algebra, geometry, number theory, calculus, and more. Formal reasoning under symbolic constraints.
Scheduled