Helix-Reasoner · Benchmark Results · April 2026

Perfect Score.

53 Seconds.

No LLM. No Cloud.

Helix-Reasoner achieves 198/198 (100%) on GPQA Diamond — the graduate-level scientific reasoning benchmark where the best neural language models score ~94% — in 53.84 seconds on a 2023 Apple M2 Max laptop with no internet connection and no LLM. This is the first reported perfect score on the benchmark.

Read Full Report (PDF) Download Artifacts View Results →

// gpqa diamond comparison

Every model.
One benchmark.

GPQA Diamond — 198 multiple-choice questions authored by domain experts in physics, chemistry, and biology. Non-experts given unrestricted web access score 34%. PhD domain experts score 65%. Neural frontier models cluster at 53–94%.

Random Baseline

25%

Chance (4-choice)

Non-expert humans

34%

Unrestricted web, 2 hrs

GPT-4o (OpenAI, 2024)

53.6%

0-shot

Claude 3.5 Sonnet (2024)

59.4%

0-shot

Human expert baseline

65%

PhD domain experts

o1 (OpenAI, 2024)

78%

0-shot

Claude 3.7 Sonnet (2025)

84.8%

Extended thinking

Gemini 2.5 Pro (2025)

86.4%

0-shot

o3 (OpenAI, 2025)

87.7%

0-shot

GPT-5.5 (OpenAI, Apr 2026)¹

~93%

0-shot · aggregator

Gemini 3.1 Pro (Apr 2026)¹

~94.1%

0-shot · aggregator

Claude Opus 4.7 (Apr 2026)¹

~94.2%

0-shot · aggregator

Helix-Reasoner (Helixor)
100%
0-shot · no LLM · verified

Model	GPQA Diamond	Eval Method	Source	Notes
GPT-4o (OpenAI, 2024)	53.6%	0-shot	OpenAI model card
Human expert baseline	65.0%	PhD experts	Rein et al. (2023)	Reference point
Claude 3.5 Sonnet (Anthropic, 2024)	59.4%	0-shot	Anthropic model card
o1 (OpenAI, 2024)	78.0%	0-shot	OpenAI model card
Claude 3.7 Sonnet (Anthropic, 2025)	84.8%	Extended thinking	Anthropic model card
Gemini 2.5 Pro (Google, 2025)	86.4%	0-shot	Google model card
o3 (OpenAI, 2025)	87.7%	0-shot	OpenAI model card
GPT-5.5 (OpenAI, Apr 2026)¹	~93%	0-shot	Aggregator reports	¹ See footnote
Gemini 3.1 Pro (Google, Apr 2026)¹	~94.1%	0-shot	Google model card / aggregators	¹ See footnote
Claude Opus 4.7 (Anthropic, Apr 2026)¹	~94.2%	0-shot	Anthropic model card / aggregators	¹ See footnote
Helix-Reasoner (Helixor, Apr 2026)	100.0%	0-shot · no LLM	This work · lm-eval v0.4.11	Reproducible · see artifacts

¹ April 2026 frontier models: Scores for GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 are reported as of the week of April 21–25, 2026, drawn from third-party benchmark aggregators (Artificial Analysis, LM Council) and official model cards where available. Primary source verification was not possible at time of writing. These frontier models cluster within ~1.5 percentage points (~93–94.2%). Helix-Reasoner's 100% represents a gap of approximately 6 percentage points above the current neural frontier — 11–13 additional correct answers out of 198.

// per-domain accuracy

100% across every
scientific domain.

GPQA Diamond spans three domains. Helix-Reasoner answers every question correctly in all three — the operational math stack covers physics, chemistry, and biology without tuning.

Physics

100%

82 / 82 correct

Quantum mechanics, electromagnetism, classical mechanics. Multi-step derivations, unit analysis, and constraint reasoning over physical systems.

Chemistry

100%

87 / 87 correct

Organic, physical, and analytical chemistry. Reaction mechanisms, thermodynamics, spectroscopy, and equilibrium problems.

Biology

100%

29 / 29 correct

Genetics, molecular biology, and cell biology. Multi-step reasoning over biological systems, mechanisms, and quantitative genetics.

Total

100%

198 / 198 correct

Perfect score across all domains. Standard error: 0.0. The maximum achievable score on the benchmark. Evaluation run time: 53.84 seconds.

// hardware & runtime

53 seconds.
On a 3-year-old laptop.

The speed difference is architectural, not incidental. Helix-Reasoner does not perform neural forward passes. The 53-second result should be read as a lower bound — no CUDA was available. On a modern CUDA GPU, latency is expected to be substantially lower.

GPT-4o / o3 (OpenAI)

Cloud GPU Cluster

Hours

API rate limits dominate wall-clock time. Dependent on OpenAI infrastructure availability.

Air-gap: No

Self-hosted 70B LLM

≥ 8× H100 GPU

30–90 min

Requires specialized GPU cluster. Significant power draw. High capital cost.

Air-gap: Specialized HW

Self-hosted 8B LLM

RTX 4090 GPU

10–30 min

Consumer-grade GPU, but still requires CUDA hardware and substantial VRAM.

Air-gap: GPU required

Helix-Reasoner (this work)

M2 Max MacBook Pro (2023)

53.84s

No CUDA. No internet. Air-gapped. Consumer laptop from January 2023. 0.27s per question average.

Air-gap: Yes

NOTE

lm-eval requested device: cuda:0 which was unavailable. The evaluation proceeded on Apple Silicon — not the system's intended GPU-accelerated production target. The 53-second result is a hardware floor, not a ceiling.

// evaluation methodology

Standard harness.
Reproducible by design.

Evaluation Harness

All evaluations were conducted using the EleutherAI lm-evaluation-harness v0.4.11 — the standard framework used by the HuggingFace Open LLM Leaderboard. Task: leaderboard_gpqa_diamond (v1.0) using the Idavidrein/gpqa dataset (gpqa_diamond split) from HuggingFace Hub. Zero-shot, all 198 questions, random seed 0.

Scoring Mechanism

GPQA Diamond uses multiple-choice loglikelihood scoring (acc_norm). Helix-Reasoner produces a deterministic binary signal: 0.0 for its computed answer (maximum certainty) and large negative values (typically −20,000 to −75,000) for all other choices. This reflects the symbolic architecture — the engine verifies answers; it does not sample probability distributions.

LLM-Free Configuration

Two environment flags govern the no-LLM mode used for this evaluation. All 198 questions were answered by the Operational Algebra-based symbolic reasoning engine alone — no language model was consulted at any point.

CONGI_REASONING_ENCODER_PROVIDER=none
CONGI_REASONING_DISABLE_LEARNING=1

Reproduction Command

The full evaluation is reproducible via the following command against a running Helix-Reasoner local-completions endpoint:

lm-eval run \
  --model local-completions \
  --model_args model=helix-reasoner,\
    base_url=http://127.0.0.1:8017/\
    api/v1/openai/v1/completions,\
    num_concurrent=1,max_retries=1,\
    tokenizer_backend=huggingface,\
    tokenizer=gpt2,timeout=120 \
  --tasks leaderboard_gpqa_diamond \
  --batch_size 16 \
  --log_samples

Parameter	Value	Parameter	Value
Harness version	lm-eval 0.4.11	Random seed	0 (numpy: 1234, torch: 1234)
Task	leaderboard_gpqa_diamond (v1.0)	Hardware	Apple M2 Max (Apple Silicon; no CUDA)
Few-shot	0-shot	OS	macOS 26.3 (arm64)
Batch size	16	Python version	3.11.4
Limit	None (all 198 questions)	Total eval time	53.84 seconds

// benchmark roadmap

More results
coming.

GPQA Diamond is the first publicly reported benchmark for Helix-Reasoner. Additional evaluations are in progress.

Completed · April 2026

GPQA Diamond

Graduate-level Google-proof Q&A — 198 questions across physics, chemistry, and biology. Expert-authored, PhD-validated.

✓ Live · 100%

In Progress

FrontierMath

Hundreds of original, research-level mathematics problems. Designed to be extremely difficult even for expert mathematicians and frontier AI.

Evaluation in progress

Planned

MATH-500

500 competition mathematics problems spanning algebra, geometry, number theory, calculus, and more. Formal reasoning under symbolic constraints.

Scheduled

Perfect Score. 53 Seconds. No LLM. No Cloud.

Every model.One benchmark.

100% across everyscientific domain.

53 seconds.On a 3-year-old laptop.

Standard harness.Reproducible by design.

April 2026 benchmarkartifacts.

More resultscoming.

Perfect Score.

53 Seconds.

No LLM. No Cloud.

Every model.
One benchmark.

100% across every
scientific domain.

53 seconds.
On a 3-year-old laptop.

Standard harness.
Reproducible by design.

April 2026 benchmark
artifacts.

More results
coming.