Even the Best Model Got Only 36.1% — OpenAI's LifeSciBench Is a Reality Check on AI Doing Science
On June 17, OpenAI released LifeSciBench, a benchmark grading AI on real life-science research — 750 tasks built by 173 PhD scientists, scored on expert rubrics averaging 25 criteria each. The strongest model passed just 36.1%. It's a ruler for how true 'AI does science' really is.

Just how true is "AI does science"?
Here's the deal: headlines say AI designs drugs and analyzes genomes — but exactly how well? Demos always look great; what models actually do when handed real research tasks has rarely been well measured. OpenAI's LifeSciBench, released June 17, puts a ruler to that question.
The headline number is humbling: the best model passed only 36.1% of the 750 tasks — two of every three, wrong. That matters because it gives a cold baseline against the overheated "AI will replace scientists" story, while also blocking the opposite extreme — 36.1% isn't zero. It shows, in numbers, how far you can trust AI and where humans must verify.
The players — OpenAI, 173 scientists, and GPT-Rosalind
OpenAI shipped an evaluation tool, not a product — a public benchmark grading multiple models on one yardstick, which is also a claim to define the field's standard. 173 PhD scientists from biotech and pharma authored the tasks and rubrics — averaging 25 grading criteria per task, 19,020 in total — so it's not "get one answer right," it's dissecting the logic, evidence, and interpretation item by item. GPT-Rosalind, OpenAI's life-sciences model, topped GPT-5.5, Grok 4.3, and Gemini 3.1 Pro — but topping out at ~36.1% is the whole point: even the best has far to go.
What makes it different
| Item | LifeSciBench |
|---|---|
| Tasks | 750 |
| Authors | 173 PhD scientists |
| Scoring | expert rubrics averaging 25 criteria/task (19,020 total) |
| Domains | evidence handling, analysis, design & optimization, scientific reasoning, validation & operations, translation & communication |
| Top score | GPT-Rosalind, ~36.1% pass |
| Inputs | genomic sequence files, chemical structure files, experimental figures |
It's not multiple choice. Prior AI-biology benchmarks leaned on clean MCQs; LifeSciBench is free-response, graded against expert rubrics — like real research. Models must interpret real scientific artifacts — sequence files, chemical structures, figures — not recite memorized text, which sharply raises difficulty. And it splits ability into six domains, so you can diagnose "reasons well but weak at validation" — a far more useful map than a single score.
Who welcomes it
Pharma/biotech are the most direct beneficiaries — they keep hitting "how far can I trust this model," and LifeSciBench answers with data, letting them design exactly where humans must verify instead of blanket trust or rejection. AI companies get a clear target — 36.1% says "raise this," and the six-domain breakdown shows which ability to shore up. Scientists get a reality check: in an age of hype, a rigorous eval by 173 peers sets an agreed baseline for "this is where AI is now."
Past parallels — benchmarks that pulled fields forward
Good benchmarks often pull whole fields. ImageNet did it for vision — a clear metric sparked global competition that triggered the deep-learning revolution; MMLU played a similar role for language models. Winners "looked enough like reality" — too easy and it saturates, too detached and high scores don't transfer; LifeSciBench's free-response, expert rubrics, and real-data inputs are exactly that effort at realism. The trap to watch: once a benchmark gains authority, companies tune models to ace that test (overfitting). The low 36.1% start is paradoxically healthy — it reflects real limits, not test-taking tricks.
Counter-play — the benchmark-authority race
Shipping a life-sciences benchmark is strategic: defining the field's evaluation standard, and having your model (GPT-Rosalind) top it, is a powerful message — "our model wins on the hardest test we set." Rivals (Google's Gemini, xAI's Grok) can either compete on LifeSciBench scores or propose their own standards, entering the rule-making contest; who owns AI evaluation is becoming as important a front as model performance. But going public is double-edged — others can now measure and compare, and if a rival passes GPT-Rosalind, that shows on the same yardstick too. Transparency builds trust and takes on risk.
So what actually changes
If you work in pharma/biotech, this is concrete grounding for "how far to delegate to AI" — the 36.1% and the six-domain strengths let you design workflows where AI does first-pass analysis and humans verify the core call. If you carry vague hope or fear about AI, this number is ballast — neither "AI replaces scientists soon" nor "it's all lies," but "today's best is ~36.1%." For researchers/students, it's a reference for the trust range when using AI as a research assistant.
🥄 Three Things You're Probably Wondering
— So what does this mean for me? If you're in bio/pharma/research, yes — it's a real basis for how far to trust AI. Elsewhere, it's a reference point for "where AI's science ability stands."
— Is 36.1% good or bad? Depends on the lens. "Only 36%" says far to go; "already 36%" says fast progress. The point is it's neither 0 nor 100 — read it as a signal that human verification is still needed.
— Does GPT-Rosalind topping it mean OpenAI is best? Too early. A model topping its maker's own benchmark deserves a grain of salt. But it's public, so others can measure on the same yardstick — a more objective picture will emerge over time.
Sources
- Introducing LifeSciBench — OpenAI
- OpenAI Releases LifeSciBench, a 750-Task Benchmark — MarkTechPost
- OpenAI Life Science Benchmark Reveals AI Passes Only 1 in 3 Scientific Research Tasks — TechTimes
Numbers and criteria are as of announcement and may change.
출처
관련 기사

GPT-5.5 vs Opus 4.7 — Developers Split Into 'Accuracy' and 'Autonomy' Camps

OpenAI's Lilli Replaces Internal Knowledge Search with AI Agents

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.