spoonai
TOPOpenAILifeSciBenchBenchmark

Even the Best Model Got Only 36.1% — OpenAI's LifeSciBench Is a Reality Check on AI Doing Science

On June 17, OpenAI released LifeSciBench, a benchmark grading AI on real life-science research — 750 tasks built by 173 PhD scientists, scored on expert rubrics averaging 25 criteria each. The strongest model passed just 36.1%. It's a ruler for how true 'AI does science' really is.

·5분 소요
공유
OpenAI LifeSciBench — grading AI on real life-science research tasks
Source: OpenAI

Just how true is "AI does science"?

Here's the deal: headlines say AI designs drugs and analyzes genomes — but exactly how well? Demos always look great; what models actually do when handed real research tasks has rarely been well measured. OpenAI's LifeSciBench, released June 17, puts a ruler to that question.

The headline number is humbling: the best model passed only 36.1% of the 750 tasks — two of every three, wrong. That matters because it gives a cold baseline against the overheated "AI will replace scientists" story, while also blocking the opposite extreme — 36.1% isn't zero. It shows, in numbers, how far you can trust AI and where humans must verify.

The players — OpenAI, 173 scientists, and GPT-Rosalind

OpenAI shipped an evaluation tool, not a product — a public benchmark grading multiple models on one yardstick, which is also a claim to define the field's standard. 173 PhD scientists from biotech and pharma authored the tasks and rubrics — averaging 25 grading criteria per task, 19,020 in total — so it's not "get one answer right," it's dissecting the logic, evidence, and interpretation item by item. GPT-Rosalind, OpenAI's life-sciences model, topped GPT-5.5, Grok 4.3, and Gemini 3.1 Pro — but topping out at ~36.1% is the whole point: even the best has far to go.

What makes it different

Item LifeSciBench
Tasks 750
Authors 173 PhD scientists
Scoring expert rubrics averaging 25 criteria/task (19,020 total)
Domains evidence handling, analysis, design & optimization, scientific reasoning, validation & operations, translation & communication
Top score GPT-Rosalind, ~36.1% pass
Inputs genomic sequence files, chemical structure files, experimental figures

It's not multiple choice. Prior AI-biology benchmarks leaned on clean MCQs; LifeSciBench is free-response, graded against expert rubrics — like real research. Models must interpret real scientific artifacts — sequence files, chemical structures, figures — not recite memorized text, which sharply raises difficulty. And it splits ability into six domains, so you can diagnose "reasons well but weak at validation" — a far more useful map than a single score.

Who welcomes it

Pharma/biotech are the most direct beneficiaries — they keep hitting "how far can I trust this model," and LifeSciBench answers with data, letting them design exactly where humans must verify instead of blanket trust or rejection. AI companies get a clear target — 36.1% says "raise this," and the six-domain breakdown shows which ability to shore up. Scientists get a reality check: in an age of hype, a rigorous eval by 173 peers sets an agreed baseline for "this is where AI is now."

Past parallels — benchmarks that pulled fields forward

Good benchmarks often pull whole fields. ImageNet did it for vision — a clear metric sparked global competition that triggered the deep-learning revolution; MMLU played a similar role for language models. Winners "looked enough like reality" — too easy and it saturates, too detached and high scores don't transfer; LifeSciBench's free-response, expert rubrics, and real-data inputs are exactly that effort at realism. The trap to watch: once a benchmark gains authority, companies tune models to ace that test (overfitting). The low 36.1% start is paradoxically healthy — it reflects real limits, not test-taking tricks.

Counter-play — the benchmark-authority race

Shipping a life-sciences benchmark is strategic: defining the field's evaluation standard, and having your model (GPT-Rosalind) top it, is a powerful message — "our model wins on the hardest test we set." Rivals (Google's Gemini, xAI's Grok) can either compete on LifeSciBench scores or propose their own standards, entering the rule-making contest; who owns AI evaluation is becoming as important a front as model performance. But going public is double-edged — others can now measure and compare, and if a rival passes GPT-Rosalind, that shows on the same yardstick too. Transparency builds trust and takes on risk.

So what actually changes

If you work in pharma/biotech, this is concrete grounding for "how far to delegate to AI" — the 36.1% and the six-domain strengths let you design workflows where AI does first-pass analysis and humans verify the core call. If you carry vague hope or fear about AI, this number is ballast — neither "AI replaces scientists soon" nor "it's all lies," but "today's best is ~36.1%." For researchers/students, it's a reference for the trust range when using AI as a research assistant.

🥄 Three Things You're Probably Wondering

— So what does this mean for me? If you're in bio/pharma/research, yes — it's a real basis for how far to trust AI. Elsewhere, it's a reference point for "where AI's science ability stands."

— Is 36.1% good or bad? Depends on the lens. "Only 36%" says far to go; "already 36%" says fast progress. The point is it's neither 0 nor 100 — read it as a signal that human verification is still needed.

— Does GPT-Rosalind topping it mean OpenAI is best? Too early. A model topping its maker's own benchmark deserves a grain of salt. But it's public, so others can measure on the same yardstick — a more objective picture will emerge over time.

Sources

Numbers and criteria are as of announcement and may change.

관련 기사

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.

매일 30개+ 소스 분석 · 한국어/영어 이중 언어광고 없음 · 1-클릭 해지