GPQA 32% in 10 Hours -- HuggingFace's AI Intern Outperformed Claude Code

32% on a Benchmark Designed to Stump Experts

GPQA -- Graduate-Level Google-Proof Q&A -- is a benchmark of science questions hard enough that even domain experts can't just Google the answers. It's designed to test genuine reasoning ability.

HuggingFace's ml-intern agent scored 32% on it. Claude Code scored 22.99%. The agent ran for 10 hours with zero human intervention.

But the score isn't the real story. What matters is the scope of what ml-intern did: it searched papers, found relevant datasets, wrote training scripts, trained a model, evaluated the results, identified gaps, and iterated. The entire post-training pipeline, automated end to end.

What ml-intern Actually Does

ml-intern agent architecture ml-intern's automated post-training pipeline

Think of it as an ML research intern that never sleeps. You give it a goal -- "improve scientific reasoning" -- and it executes:

Literature Scan: Searches arXiv and Semantic Scholar for relevant papers, extracts key techniques
Dataset Discovery: Finds training data on HuggingFace Hub, evaluates quality
Training Script Generation: Writes fine-tuning code using TRL (Transformer Reinforcement Learning)
Evaluation: Benchmarks the trained model and analyzes results
Iteration: If results are lacking, repeats the cycle with improvements

A human researcher takes weeks to do this. ml-intern did it in 10 hours.

Tech Stack

Framework: smolagents (HuggingFace's lightweight agent framework)
Training: Transformers, TRL
Data: Datasets library
Language: Python
License: Apache-2.0

Building on smolagents is significant. Unlike LangChain's heavy abstraction layers, smolagents uses a "code agent" approach -- it generates and executes code directly, giving it far more flexibility in tool usage.

Performance Context

System	GPQA Score	Time	Human Input	Cost
ml-intern	32%	10 hours	None	GPU costs only
Claude Code	22.99%	Manual	Full	API costs
Human ML researcher	Varies	Weeks	Full	Salary

Fair warning: this isn't a perfectly apples-to-apples comparison. Claude Code is a general-purpose coding agent; ml-intern is specialized for ML post-training. But the point stands -- a specialized automated agent outperformed a general-purpose AI tool in its domain.

Competitor Landscape

Project	Stars	Focus	Automation Scope	Framework
ml-intern	6.8K	ML post-training	Papers-data-training-eval-iterate	smolagents
SWE-agent	15K	Code bug fixing	Issue-patch-test	Custom
STORM	9K	Paper writing	Research-outline-draft	Custom

SWE-agent (Princeton NLP) automates code fixes. STORM (Stanford OVAL) automates paper writing. ml-intern automates the experiments themselves. Different lanes, minimal overlap.

ml-intern trending on GitHub Climbing GitHub Trending at 260 stars/day

At 260 stars per day, ml-intern is punching well above its weight for a 6.8K-star project. A few reasons it's gaining traction.

The project targets an acute pain point: the repetitive grind of ML research. Reading papers, finding datasets, writing training code, tuning hyperparameters -- roughly 80% of this work is mechanical, not creative. ml-intern automates the mechanical parts.

It also benefits from deep HuggingFace ecosystem integration. Transformers, TRL, Datasets, Hub -- everything works natively. If you're already in the HuggingFace stack, adoption cost is near zero.

And the Product Hunt launch on April 23 (365 upvotes) gave it visibility beyond the typical GitHub audience. Running an open-source launch and a product launch simultaneously was a smart play.

Ecosystem Context

ml-intern signals the start of "AI training AI" becoming practical. Where Hermes Agent handles agent self-improvement and Google ADK handles deployment infrastructure, ml-intern carves out research automation as its own axis. The 2026 agent ecosystem is specializing fast.

The bigger implication is a shift in what ML researchers actually do. Less hyperparameter tuning and dataset curation, more setting research direction and interpreting results.

Getting Started

pip install ml-intern
ml-intern run --task "improve scientific reasoning"

One command kicks off the full pipeline. You'll need GPU access (A100 recommended minimum) and a HuggingFace Hub token configured.

Who Should Skip This

If you only need inference, not training
If you have no GPU access
If you want fine-grained control over every training step (ml-intern is highly autonomous, making mid-run intervention difficult)

What's Next

ml-intern roadmap ml-intern development roadmap

Multi-GPU and multi-node training support
Experiment tracking integration (Weights & Biases, MLflow)
Automated paper draft generation

An agent that automates 80% of ML post-training. If this is the "intern" version, the "senior" version is going to be something else entirely.

References

GPQA 32% in 10 Hours -- HuggingFace's AI Intern Outperformed Claude Code

32% on a Benchmark Designed to Stump Experts

What ml-intern Actually Does

Tech Stack

Performance Context

Competitor Landscape

Ecosystem Context

Getting Started

Who Should Skip This

What's Next

출처

관련 기사

An AI Intern That Runs Your Entire Post-Training Pipeline -- ml-intern on PH

OpenClaw — Why a Local AI Assistant Hit 250K Stars on GitHub

95,600 Stars in 7 Weeks -- Nous Research Built an Agent That Improves Itself

32% on a Benchmark Designed to Stump Experts

What ml-intern Actually Does

Tech Stack

Performance Context

Competitor Landscape

Why It's Trending Now

Ecosystem Context

Getting Started

Who Should Skip This

What's Next

출처

관련 기사

An AI Intern That Runs Your Entire Post-Training Pipeline -- ml-intern on PH

OpenClaw — Why a Local AI Assistant Hit 250K Stars on GitHub

95,600 Stars in 7 Weeks -- Nous Research Built an Agent That Improves Itself

AI 트렌드를 앞서가세요