Nebius Build Berlin  ·  April 28, 2026

The VLA Reality Check

PhAIL — the Physical AI Leaderboard. Real hardware. Real metrics.

Is today's model better than yesterday's?

Are we actually making progress?

The problem

Answering honestly is harder than it sounds.

Four methodological traps make most VLA comparisons misleading.

01 Operator and environment shift outcomes
02 Different models speak different languages
03 One metric isn't enough — speed, reliability, failure modes
04 10 runs don't prove anything
The response

What honest eval looks like.

Four principles. Each one a direct answer to a trap above.

Same-session, blinded A/B
No drift, no bias
One inference API
Apples-to-apples
Full data
Any metric you want
Enough rollouts
Signal, not noise
Live

PhAIL — Physical AI Leaderboard

joint with Nebius

PhAIL run explorer: episode viewer with camera feeds, 3D trajectory, and telemetry. phail.ai ↗
Headline results

Where VLAs are in April 2026.

Three numbers from running four open-source VLAs on the same rig.

5%
of human throughput
best model, pick-and-place
~4 min
between human assists
mean time between assists
−22 pp
GR00T loses when the camera is occluded. OpenPI loses just 6.
robustness is not equal
What's next

PhAIL in the coming months.

Trossen bimanual
New embodiment landing on the rig.
More tasks
Custom evaluation tasks on request.
Get your model on the board
Free Nebius credits to fine-tune & submit. Talk to us after the talk.
So, are we making progress?

Now we can answer.

— and you can, too.

phail.ai ↗

Sergey Arkhangelsky · Positronic Robotics · Nebius Build Berlin 2026