Nebius Build Berlin · April 28, 2026

The VLA Reality Check

PhAIL — the Physical AI Leaderboard. Real hardware. Real metrics.

Is today's model better than yesterday's?

Are we actually making progress?

The problem

Answering honestly is harder than it sounds.

Four methodological traps make most VLA comparisons misleading.

01 Operator and environment shift outcomes

02 Different models speak different languages

03 One metric isn't enough — speed, reliability, failure modes

04 10 runs don't prove anything

The response

What honest eval looks like.

Four principles. Each one a direct answer to a trap above.

Same-session, blinded A/B

→

No drift, no bias

One inference API

→

Apples-to-apples

Full data

→

Any metric you want

Enough rollouts

→

Signal, not noise

Live

PhAIL — Physical AI Leaderboard

joint with Nebius

PhAIL run explorer: episode viewer with camera feeds, 3D trajectory, and telemetry.

phail.ai ↗

Headline results

Where VLAs are in April 2026.

Three numbers from running four open-source VLAs on the same rig.

5%

of human throughput

best model, pick-and-place

~4 min

between human assists

mean time between assists

−22 pp

GR00T loses when the camera is occluded. OpenPI loses just 6.

robustness is not equal

What's next

PhAIL in the coming months.

Trossen bimanual

New embodiment landing on the rig.

More tasks

Custom evaluation tasks on request.

Get your model on the board

Free Nebius credits to fine-tune & submit. Talk to us after the talk.

So, are we making progress?

Now we can answer.

— and you can, too.

phail.ai ↗

Sergey Arkhangelsky · Positronic Robotics · Nebius Build Berlin 2026