We measure how physical AI actually performs.
Every lab claims their model works. Conference demos look great. But can these models handle real commercial tasks – reliably, repeatedly, at production speed?
We built PhAIL (Physical AI Leaderboard) to answer that question. PhAIL is the first real-hardware benchmark for foundation models in robotics. We test leading VLA models on physical robots doing commercial tasks, and measure what businesses actually care about: throughput, reliability, and failure modes.
Not success rates in simulation. Real results on real hardware.
Models on the leaderboard include OpenPI 0.5 (Physical Intelligence), GR00T and DreamZero (NVIDIA), and SmolVLA (HuggingFace) – tested alongside human and teleoperated baselines.
The infrastructure problem
New foundation models ship every month. Each requires its own inference setup, its own data format, its own training recipe. Teams spend weeks on integration work that becomes obsolete when the next model drops.
Running OpenPI 0.5 needs 78 GB of VRAM. GR00T needs CUDA on Linux. SmolVLA runs on a consumer GPU. Getting all of them to talk to the same robot arm is its own engineering project – every single time.
This is the problem we solve.
What Positronic does
Positronic is an open-source Python toolkit that handles the full lifecycle of deploying AI on real robots – from data collection through fine-tuning to production inference. One codebase, any model vendor, any hardware.
- Collect – teleoperate in simulation or on hardware (phone, VR, leader arm)
- Train – fine-tune on your data. Switch models without re-recording.
- Run – unified inference across vendors. Same protocol, any model, any robot.
- Iterate – measure what works, collect edge cases, retrain.
We built this to run PhAIL. Every model on the leaderboard goes through the same infrastructure: same data pipeline, same hardware drivers, same evaluation protocol. The codec layer translates between one canonical task representation and each model's expected format – so every model gets a fair test under identical conditions.
The same infrastructure is available for your deployment.
The long view
Despite the hype, this field is much earlier than most think. We are where self-driving was in 2015 – real potential, but years of groundwork ahead.
The teams that deploy physical AI at scale will need more than a trained model. They will need infrastructure that doesn't break every time a better model ships: reliable evaluation, unified inference, production data pipelines. That is what we are building. PhAIL is where we prove it works.
Get involved
- PhAIL launches March 24 – the full leaderboard, methodology, and evaluation data.
- Star on GitHub – the open-source infrastructure behind PhAIL.
- Join Discord – questions, discussion, feature requests.
- Email: [email protected]