Definitions
Zero-shot means executing a new task or handling a new object with no new demonstrations — the robot relies entirely on what it learned during pre-training. Few-shot means adapting to a new task with 5-50 new demonstrations. These are meaningfully different capabilities, and conflating them leads to unrealistic planning.
The Current Reality for Zero-Shot
Foundation models for robot manipulation — OpenVLA, Octo, RT-2 — have demonstrated genuine zero-shot capability on simple tasks within their training distribution. The results that have been validated and reproduced across multiple labs:
- Open-vocabulary object detection + simple top-down grasp planning works zero-shot for ~60% of common household objects presented in upright orientation on a clear surface
- Language-conditioned navigation in previously explored environments works zero-shot with ~75% success in structured settings
- Pick-and-place with familiar object categories (mugs, bottles, blocks) achieves 50-65% success zero-shot with OpenVLA on standard benchmarks
Where zero-shot reliably fails: precision tasks requiring sub-5mm placement, dexterous manipulation, novel tool use, tasks where the object pose is non-canonical (tilted bottles, stacked cups), and any task involving deformable objects not well-represented in training data.
Few-Shot Fine-Tuning: The More Practical Capability
Few-shot fine-tuning (20-100 demonstrations on a new task) is where foundation models show their clearest practical value. The comparison that matters:
| Training Approach | Demos Required | Typical Success Rate | Time to Train |
|---|---|---|---|
| Foundation model, zero-shot | 0 | 30–65% (simple tasks) | N/A |
| Foundation model + 20 demo fine-tune | 20 | 70–80% | 30 min GPU |
| Foundation model + 100 demo fine-tune | 100 | 80–90% | 2 hr GPU |
| Train from scratch (ACT) | 500 | 75–88% | 3–4 hr GPU |
| Train from scratch (Diffusion) | 1000 | 82–92% | 8–12 hr GPU |
The foundation model advantage is most pronounced in the low-data regime (under 100 demonstrations). With 20 demonstrations, a fine-tuned foundation model achieves success rates comparable to training ACT from scratch on 500 demonstrations. That is a 25× data efficiency improvement — which translates directly to cost and time savings.
Where Zero-Shot Actually Works Today
- Structured picking with clear object detection: DETIC + GraspNet-style open-vocabulary detection + simple grasp planner works zero-shot for regular objects in organized bins. This is production-ready today for e-commerce and logistics.
- Language-conditioned navigation in known spaces: VLN (Vision-Language Navigation) models work zero-shot in spaces they were trained to understand, with good generalization to same-layout spaces in different buildings.
- Object recognition and sorting by category: Language-conditioned sorting ("put the red items in the left bin") works zero-shot for known categories with RGB classification.
Where It Does Not (Yet)
- Contact-rich manipulation: Peg-in-hole, snap connectors, folding fabric, unstacking cups — zero-shot success rates are 10-30% for current foundation models. Not reliable for production.
- Novel tool use: Using an unfamiliar tool (a can opener, a specific screwdriver) zero-shot is not yet reliable. Few-shot (20-50 demos) works.
- Dexterous manipulation: In-hand re-grasping, rotation of objects using finger control — outside current zero-shot capability for all production models.
Foundation Model Benchmark Results (2025-2026)
| Model | Provider | Zero-Shot (SimplerEnv) | 20-Shot Fine-Tune | Parameters |
|---|---|---|---|---|
| OpenVLA | Stanford/TRI | 48-62% | 72-80% | 7B |
| Octo | Berkeley | 35-55% | 65-78% | 93M |
| pi-0 | Physical Intelligence | 55-70% | 78-88% | 3B |
| RT-2-X | Google DeepMind | 50-65% | 75-85% | 55B |
| ACT (from scratch) | Stanford | N/A | 40-55% (20 demos) | 12M |
These numbers represent performance on simple manipulation benchmarks (pick-place, drawer opening, button pressing) in controlled lab environments. Real-world deployment numbers are typically 10-20 percentage points lower due to lighting variation, background clutter, and object diversity not represented in benchmark conditions.
Few-Shot Data Efficiency Curves
The most valuable insight for practitioners is how performance scales with the number of fine-tuning demonstrations. Based on published results and SVRC's internal evaluations:
- 5 demonstrations: Foundation models show 5-15% improvement over zero-shot. Training from scratch produces unreliable policies. The foundation model advantage is largest here -- roughly 10x more data-efficient than scratch training.
- 20 demonstrations: The sweet spot for foundation model fine-tuning. Most models achieve 70-80% of their final performance at this point. From scratch, ACT typically reaches 40-55%. The gap between pre-trained and scratch is 20-30 percentage points.
- 50 demonstrations: Foundation models approach their ceiling (80-88%). Scratch-trained models close the gap, reaching 60-75% with well-collected data. The cost difference is significant: 50 demos costs $200-500 at SVRC rates vs. the months of pre-training compute invested in the foundation model.
- 100 demonstrations: Foundation model fine-tuning reaches diminishing returns for most tasks. Scratch-trained models catch up further (75-85%). The practical question becomes whether the remaining performance gap justifies the complexity of using a 3-7B parameter model for inference.
- 500 demonstrations: Scratch-trained ACT and Diffusion Policy typically match or exceed fine-tuned foundation models. At this data volume, the advantage of pre-training is minimal for single-task deployment. Foundation models retain an advantage for multi-task generalization.
When to Use Which Approach: Decision Matrix
| Scenario | Recommendation | Why |
|---|---|---|
| Quick feasibility test, simple task | Zero-shot with OpenVLA/pi-0 | No data cost; instant evaluation |
| New task, limited budget (<$2K data) | Foundation model + 20-50 demo fine-tune | Maximum performance per dollar |
| Production deployment, single task | ACT/Diffusion from scratch, 300-500 demos | Smaller model = faster inference, simpler deployment |
| Multi-task deployment (10+ tasks) | Foundation model + per-task fine-tuning | Shared backbone amortizes model cost |
| Precision task (sub-2mm tolerance) | Scratch Diffusion Policy, 500+ demos | Foundation models lack precision for tight tolerances |
| Edge compute (Jetson, no GPU server) | Small model from scratch (ACT 12M params) | 7B VLA models cannot run on edge devices |
The Embodiment Gap: Why Zero-Shot Is Harder Than It Looks
Zero-shot performance in NLP improves reliably with model scale: a 70B language model is consistently better at zero-shot text tasks than a 7B model. The same scaling relationship does not hold for robot foundation models, and understanding why is critical for setting realistic expectations.
The fundamental challenge is the embodiment gap: unlike text (which has a universal tokenization), robot actions are embodiment-specific. A 7-DOF joint velocity command for a Franka Research 3 is meaningless for a 6-DOF OpenArm or a mobile manipulator with a different kinematic chain. Current foundation models handle this through either action space normalization (mapping all robots to a shared 7-DOF representation, which loses information for robots with more or fewer DOF) or embodiment-specific output heads (which require at least a few demonstrations on the target embodiment to calibrate the output head).
This means that "zero-shot" on a new robot embodiment is functionally impossible with current architectures -- you always need at least a calibration step. The zero-shot capability that works today is zero-shot to new tasks on the same embodiment that the model was trained on. Transfer to a new embodiment is always few-shot at minimum.
The perception gap is the second major challenge. Foundation models are trained on datasets collected in specific labs with specific cameras, lighting, and backgrounds. When deployed in a new environment with different visual conditions, the visual encoder's representations shift, degrading policy performance. This is why the published zero-shot numbers (collected in labs similar to the training data) are 10-20 percentage points higher than real-world numbers. Domain randomization during training and vision-language pre-training both help, but neither fully closes this gap today.
The action distribution gap. Different tasks have fundamentally different action distributions -- pick-and-place involves discrete grasp/release events, while wiping involves continuous contact maintenance. A foundation model trained primarily on pick-and-place data will have poor zero-shot performance on contact-rich tasks even if the visual understanding transfers perfectly. This is why task diversity in the pre-training data matters more than dataset size for zero-shot generalization.
Inference Cost and Latency Comparison
A factor that is often overlooked in the zero-shot vs. few-shot discussion: the inference cost of running foundation models in production. A 7B-parameter VLA model requires a dedicated GPU for inference (A10G minimum, ~$0.75/hour on cloud) and introduces 100-300ms latency per action. A 12M-parameter ACT model runs on a $200 Jetson Orin Nano at 5-15ms per action.
For deployment at scale (10+ robots), the GPU inference cost of foundation models can exceed $50,000/year -- potentially more than the cost of collecting enough data to train smaller task-specific models. This is the counterintuitive economic argument: investing in data collection (a one-time cost) can be cheaper than paying ongoing inference costs for foundation models.
Practical Guidance
Plan for 200-500 demonstrations for any new task even with foundation models. Zero-shot performance is a bonus to be measured, not a baseline to be assumed. If zero-shot achieves 60%+ success on your task, consider yourself ahead of schedule. If it achieves 30%, proceed with your planned fine-tuning data collection.
The SVRC data services team can assess your specific task for zero-shot viability and recommend a realistic demonstration budget before you commit to a collection timeline.
Related Reading
- RL vs. Imitation Learning Decision Guide -- The broader training approach decision that precedes the zero/few-shot question
- Robot Data Collection Cost Breakdown -- Budget planning for the demos you will need for fine-tuning
- LeRobot Guide -- Training ACT and Diffusion Policy for few-shot and from-scratch approaches
- What Makes Good Robot Training Data? -- Maximizing the value of every demonstration in low-data regimes
- Embodied AI Explained Simply -- Foundation models and their role in physical AI
- SVRC Platform -- Dataset management and model training infrastructure
- Data Services -- Collecting the 20-500 demonstrations you need for fine-tuning