Across 1,110 real-robot trials on a Unitree G1, three pick-and-place tasks of increasing whole-body difficulty, and three VLA backbones (ψ0, π0.5, GR00T N1.6), policies fine-tuned purely on LEGS data match or exceed human teleoperation on every experiment. The same recorded motion re-renders under new scenes and objects at ~15× lower cost than re-collecting teleoperation demonstrations.
Training vision-language-action (VLA) policies for humanoid loco-manipulation is constrained by the high cost and complexity of collecting human teleoperation demonstrations. VLA policies fine-tuned in simulators have, until now, failed to transfer effectively in humanoid loco-manipulation tasks.
We present LEGS (Loco-manipulation via Embodied Gaussian Splatting), a hybrid simulator that composites a mesh foreground (robot, objects, props) over a photorealistic 3D Gaussian Splatting (3DGS) background reconstructed from a handheld scene capture. LEGS uses a procedural motion-primitive generator to synthesize labeled demonstrations at scale without human teleoperation, and a deterministic two-stage color calibration to align the rendered 3DGS image to the robot's deployment camera.
On a Unitree G1 humanoid robot, across three pick-and-place tasks of increasing whole-body difficulty and three VLA backbones (ψ0, π0.5, GR00T N1.6), a policy trained purely on LEGS data matches or exceeds one trained on human teleoperation demos on every experiment. It also outperforms a mesh-only simulation baseline that ablates the effect of the 3DGS background, showing that photorealistic rendering is a key enabler for synthetic data transfer.
Humanoid motion is recorded independently of scene appearance in LEGS, allowing the same auto-generated demonstrations to be re-rendered under new backgrounds and object meshes — covering a new scene at more than 15× lower cost than teleoperation — to augment training data for robustness to scene variations. Under combined object-and-scene appearance shift, the policy trained on re-rendered LEGS-AUG data maintains task success while the baseline trained on teleoperation data fails entirely.
The LEGS pipeline. A scene video and object photo are reconstructed into a 3DGS background and SAM3D meshes, which feed the LEGS simulator. The simulator decouples a visual frontend (3DGS and mesh compositor with color calibration) from a physics backend (MuJoCo with the low-level whole-body controller). A procedural generator produces labeled demonstrations, re-rendered under scene and object augmentations, and used to fine-tune a VLA backbone (ψ0, π0.5, GR00T N1.6) for real-robot deployment.
Each pick-and-place task decomposes into Walk → Pick → Place motion primitives. A procedural generator samples scene-level arguments under each randomized initial condition, executes the trajectory in MuJoCo with the whole-body controller, and saves only verified successful episodes.
Task 3 — third-person view
Task 3 — egocentric view
Policies fine-tuned on LEGS-generated data, deployed zero-shot on a Unitree G1 with a head-mounted Intel RealSense D435 (30 Hz RGB). Ten trials per (data, backbone, task) cell, with object positions perturbed by ±5 cm, robot heading by ±10°, and robot base by ±10 cm on loco-manipulation tasks.
Task 1 — manipulation only
Task 2 — loco-manipulation
Task 3 — walk, pick, turn, place
ψ0
π0.5
GR00T N1.6
Real-robot end-task success across three tasks, three VLA backbones, and four data conditions — 1,110 trials total.
Q1Can teleoperation-free synthetic data match human teleoperation for VLA fine-tuning?
Yes — on every (backbone, task) cell. LEGS (200) matches or exceeds Teleop (50) across all nine experiments. On the long-horizon Task 3, teleoperation collapses to 0/10 across all three backbones, whereas LEGS achieves up to 6/10.
Q2Is the improvement attributable to dataset size?
No. At a budget-matched 50 demonstrations, LEGS (50) still matches or surpasses Teleop (50) on every experiment, isolating the gain to the data pipeline rather than its scale.
Q3Does photorealistic rendering matter, or does mesh-only synthesis suffice?
Photorealism approximately doubles end-task success. Holding the pipeline fixed, the mesh-only SAM3D (200) baseline averages 33% TSR versus 67% for LEGS (200), with the gap concentrated at the close-range pick and place stages.
Q4How efficiently can LEGS adapt to new appearance conditions?
~15× cheaper than teleoperation, with task success retained. Each new appearance condition requires ~0.1 GPU-hr to re-render versus >1.5 operator-hr to re-teleoperate. Under the hardest object-and-scene shift, LEGS-AUG reaches 100 / 80 / 40% on Tasks 1–3, while both teleoperation and unaugmented LEGS fail.
| Data condition | Task 1 | Task 2 | Task 3 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ψ0 | π0.5 | GR00T | ψ0 | π0.5 | GR00T | ψ0 | π0.5 | GR00T | |
| Teleop (50) | 8/10 | 2/10 | 4/10 | 6/10 | 0/10 | 2/10 | 0/10 | 0/10 | 0/10 |
| SAM3D (200) | 7/10 | 1/10 | 6/10 | 4/10 | 3/10 | 5/10 | 1/10 | 0/10 | 3/10 |
| LEGS (50) | 8/10 | 5/10 | 9/10 | 8/10 | 4/10 | 7/10 | 3/10 | 1/10 | 5/10 |
| LEGS (200) | 10/10 | 6/10 | 9/10 | 9/10 | 5/10 | 8/10 | 5/10 | 2/10 | 6/10 |
End-task success out of 10 trials per data condition, backbone, and task. Best per (backbone, task) in bold. LEGS (200) is the main dataset; LEGS (50) is a 50-episode subsample for a budget-matched comparison against the 50-demo teleop baseline.
Task 3 deployments from the same VLA backbone fine-tuned on four data conditions. LEGS matches or exceeds teleoperation and outperforms the mesh-only SAM3D baseline, identifying photorealistic rendering as the key enabler of synthetic-data transfer.
LEGS (ours)
LEGS-AUG (ours)
SAM3D (mesh-only)
Teleoperation
Task 3 deployments under lateral perturbations of the orange's initial position pushed beyond the ±5 cm training distribution, probing out-of-distribution robustness.
Lateral perturbation — far left (out-of-distribution)
Lateral perturbation — far right (out-of-distribution)
Because foreground motion is recorded independently of scene appearance, each additional appearance condition is generated by re-rendering the existing motion at ~0.1 GPU-hr versus >1.5 operator-hr for teleoperation (≈15× cheaper). Policies trained on re-rendered LEGS-AUG data retain task success under scene-only, object-only, and combined shifts.
Scene shift
Object shift
Scene + object shift
@article{legs2026,
title = {LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World},
author = {Anonymous},
journal = {arXiv preprint},
year = {2026},
}