LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World

0 teleoperation demos needed

1,110 real-robot trials on a Unitree G1

9/9 experiments match or beat human teleop

~15× cheaper adaptation to new scenes & objects

Three pick-and-place tasks of increasing whole-body difficulty × three VLA backbones (ψ₀, π_0.5, GR00T N1.6).

Collecting teleoperation demonstrations for humanoids is slow and expensive — and simulator-trained VLA policies have, until now, failed to transfer to real humanoid loco-manipulation. What if a phone scan is all you need? LEGS composites the robot and objects over a photorealistic 3D Gaussian Splatting background inside MuJoCo, and procedurally generates labeled demonstrations — no teleoperation, no seed demos, no human video. We find that LEGS:

Matches or beats human teleoperation on all nine (backbone, task) experiments — zero-shot on a real Unitree G1, across ψ₀, π_0.5, and GR00T N1.6.
Owes the gain to photorealism — end-task success improves 1.6×–3.25× over mesh-only simulation (across backbones), not dataset size.
Re-renders one motion dataset to new scenes, objects, and prompts at ~15× lower cost than re-teleoperating — retaining success under appearance shifts where every default-only baseline collapses to 0–1/10.

Method

A phone scan in. A humanoid skill out. Everything in between is synthesized.

1

Capture Handheld scene video + one photo per object

2

Reconstruct 3DGS background + SAM3D object meshes

3

Simulate & Generate Procedural demos in MuJoCo physics ⊕ calibrated 3DGS render

4

Deploy Fine-tune a VLA → zero-shot on Unitree G1

hover a step — or a part of the pipeline — to highlight it

Two-stage color calibration

A deterministic two-stage calibration aligns the render to the robot's deployment camera — the first stage calibrates the object mesh, and the second is applied to both the mesh and the 3DGS background.

raw 3DGS + mesh

mesh calibrated

mesh + 3DGS calibrated

real camera

raw 3DGS + mesh

mesh calibrated

mesh + 3DGS calibrated

real camera

One episode, many appearances

1 recorded episode motion only — independent of appearance

re-render: ~0.1 GPU-hr per condition

re-render with

new 3DGS background new objects new prompts

wood

blue

white

wood

blue

white

wood

blue

white

Real-Robot Deployment

Policies fine-tuned on LEGS-generated data, deployed zero-shot on a Unitree G1.

Three tasks of increasing difficulty

Task 1 — manipulation (ψ₀)

Task 2 — loco-manipulation (ψ₀)

Task 3 — long-horizon (GR00T N1.6)

Task 3 across three VLA backbones

GR00T N1.6

ψ₀

π_0.5

LEGS-AUG: Appearance Randomization & Robustness

Motion is recorded independently of appearance, so each episode re-renders under new objects and backgrounds at ~0.1 GPU-hr versus >1.5 operator-hr for re-teleoperation (≈15× cheaper).

Without re-rendering: the default-only policy collapses

Even on Task 1 — the simplest, stationary pick-and-place — a policy trained only on the default demonstrations (orange→plate on a wooden table) collapses once the objects, scene, or prompt change.

failure2×

Scene shift
“place the orange on the plate”

failure2×

Object shift
“place the apple in the box”

failure2×

Scene + object shift
“place the apple in the box”

With LEGS re-rendering: zero-shot under appearance shift

success

Scene shift
“pick the orange, turn right, place it on the plate”

success

Object shift
“pick the apple, turn right, and put it in the box”

success

Scene + object shift
“pick the apple, turn right, and put it in the box”

Out-of-distribution object poses

Task 3 with the orange pushed beyond the training distribution.

Far left (out-of-distribution)

OOD probe

Far right (out-of-distribution)

OOD probe

Key Results

Real-robot end-task success across three tasks, three VLA backbones, and four data conditions — 1,110 trials total.

LEGS (200) is best or tied on every task and every backbone

Even at the same data budget, LEGS (50) beats Teleop (50).

Teleop (50) SAM3D (200) LEGS (50) LEGS (200) — ours

ψ₀

10

5

0

T1T2T3

GR00T N1.6

10

5

0

T1T2T3

π_0.5

10

5

0

T1T2T3

Under the hardest shift (objects + scene), re-rendering wins

(a) Photorealism beats mesh-only

SAM3D-aug (200) LEGS-aug (200) — ours

Task 1

60%

100%

Task 2

50%

80%

Task 3

20%

40%

(b) Augmentation beats scale

LEGS (200), default-only LEGS-aug (50)

Task 1

10%

50%

Task 2

10%

40%

Task 3

10%

30%

Q1Can teleoperation-free synthetic data match human teleoperation for VLA fine-tuning?

Yes — on every (backbone, task) cell. LEGS (200) matches or exceeds Teleop (50) across all nine experiments. On the long-horizon Task 3, teleoperation collapses to 0/10 across all three backbones, whereas LEGS achieves up to 6/10.

Q2Is the improvement attributable to dataset size?

No. At a budget-matched 50 demonstrations, LEGS (50) still matches or surpasses Teleop (50) on every experiment, isolating the gain to the data pipeline rather than its scale.

Q3Does photorealistic rendering matter, or does mesh-only synthesis suffice?

Photorealism improves end-task success by 1.6×–3.25× across the three VLA backbones. Holding the pipeline fixed, LEGS (200) beats the mesh-only SAM3D (200) baseline on all nine (backbone, task) experiments.

Q4How efficiently can LEGS adapt to new appearance conditions?

~15× cheaper than teleoperation, with task success retained. Each new appearance condition requires ~0.1 GPU-hr to re-render versus >1.5 operator-hr to re-teleoperate. Under the hardest object-and-scene shift, LEGS-AUG reaches 100 / 80 / 40% on Tasks 1–3, while both teleoperation and unaugmented LEGS fail (0–10%).

BibTeX

@article{kim2026legs,
  title   = {LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World},
  author  = {Kim, Hojune and Chen, Timothy and Sun, Jiankai and Osterberg, Lars W. and Chen, Qianzhong and Wang, Ke and Schwager, Mac},
  journal = {arXiv preprint arXiv:2606.01458},
  year    = {2026}
}

LEGS Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World

Method

Two-stage color calibration

One episode, many appearances

Real-Robot Deployment

Three tasks of increasing difficulty

Task 3 across three VLA backbones

LEGS-AUG: Appearance Randomization & Robustness

Without re-rendering: the default-only policy collapses

With LEGS re-rendering: zero-shot under appearance shift

Out-of-distribution object poses

Key Results

BibTeX