Stack two world models at different time-scales — and a robot that couldn't pick up a cup from a single goal image (0%) now succeeds 70% of the time. No new training data. No new policy. Just a smarter way to plan.
FAIR at Meta · NYU · Mila · Brown
In Plain English
A modern world-model robot “sees” the world and the goal, imagines a few hundred short action sequences, and picks the one that ends up closest to the goal. On simple tasks — push a thing forward, slide a drawer in a straight line — this works. On anything that requires going the long way around, it collapses. Pick-and-place is the cleanest example: to put a cup somewhere new, the gripper has to first move down to the cup — which, in the moment, looks like moving away from the goal.
The paper's fix is almost embarrassingly clean. Train two latent world models on the same data — one that reasons step-by-step (low level), one that reasons in big abstract leaps (high level). Have them share the same internal language. Then, at inference time, let the high-level model sketch out a coarse plan and hand off intermediate “subgoals” for the low-level model to chase.
No new training data. No new policy network. No reward function. It's a planning trick that bolts onto existing world models like V-JEPA 2, DINO-WM, and PLDM — and it beats vision-language-action models trained on 77× more robot data.
Why a single-level planner can't pick up a cup from a single goal image — and what changes when you give it a hierarchy. Both panels show the same task and the same world model. Only the planning differs.
Optimises a single objective: minimise distance from the current latent state to the goal latent state. The greedy direction is straight to the target zone — so the gripper drifts there, hovers, and never lifts the cup. Cup stays put. Goal not reached.
A high-level world model first proposes a subgoal in the same latent space — “be over the cup, ready to grasp.” The low-level planner then optimises primitive actions to reach that subgoal. Once reached, the high level re-plans, this time toward the target zone with the cup in hand.
What you're seeing
Left. The flat planner has one objective in its head: the goal state (cup at the target zone). Every candidate action sequence it samples is scored on how close the predicted final state lands to that goal. The straight-line moves get the best scores. The arm drifts to the target zone and hovers. The cup, which the arm never went near, is exactly where it started. Failure.
Right. The hierarchical planner first asks a different question: “what's a reasonable intermediate state on the way to the goal?” Its high-level world model proposes a handful of candidate subgoals and the best one — “gripper hovering over the cup” — gets handed to the low-level planner as its new target. The low-level planner doesn't know or care about the eventual goal; it just gets to the subgoal. Once it arrives, the high level re-plans, this time toward the target zone with the cup attached.
That's the whole idea. The same world model, the same actions, the same scene — but the planning is structured to make “detours” cheap and natural rather than impossible.
HWM's structural trick: train two world models that operate at different time-scales but share the same encoder — and therefore the same latent vocabulary. Anything one model predicts is something the other model can act on.
Why this is the whole game
Hierarchical control isn't new — robotics has tried for decades to stack a “manager” on top of a “worker.” The classical problem is the handoff: the manager speaks in goals (“move to the kitchen”), the worker speaks in actions (“rotate joint 3 by 0.04 rad”), and you need a glue layer — an inverse model, a skill library, a goal-conditioned policy — to translate between them.
HWM eliminates the glue. Both world models map into the same latent space, so a subgoal predicted by the high-level model is, by construction, a valid target for the low-level model. There's no translation step that can go wrong. The shared encoder is the entire interface.
What the agent actually does at every step. Step through the four stages of the hierarchical MPC loop.
The agent receives two images: the current observation s₁ and the final goal s_g. The shared encoder turns both into latent vectors — z₁ and z_g — that live in the same space.
From here on, everything is a latent. The model never reasons about pixels again — too expensive, too noisy.
In Plain English
The whole loop runs once per primitive action. The agent doesn't commit to a long plan and march through it; it plans every step, executes one action, and re-plans. That's why error doesn't compound — a wrong prediction at step 5 of a 16-step plan never gets a chance to matter, because the plan gets thrown out and rebuilt at step 1.
The cost of all this planning is real, but the math works out. The high-level model covers ground in big leaps (so fewer total prediction steps), and the low-level model only has to handle short, easy stretches between subgoals — which it's good at. Total compute: 3–4× less than the flat planner trying to brute-force a long horizon directly.
How does the high-level model represent a multi-step move as a single action? The naïve answer — just store the net displacement — throws away too much. The paper learns a richer alternative.
A learned 4-dim latent that captures the structure of the whole sequence — including the non-greedy moves a delta-pose would erase.
| High-level action | Cosine similarity ↑ | L₁ distance ↓ |
|---|---|---|
| Delta-pose only | 0.80 ± 0.02 | 0.088 ± 0.005 |
| Learned latent action | 0.88 ± 0.03 | 0.080 ± 0.002 |
In Plain English
Imagine summarising the move “walk to the fridge, open it, grab milk, close it, walk back” as a single tag. The simple summary — “I ended up where I started” — is technically true but useless: it erases the entire reason for the trip. That's what the delta-pose representation does to a pick-and-place sequence.
The learned latent action is a richer summary — four numbers that compress the shape of the whole sequence, not just its endpoints. Empirically it produces high-level plans that align 10% better with expert human behaviour on the same task.
HWM is a planning abstraction, not an architecture. It bolts on top of existing latent world models without retraining them. Same backbones, same data — just a smarter planner.
From a single goal image, on a real 7-DoF arm. The flat planner can't solve this without manually provided intermediate subgoals.
also beats π₀.₅-DROID (68%) trained on ~77× more robot data
Fine-grained manipulation across a long horizon. Flat planners collapse as the horizon grows; HWM stays robust.
DINO-WM baseline at the original 25-step horizon: 84% → 89% with HWM
Out-of-distribution navigation. HWM more than doubles success on hard, unseen layouts — and is 4× cheaper at planning time.
outperforms goal-conditioned and zero-shot RL baselines (HIQL, HILP)
Why this generalises
VJEPA2-AC, DINO-WM, and PLDM are three different latent world models built by different teams for different domains. The fact that HWM lifts results on all three — by 39–70 percentage points — is the evidence that this isn't a dataset-specific trick. It's a property of how to plan, not what to plan with.
The pick-and-place result deserves a moment. It's a real robot, not a simulator. The goal is a single photograph of the desired final configuration. The agent has never seen this exact object-target pair before. It still solves the task 70% of the time — and it beats two state-of-the-art vision-language-action models that were trained on roughly 77 times more robotic interaction data.
The hierarchical planner doesn't just succeed more often — it succeeds with less compute per planning step. Higher and to the left is better.
Up-and-to-the-left is better. The HWM curve sits above and to the left of the flat planner on every operating point — same or better success, in 3–4× less compute. At the flat planner's peak budget, HWM does its job in roughly a third of the time.
Why this matters in deployment
A planning algorithm that's 30% more accurate but takes 10× longer per step isn't useful in a real robot — by the time the plan is ready, the world has moved on. HWM moves in the opposite direction: better and faster. That's what makes it a deployable abstraction rather than a research curiosity.
HWM is, structurally, a small idea: train a second world model at a longer time-scale, share the latent space, plan over both. But the consequences are outsized — because the failure mode it fixes (non-greedy tasks) is the failure mode that has kept latent-world-model robots from doing useful, long-horizon manipulation in the real world.
Crucially, HWM is a planning-time intervention. It needs no new training data, no reward function, no task-specific fine-tuning, no policy network. The underlying world models — V-JEPA 2, DINO-WM, PLDM — are unchanged. That makes it the kind of result that other labs can adopt next week, on whatever latent world model they've already built.
The headline benchmark — beating vision-language-action models trained on 77× more robotic data — is the empirical receipt for a deeper claim: that the bottleneck in modern robotics isn't the amount of data you've scraped from the internet, but the structure of how a model uses what it already knows.
Read alongside the lab's other recent work, a pattern emerges. LeWorldModel showed how to train a latent world model cleanly from raw pixels — strip the engineering down to two losses. V-JEPA 2.1 showed how to scale latent representations to dense, feature-rich video. HWM shows how to plan with whatever latent world model you have, on whatever task you point it at — without retraining anything.
Three independent threads, one direction of travel: world models become the substrate, planning becomes the lever. The future isn't a single monolithic foundation model. It's a stack of composable pieces that any team can mix and extend.
@article{zhang2026hwm,
title = {Hierarchical Planning with Latent World Models},
author = {Zhang, Wancong and Terver, Basile and Zholus, Artem
and Chitnis, Soham and Sutaria, Harsh and Assran, Mido
and Balestriero, Randall and Bar, Amir and Bardes, Adrien
and LeCun, Yann and Ballas, Nicolas},
journal = {arXiv:2604.03208},
year = {2026}
}