Research · March 2025

LeWorldModel

The first Joint Embedding Predictive Architecture (JEPA) that trains stably end-to-end from raw pixels — using only two loss terms.

Lucas Maes · Quentin Le Lidec · Damien Scieur · Yann LeCun · Randall Balestriero

Mila · NYU · Meta FAIR

arXiv Paper GitHub Project Page

LeWM planning demo across TwoRoom, PushT, Cube and Reacher environments

Live planning demos across all four evaluation environments

2loss termsvs. up to 6 in prior work

48×faster planningvs. foundation-model baselines

~15Mparameterstrains on a single GPU in hours

4environments2D nav, 2D & 3D manipulation

In Plain English

Think of world models like a flight simulator for AI. Instead of learning from text the way ChatGPT does, a world model tries to build an internal mental picture of how things work — objects move, gravity pulls, a ball bounces off a wall. The idea is that an AI with this kind of understanding could plan ahead, anticipate consequences, and eventually interact with the physical world in ways that today's chatbots simply can't.

The problem? The leading approach — JEPA (Joint Embedding Predictive Architecture) — has a nasty habit of collapsing during training. Imagine trying to teach someone to draw, but every time they pick up the pencil, they just scribble a single dot and say “done.” That's essentially what happens: the model finds a shortcut where it maps every situation to the same meaningless representation. Loss goes to zero. The model learns nothing.

To prevent this, researchers piled on fixes — freezing parts of the model, adding extra training objectives, pre-training components separately. It worked, sort of, but the result was fragile, expensive, and hard to reproduce.

LeWorldModel takes a different approach: instead of adding more complexity, it strips the problem down to its mathematical core. The result is a system that trains stably from raw pixels using just two simple objectives — and runs on a single GPU in a few hours.

The Problem with Prior World Models

Prior approaches

Required frozen, pre-trained ViT encoders (e.g. ImageNet-pretrained) for training stability
Up to 6 separate loss terms with complex hyperparameter tuning
Slow inference — large foundation models make planning expensive
Could not be trained end-to-end from pixels

LeWM solution

Trains the encoder from scratch, end-to-end from raw pixels
Stable training with just 2 losses: prediction + SIGReg
Compact ~15M param model — 48× faster planning
Competitive performance with a fraction of the compute

Architecture

LeWM follows the JEPA paradigm: predict in latent space, not pixel space. This avoids wasting model capacity on unpredictable pixel-level details like lighting and texture.

The ViT Encoder maps 224×224 frames to 192-dim embeddings using 14×14 patches. Uniquely, it is trained end-to-end — no frozen weights.

The latent z_t is the bottleneck where the model learns a compact world representation. The SIGReg loss keeps this space well-structured.

The AR Predictor autoregressively predicts future latents given history and actions, using AdaLN-zero conditioning (the same technique as in DiT).

Why the Architecture Matters Beyond the Lab

Previous world models had a dependency problem. They relied on massive pre-trained vision models just to get started — like needing a fully equipped kitchen before you can boil an egg. That made them expensive, hard to customize, and nearly impossible for smaller teams to work with.

LeWorldModel trains everything from scratch. The encoder that processes visual information and the predictor that imagines what happens next all learn together, end-to-end, from raw pixels. No pre-trained components required. This is a meaningful step toward world models that could be tailored to specific industries — factory floors, surgical robotics, autonomous vehicles — without needing a giant foundation model as a starting point.

It's also small and fast. At roughly 15 million parameters, it's a fraction of the size of most modern AI models. It trains in hours on a single GPU and plans 48× faster than comparable systems. A robot that needs minutes to decide its next move is useless. One that can plan in milliseconds is a product.

Two Loss Terms — That's It

The key innovation is training stability through a novel regulariser, eliminating the need for complex multi-term objectives.

①

SIGReg

Sketch Isotropic Gaussian Regulariser

Enforces that the latent embedding distribution is Gaussian-shaped, preventing representational collapse — the main failure mode of end-to-end JEPA training.

Uses random projections and Fourier analysis to efficiently measure distributional properties without needing a discriminator or contrastive pairs.

ℒ_SIGReg = 𝔼[‖proj(z) − 𝒩(0,1)‖]

②

Next-Embedding Prediction

MSE loss in latent space

The predictor is trained to minimise the mean-squared error between its predicted future latent ẑ_t+k and the encoder's actual embedding of the future frame z_t+k.

By predicting in latent space rather than pixel space, the model focuses on learning the dynamics of the world, not irrelevant visual details.

ℒ_pred = ‖ẑ_t+k − z_t+k‖²

Total objective: ℒ = ℒ_pred + 0.09 · ℒ_SIGReg

In Plain English

Previous systems needed up to six different training objectives, each requiring careful tuning. LeWorldModel gets away with just two: one that teaches the model to predict what happens next, and one (SIGReg) that prevents collapse by mathematically forcing the model's internal representations to stay spread out and meaningful.

Think of SIGReg as a rule that says: “every situation must look different on the inside” — which prevents the model from taking shortcuts and mapping everything to the same boring answer.

Planning with the Cross-Entropy Method

At inference time, LeWM uses model-predictive planning: sample candidate action sequences, roll them out in latent space, and pick the best one — all 48× faster than foundation-model-based alternatives.

PushT-style environment — agent (amber) plans to reach the goal (green)

①

Sample

Draw 300 candidate action sequences from a Gaussian distribution.

②

Roll out in latent space

AR Predictor autoregressively predicts future latent states for each candidate — no pixel decoding needed.

③

Evaluate & refit

Score each trajectory by its final latent distance to the goal. Keep the top-k and refit the Gaussian.

④

Execute & replan

Execute the first action of the best sequence, observe the new state, and replan from scratch.

In Plain English

Once the model has learned how the world works, it uses that knowledge to make decisions by imagining 300 different possible action sequences, mentally simulating what would happen for each one, keeping the best options, and repeating — all without ever touching the real environment.

It's the same basic logic you use when you mentally rehearse different routes to work and pick the fastest one. The key difference from prior approaches: because LeWorldModel's internal world is so compact, this mental rehearsal runs 48× faster.

Live Planning Viewer — PushT

Pre-computed rollout from the trained LeWM checkpoint. Step through 80 frames of CEM planning: coloured lines are candidate action trajectories (teal = low cost, red = high cost), the highlighted path is the elite set the model selected. The UMAP plot tracks the agent's position in learned latent space.

Loading rollout…

Reading the Viewer

Each colored line is a possible future the model imagined. Teal lines are plans the model liked (low cost — they get close to the goal). Red lines are plans it rejected (high cost — they go the wrong way). The highlighted path is the winner.

The UMAP plot on the right is a window into the model's “mind” — it shows how the model internally represents the agent's position, compressed from 192 dimensions down to a 2D map you can see.

Try It Live — Browser Inference

Click the canvas to place a goal, press ▶, and watch the agent (the red circle) navigate toward it in real time — the entire model runs inside your browser, no server involved.

What did the model actually learn? During training, LeWM was shown thousands of video frames of an agent moving around this environment. It learned to compress each frame into a compact “fingerprint” — 192 numbers that capture the essentials of the scene: roughly where things are and how they relate. Crucially, scenes that look similar get similar fingerprints. The model can then reason about what fingerprint to expect after a given action, without ever having to predict raw pixels. That is how it plans: it imagines several possible moves, checks which imagined fingerprint lands closest to the goal, and takes that step.

What the right-hand plot is showing you. Because a fingerprint is 192 numbers, we can't draw it directly — so we flatten it to 2D and plot it as a dot. The blue dots come from one pre-recorded run where the agent successfully reached a goal; each dot is one moment in time, in order, so together they trace the “journey” through fingerprint-space for that run. The red dot is the fingerprint of whatever is on your canvasright now, updated live after each step. As your agent moves you should see the red dot travel — and if it drifts into or alongside the blue cloud it means the model is encountering scenes it recognises as similar to what it saw during that successful run. The blue dots are fixed; only the red one moves.

How long to run it: A few seconds is enough to see the agent move. Let it run for 10–20 seconds to watch the red dot settle into the blue cloud as the situation becomes familiar to the model. Try clicking a new goal mid-run to see it re-plan.

⚡ ONNX Runtime Web⚡ No server — pure browser⚡ 24 MB encoder · PCA projection

Real-time ONNX inference — runs in your browser

PushT environment — click to place goal

Loading assets…

Live Latent Space (PCA)

Pre-computed rollout Live encoder

Agent(282, 105)

Goal(340, 330)

Distance232.4 px

Press ▶ to start. The red agent moves toward your goal using kinematic CEM planning. After each step, the 420×420 frame is encoded by the LeWM ViT encoder running entirely in your browser via ONNX Runtime WebAssembly — no server call. The red dot shows where that frame lands in the 192-dim JEPA latent space, projected to 2D with PCA.

What's Actually Happening

When you click the canvas and set a goal, every single frame the model performs a full cycle of intelligent planning — entirely inside your browser, with no server involved.

1. It “sees” the current scene — processing raw pixels through its vision encoder, compressing a 224×224 image down to a single 192-number summary of “what the world looks like right now.”

2. It imagines 300 possible futures — mentally simulating different action sequences entirely in its compressed internal world, not in pixel-space (which would be slow and expensive).

3. It picks the best plan and acts — scores each imagined future by how close it gets to your goal, takes the first step of the winning plan, then replans from scratch.

4. The scatter plot shows the model's “mental map” — projecting the model's 192-dimensional internal state down to 2D so you can see it.

This is a tiny model — about 15 million parameters, small enough to run in a web browser. And yet it's performing the core loop of autonomous decision-making: perceive, imagine, plan, act, repeat. That loop is the foundation of everything from warehouse robots to self-driving cars to surgical assistants. Each time you click and set a new goal, you're testing whether the model has genuinely learned the physics of this little world — not memorized a fixed path, but understood the underlying rules well enough to plan a new route on the fly.

Evaluation Environments

Tested across four tasks of increasing complexity — 50 episodes each, 5-step planning horizon, 300 CEM samples.

2D Navigation

TwoRoom

Agent navigates between two rooms through a narrow doorway. Tests long-horizon planning and spatial reasoning.

2D Manipulation

PushT

Push a T-shaped block to a target position and orientation. A classic robotic manipulation benchmark.

3D Manipulation

Cube

Manipulate a cube to a goal pose via OGBench's cube_single task. Tests 3D spatial understanding.

3D Control

Reacher

Continuous control via DeepMind Control Suite Reacher. Tests fine-grained motor control from pixels.

In Plain English

The team tested LeWorldModel across four tasks that get progressively harder: navigating between rooms, pushing an object to a target, manipulating a 3D cube, and controlling a robotic arm. These are standard benchmarks in the field — the AI equivalent of standardized tests.

LeWorldModel holds its own against much larger systems on simpler tasks, though bigger pre-trained models still perform better in visually complex 3D settings. The point isn't that it's the best at everything — it's that it's competitive while being dramatically simpler and faster.

The Big Picture

LeWorldModel is not a breakthrough in what world models can do. It's a breakthrough in how simply they can be built.

For years, the world model approach to AI — building systems that understand and simulate physical reality rather than just predicting text — has been stuck behind an engineering wall. The systems were too fragile, too complex, and too expensive to train reliably. LeWorldModel doesn't demolish that wall, but it shows a much simpler path through it.

By replacing a tangled web of training tricks with a clean mathematical solution, it demonstrates that a fully end-to-end JEPA world model can be trained from raw pixels, on a single GPU, in a few hours. That's significant not because of the benchmarks (which are solid but not record-breaking), but because it changes the economics and accessibility of this entire line of research.

If this recipe scales — and that's still a big “if” — it could mean that building a world model stops being something only a handful of well-funded labs can attempt, and starts being something any AI team can experiment with. In a field defined by the mantra “bigger models, more compute,” LeWorldModel quietly suggests that sometimes, the answer is a better equation.

Two Paths, One Vision — LeWorldModel & V-JEPA 2.1

LeWorldModel isn't happening in isolation. LeCun's broader research program is pursuing two parallel tracks: LeWorldModel asks how simple world models can be made while keeping them functional — strip away the complexity, find the minimal recipe. V-JEPA 2.1 asks how rich and expressive world model representations can become — more supervision, finer details, scaled across images and video.

These aren't competing approaches. One is compressing the engine to its essence. The other is expanding the fuel supply. For AMI, having both threads advancing simultaneously means the research isn't betting on a single path — it's building a toolkit.

Citation

@article{maes2026lewm,
  title   = {LeWorldModel: End-to-End JEPA World Model from Pixels},
  author  = {Maes, Lucas and Le Lidec, Quentin and Scieur, Damien
             and LeCun, Yann and Balestriero, Randall},
  journal = {arXiv:2603.19312},
  year    = {2026}
}