"Meta's Chief AI Scientist Just Quit to Bet Against ChatGPT

Meta's Chief AI Scientist Yann LeCun argues the path to human-level intelligence isn't scaling up today's LLMs, but building a new "Objective-Driven" architecture that learns predictive world models through observation, much like a baby.

Grant Harvey

July 29, 2024

Yann LeCun is out at Meta. If you don't know Yann, he is a Turing Award winner who invented convolutional neural networks in the '80s, and was Meta's Chief AI Scientist since 2013 (‘til today, anyway). Now, he's starting his own company…and if ex-OpenAI fam Ilya Sutskever or Mira Murati are any guide, it’s about to be worth a billion dollars…

Why’s he leaving Meta? He thinks the path to AGI (artificial general intelligence, where AI can do everything a human can do as well as a human can) won't go through large language models (LLMs). While everyone's scaling up ChatGPT clones, LeCun's betting on “world models”, or AI that learns by watching video and understanding physical reality, not just predicting words. He's not alone btw; Dr. Fei-Fei Li, the "Godmother of AI", is focused on the same idea.

LeCun's approach uses JEPA (Joint Embedding Predictive Architecture), which lets AI build internal simulations of how the world works. Think: understanding how ingredients interact vs. memorizing recipes (video about it here, and related paper here).

Specifically, he's proposing three interconnected things:

1. Objective-Driven AI Architecture (the big vision)

An AI system with these components working together:

Perception - observes the world.
World Model - predicts what happens when you take actions (the centerpiece).
Memory - stores past information.
Objectives - goals and guardrails the system tries to achieve.
Actor - produces actions by optimizing those objectives.

Think of it like: the system predicts different action sequences, sees which outcomes best satisfy its goals, then executes those actions.

2. JEPA (the technical breakthrough)

This is his solution to making world models work. Instead of predicting every pixel in a video (which creates blurry garbage), JEPA:

Takes two views of data (like two parts of a video).
Runs them through encoders to create compressed representations.
Predicts one representation from the other.
Only cares about the meaningful information, not every detail.

Why it matters: Generative models waste resources predicting irrelevant details. JEPA focuses on what actually matters for understanding and prediction.

3. V-JEPA (the practical demo)

In the above video, LeCun demonstrates a video version of JEPA that:

Learns from video by predicting masked portions.
Creates world models without labeled data.
Could enable systems that plan actions, not just words.

The bottom line: LeCun is proposing we abandon today's dominant approaches (generative models like diffusion, autoregressive LLMs) for systems that learn compressed world models and plan actions to achieve objectives. It's a fundamentally different architecture than current AI.

Below, we dive a bit deeper into his arguments to fully explore them.

Yann LeCun's Radical Vision: Why Today's AI "Sucks" and What Comes Next

In a world captivated by the seemingly magical abilities of Large Language Models (LLMs) like ChatGPT, Meta's Chief AI Scientist and Turing Award winner Yann LeCun has a sobering message: "Machine learning sucks." This isn't the cynical take of a laggard, but the calculated critique of a pioneer who sees a fundamental wall that current approaches are destined to hit. While the rest of Silicon Valley is locked in a race to scale up existing models, LeCun is charting a completely different course—one that abandons today's dominant paradigms in favor of an architecture inspired by the most efficient learners on the planet: babies and animals.

His argument is simple yet profound. Today's AI systems, for all their fluency, lack the basic building blocks of true intelligence. They have no common sense, cannot truly reason or plan, and don't understand the physical world in any meaningful way. An LLM can write a sonnet about gravity, but it doesn't know that an unsupported object will fall. This, LeCun argues, is because they are trained almost exclusively on text, a dataset that is shockingly small compared to the sensory firehose a child experiences. A four-year-old, he points out, has processed roughly 50 times more data through their eyes alone than the entire public text corpus used to train today's most powerful language models. That massive stream of visual data is what builds our internal "world model"—our intuitive grasp of physics, object permanence, and cause and effect. Language is learned on top of this foundation, not the other way around.

This fundamental flaw is why current systems are so inefficient. A teenager can learn to drive a car in about 20 hours of practice, drawing on a lifetime of learned intuitive physics. A self-driving AI, by contrast, requires millions of miles of data and is still brittle. LLMs also suffer from the technical limitations of their auto-regressive design. They generate text one token at a time, unable to plan a full response or correct early mistakes—a process LeCun calls an "exponentially divergent process" where errors compound over time.

The Blueprint for a Smarter AI: Objective-Driven Architecture

To overcome these hurdles, LeCun proposes the Objective-Driven AI Architecture framework. The architecture is composed of several interconnected modules designed to work in concert, much like different regions of the brain.

At its heart are four key components:

A Perception Module: This takes in sensory data (like video) and creates an internal representation of the world's current state.
A World Model: This is the centerpiece of the system. It's an internal simulator that can predict how the world will evolve. Crucially, it takes a sequence of imagined actions from the Actor and predicts the likely future states.
A Cost Module: This module computes a single number representing "discomfort" or "energy." The agent's ultimate goal is to choose actions that will lead to future states with the lowest possible cost. This module includes a hard-wired "Intrinsic Cost" (defining basic drives like curiosity or avoiding "pain") and a trainable "Critic" that learns to predict future costs.
An Actor Module: This is the agent's decision-maker. It proposes sequences of actions to the World Model, observes the predicted costs, and uses optimization techniques (like gradient descent) to find the action plan that minimizes future discomfort.

This design allows for two modes of thinking, analogous to Daniel Kahneman's "System 1" and "System 2." Mode-1 is reactive and fast: the Actor produces an immediate action based on perception. Mode-2 is deliberative and slow: the agent engages in planning, running simulations through its World Model to find the best long-term strategy. This is how humans decide whether to swerve instinctively versus carefully planning a multi-stop road trip.

JEPA: The Engine That Builds the World Model

This all sounds good in theory, but how do you actually build a world model that can learn intuitive physics from raw video? LeCun and his team at Meta AI have developed a specific technique called the Joint Embedding Predictive Architecture (JEPA).

JEPA is a direct response to the failure of so-called "generative" models. If you ask a generative AI to predict the next frame of a video, it will try to predict every single pixel. Since the real world is filled with unpredictable details—the exact rustle of leaves on a tree, the texture of a carpet—the model hedges its bets and produces a blurry, averaged-out mess.

JEPA takes a smarter, non-generative approach. Instead of predicting pixels, it predicts in an abstract representation space. Here’s how it works:

The model looks at a piece of data, like a video clip with a section blacked out (x). It runs this through an encoder to get an abstract representation (sx).
It then looks at the full, complete video clip (y) and runs that through a separate encoder to get its representation (sy).
The system's only job is to train a predictor that can guess sy based on sx.

The magic is in what the encoders learn to ignore. To make the prediction task manageable, the encoders learn to strip out all the unpredictable, high-frequency details (the "noise") and keep only the essential, abstract information that is predictable. By learning to predict that a representation of a ball will move in a parabolic arc, the system learns about gravity without ever needing to predict the ball's scuff marks or the clouds in the background.

Meta AI has already demonstrated this with I-JEPA for images and V-JEPA for videos. These models learn rich, common-sense representations about object shapes, collision physics, and 3D space simply by observing, much like a baby in a crib. LeCun envisions a Hierarchical JEPA (H-JEPA), where stacked models learn abstractions at different timescales, enabling complex, long-range planning.

Answering the Big Question: How It All Fits Together

So, is LeCun proposing World Models, JEPA, or Objective-Driven AI? The answer is all three, working in a clear hierarchy:

Objective-Driven AI is the overarching cognitive architecture—the blueprint for an intelligent agent.
The World Model is a critical component within that architecture, acting as the agent's internal simulator of reality.
JEPA is the specific technical method or training paradigm used to build and train the World Model.

Think of it like building a car. The Objective-Driven AI is the complete car design. The World Model is the engine. And JEPA is the advanced manufacturing process you use to build that engine.

A Future That's Open and Safe

LeCun’s vision extends beyond technical architecture. He is a fierce advocate for open-source AI, arguing that it is essential for democracy. As AI systems become the primary mediators of our digital lives, control by a handful of corporations would create a dangerous concentration of power. Instead, he envisions a diverse ecosystem of models, where communities can build and fine-tune systems that reflect their own languages and values.

He also sees his proposed architecture as inherently safer. Unlike the black-box nature of LLMs, an objective-driven agent's behavior is guided by an explicit, auditable Cost Module. Safety can be engineered directly into the system through "guardrail objectives" that make harmful actions generate an extremely high cost, causing the planner to avoid them entirely.

For LeCun, the path to human-level AI is not a sprint to scale the systems we have today. It's a long, patient effort to build machines that learn about the world in the same way we do: by watching, predicting, and acting with purpose. It's a future that promises not just more fluent chatbots, but truly intelligent partners.

What this means for you...

LeCun is making a bold bet against the "scaling is all you need" philosophy that currently dominates AI (although, as we've written before, other AI legends including Andrej Karpathy and Richard Sutton tend to agree). He believes the next breakthrough won't come from making LLMs bigger, but from building systems with foundational world knowledge. For developers, this points to a new frontier in non-generative, predictive models. For everyone else, it’s a powerful reminder that while today's AI is a useful tool, the systems that can truly plan, reason, and understand our world are still on the horizon—and they might look completely different from what we have now.

As John Coogan curated recently, most "in-the-know" observers (including Karpathy, Dwarkesh Patel, George Hotz, and not mentioned by Coogan, Google DeepMind CEO Demis Hassabis) seem to have landed on the consensus that "AGI" is more or less a decade away, for a variety of technical reasons that have more to do the realities of developing systems that can handle all the edge cases that come with trying to automate human behavior (and more importantly for business uses, meaningful human labor).

If AGI really is a decade away, the next 10 years will be the last decade of "business as usual" for knowledge work. Here's what that means practically:

For your career: The question isn't "will AI replace my job?" but "which parts of my job create leverage that AI can't replicate?" Focus on building skills in three categories that remain hard to automate even with AGI:

Judgment calls under ambiguity - When there's no clear right answer and the stakes matter (What should our company strategy be? Should we enter this market? Is this the right hire?)
Relationship arbitrage - Trust, reputation, and networks you've built over years. AGI can draft the email, but it can't be the person people want to take a meeting with.
Taste and curation - Knowing what's good, what matters, and what resonates. AI can generate 1,000 options; you need to know which three are worth pursuing.

For your kids' education: Don't train them for 2035 jobs. Train them to be the kind of humans that thrive when AI handles the rote work. That means focusing on fundamentals: clear communication, creative problem-solving, building things, understanding incentives and human behavior.

For your business: Start experimenting now with AI tools, not because you need to maximize efficiency today, but because you need to understand where the efficiency gains come from. The companies that survive the transition won't be the ones with the most AI - they'll be the ones who best understand which problems AI actually solves vs. which problems still need humans.

So What Does AGI Actually Require?

If AGI really is a decade away, here's what needs to happen: a combination of (1) scaling up current systems to their limits, (2) new algorithmic breakthroughs like LeCun's world models or other competing approaches, and (3) technical breakthroughs in chip design, hardware form factors, energy management, and system architecture—all working together to create something powerful enough and efficient enough to actually work.

AGI will likely come in many flavors. Some systems might claim AGI by being narrowly superhuman across many domains. Others might achieve true general intelligence—able to do any cognitive task a human can. But here's the brutal constraint: to truly match human versatility, you need human-level performance at human-level efficiency.

That means a digital AGI running in the cloud needs to operate on roughly 120 watts (about what your entire bod uses when you're working hard, plus some overhead for the full cognitive system). A physical humanoid robot doing real-world tasks needs about 500 watts—the same power budget as a human doing moderate physical labor.

Let's also be clear about what kind of intelligence we're actually targeting: the human brain, which = ~1 exaflop of compute on only 20 watts. (call it 22 W if it's working overtime). That's the power of an LED bulb achieving a billion-billion calculations per second. that's also one-sixth of the power draw of an average human on a given day. But since we're talking about omni-models here, you might need a more embodied system with lots of interconnected parts working together, so let's give ourselves the goal of a 120 W system to start.

The Efficiency Gap: El Capitan vs. Your Brain

Start with what's theoretically possible today. The world's fastest supercomputer as of June 2025, El Capitan, can actually match (or even exceed) the brain's computational performance—it hits 1.7 exaflops. But here's the problem: it burns 29.6 megawatts to do it.

Scale El Capitan down to human power levels and you would get:

At 500W (robot budget): 0.03% of the computational power → need 34,000x better efficiency
At 120W (cloud AGI budget): 0.007% of the computational power → need 141,000x better efficiency

This is the theoretical ceiling. If you could take today's peak supercomputing capability and somehow shrink it down to fit human power constraints, you'd need to make the entire system somewhere between 34,000-141,000 times more efficient.

El Capitan wastes massive power on interconnects, cooling, and networking between 11 million cores. That's the price of building a supercomputer—you're essentially paying for the infrastructure to coordinate all that compute.

What About Deployable Chips? The B200 Reality Check

Now let's talk about what you'd actually put in a robot or data center today. Nvidia's B200—the cutting edge chip powering the next generation of AI systems—delivers 2.5 petaflops at 1,200W. That's 400x less compute than El Capitan per chip, but it's designed to be deployable.

Here's the B200 math:

Current efficiency: 2.08 TFLOPS per watt.
Target efficiency (to match brain): 50,000 TFLOPS per watt.
Gap for a robot (500W): need 961x better efficiency.
Gap for cloud AGI (120W): need 4,006x better efficiency.

Why such a huge difference from El Capitan's numbers? Because the B200 is already optimized for deployment. It doesn't have the overhead of massive interconnects and cooling systems. But even this optimized chip is still about 1,000x less efficient than your brain at the system level.

So What's Realistic? The Timeline Math

Historically, GPU efficiency doubles every 2-3 years. At that pace:

For a robot (961x improvement): ~20 years
For cloud AGI (4,006x improvement): ~24 years

But AI insiders keep saying AGI is a decade away. The only way those timelines match is if we're about to hit massive algorithmic breakthroughs that skip past incremental hardware improvements. We're talking about 10-100x improvements from fundamentally rethinking how AI systems work.

What about NVIDIA's actual roadmap? The company has already mapped out its next three generations:

‍Blackwell Ultra (late 2025): 50% more performance than B200, but still at similar power levels. Gets us to ~3.1 TFLOPS/watt—better, but still 16,000x away from brain efficiency.‍
Vera Rubin (late 2026): This is the big one. Each GPU hits 50 petaflops of FP4 compute—2.5x better than B200. Full rack systems deliver 3.6 exaflops. But here's the catch: estimated power consumption jumps to ~1800W per chip. That puts efficiency at roughly 28 TFLOPS/watt. Still need 1,800x improvement to match your brain.‍
Rubin Ultra (2027): Doubles down to 100 petaflops per GPU, with rack systems hitting 15 exaflops. NVIDIA claims this will be 14x faster than current systems—but that's raw performance, not efficiency. Even if we're generous and assume they keep power under 2000W per chip, we're looking at maybe 50 TFLOPS/watt. Still need 1,000x to match brain efficiency.

The pattern is clear: Raw performance is scaling fast (10-15 petaflops → 50 → 100 petaflops over 3 years). But power consumption is scaling almost as fast (1200W → 1800W → 2000W+). The efficiency improvements are incremental—maybe 10-15x over the next 3 years if we're optimistic.

To hit brain-level efficiency (50,000 TFLOPS/watt) by 2027, we'd need Rubin Ultra to be 1,000x more efficient than expected. Hardware alone isn't going to cut it. This is where LeCun's bet comes in: algorithmic breakthroughs that do 100-500x more with the same silicon.

Where Could Those Efficiency Gains Come From?

This is speculative, but here's one way to think about where a full AGI system's computational budget might go, based on LeCun's proposed architecture and what we know about current robotics systems:

Perception (~30-40% of compute):

Vision processing: multiple camera feeds, depth sensing, object recognition.
Proprioception: joint angles, force sensors, balance.
Current bottleneck: Traditional cameras generate massive data streams that need constant processing.
Potential breakthrough: Neuromorphic vision chips that work like biological retinas—only transmitting changes in the visual field rather than every pixel. Think event-based cameras that could be 10-100x more efficient.

World Model (~30-40% of compute):

This is LeCun's centerpiece: an internal physics simulator predicting what happens when you take actions.
Needs to run faster than real-time for planning (imagine your brain simulating "what if I pick this up?" before you do it).
Current bottleneck: Generative models try to predict every pixel of what happens next—massively wasteful.
Potential breakthrough: JEPA-style abstract representations. Instead of predicting pixels, predict compressed abstractions of what matters. This is LeCun's core bet—that this alone could provide 50-500x efficiency gains.

Motor Control & Planning (~20-30% of compute):

Trajectory optimization for 28+ actuators (Tesla Optimus has 28 structural actuators).
Real-time control loops running at 1000Hz+.
Current bottleneck: Centralized control from a single chip trying to coordinate everything.
Potential breakthrough: Distributed control + morphological computation. Let the body's physics do some of the computational work. Example: a compliant robotic gripper can passively adapt to object shapes, reducing the control problem complexity by 10-20x.

Memory & Goal Management (~5-10% of compute):

Episodic memory, learned behaviors, cost calculations.
Relatively small compared to perception and world modeling.

The Core Insight: It's Not Just About Better Chips

Here's the math we established:

Deployable chips (B200): need 961-4,006x efficiency gains.
Theoretical peak (El Capitan): need 34,000-141,000x efficiency gains.

If hardware improvements alone followed historical trends (2x every 2-3 years), we're looking at 20-24 years for the B200 path. But multiple leading AI researchers believe AGI is roughly a decade away.

That gap only makes sense if we're about to see massive algorithmic breakthroughs. The kind where the efficiency gains don't come from shrinking transistors—they come from fundamentally rethinking how AI learns and operates.

This is precisely LeCun's bet. Your brain doesn't waste compute predicting every pixel. It builds compressed world models and plans actions through those abstractions. If done correctly, it's potentially 100-1000x better in terms of computational efficiency for the same capability.

We know hardware will keep getting better: GPU efficiency will keep improving, neuromorphic hardware (or thermodynamic hardware) will mature, and system architectures will get smarter.

The real question is whether we're building the right kind of intelligence on top of those chips. LeCun left Meta to prove the answer is no—that scaling up today's word-prediction models won't get us to AGI, no matter how much compute we throw at them.

And given that he invented the convolutional neural networks powering today's entire AI industry back in the '80s, that's not a bet to ignore. If he's even half-right about world models being the breakthrough we need, the next decade will be defined by whoever figures out how to make AI learn the way humans do—efficiently. And right now, on that front, China seems to be making much farther progress than the U.S.