Teaching AI to Play Minecraft by Watching YouTube Videos

You know the usual way to train a game-playing AI: set up a reward function, run a million simulations, tweak hyperparameters, pray. But what if the AI could just watch a few unlabeled Minecraft videos on YouTube and figure out how to play? That's exactly what OpenAI's Video Pre-Training (VPT) project does.

Instead of hand-crafting reward signals or requiring massive amounts of labeled gameplay data, VPT learns from millions of hours of raw, unlabeled video. It's like giving an AI a YouTube binge session and asking it to pick up the basics of Minecraft from scratch. No special training labels, no explicit reward functions. Just pixels and audio.

What It Does

Video Pre-Training is a framework for training foundation models that can play Minecraft by watching unlabeled video data. The core idea is straightforward:

Train an inverse dynamics model (IDM) on a small amount of labeled gameplay (about 2000 hours). The IDM learns to predict actions from video frames.
Apply the IDM to a huge dataset of unlabeled Minecraft YouTube videos (70,000+ hours). This generates action labels for the video frames, creating a massive labeled dataset.
Train a video pretraining (VPT) model on this generated dataset using a simple next-frame prediction objective, but with the learned action labels.
Fine-tune the VPT model on specific tasks using reinforcement learning or supervised learning.

The result? The AI learns to do things like chop trees, craft tools, and navigate the world—without ever being explicitly told what to do.

Why It’s Cool

This isn't another "watch the AI fail at Minecraft" demo. VPT actually achieves state-of-the-art results on the MineRL benchmark, and does it with minimal human intervention. Here's what makes it stand out:

Zero reward design, almost. The small labeled dataset (2000 hours) is used only to train the IDM. The main 70,000 hours of video are completely unlabeled. Compare that to traditional reinforcement learning where you'd hand-craft reward functions for every single action.
Works from raw pixels. No need to compress the game state into clean vectors. The model consumes raw video frames like a human would.
Generalizes surprisingly well. The pretrained model isn't just good at one task. It learns a broad suite of Minecraft behaviors—mining, crafting, building—and can be fine-tuned for specific goals.
Data source is YouTube. They scraped real human gameplay. That’s messy, noisy, and imperfect, but also rich with the kind of behavior that matters in open-world games.

Watch unlabeled Minecraft videos teach an AI to play the game

README

Teaching AI to Play Minecraft by Watching YouTube Videos

What It Does

Why It’s Cool

Join our weekly newsletter

Love discovering amazing projects?