This week at AWS re:Invent, Corey sat down with TwelveLabs to discuss their newly launched Marengo 3.0, a video foundation model that actually understands video the way humans do—across time, space, and all modalities (visual, audio, text) simultaneously.
Check out the video below, or keep reading to learn all about it.
Here's the problem:
Traditional computer vision treats video like a flipbook, analyzing frame-by-frame images. That means AI misses what's happening between frames—the temporal relationships, audio cues, and cross-modal connections that give video its meaning. Meanwhile, 80% of the world's data sits trapped in video format (often literally on tapes in storage), unusable because there's been no good way to search or understand it.
What makes Marengo 3.0 different:
Marengo 3.0 enables natural language search across your entire video library. Instead of manually scrubbing through hours of footage or relying on manual tagging, you can search with queries like "find the moment when the player in the red jersey scores a jump shot" or "show me all segments where the mechanic points to the engine component."
Even better: it supports composed queries—combining an image plus text (like a photo of a specific player + "scored a three-pointer") to pinpoint exact moments. It handles hour-long videos without breaking, works across 50+ languages natively, and processes everything 30x faster than competitors like Amazon Nova while using 6x less storage space.
Real-world use cases you can't do with other models:
- Sports analysis: Search thousands of hours of game footage for specific plays, player actions, or strategic moments without watching everything
- Media production: Find the exact B-roll footage you need from your archive by describing what you want in natural language
- Insurance claims: Replace tedious PDF processing by analyzing video evidence of damage, incidents, or repairs
- Security & compliance: Identify critical events across surveillance footage without human review
- Automotive diagnostics: Build AI mechanics that can visually identify problems and provide repair guidance
The model combines two components: Marengo (for search/retrieval with natural language queries) and Pegasus (for generating text outputs like summaries, chapters, or classifications). Both are now available on Amazon Bedrock and through TwelveLabs' APIs; you can also try them out in the playground here.
Bottom line: if your company has video archives gathering digital dust, Marengo 3.0 makes that content searchable, analyzable, and actually useful for the first time.







