SHARE

Your Laptop Can Now Clone Anyone's Voice—No GPU Required

Kyutai's Pocket TTS is a 100M-parameter text-to-speech model that runs faster than real-time on your CPU. Clone any voice with just 5 seconds of audio—fully local, fully open-source, and no cloud required.

Written By

Grant Harvey

Jan 16, 2026

3 minute read

Most AI voice models demand expensive GPUs and cloud APIs to generate speech. Not ideal if you're building a voice assistant or just want to clone your voice without burning through compute credits.

Kyutai just released Pocket TTS, a text-to-speech model so small (100M parameters) it runs faster than real-time on your CPU—no fancy GPU needed. Watch the demo here.

The model delivers high-quality voice cloning using just 5 seconds of audio. Give it 5 seconds of someone's voice, and it'll clone their tone, accent, emotion, and even the room acoustics and microphone quality. Kinda like how your nephew can do a perfect impression of that one annoying TikTok video on repeat, so now you can do it too. Anyone else’s extended family ban the phrase “6 7” after last year’s Thanksgiving?

Once the voice is cloned, it then generates speech in that voice at 6x faster than real-time: all running locally on a standard laptop processor.

The numbers speak for themselves:

Best-in-class accuracy: Lowest Word Error Rate (1.84%) among competitors—including models 7x larger.
Truly portable: Runs on Apple M3 or Intel Core Ultra CPUs without dedicated graphics.
Open everything: Fully open-source under MIT license with full training code and 88k hours of public data.

‍

The breakthrough comes from Continuous Audio Language Models (CALM), a new framework that predicts audio directly instead of converting it to discrete tokens first. This eliminates the computational bottleneck that made previous TTS models GPU-dependent.

Traditional voice AI follows a three-step process:

Convert audio to discrete tokens.
Run the model.
Decode back to audio.

Each step adds overhead. It’s faster to process, but you lose quality with every cut.

CALM (Continuous Audio Language Models) skips chunking and even tokenization entirely, predicting smooth, uncompressed representations of sound. It's like describing a sunset with infinite color gradients instead of, y’know, 8 crayons.

Why this matters: Voice AI just became accessible to any developer (or even you) with a laptop. Building voice assistants, audiobook narrators, or accessibility tools no longer requires expensive infrastructure (or an expensive ElevenLabs subscription, tho don’t cry for them; they just hit $330M in ARR, which = annualized recurring revenue). This is part of a broader shift toward edge AI that we’ll see get even bigger this year: powerful models getting small enough to run locally while maintaining quality.

What you can do today that was impossible yesterday:

A solo game developer adds 50 unique character voices without hiring a single actor—or paying for cloud API calls
A podcaster on a flight edits an episode and re-records a flubbed line in their exact voice, no internet required
Someone with ALS banks their voice on a laptop before it deteriorates, keeping their identity in a private file they control
A language teacher creates pronunciation guides in their own voice across 200 vocabulary words in an afternoon
A content creator in rural Montana with spotty WiFi produces professional voiceovers without uploading anything to the cloud

The privacy angle matters most. Until now, voice cloning meant sending audio to someone else's servers. Medical dictation, legal depositions, confidential business communications; all required trusting a third party. Now? Your voice never leaves your machine.

Over the next year, expect more foundation models to follow this pattern: smaller, faster, and designed for devices you already own (led by the new partnership between Apple and Google). Developers can start using Pocket TTS immediately; if you wanna try it yourself, the full technical report includes setup instructions and voice samples.

Grant Harvey

Grant Harvey is the Lead Writer of The Neuron, where he continues to lead the publication's daily coverage of AI news, tools, and trends.

Your Laptop Can Now Clone Anyone's Voice—No GPU Required

Grant Harvey

Company

Categories