Hands On With Speak's AI Language Tutor: Stop Memorizing Words, And Start Building "Micro-Fluency."

Ever feel like you’ve been "learning" a language for a decade but still freeze up when a waiter actually asks you a question? You’re not alone. The gap between knowing vocabulary and having functional conversational ability is massive—and it’s where most apps fail.

We just wrapped a deep-dive livestream with Andrew Hsu, Co-founder and CTO of Speak, and walked away with a completely new perspective on how AI is finally solving the "silent learner" problem.

What follows is everything we learned: the philosophy behind their "Learn, Drill, Apply" method, the exact mechanics of their new Winter Release, and why major global enterprises are ditching traditional language classes for AI tutors.

Fair warning: we geek out on voice AI tech here, too (it's the best part IMO!).

Also, shout out to Andrew for dealing with our live tech hiccups like a pro! You’re a legend for that, Andrew!

The TL;DR; if you only have 3 mins, read this...
The Live Stream Itself...
The Winter Release: A New Way to Learn
"Vibe Coding" for Language?
The B2B Unlock: Customizing the Curriculum
The Technical Reality Check on Voice AI
Why It Matters
Key Takeaways: What To Do Next
Q&A: Andrew Hsu on the Future of Speak

The TL;DR; if you only have 3 mins, read this...

IMO, the most important takeaway from the stream wasn't just the tech—it was the pedagogy. Andrew introduced us to the concept of "Functional Fluency" over perfection. It’s not about getting the grammar 100% right; it’s about reducing the latency between your brain and your mouth.

The Speak Method (14:45): Speak ditches the overly "gamified translation" model for a three-step loop: Learn (concept), Drill (rapid-fire repetition), and Apply (roleplay).
The "Tutor Lesson" (15:26): This is the new killer feature. Instead of watching a static video, an AI tutor teaches you a concept and immediately makes you say it out loud, correcting pronunciation in real-time using multimodal models.
The B2B Pivot (40:24): This isn't just for tourists. Andrew revealed how companies like Samsung and Hyundai use Speak to generate hyper-custom curriculums based on employees' actual meeting calendars (so get your boss to pay for it for your international meetings!).
The Future of Voice (44:23): We dug into the limitations of current AI. Andrew was refreshingly honest: audio models are "one or two generations behind text models," specifically when it comes to "code-switching" (mixing two languages in one sentence).

The Live Stream Itself...

Timestamps for Our Top Takeaways

If you only have a few minutes or want to jump to the key moments, here are our favorite parts.

The Tech & The Winter Release

(05:06) Pre-LLM Origins: Speak didn't jump on the ChatGPT bandwagon; they started in 2016 building deep learning models for accent detection before Transformers took over.
(15:26) The "Tutor Lesson" Demo: Andrew demos the new interactive lesson format. The AI tutor asks him to say "Mucho gusto," detects a pronunciation error on the "ch" sound, and forces him to try again.
(24:40) Immersive Roleplay: We saw a "Bad Bunny" themed scenario where the user practices Puerto Rican slang. The AI adapts the difficulty based on user performance.
(33:30) "Speak Tutor" Agent: You can now summon a floating ghost icon (the Speak Tutor) to generate on-the-fly lessons. Andrew asked it: "I'm going to Mexico City next week. Make me a lesson to practice ordering at a taco truck."

Business & Future Vision

(39:35) From Consumer to Enterprise: Why Speak is exploding in the B2B market (Speak for Business), enabling customized L&D training for global workforces.
(44:07) The Frontier of Voice AI: Andrew breaks down the hardest technical challenges: natural turn-taking (knowing when to interrupt) and handling mixed-language input.
(51:50) Beyond Language: While focused on language now, Andrew admits the underlying "adaptive learning engine" is designed to eventually teach any subject.
(53:46) Speak for Sign Language?! Someone asked Andrew if they could use Speak for sign language, and his answer spoke volumes. While they didn't have any concrete plans for this (at least not any he let on), he did mention a few ways you could do it, and seemed open to the idea... let's hope they do it, because that would be awesome!

The Winter Release: A New Way to Learn

Let’s dive into the new features dropped in the Winter Release (available now). Andrew walked us through the three pillars of the new experience:

1. The Learning PathThis isn't your standard "tree" of lessons. Speak has reorganized its curriculum into Units containing Learning Loops.

Old Way: Watch a video explanation, answer a multiple-choice question.
New Way (Tutor Lessons): An AI avatar explains the concept and immediately demands you speak. It uses OpenAI’s real-time API capabilities to listen not just for the right words, but for the right sounds.

"We think sometimes users want to see a human face... but a lot of times, especially for custom generative courses... that has to be AI generated." — Andrew Hsu

2. Immersive Roleplay (The "Free Talk" Tab)This was the coolest demo. Andrew jumped into a scenario to practice slang.

Dynamic Difficulty: If you’re a beginner, it guides you with structured turns. If you’re advanced, it enters "Auto Mode" (like ChatGPT Voice) for a fluid, back-and-forth conversation.
Real-world Prep: As Andrew put it: "Imagine if you could practice exactly whatever custom scenario that you want 15 times before you went and actually did it."

3. Smart Review: Speak tracks every mistake you make across every lesson. The Smart Review tab uses a spaced-repetition algorithm (similar to Anki but powered by LLMs) to surface the exact phrases you’re struggling with, right before you forget them.

"Vibe Coding" for Language?

We talked a lot about how users interact with the app. Andrew described a phenomenon we're calling "Vibe Learning." In the Drill phase, the app forces you to repeat variations of a sentence ("How's the family?", "How is your mother?", "How are they?") dozens of times.The goal isn't intellectual understanding; it's muscle memory.

"We’re actually optimizing for communication... We're trying to get you to speak out loud and have that feel as automatic as possible, even if you're making mistakes."

This aligns perfectly with the "Vibe Coding" concept we see in engineering—getting into a flow state where the tool (or the language) becomes an extension of your thought process, rather than a syntax puzzle you have to solve manually.

The B2B Unlock: Customizing the Curriculum

One of the most surprising insights was Speak's massive traction with enterprises like Samsung and LG.Standard corporate language training is notoriously bad—generic videos about "business meetings" that nobody watches.Speak for Business changes the game by ingesting context.

The Workflow: An employee connects their calendar or job description.
The Output: Speak generates a custom roleplay scenario based on tomorrow's meeting.
The Result: Employees aren't learning "Business English"; they are rehearsing for their actual job.

The Technical Reality Check on Voice AI

We tried to press Andrew on when we'll have a universal translator or a perfect AI tutor. His answer was grounded in engineering reality. He highlighted that Voice AI is lagging behind Text AI.

Latency: Humans interrupt each other constantly. AI still struggles with "turn-taking"—knowing if a pause is a comma or a period.
Code Switching: This is a "frontier problem." If a user says, "Cómo se dice 'turn left' en español?", the model has to process English and Spanish simultaneously without hallucinating the accent.

Why It Matters

The results speak for themselves. Speak has hit $100M ARR with 15M+ downloads. They are proving that AI isn't just a "feature" in education—it's the entire product.

"We basically realized that what we'd be able to build using these powerful tools was a better way of helping people get to fluency... [it] allows us to create a totally new product category here of AI native language learning tools."

Key Takeaways: What To Do Next

If You’re a User Wanting to Learn a Language:

Download the App: It’s on iOS and Android.
Try the "Speak Tutor": Tap the ghost icon and ask for a custom scenario (e.g., "I'm fighting with a taxi driver in Paris").
Don't be shy: The app is designed to be a safe space. Speak loudly.

If You’re in EdTech / AI:

Focus on Latency: The magic of Speak is the speed of the feedback loop.
Build for "Micro-Fluency": Don't try to teach everything. Teach a specific, usable skill (like ordering coffee) and drill it until it sticks.
Personalize Context: The B2B success proves that generic content is dead. The future is curriculum generated just for you.

Where to Learn More

Speak Blog: Read about the Winter Release
Andrew on Twitter/X: Follow his journey building the tech.
The Neuron: Subscribe for more AI deep dives

Q&A: Andrew Hsu on the Future of Speak

The following transcript has been lightly edited for clarity and length.

Question: What was the thesis when you started the company back in 2016?

Andrew Hsu: My co-founder and I were really interested in deep learning. This was pre-LLM, but speech recognition systems were starting to come back. We realized that if you look at the curve of progress, models were clearly going to become superhuman in 5 or 10 years. We wanted to use these tools to build a better way of helping people get to fluency—actually speaking out loud and getting immediate feedback, which previously required hiring a human.

Question: Does it switch to "easy mode" if it detects you're having trouble, like if you have a lisp?

Andrew Hsu: That's a good question. What we're optimizing for is communication, not harsh judgment. We want you to speak out loud and have it feel automatic. We tune the speech systems so that it is quite easy to match the words if you are communicating in good faith. We do have a separate pronunciation coach for phone-level feedback, but the drills are designed to get you into a flow state.

Question: Can it help non-native speakers of English speak more comfortably in professional settings?

Andrew Hsu: Yes, absolutely. We started as a consumer app, but in South Korea, big companies like Samsung and LG started using us for employee language training. Our B2B product, Speak for Business, allows us to generate custom curriculum based on your specific job role or even your meeting calendar. It’s hyper-customized training that wasn’t possible before AI.

Question: As someone building voice AI, what are the current limits? What hasn't been unlocked yet?

Andrew Hsu: Audio model intelligence is still one or two generations behind text models. The really hard thing is turn-taking—knowing when to respond. Even humans talk over each other, but the latency in AI makes it noticeable. Also, "code-switching"—speaking two languages in one sentence—is a frontier problem for both speech generation and recognition.

Question: Would you ever expand beyond language to teach other subjects?

Andrew Hsu: The long-term vision is to build a superhuman tutor. Our adaptive learning engine isn't specific to language; it tracks what you know and what you forget. Theoretically, we could expand to other subjects. But right now, there is such an enormous prize in winning language learning that we are hyper-focused on that.