32 layers
4 layers (turbo)
The Whisper Family, Explained
OpenAI released Whisper in September 2022 as an open-source speech recognition model. Unlike Siri or Google's speech API, Whisper was trained on 680,000 hours of multilingual audio from the web — making it remarkably accurate across accents, background noise, and technical vocabulary. Since then, OpenAI has released several versions:| Model | Parameters | Speed | Accuracy | Released |
|---|---|---|---|---|
| Whisper tiny | 39M | Very fast | Low | Sep 2022 |
| Whisper base | 74M | Fast | Fair | Sep 2022 |
| Whisper small | 244M | Medium | Good | Sep 2022 |
| Whisper medium | 769M | Slow | Very good | Sep 2022 |
| Whisper large-v2 | 1.55B | Very slow | Excellent | Dec 2022 |
| Whisper large-v3 | 1.55B | Very slow | Best | Nov 2023 |
| large-v3-turbo | 809M | Fast | Near-best | Oct 2024 |
What Makes large-v3-turbo Different
The "turbo" in the name comes from a technique called knowledge distillation. Here's the idea:Start with the full model
Whisper large-v3 has a 32-layer encoder and a 32-layer decoder. The encoder converts audio into internal representations. The decoder converts those representations into text.
Keep the encoder, shrink the decoder
The turbo variant keeps all 32 encoder layers (the "listening" part) but reduces the decoder from 32 layers to just 4 (the "writing" part). The encoder does the heavy lifting — it needs to understand speech. The decoder just needs to output the right tokens.
Train the small decoder to mimic the big one
The 4-layer decoder is trained to produce the same outputs as the original 32-layer decoder. It loses some nuance but retains 99% of the accuracy — at a fraction of the computational cost.
How It Runs on Your Mac
Running an 809-million-parameter model in real time sounds impossible for a laptop. But Apple Silicon Macs have a secret weapon: the Neural Engine.Audio capture
Your Mac's microphone captures audio via AVFoundation. The audio is chunked into 30-second segments (Whisper's native input size).
Mel spectrogram
The raw audio waveform is converted into a mel spectrogram — a visual representation of sound frequencies over time. This is what the model actually "sees."
Neural Engine inference
The spectrogram is fed through the encoder (32 layers) and decoder (4 layers) on the Neural Engine. This happens in milliseconds — not seconds.
Token decoding
The model outputs a sequence of tokens that are decoded into text. Language is auto-detected from the first few seconds of audio.
Auto-paste
The transcribed text is inserted directly into the active application via system-level accessibility — no clipboard, no intermediate window.
The key insight
Because everything runs on the Neural Engine, your CPU and GPU stay free for other work. Dictating while coding, browsing, or video calling has zero impact on system performance.
Why On-Device Matters
Cloud speech recognition (Google, Amazon Transcribe, Otter.ai) sends your audio to remote servers for processing. On-device processing with Whisper keeps everything local.Cloud transcription
On-device (Whisper + Neural Engine)
| Factor | Cloud | On-device (Whisper) |
|---|---|---|
| Privacy | Audio sent to servers | Never leaves your Mac |
| Latency | 200-500ms network delay | Near-instant |
| Offline | Requires internet | Works in airplane mode |
| Cost | Per-minute pricing | Free forever |
| Data retention | May be stored/used | Zero retention |
| Accuracy | Excellent (large models) | Excellent (large-v3-turbo) |
Whisper vs. Other Speech Models
Whisper isn't the only speech recognition model. Here's how it compares to the alternatives:| Model | Open-source | On-device | Languages | Best for |
|---|---|---|---|---|
| Whisper large-v3-turbo | Yes | Yes (Apple Silicon) | 30+ | General dictation, multilingual |
| Apple Speech (Siri) | No | Partial | 20+ | Short commands, Siri integration |
| Google Speech-to-Text | No | No (cloud only) | 125+ | Enterprise, real-time captions |
| Amazon Transcribe | No | No (cloud only) | 100+ | AWS integration, call centers |
| Meta MMS | Yes | Possible (GPU) | 1,000+ | Low-resource languages |
| Deepgram Nova-2 | No | No (cloud only) | 36 | Real-time streaming, API |
How MetaWhisp Uses Whisper
MetaWhisp runs Whisper large-v3-turbo through WhisperKit, optimized specifically for Apple Silicon. On top of the base transcription, MetaWhisp adds:Processing modes
Raw gives you verbatim Whisper output. Correct removes filler words and fixes grammar. Rewrite transforms casual speech into polished text. Translate outputs text in a different language.
Auto-learning corrections dictionary
Whisper sometimes misses domain-specific terms (company names, jargon, acronyms). MetaWhisp learns your corrections and applies them automatically.
Real-time translation
Speak in one of 30+ languages and get text output in another. Whisper's multilingual training makes cross-language transcription remarkably accurate.
Zero-cloud architecture
The entire Whisper inference pipeline runs on your Mac. Raw and Correct modes never touch the internet. Rewrite and Translate modes use your own API key — MetaWhisp never sees your data.
Test Your Knowledge
Think you understood everything? Take this quick quiz to find out.
Frequently Asked Questions
Is Whisper large-v3-turbo free to use?
Yes. Whisper is open-source under the MIT license. You can use it for any purpose — personal, commercial, or academic — at no cost. MetaWhisp bundles it for free.
How much storage does the model need?
The Whisper large-v3-turbo model is approximately 1.5 GB. It downloads once on first launch and is stored locally on your Mac.
Does it work on Intel Macs?
No. Whisper large-v3-turbo requires Apple Silicon (M1 or later) to run at real-time speeds via the Neural Engine. Intel Macs lack the dedicated ML hardware needed for local inference.
Is it as accurate as cloud speech recognition?
For most use cases, yes. Whisper large-v3-turbo matches or exceeds Google Speech-to-Text and Amazon Transcribe on standard benchmarks. It's particularly strong with accented speech, background noise, and technical vocabulary.
What languages does Whisper support?
Whisper was trained on 99 languages. The large-v3-turbo variant provides high-quality transcription for 30+ languages and can auto-detect which language you're speaking.