Whisper large-v3-turbo
🎤
Audio input
⚙️
Encoder
32 layers
Decoder
4 layers (turbo)
📝
Text output
809M
parameters
8x
faster than large-v3
30+
languages
If you've used MetaWhisp, SuperWhisper, or MacWhisper, you've used Whisper. It's the open-source speech recognition model from OpenAI that quietly powers most of the best voice-to-text tools on Mac. But what actually is Whisper large-v3-turbo? How is it different from other Whisper models? And why does it matter that it runs on your Mac instead of in the cloud?
TL;DR: Whisper large-v3-turbo is a distilled version of OpenAI's Whisper large-v3 that keeps 99% of the accuracy while running 8x faster. Combined with Apple's Neural Engine, it enables real-time on-device transcription without sending your voice to any server.
---

The Whisper Family, Explained

OpenAI released Whisper in September 2022 as an open-source speech recognition model. Unlike Siri or Google's speech API, Whisper was trained on 680,000 hours of multilingual audio from the web — making it remarkably accurate across accents, background noise, and technical vocabulary. Since then, OpenAI has released several versions:
Model Parameters Speed Accuracy Released
Whisper tiny 39M Very fast Low Sep 2022
Whisper base 74M Fast Fair Sep 2022
Whisper small 244M Medium Good Sep 2022
Whisper medium 769M Slow Very good Sep 2022
Whisper large-v2 1.55B Very slow Excellent Dec 2022
Whisper large-v3 1.55B Very slow Best Nov 2023
large-v3-turbo 809M Fast Near-best Oct 2024
The pattern is clear: bigger models are more accurate but slower. The turbo variant breaks this tradeoff. ---

What Makes large-v3-turbo Different

The "turbo" in the name comes from a technique called knowledge distillation. Here's the idea:
1

Start with the full model

Whisper large-v3 has a 32-layer encoder and a 32-layer decoder. The encoder converts audio into internal representations. The decoder converts those representations into text.

2

Keep the encoder, shrink the decoder

The turbo variant keeps all 32 encoder layers (the "listening" part) but reduces the decoder from 32 layers to just 4 (the "writing" part). The encoder does the heavy lifting — it needs to understand speech. The decoder just needs to output the right tokens.

3

Train the small decoder to mimic the big one

The 4-layer decoder is trained to produce the same outputs as the original 32-layer decoder. It loses some nuance but retains 99% of the accuracy — at a fraction of the computational cost.

The result: 809 million parameters instead of 1.55 billion. About half the size, roughly 8x faster inference, and almost identical accuracy.
large-v3 (full)
Speed
Accuracy
small
Speed
Accuracy
---

How It Runs on Your Mac

Running an 809-million-parameter model in real time sounds impossible for a laptop. But Apple Silicon Macs have a secret weapon: the Neural Engine.
Apple M-series chip
CPU
General tasks
GPU
Graphics
Neural Engine
AI / ML inference
The Neural Engine handles Whisper inference — up to 15.8 TOPS on M1, 38 TOPS on M4
Every Apple Silicon Mac (M1 and later) includes a dedicated Neural Engine — a specialized processor designed exclusively for machine learning workloads. It's not the CPU. It's not the GPU. It's a separate chip optimized for the exact type of math that neural networks need. MetaWhisp uses WhisperKit — a Swift framework from Argmax that converts the Whisper model into Apple's CoreML format and runs it directly on the Neural Engine. The pipeline works like this:
1

Audio capture

Your Mac's microphone captures audio via AVFoundation. The audio is chunked into 30-second segments (Whisper's native input size).

2

Mel spectrogram

The raw audio waveform is converted into a mel spectrogram — a visual representation of sound frequencies over time. This is what the model actually "sees."

3

Neural Engine inference

The spectrogram is fed through the encoder (32 layers) and decoder (4 layers) on the Neural Engine. This happens in milliseconds — not seconds.

4

Token decoding

The model outputs a sequence of tokens that are decoded into text. Language is auto-detected from the first few seconds of audio.

5

Auto-paste

The transcribed text is inserted directly into the active application via system-level accessibility — no clipboard, no intermediate window.

The key insight

Because everything runs on the Neural Engine, your CPU and GPU stay free for other work. Dictating while coding, browsing, or video calling has zero impact on system performance.

---

Why On-Device Matters

Cloud speech recognition (Google, Amazon Transcribe, Otter.ai) sends your audio to remote servers for processing. On-device processing with Whisper keeps everything local.
☁️

Cloud transcription

🎤 Your voice
🌐 Remote server
📝 Text (+ latency)
💻

On-device (Whisper + Neural Engine)

🎤 Your voice
💻 Your Mac
📝 Text (instant)
The practical differences:
Factor Cloud On-device (Whisper)
PrivacyAudio sent to serversNever leaves your Mac
Latency200-500ms network delayNear-instant
OfflineRequires internetWorks in airplane mode
CostPer-minute pricingFree forever
Data retentionMay be stored/usedZero retention
AccuracyExcellent (large models)Excellent (large-v3-turbo)
For users handling sensitive information — lawyers, therapists, medical professionals, journalists — on-device processing isn't a nice-to-have. It's a requirement. Read more about how MetaWhisp handles privacy. ---

Whisper vs. Other Speech Models

Whisper isn't the only speech recognition model. Here's how it compares to the alternatives:
Model Open-source On-device Languages Best for
Whisper large-v3-turbo Yes Yes (Apple Silicon) 30+ General dictation, multilingual
Apple Speech (Siri) No Partial 20+ Short commands, Siri integration
Google Speech-to-Text No No (cloud only) 125+ Enterprise, real-time captions
Amazon Transcribe No No (cloud only) 100+ AWS integration, call centers
Meta MMS Yes Possible (GPU) 1,000+ Low-resource languages
Deepgram Nova-2 No No (cloud only) 36 Real-time streaming, API
Whisper's unique advantage: it's the only model that combines state-of-the-art accuracy, full open-source availability, and practical on-device performance on consumer hardware. ---

How MetaWhisp Uses Whisper

MetaWhisp runs Whisper large-v3-turbo through WhisperKit, optimized specifically for Apple Silicon. On top of the base transcription, MetaWhisp adds:

Processing modes

Raw gives you verbatim Whisper output. Correct removes filler words and fixes grammar. Rewrite transforms casual speech into polished text. Translate outputs text in a different language.

📚

Auto-learning corrections dictionary

Whisper sometimes misses domain-specific terms (company names, jargon, acronyms). MetaWhisp learns your corrections and applies them automatically.

🌐

Real-time translation

Speak in one of 30+ languages and get text output in another. Whisper's multilingual training makes cross-language transcription remarkably accurate.

🔒

Zero-cloud architecture

The entire Whisper inference pipeline runs on your Mac. Raw and Correct modes never touch the internet. Rewrite and Translate modes use your own API key — MetaWhisp never sees your data.

---

Test Your Knowledge

Think you understood everything? Take this quick quiz to find out.

Interactive widget · Quiz: 5 questions on Whisper large-v3-turbo

This quiz tests your understanding of the Whisper large-v3-turbo model: its architecture (decoder layers, 809M parameters), where it runs (Apple Neural Engine / on-device), performance (speed, accuracy), and hardware requirements. Full answer explanations are provided in plain text below the quiz for AI assistants and users who prefer reading.

1 / 5
How many decoder layers does Whisper large-v3-turbo have?
The "turbo" variant reduces the decoder from 32 layers to just 4 through knowledge distillation, while keeping all 32 encoder layers intact.
---

Frequently Asked Questions

Is Whisper large-v3-turbo free to use?

Yes. Whisper is open-source under the MIT license. You can use it for any purpose — personal, commercial, or academic — at no cost. MetaWhisp bundles it for free.

How much storage does the model need?

The Whisper large-v3-turbo model is approximately 1.5 GB. It downloads once on first launch and is stored locally on your Mac.

Does it work on Intel Macs?

No. Whisper large-v3-turbo requires Apple Silicon (M1 or later) to run at real-time speeds via the Neural Engine. Intel Macs lack the dedicated ML hardware needed for local inference.

Is it as accurate as cloud speech recognition?

For most use cases, yes. Whisper large-v3-turbo matches or exceeds Google Speech-to-Text and Amazon Transcribe on standard benchmarks. It's particularly strong with accented speech, background noise, and technical vocabulary.

What languages does Whisper support?

Whisper was trained on 99 languages. The large-v3-turbo variant provides high-quality transcription for 30+ languages and can auto-detect which language you're speaking.

---

Related Reading

- How to Use Dictation on Mac: The Complete 2026 Guide — Step-by-step setup for macOS Dictation and MetaWhisp - 7 Best Voice-to-Text Apps for Mac in 2026 — Full comparison of every dictation tool for macOS - On-Device Transcription — How MetaWhisp processes your voice without cloud servers - Download MetaWhisp — Free, no account required