What Does "On-Device Transcription" Mean?

Most voice-to-text tools record your speech, send the audio to a remote server, process it in the cloud, and return the text. Your voice data travels across the internet, gets stored on third-party infrastructure, and is often used to train models or improve services.

On-device transcription eliminates every step of that. The speech recognition model lives on your Mac. Audio is captured, processed in memory, converted to text, and discarded — all within your machine. The audio never touches a network interface. There is no server to send it to.

This is not a "local cache" that syncs later. It is not a "privacy mode" that still phones home. The entire inference pipeline — from raw audio waveform to final text output — runs locally on Apple Silicon hardware. You can unplug your Ethernet, turn off Wi-Fi, and MetaWhisp works exactly the same.

How It Works: WhisperKit + Neural Engine

Three components work together to turn your speech into text in under a second.

1

Audio Capture and Preprocessing

When you press the hotkey, MetaWhisp begins recording from your Mac's microphone using Apple's AVFoundation framework. The raw audio is captured at 16kHz mono — the format Whisper expects. A voice activity detector (VAD) identifies when you start and stop speaking, so silence is trimmed automatically. The audio buffer lives in RAM and is never written to disk.

2

WhisperKit Model Inference

WhisperKit is an open-source Swift framework from Argmax that optimizes OpenAI's Whisper model for Apple hardware. It converts the Whisper architecture into Core ML format, which means inference runs directly on the Apple Neural Engine — the dedicated machine learning accelerator built into every Apple Silicon chip. The Neural Engine handles up to 15.8 trillion operations per second on M3, which is why transcription finishes in under a second even with the large-v3-turbo model.

3

Text Output and Auto-Paste

Once inference completes, the text is placed on the system clipboard and automatically pasted into the active application using macOS accessibility APIs. The audio buffer is released from memory. The entire pipeline — capture, preprocess, infer, paste — typically completes in 500-900ms for a 10-second recording. You can use the text in any app: Slack, VS Code, Terminal, Notes, Safari, or any text field on your Mac.

Key takeaway: The Neural Engine is the reason on-device transcription is now practical. Previous attempts at local speech recognition relied on the CPU or GPU, which were too slow for real-time use and drained battery. Apple Silicon's dedicated ML hardware changed the equation entirely.

Why On-Device Matters

Running transcription locally is not just a privacy feature. It changes how the tool performs in practice.

Privacy by architecture, not policy. Cloud transcription services ask you to trust their privacy policy. On-device transcription makes the question irrelevant. If audio never leaves your machine, it cannot be intercepted, stored, leaked, subpoenaed, or used for training. This matters for anyone working with medical records, legal documents, financial data, journalistic sources, or proprietary code. Read our full privacy policy — it is short because there is nothing to disclose.
Speed without network dependency. Cloud transcription adds 200-800ms of network latency on top of processing time. On a slow connection or congested Wi-Fi, that becomes 2-5 seconds. On-device transcription has consistent sub-second latency regardless of your network. This matters at airports, coffee shops, rural areas, or any environment where connectivity is unreliable.
Reliability without failure modes. Cloud services go down. APIs hit rate limits. Servers experience latency spikes during peak hours. ISPs have outages. On-device transcription has exactly one dependency: your Mac being turned on. No service degradation, no outages, no "sorry, something went wrong" errors. It works in airplane mode, on a submarine, or in a Faraday cage.
Cost that stays at zero. Cloud transcription APIs charge per minute of audio. Even "free tiers" have limits. On-device transcription has no per-use cost because there is no server to pay for. MetaWhisp is free to download and use without limits. There is no premium tier, no usage cap, and no account required.

On-Device vs. Cloud Transcription

A direct comparison of the two approaches across the dimensions that matter most.

Dimension On-Device (MetaWhisp) Cloud-Based
Privacy Audio never leaves your Mac Audio sent to remote servers
Works offline Yes, always No, requires internet
Latency <1 second consistent 1-5 seconds, variable
Accuracy 5.7% WER (Whisper large-v3-turbo) 4-8% WER (varies by service)
Languages 30+ with auto-detection 50-100+ (varies)
Uptime 100% (no server dependency) 99.5-99.9% (service outages)
Cost Free, unlimited $0.006-0.024/min or subscription
Data retention None — audio discarded instantly Varies — often stored 30+ days
Hardware requirement Apple Silicon Mac Any device with a browser
Speaker diarization Single speaker Multi-speaker identification

Technical Specifications

The specifics for those who want to understand exactly what is running on their hardware.

Specification Details
Model OpenAI Whisper large-v3-turbo (distilled)
Framework WhisperKit (Swift, Core ML)
Inference engine Apple Neural Engine via Core ML
Model size ~809 MB (downloaded once on first launch)
App binary size 7.5 MB
Transcription latency <1 second for 10-second recordings
Audio format 16kHz mono PCM (converted from input)
Supported languages 30+ including English, Spanish, Chinese, Japanese, German, French, Russian, Korean, Arabic, Hindi
Language detection Automatic (built into model)
CPU usage at idle ~2%
Minimum macOS macOS 14 Sonoma
Required hardware Apple Silicon (M1, M2, M3, M4 or later)
Word error rate 5.7% (Whisper large-v3-turbo benchmark)
Why Whisper large-v3-turbo? OpenAI released the large-v3-turbo variant as a distilled version of large-v3. It achieves nearly identical accuracy (5.7% vs 5.5% WER) while running significantly faster on edge hardware. This is what makes real-time on-device transcription possible — the model is small enough to fit in Neural Engine memory and fast enough to process speech faster than you can produce it.

Frequently Asked Questions

Does on-device transcription work without an internet connection?

Yes. MetaWhisp runs entirely on your Mac using Apple's Neural Engine and the WhisperKit framework. No internet connection is needed for transcription. The only feature that requires internet is the optional AI post-processing (Correct and Rewrite modes), which uses the OpenAI API. The core voice-to-text functionality is 100% offline.

How accurate is local speech recognition compared to cloud services?

MetaWhisp uses OpenAI's Whisper large-v3-turbo model, which achieves a 5.7% word error rate on standard benchmarks. For comparison, Google Cloud Speech-to-Text and Amazon Transcribe typically achieve 4-8% WER depending on the audio conditions. In practice, on-device accuracy is comparable to cloud for clear single-speaker dictation. Cloud services may have an edge in noisy multi-speaker environments where server-side noise reduction helps.

What Mac hardware is required for on-device transcription?

You need a Mac with Apple Silicon (M1, M2, M3, M4, or later) running macOS 14 Sonoma or later. The Neural Engine in Apple Silicon chips is what makes real-time local transcription possible — it provides dedicated hardware for ML inference that is both fast and power-efficient. Intel Macs are not supported because they lack a Neural Engine capable of running the Whisper model at real-time speeds.

Is my voice data stored or sent anywhere?

No. Audio is captured into a RAM buffer, processed by the on-device model, and the buffer is released as soon as transcription completes. Nothing is written to disk, sent over the network, or retained in any form. MetaWhisp has no servers, no user accounts, no analytics, and no telemetry. You can verify this by monitoring network activity with Little Snitch or running the app in airplane mode. Read the full privacy policy for details.

Can I use on-device transcription in any app on my Mac?

Yes. MetaWhisp works system-wide via a global hotkey (Right Option key by default). The transcribed text is automatically pasted into the active application — wherever your cursor is. This includes Slack, VS Code, Terminal, Notes, Safari, Chrome, Mail, Pages, and any other app that accepts text input. It also works in full-screen applications. See the complete dictation guide for setup instructions.

Learn More

Try On-Device Transcription Free

Download MetaWhisp and experience offline, private voice-to-text on your Mac. No account, no subscription, no data collection.

Download for macOS

macOS 14+ · Apple Silicon · Free