Documentation Index Fetch the complete documentation index at: https://mintlify.com/argmaxinc/WhisperKit/llms.txt
Use this file to discover all available pages before exploring further.
Optimizing WhisperKit performance involves balancing speed, accuracy, and resource usage. This guide covers compute units, model selection, decoding options, and platform-specific optimizations.
Compute Units
CoreML models can target different hardware accelerators on Apple devices:
Neural Engine Specialized ML accelerator (fastest, most efficient)
GPU Graphics processor (good balance)
CPU Central processor (most compatible)
ModelComputeOptions
Configure compute units for each model component:
import WhisperKit
import CoreML
let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU , // Mel spectrogram extraction
audioEncoderCompute : . cpuAndNeuralEngine , // Audio encoder
textDecoderCompute : . cpuAndNeuralEngine , // Text decoder
prefillCompute : . cpuOnly // KV cache prefill
)
let config = WhisperKitConfig (
model : "large-v3" ,
computeOptions : computeOptions
)
let pipe = try await WhisperKit (config)
Available Compute Units
CPU only - most compatible, slowest
CPU and GPU - good for macOS < 14
CPU and Neural Engine - recommended for macOS 14+, iOS 17+
All available compute units - lets CoreML decide
Recommended Configurations
macOS 14+ (Recommended)
macOS 13 and Earlier
iOS 18+
Low Memory
let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU ,
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine ,
prefillCompute : . cpuOnly
)
Best performance on M1/M2/M3 Macs let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU ,
audioEncoderCompute : . cpuAndGPU ,
textDecoderCompute : . cpuAndNeuralEngine ,
prefillCompute : . cpuOnly
)
Neural Engine support limited for audio encoder let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU ,
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine ,
prefillCompute : . cpuOnly
)
Optimized for iPhone/iPad let computeOptions = ModelComputeOptions (
melCompute : . cpuOnly ,
audioEncoderCompute : . cpuOnly ,
textDecoderCompute : . cpuOnly ,
prefillCompute : . cpuOnly
)
Minimize memory usage
CLI Configuration
Set compute units via command-line:
swift run whisperkit-cli transcribe \
--audio-path "audio.wav" \
--model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
--audio-encoder-compute-units cpuAndNeuralEngine \
--text-decoder-compute-units cpuAndNeuralEngine
Model Selection
Model size significantly impacts speed and accuracy:
Speed : ~32x real-time | WER : ~10-15% | Size : ~75 MBBest for: Real-time applications, low-end devices, quick prototyping
Speed : ~16x real-time | WER : ~8-12% | Size : ~142 MBBest for: Balanced speed/quality, general transcription
Speed : ~6x real-time | WER : ~5-8% | Size : ~466 MBBest for: Production applications requiring accuracy
Speed : ~2x real-time | WER : ~4-6% | Size : ~1.5 GBBest for: High-accuracy requirements, offline processing
Speed : ~1x real-time | WER : ~3-5% | Size : ~3 GBBest for: Maximum accuracy, multilingual, post-processing
Distil Models
Distilled models offer 2-3x speedup with minimal accuracy loss:
let config = WhisperKitConfig (
model : "distil*large-v3" , // Glob pattern
modelRepo : "argmaxinc/whisperkit-coreml"
)
Benchmarks: WhisperKit Benchmarks
Decoding Options
Temperature and Sampling
Control randomness and diversity:
var options = DecodingOptions ()
options. temperature = 0.0 // Deterministic (default)
options. topK = 5 // Top-K sampling candidates
0.0: Deterministic (always pick most likely token)
0.0 - 1.0: Increasing randomness
Higher values = more creative but less accurate
Number of top candidates to sample from when temperature > 0
Fallback Strategy
Automatically retry failed segments:
var options = DecodingOptions ()
options. temperatureIncrementOnFallback = 0.2 // Increment per retry
options. temperatureFallbackCount = 5 // Max retries
Quality Thresholds
Detect and reject poor transcriptions:
var options = DecodingOptions ()
options. compressionRatioThreshold = 2.4 // Detect repetition
options. logProbThreshold = -1.0 // Average confidence
options. firstTokenLogProbThreshold = -1.5 // First token confidence
options. noSpeechThreshold = 0.6 // Silence detection
Lower thresholds = more rejections = better quality but longer processing Default values work well for most use cases.
Timestamp Control
var options = DecodingOptions ()
options. withoutTimestamps = false // Include timestamps
options. wordTimestamps = true // Word-level timing
options. maxInitialTimestamp = 1.0 // First timestamp limit
Word timestamps require ~10-15% more processing time.
Parallel Processing
Concurrent Workers
Process multiple audio segments in parallel:
var options = DecodingOptions ()
# if os ( macOS )
options. concurrentWorkerCount = 16 // Default on macOS
# else
options. concurrentWorkerCount = 4 // Default on iOS/iPadOS
# endif
iOS devices show regression with >4 workers. macOS handles 16+ workers efficiently.
Chunking Strategy
var options = DecodingOptions ()
options. chunkingStrategy = . vad // Voice Activity Detection
// or
options. chunkingStrategy = . none // Process entire audio
Voice Activity Detection (VAD) :
Automatically splits audio at silence
Reduces unnecessary processing
Better for long audio with pauses
Slightly slower initialization
No Chunking :
Process entire audio file
Faster for short clips
May hit token limits on very long audio
Prefill Optimization
KV Cache Prefill
Accelerate initial decoding with cached key-value pairs:
var options = DecodingOptions ()
options. usePrefillCache = true // Use prefilled KV cache (faster)
options. usePrefillPrompt = true // Force initial prompt tokens
Prefill reduces first-token latency by 2-3x but requires compatible models with prefill data.
Language and Task Prefill
var options = DecodingOptions (
task : . transcribe ,
language : "en" ,
usePrefillPrompt : true ,
usePrefillCache : true
)
Automatic prefill when:
Language is specified
Task is .translate
Custom prompt tokens provided
Memory Management
Model Prewarming
Reduce peak memory during model loading:
let config = WhisperKitConfig (
model : "large-v3" ,
prewarm : true // Load-unload-load pattern
)
CoreML models need “specialization” on first load (compiling for your device). This specialized cache is maintained by Apple but evicted after OS updates. With prewarm=true :
Models loaded sequentially
Each model unloaded after specialization
Lower peak memory (1 model at a time)
2x longer load time if cache is hit
With prewarm=false (default):
Models loaded in parallel
Higher peak memory (all models + compilation)
Faster load time when cache is hit
Enable when : Minimizing peak memory is critical (older devices, background apps)Disable when : Load time is critical (real-time apps, foreground processing)
let config = WhisperKitConfig (
model : "large-v3" ,
prewarm : true ,
verbose : true // See timing breakdown
)
let pipe = try await WhisperKit (config)
// Logs: prewarmLoadTime, modelLoading, encoderLoadTime, decoderLoadTime
See Memory Management for detailed strategies.
Real-Time Factor
let result = try await pipe. transcribe ( audioPath : "audio.wav" )
let rtf = result ? . timings . realTimeFactor ?? 0
print ( "Real-time factor: \( rtf ) " )
// 0.25 = 4x faster than real-time
// 1.0 = same speed as audio duration
// 2.0 = 2x slower than real-time
Tokens Per Second
let tps = result ? . timings . tokensPerSecond ?? 0
print ( "Tokens per second: \( tps ) " )
// Higher is better (typically 50-200 for large models)
Speed Factor
let speedFactor = result ? . timings . speedFactor ?? 0
print ( "Speed factor: \( speedFactor ) x" )
// Inverse of real-time factor
// 4.0 = 4x faster than real-time
macOS
M1/M2/M3 Optimization let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU ,
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine ,
prefillCompute : . cpuOnly
)
var decodingOptions = DecodingOptions ()
decodingOptions. concurrentWorkerCount = 16
decodingOptions. chunkingStrategy = . vad
iOS/iPadOS
iPhone/iPad Optimization // Use smaller models on mobile
let config = WhisperKitConfig (
model : "small" , // or "distil*small"
computeOptions : ModelComputeOptions (
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine
)
)
var decodingOptions = DecodingOptions ()
decodingOptions. concurrentWorkerCount = 4 // Don't exceed 4
watchOS
WhisperKit on watchOS requires tiny/base models with CPU-only compute units due to memory constraints.
Benchmarking
Measure performance on your target devices:
View results:
See BENCHMARKS.md for details.
Optimization Checklist
Choose the right model
Start with small for balance, tiny for speed, large-v3 for accuracy
Configure compute units
Use .cpuAndNeuralEngine on macOS 14+ and iOS 17+
Enable prefill caching
Set usePrefillCache = true for faster first-token generation
Adjust concurrent workers
16 on macOS, 4 on iOS for optimal parallelism
Use VAD chunking
Enable chunkingStrategy = .vad for long audio files
Tune quality thresholds
Lower thresholds for better quality, higher for speed
Monitor metrics
Track real-time factor and tokens per second
Test on target devices
Always benchmark on actual hardware
Next Steps
Memory Management Advanced memory optimization strategies
Custom Models Fine-tune models for your domain