Performance Optimization

Optimizing WhisperKit performance involves balancing speed, accuracy, and resource usage. This guide covers compute units, model selection, decoding options, and platform-specific optimizations.

Compute Units

CoreML models can target different hardware accelerators on Apple devices:

Neural Engine

Specialized ML accelerator (fastest, most efficient)

GPU

Graphics processor (good balance)

CPU

Central processor (most compatible)

ModelComputeOptions

Configure compute units for each model component:

import WhisperKit
import CoreML

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,              // Mel spectrogram extraction
    audioEncoderCompute: .cpuAndNeuralEngine,  // Audio encoder
    textDecoderCompute: .cpuAndNeuralEngine,   // Text decoder
    prefillCompute: .cpuOnly             // KV cache prefill
)

let config = WhisperKitConfig(
    model: "large-v3",
    computeOptions: computeOptions
)

let pipe = try await WhisperKit(config)

Available Compute Units

.cpuOnly

MLComputeUnits

CPU only - most compatible, slowest

.cpuAndGPU

MLComputeUnits

CPU and GPU - good for macOS < 14

.cpuAndNeuralEngine

MLComputeUnits

CPU and Neural Engine - recommended for macOS 14+, iOS 17+

.all

MLComputeUnits

All available compute units - lets CoreML decide

Recommended Configurations

macOS 14+ (Recommended)
macOS 13 and Earlier
iOS 18+
Low Memory

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,
    audioEncoderCompute: .cpuAndNeuralEngine,
    textDecoderCompute: .cpuAndNeuralEngine,
    prefillCompute: .cpuOnly
)

Best performance on M1/M2/M3 Macs

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,
    audioEncoderCompute: .cpuAndGPU,
    textDecoderCompute: .cpuAndNeuralEngine,
    prefillCompute: .cpuOnly
)

Neural Engine support limited for audio encoder

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,
    audioEncoderCompute: .cpuAndNeuralEngine,
    textDecoderCompute: .cpuAndNeuralEngine,
    prefillCompute: .cpuOnly
)

Optimized for iPhone/iPad

let computeOptions = ModelComputeOptions(
    melCompute: .cpuOnly,
    audioEncoderCompute: .cpuOnly,
    textDecoderCompute: .cpuOnly,
    prefillCompute: .cpuOnly
)

Minimize memory usage

CLI Configuration

Set compute units via command-line:

swift run whisperkit-cli transcribe \
  --audio-path "audio.wav" \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-encoder-compute-units cpuAndNeuralEngine \
  --text-decoder-compute-units cpuAndNeuralEngine

Model Selection

Model size significantly impacts speed and accuracy:

tiny

39M parameters

Speed: ~32x real-time | WER: ~10-15% | Size: ~75 MBBest for: Real-time applications, low-end devices, quick prototyping

base

74M parameters

Speed: ~16x real-time | WER: ~8-12% | Size: ~142 MBBest for: Balanced speed/quality, general transcription

small

244M parameters

Speed: ~6x real-time | WER: ~5-8% | Size: ~466 MBBest for: Production applications requiring accuracy

medium

769M parameters

Speed: ~2x real-time | WER: ~4-6% | Size: ~1.5 GBBest for: High-accuracy requirements, offline processing

large-v3

1550M parameters

Speed: ~1x real-time | WER: ~3-5% | Size: ~3 GBBest for: Maximum accuracy, multilingual, post-processing

Distil Models

Distilled models offer 2-3x speedup with minimal accuracy loss:

let config = WhisperKitConfig(
    model: "distil*large-v3",  // Glob pattern
    modelRepo: "argmaxinc/whisperkit-coreml"
)

Benchmarks: WhisperKit Benchmarks

Decoding Options

Temperature and Sampling

Control randomness and diversity:

var options = DecodingOptions()
options.temperature = 0.0  // Deterministic (default)
options.topK = 5           // Top-K sampling candidates

temperature

Float

default:"0.0"

0.0: Deterministic (always pick most likely token)
0.0 - 1.0: Increasing randomness
Higher values = more creative but less accurate

topK

Int

default:"5"

Number of top candidates to sample from when temperature > 0

Fallback Strategy

Automatically retry failed segments:

var options = DecodingOptions()
options.temperatureIncrementOnFallback = 0.2  // Increment per retry
options.temperatureFallbackCount = 5          // Max retries

Quality Thresholds

Detect and reject poor transcriptions:

var options = DecodingOptions()
options.compressionRatioThreshold = 2.4       // Detect repetition
options.logProbThreshold = -1.0               // Average confidence
options.firstTokenLogProbThreshold = -1.5     // First token confidence
options.noSpeechThreshold = 0.6               // Silence detection

Lower thresholds = more rejections = better quality but longer processingDefault values work well for most use cases.

Timestamp Control

var options = DecodingOptions()
options.withoutTimestamps = false      // Include timestamps
options.wordTimestamps = true          // Word-level timing
options.maxInitialTimestamp = 1.0      // First timestamp limit

Word timestamps require ~10-15% more processing time.

Parallel Processing

Concurrent Workers

Process multiple audio segments in parallel:

var options = DecodingOptions()

#if os(macOS)
options.concurrentWorkerCount = 16  // Default on macOS
#else
options.concurrentWorkerCount = 4   // Default on iOS/iPadOS
#endif

iOS devices show regression with >4 workers. macOS handles 16+ workers efficiently.

Chunking Strategy

var options = DecodingOptions()
options.chunkingStrategy = .vad  // Voice Activity Detection
// or
options.chunkingStrategy = .none  // Process entire audio

Voice Activity Detection (VAD):

Automatically splits audio at silence
Reduces unnecessary processing
Better for long audio with pauses
Slightly slower initialization

No Chunking:

Process entire audio file
Faster for short clips
May hit token limits on very long audio

Prefill Optimization

KV Cache Prefill

Accelerate initial decoding with cached key-value pairs:

var options = DecodingOptions()
options.usePrefillCache = true   // Use prefilled KV cache (faster)
options.usePrefillPrompt = true  // Force initial prompt tokens

Prefill reduces first-token latency by 2-3x but requires compatible models with prefill data.

Language and Task Prefill

var options = DecodingOptions(
    task: .transcribe,
    language: "en",
    usePrefillPrompt: true,
    usePrefillCache: true
)

Automatic prefill when:

Language is specified
Task is .translate
Custom prompt tokens provided

Memory Management

Model Prewarming

Reduce peak memory during model loading:

let config = WhisperKitConfig(
    model: "large-v3",
    prewarm: true  // Load-unload-load pattern
)

What is prewarming?

CoreML models need “specialization” on first load (compiling for your device). This specialized cache is maintained by Apple but evicted after OS updates.With prewarm=true:

Models loaded sequentially
Each model unloaded after specialization
Lower peak memory (1 model at a time)
2x longer load time if cache is hit

With prewarm=false (default):

Models loaded in parallel
Higher peak memory (all models + compilation)
Faster load time when cache is hit

Enable when: Minimizing peak memory is critical (older devices, background apps)Disable when: Load time is critical (real-time apps, foreground processing)

let config = WhisperKitConfig(
    model: "large-v3",
    prewarm: true,
    verbose: true  // See timing breakdown
)

let pipe = try await WhisperKit(config)
// Logs: prewarmLoadTime, modelLoading, encoderLoadTime, decoderLoadTime

See Memory Management for detailed strategies.

Performance Metrics

Real-Time Factor

let result = try await pipe.transcribe(audioPath: "audio.wav")
let rtf = result?.timings.realTimeFactor ?? 0

print("Real-time factor: \(rtf)")
// 0.25 = 4x faster than real-time
// 1.0 = same speed as audio duration
// 2.0 = 2x slower than real-time

Tokens Per Second

let tps = result?.timings.tokensPerSecond ?? 0
print("Tokens per second: \(tps)")
// Higher is better (typically 50-200 for large models)

Speed Factor

let speedFactor = result?.timings.speedFactor ?? 0
print("Speed factor: \(speedFactor)x")
// Inverse of real-time factor
// 4.0 = 4x faster than real-time

Platform-Specific Tips

macOS

M1/M2/M3 Optimization

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,
    audioEncoderCompute: .cpuAndNeuralEngine,
    textDecoderCompute: .cpuAndNeuralEngine,
    prefillCompute: .cpuOnly
)

var decodingOptions = DecodingOptions()
decodingOptions.concurrentWorkerCount = 16
decodingOptions.chunkingStrategy = .vad

iOS/iPadOS

iPhone/iPad Optimization

// Use smaller models on mobile
let config = WhisperKitConfig(
    model: "small",  // or "distil*small"
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuAndNeuralEngine,
        textDecoderCompute: .cpuAndNeuralEngine
    )
)

var decodingOptions = DecodingOptions()
decodingOptions.concurrentWorkerCount = 4  // Don't exceed 4

watchOS

WhisperKit on watchOS requires tiny/base models with CPU-only compute units due to memory constraints.

Benchmarking

Measure performance on your target devices:

make benchmark-devices

View results:

WhisperKit Benchmarks Space
Local: fastlane/benchmark_data/

See BENCHMARKS.md for details.

Optimization Checklist

Choose the right model

Start with small for balance, tiny for speed, large-v3 for accuracy

Configure compute units

Use .cpuAndNeuralEngine on macOS 14+ and iOS 17+

Enable prefill caching

Set usePrefillCache = true for faster first-token generation

Adjust concurrent workers

16 on macOS, 4 on iOS for optimal parallelism

Use VAD chunking

Enable chunkingStrategy = .vad for long audio files

Tune quality thresholds

Lower thresholds for better quality, higher for speed

Monitor metrics

Track real-time factor and tokens per second

Test on target devices

Always benchmark on actual hardware

Next Steps

Memory Management

Advanced memory optimization strategies

Custom Models

Fine-tune models for your domain

Documentation Index

​Compute Units

Neural Engine

GPU

CPU

​ModelComputeOptions

​Available Compute Units

​Recommended Configurations

​CLI Configuration

​Model Selection

​Distil Models

​Decoding Options

​Temperature and Sampling

​Fallback Strategy

​Quality Thresholds

​Timestamp Control

​Parallel Processing

​Concurrent Workers

​Chunking Strategy

​Prefill Optimization

​KV Cache Prefill

​Language and Task Prefill

​Memory Management

​Model Prewarming

​Performance Metrics

​Real-Time Factor

​Tokens Per Second

​Speed Factor

​Platform-Specific Tips

​macOS

M1/M2/M3 Optimization

​iOS/iPadOS

iPhone/iPad Optimization

​watchOS

​Benchmarking

​Optimization Checklist

​Next Steps

Memory Management

Custom Models

Compute Units

ModelComputeOptions

Available Compute Units

Recommended Configurations

CLI Configuration

Model Selection

Distil Models

Decoding Options

Temperature and Sampling

Fallback Strategy

Quality Thresholds

Timestamp Control

Parallel Processing

Concurrent Workers

Chunking Strategy

Prefill Optimization

KV Cache Prefill

Language and Task Prefill

Memory Management

Model Prewarming

Performance Metrics

Real-Time Factor

Tokens Per Second

Speed Factor

Platform-Specific Tips

macOS

iOS/iPadOS

watchOS

Benchmarking

Optimization Checklist

Next Steps