> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/argmaxinc/WhisperKit/llms.txt
> Use this file to discover all available pages before exploring further.

# Voice Activity Detection

> Detect speech segments and optimize transcription with VAD

# Voice Activity Detection

Voice Activity Detection (VAD) identifies segments of audio containing speech versus silence. WhisperKit includes VAD capabilities to improve transcription accuracy, reduce computation, and enable intelligent audio chunking.

## VoiceActivityDetector Base Class

The `VoiceActivityDetector` is a base class that provides common functionality for all VAD implementations:

```swift theme={null}
open class VoiceActivityDetector {
    public let sampleRate: Int
    public let frameLengthSamples: Int
    public let frameOverlapSamples: Int
    
    open func voiceActivity(in waveform: [Float]) -> [Bool] {
        // Override in subclass
    }
}
```

See [VoiceActivityDetector](~/workspace/source/Sources/WhisperKit/Core/Audio/VoiceActivityDetector.swift:8)

### Properties

<ParamField path="sampleRate" type="Int" default="16000">
  Audio sample rate in Hz. WhisperKit uses 16kHz by default.
</ParamField>

<ParamField path="frameLengthSamples" type="Int">
  Length of each analysis frame in samples.
</ParamField>

<ParamField path="frameOverlapSamples" type="Int" default="0">
  Number of samples overlapping between consecutive frames.
</ParamField>

## EnergyVAD

WhisperKit includes `EnergyVAD`, a simple energy-based voice activity detector:

```swift theme={null}
let vad = EnergyVAD(
    sampleRate: 16000,
    frameLength: 0.1,        // 100ms frames
    frameOverlap: 0.0,       // No overlap
    energyThreshold: 0.02    // Energy threshold
)

let audioSamples: [Float] = loadAudio()
let voiceActivity: [Bool] = vad.voiceActivity(in: audioSamples)

// voiceActivity[i] == true means frame i contains speech
```

See [EnergyVAD](~/workspace/source/Sources/WhisperKit/Core/Audio/EnergyVAD.swift:7)

### EnergyVAD Initialization

```swift theme={null}
public init(
    sampleRate: Int = 16000,
    frameLength: Float = 0.1,          // Seconds
    frameOverlap: Float = 0.0,         // Seconds
    energyThreshold: Float = 0.02
)
```

See [EnergyVAD.init](~/workspace/source/Sources/WhisperKit/Core/Audio/EnergyVAD.swift:16-29)

### Parameters

<ParamField path="sampleRate" type="Int" default="16000">
  Audio sample rate matching `WhisperKit.sampleRate`.
</ParamField>

<ParamField path="frameLength" type="Float" default="0.1">
  Frame length in seconds. Default 0.1 = 100ms frames.
</ParamField>

<ParamField path="frameOverlap" type="Float" default="0.0">
  Overlap in seconds. Helps catch speech at frame boundaries.
</ParamField>

<ParamField path="energyThreshold" type="Float" default="0.02">
  Minimum energy level to consider as speech. Lower values are more sensitive.
</ParamField>

## Using VAD for Audio Chunking

VAD enables intelligent audio chunking based on speech activity:

```swift theme={null}
import WhisperKit

let whisperKit = try await WhisperKit()

// Create VAD instance
let vad = EnergyVAD(
    frameLength: 0.1,
    energyThreshold: 0.02
)

// Configure WhisperKit to use VAD
whisperKit.voiceActivityDetector = vad

// Transcribe with VAD-based chunking
var options = DecodingOptions(
    chunkingStrategy: .vad  // Enable VAD chunking
)

let results = try await whisperKit.transcribe(
    audioPath: "long_audio.wav",
    decodeOptions: options
)
```

See [DecodingOptions.chunkingStrategy](~/workspace/source/Sources/WhisperKit/Core/Configurations.swift:184)

### ChunkingStrategy Enum

```swift theme={null}
public enum ChunkingStrategy: String, Codable {
    case none  // No chunking (default)
    case vad   // VAD-based chunking
}
```

## VAD Methods

### Calculate Active Chunks

Get start/end indices of speech segments:

```swift theme={null}
let vad = EnergyVAD()
let audioSamples: [Float] = loadAudio()

let activeChunks = vad.calculateActiveChunks(in: audioSamples)

for chunk in activeChunks {
    let startTime = Float(chunk.startIndex) / Float(vad.sampleRate)
    let endTime = Float(chunk.endIndex) / Float(vad.sampleRate)
    print("Speech: \(startTime)s - \(endTime)s")
}
```

See [VoiceActivityDetector.calculateActiveChunks](~/workspace/source/Sources/WhisperKit/Core/Audio/VoiceActivityDetector.swift:52-79)

### Find Longest Silence

Identify the longest silent period:

```swift theme={null}
let vadResult: [Bool] = vad.voiceActivity(in: audioSamples)

if let silence = vad.findLongestSilence(in: vadResult) {
    let startTime = vad.voiceActivityIndexToSeconds(silence.startIndex)
    let endTime = vad.voiceActivityIndexToSeconds(silence.endIndex)
    print("Longest silence: \(startTime)s - \(endTime)s")
}
```

See [VoiceActivityDetector.findLongestSilence](~/workspace/source/Sources/WhisperKit/Core/Audio/VoiceActivityDetector.swift:95-125)

### Voice Activity Clip Timestamps

Generate clip timestamps for active segments:

```swift theme={null}
let clipTimestamps = vad.voiceActivityClipTimestamps(in: audioSamples)
// Returns [start1, end1, start2, end2, ...]

var options = DecodingOptions(
    clipTimestamps: clipTimestamps
)

let results = try await whisperKit.transcribe(
    audioArray: audioSamples,
    decodeOptions: options
)
```

See [VoiceActivityDetector.voiceActivityClipTimestamps](~/workspace/source/Sources/WhisperKit/Core/Audio/VoiceActivityDetector.swift:129-141)

## Index Conversion Utilities

### Convert VAD Index to Audio Sample

```swift theme={null}
let vadIndex = 10
let sampleIndex = vad.voiceActivityIndexToAudioSampleIndex(vadIndex)
// sampleIndex = vadIndex * frameLengthSamples
```

See [VoiceActivityDetector.voiceActivityIndexToAudioSampleIndex](~/workspace/source/Sources/WhisperKit/Core/Audio/VoiceActivityDetector.swift:84-86)

### Convert VAD Index to Seconds

```swift theme={null}
let vadIndex = 10
let seconds = vad.voiceActivityIndexToSeconds(vadIndex)
```

See [VoiceActivityDetector.voiceActivityIndexToSeconds](~/workspace/source/Sources/WhisperKit/Core/Audio/VoiceActivityDetector.swift:88-90)

## VAD in Configuration

Configure VAD in `WhisperKitConfig`:

```swift theme={null}
let vad = EnergyVAD(
    energyThreshold: 0.015  // More sensitive
)

let config = WhisperKitConfig(
    model: "base",
    voiceActivityDetector: vad
)

let whisperKit = try await WhisperKit(config)
```

See [WhisperKitConfig.voiceActivityDetector](~/workspace/source/Sources/WhisperKit/Core/Configurations.swift:34)

## Streaming with VAD

`AudioStreamTranscriber` uses VAD by default:

```swift theme={null}
let streamTranscriber = AudioStreamTranscriber(
    // ... other parameters
    useVAD: true,              // Enable VAD
    silenceThreshold: 0.3      // VAD threshold
)

try await streamTranscriber.startStreamTranscription()
```

See [AudioStreamTranscriber](~/workspace/source/Sources/WhisperKit/Core/Audio/AudioStreamTranscriber.swift:40-71)

### How Streaming VAD Works

1. Audio buffer accumulates samples
2. Relative energy is calculated for recent audio
3. VAD checks if energy exceeds `silenceThreshold`
4. If no voice detected, transcription is skipped
5. If voice detected, buffer is transcribed

```swift theme={null}
let voiceDetected = AudioProcessor.isVoiceDetected(
    in: relativeEnergy,
    nextBufferInSeconds: bufferDuration,
    silenceThreshold: 0.3
)

if !voiceDetected {
    // Skip transcription for this buffer
}
```

## Custom VAD Implementation

Implement your own VAD by subclassing `VoiceActivityDetector`:

```swift theme={null}
class MyCustomVAD: VoiceActivityDetector {
    private let model: MLModel
    
    init(model: MLModel) {
        self.model = model
        super.init(
            sampleRate: 16000,
            frameLengthSamples: 1600,  // 100ms at 16kHz
            frameOverlapSamples: 0
        )
    }
    
    override func voiceActivity(in waveform: [Float]) -> [Bool] {
        // Custom ML-based VAD logic
        var results: [Bool] = []
        
        let frameCount = waveform.count / frameLengthSamples
        for i in 0..<frameCount {
            let start = i * frameLengthSamples
            let end = min(start + frameLengthSamples, waveform.count)
            let frame = Array(waveform[start..<end])
            
            // Run your ML model or algorithm
            let hasVoice = runVADModel(on: frame)
            results.append(hasVoice)
        }
        
        return results
    }
    
    private func runVADModel(on frame: [Float]) -> Bool {
        // Your custom VAD logic here
        return true
    }
}
```

### Async VAD

For ML models requiring async operations:

```swift theme={null}
class AsyncVAD: VoiceActivityDetector {
    override func voiceActivityAsync(in waveform: [Float]) async throws -> [Bool] {
        // Async VAD logic using ML models
        var results: [Bool] = []
        
        // Process frames asynchronously
        for frame in extractFrames(from: waveform) {
            let hasVoice = try await runAsyncVADModel(on: frame)
            results.append(hasVoice)
        }
        
        return results
    }
}
```

See [VoiceActivityDetector.voiceActivityAsync](~/workspace/source/Sources/WhisperKit/Core/Audio/VoiceActivityDetector.swift:45-47)

## VAD Benefits

<CardGroup cols={2}>
  <Card title="Reduced Computation" icon="gauge-high">
    Skip transcription of silent segments, saving CPU/GPU cycles and battery.
  </Card>

  <Card title="Better Accuracy" icon="bullseye">
    Avoid hallucinations on background noise by only transcribing speech.
  </Card>

  <Card title="Smart Chunking" icon="scissors">
    Split long audio at natural silence boundaries instead of arbitrary time points.
  </Card>

  <Card title="Real-time Optimization" icon="rocket">
    Streaming transcription skips silent buffers for better responsiveness.
  </Card>
</CardGroup>

## Tuning Energy Threshold

The `energyThreshold` parameter is critical for VAD performance:

### Too Low (e.g., 0.001)

* Detects very quiet speech
* May trigger on background noise
* More false positives

### Optimal (e.g., 0.02)

* Balances sensitivity and specificity
* Good for typical recording conditions
* Default value works for most cases

### Too High (e.g., 0.1)

* Only detects loud speech
* May miss quiet speakers
* More false negatives

### Testing Different Thresholds

```swift theme={null}
let audioSamples: [Float] = loadAudio()

for threshold in [0.01, 0.02, 0.03, 0.05] {
    let vad = EnergyVAD(energyThreshold: threshold)
    let activity = vad.voiceActivity(in: audioSamples)
    let speechFrames = activity.filter { $0 }.count
    let totalFrames = activity.count
    let speechPct = Float(speechFrames) / Float(totalFrames) * 100
    
    print("Threshold \(threshold): \(speechPct)% speech detected")
}
```

## Complete Example

```swift theme={null}
import WhisperKit

func transcribeWithVAD() async throws {
    // Initialize WhisperKit with VAD
    let vad = EnergyVAD(
        frameLength: 0.1,
        energyThreshold: 0.02
    )
    
    let config = WhisperKitConfig(
        model: "base",
        voiceActivityDetector: vad,
        verbose: true
    )
    
    let whisperKit = try await WhisperKit(config)
    
    // Load audio
    let audioPath = "interview.wav"
    let audioSamples = try AudioProcessor.loadAudioAsFloatArray(
        fromPath: audioPath
    )
    
    // Analyze with VAD
    let activeChunks = vad.calculateActiveChunks(in: audioSamples)
    print("Found \(activeChunks.count) speech segments")
    
    for (i, chunk) in activeChunks.enumerated() {
        let start = Float(chunk.startIndex) / Float(vad.sampleRate)
        let end = Float(chunk.endIndex) / Float(vad.sampleRate)
        print("Segment \(i): \(start)s - \(end)s")
    }
    
    // Transcribe with VAD chunking
    var options = DecodingOptions(
        chunkingStrategy: .vad,
        verbose: true
    )
    
    let results = try await whisperKit.transcribe(
        audioArray: audioSamples,
        decodeOptions: options
    )
    
    // Print results
    for result in results {
        print("\nTranscription:")
        print(result.text)
        
        print("\nSegments:")
        for segment in result.segments {
            print("[\(segment.start)s - \(segment.end)s]: \(segment.text)")
        }
    }
}
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Streaming" icon="podcast" href="/whisperkit/streaming">
    Use VAD in real-time streaming transcription
  </Card>

  <Card title="Configuration" icon="gear" href="/whisperkit/configuration">
    Advanced configuration options
  </Card>
</CardGroup>
