๐๏ธโโ๏ธ Training (Voice Activity Detection - VAD)
This page describes how to train a Voice Activity Detection (VAD) model using the soundkit CLI. You can configure model architecture, feature extraction, loss function, and learning schedule in your YAML config.
๐ Run train Mode
soundkit -t vad -m train -c configs/vad/vad.yaml
This starts training using settings from vad.yaml, including TFRecord input, features, and model parameters.
To monitor training live, open a second terminal and run:
soundkit -t vad -m train --tensorboard -c configs/vad/vad.yaml
This starts TensorBoard. Visit http://localhost:6006 to view real-time metrics and logs.
๐งพ Training Parameters
| Parameter | Description |
|---|---|
initial_lr |
Initial learning rate (uses cosine decay) |
batchsize |
Batch size for training |
epochs |
Total number of training epochs |
warmup_epochs |
Warm-up period for learning rate ramp-up |
epoch_loaded |
Resume strategy: random, latest, best, or a specific epoch number |
loss_function.type |
Loss function (e.g., cross_entropy) |
path.checkpoint_dir |
Where to save model checkpoints |
path.tensorboard_dir |
Where to save TensorBoard logs |
num_lookahead |
Lookahead frames used (0 for causal) |
๐ Feature Extraction
feature:
frame_size: 480
hop_size: 160
fft_size: 512
type: logpspec
bins: 257
| Parameter | Description |
|---|---|
type |
Feature type: logpspec, mel, or hybrid |
bins |
FFT or mel bin count |
frame_size |
STFT window size in samples |
hop_size |
Frame hop size |
fft_size |
FFT length used in STFT |
These settings must match those used for TFRecord generation.
๐ Standardization
standardization: true
Enable mean-variance normalization during training for improved convergence.
๐ง Model Configuration
Specify the model architecture via:
model:
config_dir: ./soundkit/models/arch_configs
config_file: config_crnn_vad.yaml
A sample CRNN VAD configuration:
name: crnn
units: 22
len_time: 6
dropout_rate_input: 0.1
dropout_rate: 0.2
stride_time: 1
layer_configs:
- type: dropout
rate: ${dropout_rate_input}
- type: fc
units: ${units}
activation: relu
- type: dropout
rate: ${dropout_rate}
- type: conv2d
filters: ${units}
kernel_size: ["${len_time}", "${units}"]
strides: ["${stride_time}", 1]
activation: relu
- type: dropout
rate: ${dropout_rate}
- type: lstm
units: ${units}
- type: dropout
rate: ${dropout_rate}
- type: fc
units: ${units}
activation: relu
- type: dropout
rate: ${dropout_rate}
- type: fc
units: ${units}
activation: relu
- type: dropout
rate: ${dropout_rate}
- type: fc
units: 2
activation: linear
You can switch between CRNN and other architectures by modifying the config. For custom models, see BYOM.
๐ฆ Output
After training completes:
- Checkpoints saved to
checkpoint_dir - Training logs in
tensorboard_dir - Model ready for evaluation and export using the same
nameandepoch_loaded
To visualize metrics:
soundkit -m train --tensorboard -c configs/vad/vad.yaml