๐ Data Preparation (Voice Activity Detection - VAD)
This page outlines how to prepare training and validation datasets for Voice Activity Detection (VAD) using the soundkit CLI. This step synthesizes examples by combining clean speech and noise (optionally applying reverb), sets SNR levels, and performs amplitude scaling before saving the outputs to TFRecords.
๐ง Run data Mode
soundkit -t vad -m data -c vad/vad.yaml
๐งพ Data Parameters
| Parameter | Description |
|---|---|
path_tfrecord |
Output path for storing generated TFRecords. Set to ${job_dir}/tfrecords |
tfrecord_datalist_name |
CSV file listing TFRecord shards for training and validation |
num_samples_per_noise |
Number of synthesized samples per noise clip for each split (train, val) |
force_download |
Forces re-download of corpora if set to true |
reverb_prob |
Probability of applying reverb via impulse response recordings |
num_processes |
Number of worker processes used during data synthesis |
snr_dbs |
List of signal-to-noise ratios (in dB) used for mixing clean speech and noise |
target_length_in_secs |
Duration of each generated audio clip in seconds |
min_amp, max_amp |
Amplitude range to randomly scale each sample |
debug |
Enables verbose logging for debugging if set to true |
๐ฆ Corpora Configuration
Corpora define which datasets are mixed during data generation. Each entry in the corpora field includes:
name: Dataset name (must match a registered loader)type: One ofspeech,noise, orreverbsplit: Dataset subset (train,val, ortrain-val)
corpora:
- {name: vad_train-clean-100, type: speech, split: train}
- {name: vad_train-clean-360, type: speech, split: train}
- {name: vad_dev-clean, type: speech, split: val}
- {name: vad_thchs30, type: speech, split: train-val}
- {name: ESC-50-master, type: noise, split: train-val}
- {name: FSD50K, type: noise, split: train-val}
- {name: musan, type: noise, split: train-val}
- {name: wham_noise, type: noise, split: train-val}
- {name: rirs_noises, type: reverb, split: train-val}
To register your own dataset, refer to the BYOD documentation.
๐ Signal Preprocessing
The signal section defines how the waveform is prepared before feature extraction:
signal:
sampling_rate: 16000
dc_removal: true
| Parameter | Description |
|---|---|
sampling_rate |
Sampling rate (Hz) for all audio |
dc_removal |
Removes DC bias before performing STFT if set to true |
๐งช Output
After running the data mode:
- TFRecord files (e.g.,
train-00001.tfrecord) are generated in${job_dir}/tfrecords - CSV files (
train_tfrecord.csv,val_tfrecord.csv) list the TFRecord files - These outputs are consumed during the training and evaluation phases