Data Preparation for Keyword Spotting (KWS)
The data preparation step in soundKIT for KWS involves generating training and validation datasets by mixing keyword utterances with background noise and non-keyword speech at controlled SNR levels.
This process supports reproducible and diverse datasets suitable for training robust keyword detection models.
Overview
The data mode prepares TFRecord datasets by:
- Combining keywords with garbage speech and environmental noise
- Controlling signal-to-noise ratio (SNR)
- Configuring reverberation and amplitude thresholds
- Supporting frame extension for keyword detection alignment
Run with:
soundkit -t kws -m data -c configs/kws/kws.yaml
Configuration Parameters (kws.yaml)
Here are the key settings for the data section:
data:
path_tfrecord: ${job_dir}/tfrecords
tfrecord_datalist_name:
train: train_tfrecord.csv
val: val_tfrecord.csv
num_samples_per_noise:
train: 757
val: 177
force_download: false
reverb_prob: 0.0
num_processes: 8
corpora:
- {name: train-galaxy, type: speech, split: train}
- {name: val-galaxy, type: speech, split: val}
- {name: vad_train-clean-100, type: garbage, split: train}
- {name: vad_train-clean-360, type: garbage, split: train}
- {name: vad_dev-clean, type: garbage, split: val}
- {name: vad_thchs30, type: garbage, split: train-val}
- {name: ESC-50-master, type: noise, split: train-val}
- {name: FSD50K, type: noise, split: train-val}
- {name: musan, type: noise, split: train-val}
- {name: wham_noise, type: noise, split: train-val}
- {name: rirs_noises, type: reverb, split: train-val}
snr_dbs: [-12, -9, -6, -3, 0, 3, 6, 9, 12, 15, 30]
target_length_in_secs: 15
min_amp: 0.01
max_amp: 0.95
signal:
sampling_rate: 16000
dc_removal: true
target_frames_extension: 30
debug: false
Data Parameters
| Parameter | Description | Example/Values |
|---|---|---|
path_tfrecord |
Output directory for TFRecords | ${job_dir}/tfrecords |
tfrecord_datalist_name |
Filenames listing TFRecord entries | train: train_tfrecord.csv, val: val_tfrecord.csv |
num_samples_per_noise |
Number of samples per noise type | train: 757, val: 177 |
force_download |
Force data re-download if true |
false |
accept_qualcomm_license |
Allow automatic download of Qualcomm KWS dataset | false |
reverb_prob |
Probability of adding reverberation | 0.0 |
num_processes |
Number of parallel processes for data prep | 8 |
corpora |
List of keyword, garbage, noise, and reverb data sources | See detailed list below |
snr_dbs |
Range of SNRs used for augmentation | [-12, -9, -6, -3, 0, 3, 6, 9, 12, 15, 30] |
target_length_in_secs |
Duration of each audio sample in seconds | 15 |
min_amp |
Minimum amplitude threshold for clipping | 0.01 |
max_amp |
Maximum amplitude threshold for clipping | 0.95 |
signal.sampling_rate |
Audio sampling rate in Hz | 16000 |
signal.dc_removal |
Apply DC offset removal if true |
true |
target_frames_extension |
Number of frames to extend ground truth target span | 30 |
Notes
- target_frames_extension: Extends the ground truth window to capture the full keyword span.
- corpora: Keywords, garbage speech, and noise sources should be explicitly defined.
- snr_dbs: Defines the range of SNR values for augmenting robustness.
See the QuickStart Guide if you're running this step for the first time.