Data Preparation for Keyword Spotting (KWS)

The data preparation step in soundKIT for KWS involves generating training and validation datasets by mixing keyword utterances with background noise and non-keyword speech at controlled SNR levels.

This process supports reproducible and diverse datasets suitable for training robust keyword detection models.

Overview

The data mode prepares TFRecord datasets by: - Combining keywords with garbage speech and environmental noise - Controlling signal-to-noise ratio (SNR) - Configuring reverberation and amplitude thresholds - Supporting frame extension for keyword detection alignment

Run with:

soundkit -t kws -m data -c configs/kws/kws.yaml

Configuration Parameters (`kws.yaml`)

Here are the key settings for the data section:

data:
  path_tfrecord: ${job_dir}/tfrecords
  tfrecord_datalist_name:
    train: train_tfrecord.csv
    val: val_tfrecord.csv
  num_samples_per_noise:
    train: 757
    val: 177
  force_download: false
  reverb_prob: 0.0
  num_processes: 8
  corpora:
    - {name: train-galaxy, type: speech, split: train}
    - {name: val-galaxy, type: speech, split: val}
    - {name: vad_train-clean-100, type: garbage, split: train}
    - {name: vad_train-clean-360, type: garbage, split: train}
    - {name: vad_dev-clean, type: garbage, split: val}
    - {name: vad_thchs30, type: garbage, split: train-val}
    - {name: ESC-50-master, type: noise, split: train-val}
    - {name: FSD50K, type: noise, split: train-val}
    - {name: musan, type: noise, split: train-val}
    - {name: wham_noise, type: noise, split: train-val}
    - {name: rirs_noises, type: reverb, split: train-val}
  snr_dbs: [-12, -9, -6, -3, 0, 3, 6, 9, 12, 15, 30]
  target_length_in_secs: 15
  min_amp: 0.01
  max_amp: 0.95

  signal:
    sampling_rate: 16000
    dc_removal: true
  target_frames_extension: 30
  debug: false

Data Parameters

Parameter	Description	Example/Values
`path_tfrecord`	Output directory for TFRecords	`${job_dir}/tfrecords`
`tfrecord_datalist_name`	Filenames listing TFRecord entries	`train: train_tfrecord.csv`, `val: val_tfrecord.csv`
`num_samples_per_noise`	Number of samples per noise type	`train: 757`, `val: 177`
`force_download`	Force data re-download if `true`	`false`
`accept_qualcomm_license`	Allow automatic download of Qualcomm KWS dataset	`false`
`reverb_prob`	Probability of adding reverberation	`0.0`
`num_processes`	Number of parallel processes for data prep	`8`
`corpora`	List of keyword, garbage, noise, and reverb data sources	See detailed list below
`snr_dbs`	Range of SNRs used for augmentation	`[-12, -9, -6, -3, 0, 3, 6, 9, 12, 15, 30]`
`target_length_in_secs`	Duration of each audio sample in seconds	`15`
`min_amp`	Minimum amplitude threshold for clipping	`0.01`
`max_amp`	Maximum amplitude threshold for clipping	`0.95`
`signal.sampling_rate`	Audio sampling rate in Hz	`16000`
`signal.dc_removal`	Apply DC offset removal if `true`	`true`
`target_frames_extension`	Number of frames to extend ground truth target span	`30`

Notes

target_frames_extension: Extends the ground truth window to capture the full keyword span.
corpora: Keywords, garbage speech, and noise sources should be explicitly defined.
snr_dbs: Defines the range of SNR values for augmenting robustness.

See the QuickStart Guide if you're running this step for the first time.

Data Preparation for Keyword Spotting (KWS)

Overview

Configuration Parameters (kws.yaml)

Data Parameters

Notes

Configuration Parameters (`kws.yaml`)