Skip to content

Speech Enhancement (SE)

The Speech Enhancement (SE) module in soundKIT enables denoising of speech signals for real-time and embedded applications. It is designed for both research and deployment, supporting:

This module is optimized for deployment on Ambiq's family of ultra-low power SoCs, enabling efficient and low-latency speech enhancement on edge devices.

📘 Try it now: Explore the SE Tutorial Notebook for a hands-on walkthrough.



Features

  • Noise suppression for clean speech recovery
  • Real-time frame-by-frame inference
  • Modular support for CRNN and UNet architectures
  • Export for embedded deployment (TFLite, CMSIS, etc.)
  • Demo on Ambiq's family of ultra-low power SoCs via WebUSB

Install soundKIT

Follow the instructions in the QuickStart to set up your environment.


SE Task Modes

The soundkit CLI provides multiple modes for running the SE task. All modes are configured through a YAML file (e.g., se.yaml). Below is a breakdown of the configuration structure and CLI commands.

se.yaml
name: unet_experiment
project: se
job_dir: ./soundkit/tasks/s

data:
  path_tfrecord: ${job_dir}/tfrecords
  tfrecord_datalist_name: # list of saved tfrecords
    train: train_tfrecord.csv
    val: val_tfrecord.csv
  num_samples_per_noise:
    train: 1000
    val: 250
  force_download: false
  reverb_prob: 0.5
  num_processes: 8
  corpora:
    - {name: train-clean-360, type: speech, split: train}
    - {name: train-clean-100, type: speech, split: train}
    - {name: dev-clean, type: speech, split: val,}
    - {name: thchs30, type: speech, split: train-val}
    - {name: ESC-50-master, type: noise, split: train-val}
    - {name: FSD50K, type: noise, split: train-val}
    - {name: musan, type: noise, split: train-val}
    - {name: wham_noise, type: noise, split: train-val}
    - {name: rirs_noises, type: reverb, split: train-val}
  snr_dbs: [-6, -3, 0, 3, 6, 9, 12, 15, 30] # mixture of signal-to-noise ratios
  target_length_in_secs: 5
  min_amp: 0.03
  max_amp: 0.95

  signal:
    sampling_rate: 16000
    dc_removal: true
  debug: false

train:
  initial_lr: 4e-4
  batchsize: 32
  epochs: 150
  warmup_epochs: 5
  epoch_loaded: random
  loss_function: {
    type: compressed_mse,
    params: {exp: 0.6, eps: 1e-8}
    }
  path:
    full_name: ${name}_unit64_la${train.num_lookahead}_dropout0.2_${train.feature.type}_feat
    model_dir:       ${job_dir}/models_trained/${train.path.full_name}
    tensorboard_dir: ${job_dir}/tensorboard/${train.path.full_name}
  num_lookahead: 2

  feature:
    frame_size: 480
    hop_size: 160
    fft_size: 512
    type: logpspec
    bins: 257
    # type: hybrid
    # bins_fft: 100
    # n_mels: 72

  standardization: true

  model:
    config_dir: ./soundkit/models/arch_configs
    config_file: config_unet.yaml

  debug: false

evaluate:
  epoch_loaded: best

  data:
    dir: "./wavs/se/test_wavs"
    files: [keyboard_steak.wav, i_like_steak.wav, steak_hairdryer.wav]
    # # dir: ./wavs/LibriSpeech/test-clean
    # # files:
    result_folder: ${job_dir}/test_results/${train.path.full_name}

export:
  epoch_loaded: best
  tflite_dir: ${job_dir}/tflite

demo:
  platform: pc
  epoch_loaded: best
  tflite_dir: ${job_dir}/tflite
  evb_dir: ${job_dir}/evb
  pre_gain: 1

SE Task Mode Selection

Download and prepare the training and validation data by generating TFRecords from raw audio corpora.

soundkit -t se -m data -c configs/se/se.yaml
See Data in detail.

Train the speech enhancement model using the specified configuration and dataset.

soundkit -t se -m train -c configs/se/se.yaml

To monitor training progress in real-time, open a new terminal and launch TensorBoard:

soundkit -t se -m train --tensorboard -c configs/se/se.yaml
This will open TensorBoard with logs from the specified training run. Visit http://localhost:6006 in your browser to view metrics and visualizations.

See Train in detail.

Evaluate the model on a test set and compute metrics such as SI-SDR, STOI, PESQ, or DNSMOS.

soundkit -t se -m evaluate -c configs/se/se.yaml
See See Evaluate in detail.

Convert the trained model into formats suitable for embedded or web deployment (e.g., TFLite, C arrays).

soundkit -t se -m export -c configs/se/se.yaml
See Export in detail.

Run real-time inference either on: - On your PC - A connected embedded development board (EVB) via WebUSB We suggest to use PC first for fast testing.

soundkit -t se -m demo -c configs/se/se.yaml demo.platform=pc
# or
soundkit -t se -m demo -c configs/se/se.yaml demo.platform=evb
See Demo in detail.