Skip to content

Training

This page describes how to train a Speech Enhancement (SE) model using the soundkit CLI. You can customize the architecture, feature extraction, loss functions, and learning rate schedule via the configuration YAML file.


Run train Mode

soundkit -t se -m train -c your_config.yaml

This command starts training using the provided configuration, including TFRecord input, feature extraction settings, and model architecture.

To monitor training progress in real-time, open a new terminal and launch TensorBoard:

soundkit -m train --tensorboard -c your_config.yaml

This will open TensorBoard with logs from the specified training run. Visit http://localhost:6006 in your browser to view metrics and visualizations.


Training Parameters

Parameter Description
initial_lr Initial learning rate for the optimizer. Uses cosine decay schedule
lr_schedule Learning rate schedule configuration. Supports options: cosine, constant
batchsize Mini-batch size used during training
epochs Total number of training epochs
warmup_epochs Number of warm-up epochs for linear learning rate ramp-up
epoch_loaded You can continue to train your model if your training procedure was interrupted for any reason. One of:
• random: start from scratch
• latest: resume from last checkpoint
• best: resume from best-performing checkpoint
• <int>: resume from a specific epoch
reset_states_every_batch If true, resets model states (e.g., for RNNs) at the start of every batch. Useful for non-causal or stateful models.
loss_function.type Loss function type: mse or compressed_mse
loss_function.params.exp Exponent for compressed_mse (e.g., 0.6)
loss_function.params.eps Epsilon to avoid division by zero in magnitude computation (see compressed_mse)
path.checkpoint_dir Path to save model checkpoints
path.tensorboard_dir Path to save TensorBoard logs
num_lookahead Lookahead frames used during training (0 for causal models)
feature Feature extraction settings: frame size, hop size, FFT size, type, bins, etc. Must match TFRecord generation.
standardization If true, applies mean and variance normalization to features during training.
model Model architecture configuration. Specify config directory and file for the network definition.

Feature Extraction Parameters

feature:
  frame_size: 480
  hop_size: 160
  fft_size: 512
  type: mel
  bins: 72
Parameter Description
type Feature type: mel, logpsec, or hybrid
bins Number of mel bins or FFT bins
frame_size Window size in samples
hop_size Hop length in samples
fft_size FFT length used for STFT

Standardization

standardization: true

If enabled, mean and variance normalization is applied to features during training.


Model Configuration

Specify the architecture using a YAML file:

model:
  config_dir: ./soundkit/models/arch_configs
  config_file: config_crnn.yaml

config_crnn.yaml will configure your NN:

./soundkit/models/arch_configs/config_simple_crnn.yaml
name: crnn

units: 100

len_time: 6

layer_configs:

- type: dropout
    rate: 0.1

- type: conv2d
    filters: ${units}
    kernel_size: ["${len_time}", 72]
    strides: [1, 1]
    activation: relu

- type: lstm
    units: ${units}

- type: fc
    units: ${units}
    activation: relu

- type: fc
    units: ${units}
    activation: relu

- type: fc
    units: 257
    activation: sigmoid

This allows switching between CRNN, UNet, or other registered architectures. To register your own NN architecture, see Bring-Your-Own-Model (BYOM)


Output

After training:

  • Model checkpoints will be saved to checkpoint_root
  • Training logs will be available in TensorBoard (tensorboard_dir)
  • You can evaluate or export the model using the same name and epoch_loaded settings

To visualize training:

soundkit -m train --tensorboard -c your_config.yaml