Training

This page describes how to train a Speech Enhancement (SE) model using the soundkit CLI. You can customize the architecture, feature extraction, loss functions, and learning rate schedule via the configuration YAML file.

Run `train` Mode

soundkit -t se -m train -c your_config.yaml

This command starts training using the provided configuration, including TFRecord input, feature extraction settings, and model architecture.

To monitor training progress in real-time, open a new terminal and launch TensorBoard:

soundkit -m train --tensorboard -c your_config.yaml

This will open TensorBoard with logs from the specified training run. Visit http://localhost:6006 in your browser to view metrics and visualizations.

Training Parameters

Parameter	Description
`initial_lr`	Initial learning rate for the optimizer. Uses cosine decay schedule
`lr_schedule`	Learning rate schedule configuration. Supports options: `cosine`, `constant`
`batchsize`	Mini-batch size used during training
`epochs`	Total number of training epochs
`warmup_epochs`	Number of warm-up epochs for linear learning rate ramp-up
`epoch_loaded`	You can continue to train your model if your training procedure was interrupted for any reason. One of: • `random`: start from scratch • `latest`: resume from last checkpoint • `best`: resume from best-performing checkpoint • `<int>`: resume from a specific epoch
`reset_states_every_batch`	If true, resets model states (e.g., for RNNs) at the start of every batch. Useful for non-causal or stateful models.
`loss_function.type`	Loss function type: mse or compressed_mse
`loss_function.params.exp`	Exponent for compressed_mse (e.g., 0.6)
`loss_function.params.eps`	Epsilon to avoid division by zero in magnitude computation (see compressed_mse)
`path.checkpoint_dir`	Path to save model checkpoints
`path.tensorboard_dir`	Path to save TensorBoard logs
`num_lookahead`	Lookahead frames used during training (0 for causal models)
`feature`	Feature extraction settings: frame size, hop size, FFT size, type, bins, etc. Must match TFRecord generation.
`standardization`	If true, applies mean and variance normalization to features during training.
`model`	Model architecture configuration. Specify config directory and file for the network definition.

Feature Extraction Parameters

feature:
  frame_size: 480
  hop_size: 160
  fft_size: 512
  type: mel
  bins: 72

Parameter	Description
`type`	Feature type: `mel`, `logpsec`, or `hybrid`
`bins`	Number of mel bins or FFT bins
`frame_size`	Window size in samples
`hop_size`	Hop length in samples
`fft_size`	FFT length used for STFT

Standardization

standardization: true

If enabled, mean and variance normalization is applied to features during training.

Model Configuration

Specify the architecture using a YAML file:

model:
  config_dir: ./soundkit/models/arch_configs
  config_file: config_crnn.yaml

config_crnn.yaml will configure your NN:

./soundkit/models/arch_configs/config_simple_crnn.yaml

name: crnn

units: 100

len_time: 6

layer_configs:

- type: dropout
    rate: 0.1

- type: conv2d
    filters: ${units}
    kernel_size: ["${len_time}", 72]
    strides: [1, 1]
    activation: relu

- type: lstm
    units: ${units}

- type: fc
    units: ${units}
    activation: relu

- type: fc
    units: ${units}
    activation: relu

- type: fc
    units: 257
    activation: sigmoid

This allows switching between CRNN, UNet, or other registered architectures. To register your own NN architecture, see Bring-Your-Own-Model (BYOM)

Output

After training:

Model checkpoints will be saved to checkpoint_root
Training logs will be available in TensorBoard (tensorboard_dir)
You can evaluate or export the model using the same name and epoch_loaded settings

To visualize training:

soundkit -m train --tensorboard -c your_config.yaml