Training
This page describes how to train a Speech Enhancement (SE) model using the soundkit CLI. You can customize the architecture, feature extraction, loss functions, and learning rate schedule via the configuration YAML file.
Run train Mode
soundkit -t se -m train -c your_config.yaml
This command starts training using the provided configuration, including TFRecord input, feature extraction settings, and model architecture.
To monitor training progress in real-time, open a new terminal and launch TensorBoard:
soundkit -m train --tensorboard -c your_config.yaml
This will open TensorBoard with logs from the specified training run. Visit http://localhost:6006 in your browser to view metrics and visualizations.
Training Parameters
| Parameter | Description |
|---|---|
initial_lr |
Initial learning rate for the optimizer. Uses cosine decay schedule |
lr_schedule |
Learning rate schedule configuration. Supports options: cosine, constant |
batchsize |
Mini-batch size used during training |
epochs |
Total number of training epochs |
warmup_epochs |
Number of warm-up epochs for linear learning rate ramp-up |
epoch_loaded |
You can continue to train your model if your training procedure was interrupted for any reason. One of: • random: start from scratch • latest: resume from last checkpoint • best: resume from best-performing checkpoint • <int>: resume from a specific epoch |
reset_states_every_batch |
If true, resets model states (e.g., for RNNs) at the start of every batch. Useful for non-causal or stateful models. |
loss_function.type |
Loss function type: mse or compressed_mse |
loss_function.params.exp |
Exponent for compressed_mse (e.g., 0.6) |
loss_function.params.eps |
Epsilon to avoid division by zero in magnitude computation (see compressed_mse) |
path.checkpoint_dir |
Path to save model checkpoints |
path.tensorboard_dir |
Path to save TensorBoard logs |
num_lookahead |
Lookahead frames used during training (0 for causal models) |
feature |
Feature extraction settings: frame size, hop size, FFT size, type, bins, etc. Must match TFRecord generation. |
standardization |
If true, applies mean and variance normalization to features during training. |
model |
Model architecture configuration. Specify config directory and file for the network definition. |
Feature Extraction Parameters
feature:
frame_size: 480
hop_size: 160
fft_size: 512
type: mel
bins: 72
| Parameter | Description |
|---|---|
type |
Feature type: mel, logpsec, or hybrid |
bins |
Number of mel bins or FFT bins |
frame_size |
Window size in samples |
hop_size |
Hop length in samples |
fft_size |
FFT length used for STFT |
Standardization
standardization: true
If enabled, mean and variance normalization is applied to features during training.
Model Configuration
Specify the architecture using a YAML file:
model:
config_dir: ./soundkit/models/arch_configs
config_file: config_crnn.yaml
config_crnn.yaml will configure your NN:
./soundkit/models/arch_configs/config_simple_crnn.yaml
name: crnn
units: 100
len_time: 6
layer_configs:
- type: dropout
rate: 0.1
- type: conv2d
filters: ${units}
kernel_size: ["${len_time}", 72]
strides: [1, 1]
activation: relu
- type: lstm
units: ${units}
- type: fc
units: ${units}
activation: relu
- type: fc
units: ${units}
activation: relu
- type: fc
units: 257
activation: sigmoid
This allows switching between CRNN, UNet, or other registered architectures. To register your own NN architecture, see Bring-Your-Own-Model (BYOM)
Output
After training:
- Model checkpoints will be saved to
checkpoint_root - Training logs will be available in TensorBoard (
tensorboard_dir) - You can evaluate or export the model using the same
nameandepoch_loadedsettings
To visualize training:
soundkit -m train --tensorboard -c your_config.yaml