🗣️ Keyword Spotting (KWS)

Keyword Spotting (KWS) is the task of identifying whether a specific keyword or phrase (e.g., "galaxy") occurs in an audio stream. It is a foundational technology behind wake-word engines in smart devices and low-power voice interfaces.

🧠 Problem Formulation

The input signal \(y(t)\) is a continuous audio stream that may include:

Target keywords to detect
Background speech not containing the keyword
Environmental noise and silence

The goal of KWS is to predict a binary or probabilistic mask \(m_k(t)\) for each keyword \(k\):

\[ m_k(t) = \begin{cases} 1, & \text{if keyword } k \text{ is present at time } t \\ 0, & \text{otherwise} \end{cases} \]

🔍 Why KWS Matters

KWS enables:

Always-on voice interfaces in smart speakers, earbuds, and wearables
Privacy-preserving local voice triggers without cloud processing
Ultra-low-power wake word detection for battery-constrained applications
Real-time command recognition in embedded systems

🎧 Noisy & Long-Tail Detection

KWS must perform reliably under:

Speech clutter: background conversations and overlapping talkers
Noise corruption: audio with non-speech sound sources
Far-field speech: distant or reverberant keyword utterances
Rare-event scenarios: few positive keyword examples in long audio

🎯 KWS Target

Given \(y(t)\), the system performs framewise detection of keywords within a window (e.g., every 10 ms). The output is:

A per-frame binary mask (1 = keyword present)
Or a score/probability curve that peaks around keyword regions

This is useful for visualization, alignment, and event-based inference.

🧰 soundKIT for KWS

soundKIT offers a complete KWS pipeline:

Synthetic dataset generation with keyword injection in noise
Time-frequency feature extraction (e.g., log-power spectrogram)
CRNN-based model training with focal loss
Evaluation and framewise prediction visualization
Model export to TFLite and embedded C headers for EVB use

Real-time keyword spotting unlocks natural and responsive voice UIs, even on ultra-low-power edge platforms.