🗣️ Keyword Spotting (KWS)
Keyword Spotting (KWS) is the task of identifying whether a specific keyword or phrase (e.g., "galaxy") occurs in an audio stream. It is a foundational technology behind wake-word engines in smart devices and low-power voice interfaces.
🧠 Problem Formulation
The input signal \(y(t)\) is a continuous audio stream that may include:
- Target keywords to detect
- Background speech not containing the keyword
- Environmental noise and silence
The goal of KWS is to predict a binary or probabilistic mask \(m_k(t)\) for each keyword \(k\):
🔍 Why KWS Matters
KWS enables:
- Always-on voice interfaces in smart speakers, earbuds, and wearables
- Privacy-preserving local voice triggers without cloud processing
- Ultra-low-power wake word detection for battery-constrained applications
- Real-time command recognition in embedded systems
🎧 Noisy & Long-Tail Detection
KWS must perform reliably under:
- Speech clutter: background conversations and overlapping talkers
- Noise corruption: audio with non-speech sound sources
- Far-field speech: distant or reverberant keyword utterances
- Rare-event scenarios: few positive keyword examples in long audio
🎯 KWS Target
Given \(y(t)\), the system performs framewise detection of keywords within a window (e.g., every 10 ms). The output is:
- A per-frame binary mask (1 = keyword present)
- Or a score/probability curve that peaks around keyword regions
This is useful for visualization, alignment, and event-based inference.
🧰 soundKIT for KWS
soundKIT offers a complete KWS pipeline:
- Synthetic dataset generation with keyword injection in noise
- Time-frequency feature extraction (e.g., log-power spectrogram)
- CRNN-based model training with focal loss
- Evaluation and framewise prediction visualization
- Model export to TFLite and embedded C headers for EVB use
Real-time keyword spotting unlocks natural and responsive voice UIs, even on ultra-low-power edge platforms.