Wake-word detection

Wake-word detection, also called keyword spotting, is the task that lets a voice assistant appear to be always listening without sending everything it hears to the cloud. A small program runs continuously on the device, monitoring the microphone for one specific trigger phrase, “Alexa,” “Okay Google,” “Hey Siri,” and only when it is confident it has heard that phrase does it wake the rest of the system and begin the heavier work of recognizing and understanding what the user says next.

The technical demands are particular. The detector has to fit in a small memory and computation budget so it can run constantly on a phone or a cheap speaker, respond almost instantly, and keep two kinds of errors low at once: it must not miss the wake word when the user says it, and it must not falsely trigger on ordinary conversation, music, or TV. A 2014 Google paper by Chen, Parada, and Heigold showed that a compact deep neural network could do this markedly better than the older Hidden Markov Model approach, and that style of small on-device neural classifier became the standard.

This design is also central to how companies describe the privacy of voice assistants: because only a short snippet around the wake word is supposed to be sent for full processing, the claim is that the device is not continuously uploading audio. In practice the reliability and tuning of wake-word detection, and what happens after a false trigger, have been recurring sources of both annoyance and privacy concern.

For a general reader, wake-word detection is the small gatekeeper behind every smart speaker and voice assistant: the piece that decides, many times a day, whether the AI should start paying attention.