A Survey of Code-switched Speech and Language Processing

“A Survey of Code-switched Speech and Language Processing,” posted to arXiv on March 25, 2019 by Sunayana Sitaram (Microsoft Research India), Khyathi Raghavi Chandu, Sai Krishna Rallabandi, and Alan W. Black (Carnegie Mellon), reviews how natural language and speech systems cope with code-switching: the alternation of languages within a single conversation or even a single utterance.

Code-switching is not a rare edge case. It is, as the authors put it, a common communicative phenomenon across multilingual communities worldwide, where bilingual speakers fluidly mix languages such as Hindi and English or Spanish and English. Yet most NLP systems assume one language per input. The survey catalogs available datasets and computational approaches across speech recognition, language identification, parsing, and end-to-end systems, and identifies the central bottleneck plainly: code-switching data and resources are scarce, because most written and recorded corpora are filtered down to a single language.

The paper argues that handling code-switched text and speech is essential for building agents and assistants that actually serve multilingual users the way they really talk, rather than forcing them into a monolingual mold.

For product teams, the practical point is that a large share of real users do not stay in one language, and systems trained on clean monolingual data can fail in exactly the conversational settings where mixed-language speech is most natural.

A Survey of Code-switched Speech and Language Processing

Sources

Related