SentencePiece: Language-Independent Subword Tokenizer

“SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing” by Taku Kudo and John Richardson (submitted August 19, 2018) made subword tokenization practical across many languages at once. Earlier tools assumed text had already been split into words by spaces, which fails for languages like Japanese and Chinese that do not put spaces between words. SentencePiece removes that assumption.

As the authors explain, the tool “can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system.” It treats the input as a raw stream of characters, including spaces, so the exact original text can always be reconstructed from the tokens. The authors report it achieves accuracy comparable to direct subword training on raw English-Japanese translation, and they released open-source C++ and Python implementations.

SentencePiece matters because it became the default tokenizer for many influential models, including Google’s T5 and a wide range of multilingual systems. Its reversible, language-neutral design is a quiet but essential piece of plumbing that lets one model handle dozens of languages without custom text-processing rules for each one.

SentencePiece: Language-Independent Subword Tokenizer

Sources

Related