CodeT5

CodeT5 is a code model from Salesforce Research, described in “CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation” (arXiv, September 2021, by Yue Wang, Weishi Wang, Shafiq Joty, and Steven Hoi), published at EMNLP 2021. Unlike encoder-only models in the BERT style or decoder-only models in the GPT style, CodeT5 uses a full encoder-decoder architecture, building on the T5 design, which makes it well suited to both reading code and producing it.

Its distinctive idea is identifier-aware pretraining. The model is trained to recognize which tokens are identifiers, the names developers give variables and functions, and to recover them when they are masked, so it can exploit the meaning carried by those names. CodeT5 also uses a bimodal objective that pairs code with its comments to better align natural language and programming language. The result is a single model that handles tasks from defect detection to code-to-code translation.

CodeT5 was an influential step in treating code understanding and generation together rather than as separate problems. For businesses, the same model can power code search, bug detection, summarization, and generation, and its attention to identifiers reflects a key insight: in real code, the names matter as much as the syntax.

Sources

Related