WMDP (Weapons of Mass Destruction Proxy)

WMDP, the Weapons of Mass Destruction Proxy, is a public dataset designed to measure and help remove hazardous knowledge from large language models. It was introduced in a paper led by Nathaniel Li with 55 co-authors, submitted on March 5, 2024, with a final version in May 2024, and it responds directly to concerns, raised in the US executive order on AI, that LLMs might help malicious actors.

The benchmark contains 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge across three domains: biosecurity, cybersecurity, and chemical security. Rather than directly testing dangerous instructions, the questions probe the surrounding expert knowledge that would be a precursor to misuse, which lets researchers gauge a model’s hazardous knowledge without publishing a how-to guide.

The same work introduces an unlearning method called RMU, based on controlling a model’s internal representations, which the authors show can reduce performance on the hazardous proxy while largely preserving general capability in fields like biology and computer science. The benchmark and code were released at wmdp.ai.

For a general reader, WMDP is notable because it tackles a hard tension head-on: how to measure and reduce a model’s most dangerous knowledge without creating a roadmap for harm in the process.

WMDP (Weapons of Mass Destruction Proxy)

Sources

Related