In 2020, BMJ Open published a study by Gilbert and colleagues that put consumer symptom-checker apps to a head-to-head test against doctors. The researchers built 200 primary-care clinical vignettes with gold-standard answers and ran them through eight symptom-assessment apps and seven general practitioners, measuring two things: whether the tool suggested the right condition and whether its urgency advice was safe.
No app matched the doctors on diagnosis. The GPs suggested the correct condition among their top picks 82.1 percent of the time on average. The best app, Ada, reached 70.5 percent and had the broadest coverage, suggesting conditions for 99 percent of vignettes. On safety - whether the urgency advice would steer a patient somewhere reasonable - only three apps came within one standard deviation of the GPs: Ada at 97.0 percent (matching the GP mean), Babylon at 95.1 percent, and Symptomate at 97.8 percent. Many apps were both less accurate and more risk-averse, tending to over-refer.
The study is a useful reality check on a product category that markets itself as a substitute for, or front door to, professional care. It showed that the better symptom checkers can give defensibly safe triage advice while still falling well short of a clinician’s diagnostic breadth, and that “safe” and “accurate” are different axes that must be measured separately.
Why business readers should care: this is what honest benchmarking of a health AI product looks like - against the human professional, on safety and accuracy as distinct metrics - and it explains why symptom checkers are positioned as triage aids rather than diagnosticians.