This website requires JavaScript.

Large Language Models Encode Clinical Knowledge

Karan SinghalShekoofeh AziziTao Tu ...+26 Vivek Natarajan
Dec 2022
Large language models (LLMs) have demonstrated impressive capabilities innatural language understanding and generation, but the quality bar for medicaland clinical applications is high. Today, attempts to assess models' clinicalknowledge typically rely on automated evaluations on limited benchmarks. Thereis no standard to evaluate model predictions and reasoning across a breadth oftasks. To address this, we present MultiMedQA, a benchmark combining sixexisting open question answering datasets spanning professional medical exams,research, and consumer queries; and HealthSearchQA, a new free-response datasetof medical questions searched online. We propose a framework for humanevaluation of model answers along multiple axes including factuality,precision, possible harm, and bias. In addition, we evaluate PaLM (a540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, onMultiMedQA. Using a combination of prompting strategies, Flan-PaLM achievesstate-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA,MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (USMedical License Exam questions), surpassing prior state-of-the-art by over 17%.However, human evaluation reveals key gaps in Flan-PaLM responses. To resolvethis we introduce instruction prompt tuning, a parameter-efficient approach foraligning LLMs to new domains using a few exemplars. The resulting model,Med-PaLM, performs encouragingly, but remains inferior to clinicians. We showthat comprehension, recall of knowledge, and medical reasoning improve withmodel scale and instruction prompt tuning, suggesting the potential utility ofLLMs in medicine. Our human evaluations reveal important limitations of today'smodels, reinforcing the importance of both evaluation frameworks and methoddevelopment in creating safe, helpful LLM models for clinical applications.