This website requires JavaScript.

Large Language Models Encode Clinical Knowledge

Karan SinghalShekoofeh AziziTao Tu ...+26 Vivek Natarajan
Dec 2022
摘要
Large language models (LLMs) have demonstrated impressive capabilities innatural language understanding and generation, but the quality bar for medicaland clinical applications is high. Today, attempts to assess models' clinicalknowledge typically rely on automated evaluations on limited benchmarks. Thereis no standard to evaluate model predictions and reasoning across a breadth oftasks. To address this, we present MultiMedQA, a benchmark combining sixexisting open question answering datasets spanning professional medical exams,research, and consumer queries; and HealthSearchQA, a new free-response datasetof medical questions searched online. We propose a framework for humanevaluation of model answers along multiple axes including factuality,precision, possible harm, and bias. In addition, we evaluate PaLM (a540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, onMultiMedQA. Using a combination of prompting strategies, Flan-PaLM achievesstate-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA,MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (USMedical License Exam questions), surpassing prior state-of-the-art by over 17%.However, human evaluation reveals key gaps in Flan-PaLM responses. To resolvethis we introduce instruction prompt tuning, a parameter-efficient approach foraligning LLMs to new domains using a few exemplars. The resulting model,Med-PaLM, performs encouragingly, but remains inferior to clinicians. We showthat comprehension, recall of knowledge, and medical reasoning improve withmodel scale and instruction prompt tuning, suggesting the potential utility ofLLMs in medicine. Our human evaluations reveal important limitations of today'smodels, reinforcing the importance of both evaluation frameworks and methoddevelopment in creating safe, helpful LLM models for clinical applications.
展开全部
图表提取

暂无人提供速读十问回答

论文十问由沈向洋博士提出,鼓励大家带着这十个问题去阅读论文,用有用的信息构建认知模型。写出自己的十问回答,还有机会在当前页面展示哦。

Q1论文试图解决什么问题?
Q2这是否是一个新的问题?
Q3这篇文章要验证一个什么科学假设?
0
被引用
笔记
问答