This website requires JavaScript.

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Shira GuskinMoshe WasserblatChang WangHaihao Shen
Oct 2022
摘要
Limited computational budgets often prevent transformers from being used inproduction and from having their high accuracy utilized. A knowledgedistillation approach addresses the computational efficiency by self-distillingBERT into a smaller transformer representation having fewer layers and smallerinternal embedding. However, the performance of these models drops as we reducethe number of layers, notably in advanced NLP tasks such as span questionanswering. In addition, a separate model must be trained for each inferencescenario with its distinct computational budget. Dynamic-TinyBERT tackles bothlimitations by partially implementing the Length Adaptive Transformer (LAT)technique onto TinyBERT, achieving x3 speedup over BERT-base with minimalaccuracy loss. In this work, we expand the Dynamic-TinyBERT approach togenerate a much more highly efficient model. We use MiniLM distillation jointlywith the LAT method, and we further enhance the efficiency by applying low-bitquantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) istrained only once, dynamically fits any inference scenario, and achieves anaccuracy-efficiency trade-off superior to any other efficient approaches perany computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1%accuracy loss). The code to reproduce this work will be publicly released onGithub soon.
展开全部
图表提取

暂无人提供速读十问回答

论文十问由沈向洋博士提出,鼓励大家带着这十个问题去阅读论文,用有用的信息构建认知模型。写出自己的十问回答,还有机会在当前页面展示哦。

Q1论文试图解决什么问题?
Q2这是否是一个新的问题?
Q3这篇文章要验证一个什么科学假设?
0
被引用
笔记
问答