GPT-2
0 订阅
GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modificationsLayer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers. The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.
相关学科: GPT-3GPTText GenerationXLNetRoBERTaBERTBARTTransformer-XLStory GenerationT5
学科讨论

暂无讨论内容,你可以
推荐文献
按被引用数
学科管理组
暂无学科课代表,你可以申请成为课代表
重要学者
Richard Socher
81897 被引用,249
篇论文
Andrew McCallum
53962 被引用,484
篇论文
Dawn Song
47685 被引用,470
篇论文
Huan Liu
47556 被引用,767
篇论文
Alexandre Gramfort
43993 被引用,287
篇论文
Xiang Gao
38760 被引用,1475
篇论文
Ting Wang
37128 被引用,1144
篇论文
Luke Zettlemoyer
35709 被引用,278
篇论文
Tat-Seng Chua
30731 被引用,750
篇论文
Thomas L. Griffiths
30505 被引用,557
篇论文