GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modificationsLayer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers. The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.
相关学科: GPT-3GPTText GenerationXLNetRoBERTaBERTBARTTransformer-XLStory GenerationT5









Richard Socher

81897 被引用,249 篇论文

Andrew McCallum

53962 被引用,484 篇论文

Dawn Song

47685 被引用,470 篇论文

Huan Liu

47556 被引用,767 篇论文

Alexandre Gramfort

43993 被引用,287 篇论文

Xiang Gao

38760 被引用,1475 篇论文

Ting Wang

37128 被引用,1144 篇论文

Luke Zettlemoyer

35709 被引用,278 篇论文

Tat-Seng Chua

30731 被引用,750 篇论文

Thomas L. Griffiths

30505 被引用,557 篇论文