Electric is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context.Specifically, like BERT, Electric also models $p_{\text {data }}\left(x_{t} \mid \mathbf{x}_{\backslash t}\right)$, but does not use masking or a softmax layer. Electric first maps the unmasked input $\mathbf{x}=\left[x_{1}, \ldots, x_{n}\right]$ into contextualized vector representations $\mathbf{h}(\mathbf{x})=\left[\mathbf{h}_{1}, \ldots, \mathbf{h}_{n}\right]$ using a transformer network. The model assigns a given position $t$ an energy score$$E(\mathbf{x})_{t}=\mathbf{w}^{T} \mathbf{h}(\mathbf{x})_{t}$$using a learned weight vector $w$. The energy function defines a distribution over the possible tokens at position $t$ as$$p_{\theta}\left(x_{t} \mid \mathbf{x}_{\backslash t}\right)=\exp \left(-E(\mathbf{x})_{t}\right) / Z\left(\mathbf{x}_{\backslash t}\right) $$$$=\frac{\exp \left(-E(\mathbf{x})_{t}\right)}{\sum_{x^{\prime} \in \mathcal{V}} \exp \left(-E\left(\operatorname{REPLACE}\left(\mathbf{x}, t, x^{\prime}\right)\right)_{t}\right)}$$where $\text{REPLACE}\left(\mathbf{x}, t, x^{\prime}\right)$ denotes replacing the token at position $t$ with $x^{\prime}$ and $\mathcal{V}$ is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens $x^{\prime}$ using a softmax layer, a candidate $x^{\prime}$ is passed in as input to the transformer. As a result, computing $p_{\theta}$ is prohibitively expensive because the partition function $Z_{\theta}\left(\mathbf{x}_{\backslash t}\right)$ requires running the transformer $|\mathcal{V}|$ times; unlike most EBMs, the intractability of $Z_{\theta}(\mathbf{x} \backslash t)$ is more due t
相关学科: Systems and ControlTransformerOther Computer ScienceComputational Engineering, Finance and ScienceLoad ForecastingRoboticsNeural and Evolutionary ComputingDecision MakingNetworking and Internet ArchitectureEmerging Technologies









Zhong Lin Wang

346015 被引用,2599 篇论文

Yi Cui

251981 被引用,1010 篇论文

Jean-Marie Tarascon

156339 被引用,835 篇论文

Jay Hauser

117962 被引用,2529 篇论文

Kathleen M. Eisenhardt

101784 被引用,166 篇论文

Shlomo Havlin

97092 被引用,1143 篇论文

Jun Lu

95848 被引用,1663 篇论文

Frede Blaabjerg

92021 被引用,2302 篇论文

David Smith

91954 被引用,2526 篇论文

Jun Liu

74707 被引用,650 篇论文