Split Attention

A Split Attention block enables attention across feature-map groups. As in ResNeXt blocks, the feature can be divided into several groups, and the number of feature-map groups is given by a cardinality hyperparameter $K$. The resulting feature-map groups are called cardinal groups. Split Attention blocks introduce a new radix hyperparameter $R$ that indicates the number of splits within a cardinal group, so the total number of feature groups is $G = KR$. We may apply a series of transformations {$\mathcal{F}_1, \mathcal{F}_2, \cdots\mathcal{F}_G$} to each individual group, then the intermediate representation of each group is $U_i = \mathcal{F}_i\left(X\right)$, for $i \in$ {$1, 2, \cdots{G}$}.A combined representation for each cardinal group can be obtained by fusing via an element-wise summation across multiple splits. The representation for $k$-th cardinal group is $\hat{U}^k = \sum_{j=R(k-1)+1}^{R k} U_j $, where $\hat{U}^k \in \mathbb{R}^{H\times W\times C/K}$ for $k\in{1,2,...K}$, and $H$, $W$ and $C$ are the block output feature-map sizes. Global contextual information with embedded channel-wise statistics can be gathered with global average pooling across spatial dimensions $s^k\in\mathbb{R}^{C/K}$. Here the $c$-th component is calculated as:$$ s^k_c = \frac{1}{H\times W} \sum_{i=1}^H\sum_{j=1}^W \hat{U}^k_c(i, j).$$A weighted fusion of the cardinal group representation $V^k\in\mathbb{R}^{H\times W\times C/K}$ is aggregated using channel-wise soft attention, where each feature-map channel is produced using a weighted combination over splits. The $c$-th channel is calculated as:$$ V^k_c=\sum_{i=1}^R a^k_i(c) U_{R(k-1)+i} ,$$where $a_i^k(c)$ denotes a (soft) assignment weight given by:$$a_i^k(c) =\begin{cases} \frac{exp(\mathcal{G}^c_i(s^k))}{\sum_{j=0}^R exp(\mathcal{G}^c_j(s^k))} & \quad\textrm{if } R>1, \ \frac{1}{1+exp(-\mathcal{G}^c_i(s^k))} & \quad\textrm{if } R=1,\\end{cases}$$and mapping $\mathcal{G}_i^c$ determines the weight of each split f
相关学科: ResNeStResMLPGrouped ConvolutionMath Word Problem SolvingMLP-MixerVideo Instance SegmentationFace HallucinationDepthwise ConvolutionSwin TransformerEfficientNet









Alexander J. Smola

89395 被引用,459 篇论文

Richard E. Mayer

55035 被引用,536 篇论文

John Sweller

45931 被引用,243 篇论文

Fred Paas

33578 被引用,338 篇论文

Gerald Friedland

27454 被引用,554 篇论文

Paul A. Kirschner

25604 被引用,778 篇论文

Kenneth R. Koedinger

18457 被引用,436 篇论文

Karlene Ball

16033 被引用,163 篇论文

Jan Mendling

15422 被引用,590 篇论文

David L. Roth

14834 被引用,321 篇论文