Vision Transformer
0 订阅
The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.
相关学科: Swin TransformerDeiTMLP-MixerMViTEfficientNetSelf-Supervised LearningDeepFake DetectionT2T-ViTMulti-Head AttentionDINO
学科讨论

暂无讨论内容,你可以
推荐文献
按被引用数
学科管理组
暂无学科课代表,你可以申请成为课代表
重要学者
Kaiming He
202871 被引用,125
篇论文
Ross Girshick
150810 被引用,165
篇论文
Jitendra Malik
118374 被引用,531
篇论文
Luc Van Gool
91399 被引用,1409
篇论文
Cordelia Schmid
82310 被引用,551
篇论文
Piotr Dollár
75386 被引用,100
篇论文
Shuai Liu
74044 被引用,1329
篇论文
Zhen Li
72612 被引用,1738
篇论文
Ming-Hsuan Yang
61951 被引用,641
篇论文
Dacheng Tao
57097 被引用,1414
篇论文