Vision Transformer

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.
相关学科: Swin TransformerDeiTMLP-MixerMViTEfficientNetSelf-Supervised LearningDeepFake DetectionT2T-ViTMulti-Head AttentionDINO

学科讨论

讨论Icon

暂无讨论内容,你可以

推荐文献

按被引用数

学科管理组

暂无学科课代表,你可以申请成为课代表

重要学者

Kaiming He

202871 被引用,125 篇论文

Ross Girshick

150810 被引用,165 篇论文

Jitendra Malik

118374 被引用,531 篇论文

Luc Van Gool

91399 被引用,1409 篇论文

Cordelia Schmid

82310 被引用,551 篇论文

Piotr Dollár

75386 被引用,100 篇论文

Shuai Liu

74044 被引用,1329 篇论文

Zhen Li

72612 被引用,1738 篇论文

Ming-Hsuan Yang

61951 被引用,641 篇论文

Dacheng Tao

57097 被引用,1414 篇论文