Unsupervised Open-Vocabulary Object Localization in Videos
Ke FanZechen BaiTianjun XiaoDominik ZietlowMax HornZixu ZhaoCarl-Johann Simon-GabrielMike Zheng ShouFrancesco LocatelloBernt SchieleThomas BroxZheng ZhangYanwei FuTong He
Ke FanZechen BaiTianjun Xiao
...+10
Tong He
Sep 2023
0被引用
1笔记
开学季活动火爆进行中,iPad、蓝牙耳机、拍立得、键盘鼠标套装等你来拿
摘要原文
In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.