Skip to main content

Label: CVPR

DynIBaR: Space-time view synthesis from videos of dynamic scenes

RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers

Pic2Word: Mapping pictures to words for zero-shot composed image retrieval

Unifying image-caption and image-classification datasets with prefix conditioning

Google at CVPR 2023

Enabling delightful user experiences via predictive models of human attention

Sparse video tubes for joint video and image vision transformers

Vid2Seq: a pretrained visual language model for describing multi-event videos

View Synthesis with Transformers

LOLNeRF: Learn from One Look

Revisiting Mask Transformer from a Clustering Perspective