Skip to main content

Label: video

Modular visual question answering via code generation

Sparse video tubes for joint video and image vision transformers

MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks

Vid2Seq: a pretrained visual language model for describing multi-event videos

Large Motion Frame Interpolation

End-to-end Generative Pre-training for Multimodal Video Captioning

Multimodal Bottleneck Transformer (MBT): A New Model for Modality Fusion

Experimenting with Automatic Video Creation from a Web Page

RepNet: Counting Repetitions in Videos

Audio and Visual Quality Measurement Using Fréchet Distance

Video Understanding Using Temporal Cycle-Consistency Learning