โฑ๏ธ Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding

Upload a video and ask any question! Tempo dynamically compresses visual tokens based on your query to achieve SOTA performance. ๐Ÿ  Project Page | ๐Ÿ’ป GitHub | ๐Ÿ“„ Paper | ๐Ÿ‘จโ€๐Ÿ’ป @Junjie Fei

โณ Slow preprocessing? Try Examples 4 & 5 below, decrease Max Sampled Frames in Advanced Settings, or check our GitHub for full-speed local deployment.


๐Ÿ’ก Try an Example

Examples