⏱️ Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding

Upload a video and ask any question! Tempo dynamically compresses visual tokens based on your query to achieve SOTA performance. 🏠 Project Page | 💻 GitHub | 📄 Paper | 👨‍💻 @Junjie Fei

⏳ Slow preprocessing? Try Examples 4 & 5 below, decrease Max Sampled Frames in Advanced Settings, or check our GitHub for full-speed local deployment.

⏱️ Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding

💡 Try an Example