Collections
Discover the best community collections!
Collections including paper arxiv:2404.03413
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Paper • 2404.03413 • Published • 28 -
openai/clip-vit-large-patch14-336
Zero-Shot Image Classification • Updated • 5.28M • 281 -
openai/clip-vit-base-patch32
Zero-Shot Image Classification • Updated • 19.1M • 823
-
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 23 -
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper • 2403.07508 • Published • 77 -
DragAnything: Motion Control for Anything using Entity Representation
Paper • 2403.07420 • Published • 15 -
Learning and Leveraging World Models in Visual Representation Learning
Paper • 2403.00504 • Published • 33
-
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper • 2403.01422 • Published • 29 -
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Paper • 2403.09530 • Published • 10 -
VidToMe: Video Token Merging for Zero-Shot Video Editing
Paper • 2312.10656 • Published • 11 -
TC4D: Trajectory-Conditioned Text-to-4D Generation
Paper • 2403.17920 • Published • 18
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
Orion-14B: Open-source Multilingual Large Language Models
Paper • 2401.12246 • Published • 14 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 60 -
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper • 2401.13601 • Published • 48
-
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 37 -
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper • 2403.11481 • Published • 13 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 30 -
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper • 2403.01422 • Published • 29
-
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Paper • 2403.09626 • Published • 16 -
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 37 -
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Paper • 2403.13501 • Published • 9 -
LITA: Language Instructed Temporal-Localization Assistant
Paper • 2403.19046 • Published • 19
-
Video as the New Language for Real-World Decision Making
Paper • 2402.17139 • Published • 21 -
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Paper • 2310.19512 • Published • 16 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 30 -
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Paper • 2401.09047 • Published • 14
-
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
Paper • 2312.13964 • Published • 20 -
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Paper • 2312.11514 • Published • 260 -
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
Paper • 2312.12491 • Published • 74 -
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper • 2401.02330 • Published • 18
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Paper • 2404.03413 • Published • 28 -
openai/clip-vit-large-patch14-336
Zero-Shot Image Classification • Updated • 5.28M • 281 -
openai/clip-vit-base-patch32
Zero-Shot Image Classification • Updated • 19.1M • 823
-
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 37 -
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper • 2403.11481 • Published • 13 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 30 -
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper • 2403.01422 • Published • 29
-
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 23 -
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper • 2403.07508 • Published • 77 -
DragAnything: Motion Control for Anything using Entity Representation
Paper • 2403.07420 • Published • 15 -
Learning and Leveraging World Models in Visual Representation Learning
Paper • 2403.00504 • Published • 33
-
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Paper • 2403.09626 • Published • 16 -
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 37 -
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Paper • 2403.13501 • Published • 9 -
LITA: Language Instructed Temporal-Localization Assistant
Paper • 2403.19046 • Published • 19
-
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper • 2403.01422 • Published • 29 -
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
Paper • 2403.09530 • Published • 10 -
VidToMe: Video Token Merging for Zero-Shot Video Editing
Paper • 2312.10656 • Published • 11 -
TC4D: Trajectory-Conditioned Text-to-4D Generation
Paper • 2403.17920 • Published • 18
-
Video as the New Language for Real-World Decision Making
Paper • 2402.17139 • Published • 21 -
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Paper • 2310.19512 • Published • 16 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 30 -
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Paper • 2401.09047 • Published • 14
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
Orion-14B: Open-source Multilingual Large Language Models
Paper • 2401.12246 • Published • 14 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 60 -
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper • 2401.13601 • Published • 48
-
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
Paper • 2312.13964 • Published • 20 -
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Paper • 2312.11514 • Published • 260 -
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
Paper • 2312.12491 • Published • 74 -
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper • 2401.02330 • Published • 18