Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2404.03413

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 28

Papers - Encoders - Image - Clip

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 28
openai/clip-vit-large-patch14-336

Zero-Shot Image Classification • Updated Oct 4, 2022 • 5.28M • 281
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 19.1M • 823

How Far Are We from Intelligent Visual Deductive Reasoning?

Paper • 2403.04732 • Published Mar 7, 2024 • 23
MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 77
DragAnything: Motion Control for Anything using Entity Representation

Paper • 2403.07420 • Published Mar 12, 2024 • 15
Learning and Leveraging World Models in Visual Representation Learning

Paper • 2403.00504 • Published Mar 1, 2024 • 33

Papers - Video - Synthetic Data Generator

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 29
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Paper • 2403.09530 • Published Mar 14, 2024 • 10
VidToMe: Video Token Merging for Zero-Shot Video Editing

Paper • 2312.10656 • Published Dec 17, 2023 • 11
TC4D: Trajectory-Conditioned Text-to-4D Generation

Paper • 2403.17920 • Published Mar 26, 2024 • 18

Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 151
Orion-14B: Open-source Multilingual Large Language Models

Paper • 2401.12246 • Published Jan 20, 2024 • 14
MambaByte: Token-free Selective State Space Model

Paper • 2401.13660 • Published Jan 24, 2024 • 60
MM-LLMs: Recent Advances in MultiModal Large Language Models

Paper • 2401.13601 • Published Jan 24, 2024 • 48

Papers - Video - Fine-tuning

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Paper • 2404.01258 • Published Apr 1, 2024 • 12
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 28

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18, 2024 • 13
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 29

Papers - Video - Understanding

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Paper • 2403.09626 • Published Mar 14, 2024 • 16
VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

Paper • 2403.13501 • Published Mar 20, 2024 • 9
LITA: Language Instructed Temporal-Localization Assistant

Paper • 2403.19046 • Published Mar 27, 2024 • 19

Video as the New Language for Real-World Decision Making

Paper • 2402.17139 • Published Feb 27, 2024 • 21
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Paper • 2310.19512 • Published Oct 30, 2023 • 16
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Paper • 2401.09047 • Published Jan 17, 2024 • 14

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models

Paper • 2312.13964 • Published Dec 21, 2023 • 20
LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Paper • 2312.11514 • Published Dec 12, 2023 • 260
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

Paper • 2312.12491 • Published Dec 19, 2023 • 74
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model

Paper • 2401.02330 • Published Jan 4, 2024 • 18

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 28

Papers - Video - Fine-tuning

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Paper • 2404.01258 • Published Apr 1, 2024 • 12
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 28

Papers - Encoders - Image - Clip

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 28
openai/clip-vit-large-patch14-336

Zero-Shot Image Classification • Updated Oct 4, 2022 • 5.28M • 281
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 19.1M • 823

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18, 2024 • 13
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 29

How Far Are We from Intelligent Visual Deductive Reasoning?

Paper • 2403.04732 • Published Mar 7, 2024 • 23
MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 77
DragAnything: Motion Control for Anything using Entity Representation

Paper • 2403.07420 • Published Mar 12, 2024 • 15
Learning and Leveraging World Models in Visual Representation Learning

Paper • 2403.00504 • Published Mar 1, 2024 • 33

Papers - Video - Understanding

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Paper • 2403.09626 • Published Mar 14, 2024 • 16
VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

Paper • 2403.13501 • Published Mar 20, 2024 • 9
LITA: Language Instructed Temporal-Localization Assistant

Paper • 2403.19046 • Published Mar 27, 2024 • 19

Papers - Video - Synthetic Data Generator

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 29
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Paper • 2403.09530 • Published Mar 14, 2024 • 10
VidToMe: Video Token Merging for Zero-Shot Video Editing

Paper • 2312.10656 • Published Dec 17, 2023 • 11
TC4D: Trajectory-Conditioned Text-to-4D Generation

Paper • 2403.17920 • Published Mar 26, 2024 • 18

Video as the New Language for Real-World Decision Making

Paper • 2402.17139 • Published Feb 27, 2024 • 21
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Paper • 2310.19512 • Published Oct 30, 2023 • 16
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Paper • 2401.09047 • Published Jan 17, 2024 • 14

Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 151
Orion-14B: Open-source Multilingual Large Language Models

Paper • 2401.12246 • Published Jan 20, 2024 • 14
MambaByte: Token-free Selective State Space Model

Paper • 2401.13660 • Published Jan 24, 2024 • 60
MM-LLMs: Recent Advances in MultiModal Large Language Models

Paper • 2401.13601 • Published Jan 24, 2024 • 48

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models

Paper • 2312.13964 • Published Dec 21, 2023 • 20
LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Paper • 2312.11514 • Published Dec 12, 2023 • 260
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

Paper • 2312.12491 • Published Dec 19, 2023 • 74
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model

Paper • 2401.02330 • Published Jan 4, 2024 • 18

Previous
1
2
3
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs