
Video-R1: Reinforcing Video Reasoning in MLLMs - GitHub
Feb 23, 2025 · Video-R1 significantly outperforms previous models across most benchmarks. Notably, on VSI-Bench, which focuses on spatial reasoning in videos, Video-R1-7B achieves a new state-of-the-art accuracy of 35.8%, surpassing GPT-4o, a proprietary model, while using only 32 frames and 7B parameters. This highlights the necessity of explicit reasoning capability in …
【EMNLP 2024 】Video-LLaVA: Learning United Visual ... - GitHub
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection If you like our project, please give us a star ⭐ on GitHub for latest update. 💡 I also have other video-language projects that may interest you . Open-Sora Plan: Open-Source Large Video Generation Model
yt-dlp/yt-dlp: A feature-rich command-line audio/video …
yt-dlp is a feature-rich command-line audio/video downloader with support for thousands of sites. The project is a fork of youtube-dl based on the now inactive youtube-dlc. INSTALLATION Detailed instructions Release Files Update Dependencies Compile USAGE AND OPTIONS General Options Network Options Geo-restriction Video Selection Download Options …
Frontier Multimodal Foundation Models for Image and Video
Jan 21, 2025 · VideoLLaMA 3 is a series of multimodal foundation models with frontier image and video understanding capacity. 💡Click here to show detailed performance on video benchmarks
The swiss army knife of lossless video/audio editing - GitHub
The main feature is lossless trimming and cutting of video and audio files, which is great for saving space by rough-cutting your large video files taken from a video camera, GoPro, drone, etc. It lets you quickly extract the good parts from your videos and discard many gigabytes of data without doing a slow re-encode and thereby losing quality.
Let Them Talk: Audio-Driven Multi-Person Conversational Video
We propose MultiTalk, a novel framework for audio-driven multi-person conversational video generation. Given a multi-stream audio input, a reference image and a prompt, MultiTalk generates a video containing interactions following the prompt, with consistent lip motions aligned with the audio. conda ...
HunyuanVideo: A Systematic Framework For Large Video ... - GitHub
Jan 13, 2025 · HunyuanVideo introduces the Transformer design and employs a Full Attention mechanism for unified image and video generation. Specifically, we use a "Dual-stream to Single-stream" hybrid model design for video generation. In the dual-stream phase, video and text tokens are processed independently through multiple Transformer blocks, enabling each modality to …
GitHub - deepbeepmeep/Wan2GP: A fast AI Video Generator for …
A fast AI Video Generator for the GPU Poor. Supports Wan 2.1/2.2, Hunyuan Video, LTX Video and Flux. - deepbeepmeep/Wan2GP
GitHub - huggingface/diffusers: :hugging_face: Diffusers: State-of …
About :hugging_face: Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model …
Jun 3, 2024 · Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capabilities.