StreamGVE teaser showing text-driven and image-prompted video editing results
Training-free text-driven (optional image-conditioned) streaming video editing. StreamGVE supports both text-driven video editing and first-frame-prompted editing, suitable for various editing tasks while having superiority in background preservation. Developed on text-to-video streaming generation models, StreamGVE supports efficient video editing for any length, with less than 0.32 second per frame (5 steps) on a single A100 GPU.

Abstract

Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation.

To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions.

Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality.

The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.

Short-Video Editing

Long-Video Editing

Method

The Paradigm Shift: Data-to-Data → Noise-to-Data

Conventional training-free editing follows a data → data paradigm: inversion-based methods perform source → noise → target transfer, while inversion-free methods perform direct source → target transfer. Both approaches are tightly coupled to iterative procedures and cannot well exploit fast few-step sampling.

We propose viewing video editing as source-conditioned noise → target generation. This reformulation preserves the few-step sampling capability of streaming generators while injecting source-video conditions for controllability. Built on pre-trained streaming models (Self Forcing, LongLive), StreamGVE realizes this paradigm through the dual-branch fast sampling and attention manipulations.

StreamGVE framework overview
Overview of the StreamGVE framework. (a) The generation process of dual-branch fast sampling with source-oriented guidance. (b) Self-attention bridge (S.A.B.) for source condition injection and cross-attention grounding for editing-related region identification.

Results

StreamGVE achieves state-of-the-art performance on FiVE-Bench, a comprehensive benchmark spanning 100 videos and 420 editing cases across six edit types: color alteration, material modification, object substitution with/without non-rigid deformation, addition, and removal.

Comparison of video editing methods on FiVE-Bench
Comparison of video editing methods on FiVE-Bench. StreamGVE (Ours SF and Ours LL§) delivers consistently better results with minimal time cost.

Short-Video Editing

Comparison of different methods on short-video editing tasks.

Qualitative comparisons with baseline methods
Qualitative comparisons. StreamGVE (Ours SF and Ours LL§) demonstrates advantages over previous state-of-the-art methods.

Long-Video Editing

Comparison of different methods on long-video editing tasks.

Long video editing results
Long video editing comparisons. Results on videos with over 470 frames (16 FPS, 30 seconds) demonstrating streaming editing capability.

Citation

If you find this work useful, please consider citing:

TODO

Acknowledgements

We sincerely thank the open-source community for their awesome work, particularly Self Forcing, LongLive, and UniEdit-Flow. 😊

Additionally, we would also like to thank the FiVE-Bench team for providing comprehensive baseline survey and great benchmark.