StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

Abstract

Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation.

To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions.

Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality.

The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.

Video Gallery

Short-Video Editing

Source Edited

Change the dog into a seal.

Source Edited

Replace the man with a spiderman.

Source Edited

Change the color of the car to pink.

Source Edited

Remove the toy from the scene.

Source Edited

Replace the dog with a horse.

Source Edited

Change the bus to a jeep.

Source Edited

Change the red sports car to a jeep.

Source Edited

Replace the man with a panda.

Source Edited

Add sunglasses to the dog.

Source Edited

Change the city bus to a glass bus.

Source Edited

Replace the man with a robot.

Source Edited

Replace the swan with a paper boat.

Source Edited

Replace the crow with an owl.

Source Edited

Change the helicopter's color to fuchsia.

Source Edited

Replace the jacket with a dragon robe.

Long-Video Editing

Source Edited

Replace the tiger with an elephant.

Source Edited

Change the coffee's color to light brown.

Source Edited

Replace the monkey with a red panda.

Source Edited

Change the color of the dress to red.

Source Edited

Change the thirsty dog to a goat.

Source Edited

Replace the dancer with a brown horse.

Method

The Paradigm Shift: Data-to-Data → Noise-to-Data

Conventional training-free editing follows a data → data paradigm: inversion-based methods perform source → noise → target transfer, while inversion-free methods perform direct source → target transfer. Both approaches are tightly coupled to iterative procedures and cannot well exploit fast few-step sampling.

We propose viewing video editing as source-conditioned noise → target generation. This reformulation preserves the few-step sampling capability of streaming generators while injecting source-video conditions for controllability. Built on pre-trained streaming models (Self Forcing, LongLive), StreamGVE realizes this paradigm through the dual-branch fast sampling and attention manipulations.

StreamGVE framework overview — **Overview of the StreamGVE framework.** (a) The generation process of dual-branch fast sampling with source-oriented guidance. (b) Self-attention bridge (S.A.B.) for source condition injection and cross-attention grounding for editing-related region identification.

Technical Details

Dual-Branch Fast Sampling

We extend stochastic few-step sampling to a dual-branch framework. Given a source video with its source prompt and target prompt, we denoise both in parallel under shared noise conditions:

$$x_{t_i}^{\text{src}} = (1-t_i)x_0^{\text{src}} + t_i\epsilon_{t_i}, \quad x_{t_i}^{\text{tgt}} = (1-t_i)z_{t_i}^{\text{tgt}} + t_i\epsilon_{t_i}, \quad \epsilon_{t_i} \sim \mathcal{N}(0,1)$$

This allows injecting source conditions at every denoising step:

$$v_{t_i}^{\text{src}} = v_\theta(x_{t_i}^{\text{src}}, t_i), \quad v_{t_i}^{\text{tgt}} = v_\theta^*(x_{t_i}^{\text{tgt}}, t_i \mid x_{t_i}^{\text{src}}), \quad i \in [1,N]$$

Self-Attention Bridge

We build a self-attention bridge between source and target branches to enable source condition injection. It consists of three components:

Query Blending: Structure and Motion Preservation

Query blending uses a timestep-dependent blending ratio to preserve the structure. The blending ratio $r_{t_i}$ increases from 0 to 1 as timestep decreases, with $\rho$ controlling the preservation strength:

$$\tilde{Q}_{\text{SA}}^{\text{tgt}} = r_{t_i} Q_{\text{SA}}^{\text{tgt}} + (1 - r_{t_i}) Q_{\text{SA}}^{\text{src}}, \quad r_{t_i} = 1 - t_{i-1}^{\rho}$$

Key Blending: Editing Effectiveness and Consistency

Key blending stabilizes attention for editing consistency. We blend current keys and apply masked blending to previous-frame key:

$$\tilde{K}_{\text{SA}}^{\text{tgt}} = r_{t_i} K_{\text{SA}}^{\text{tgt}} + (1 - r_{t_i}) K_{\text{SA}}^{\text{src}}$$ $$\tilde{K}_{\text{prev}}^{\text{tgt}} = M_{\text{prev}} \odot (r_{t_i} K_{\text{prev}}^{\text{tgt}} + (1 - r_{t_i}) K_{\text{prev}}^{\text{src}}) + (1 - M_{\text{prev}}) \odot K_{\text{prev}}^{\text{tgt}}$$

Source KV Injection: Background Preservation

Source KV injection provides background details when $t < t^{inj}$ (where $t^{inj} = 0.5$). The Iverson bracket $[t < t^{inj}]$ excludes the term when the condition is false. The final self-attention uses:

$$Q = \tilde{Q}_{\text{SA}}^{\text{tgt}}, \quad K = [\tilde{K}_{\text{SA}}^{\text{tgt}}, \tilde{K}_{\text{prev}}^{\text{tgt}}, [t < t^{\text{inj}}] \cdot (M_{\text{curr}} \odot K_{\text{SA}}^{\text{src}})]$$ $$V = [V_{\text{SA}}^{\text{tgt}}, V_{\text{prev}}^{\text{tgt}}, [t < t^{\text{inj}}] \cdot (M_{\text{curr}} \odot V_{\text{SA}}^{\text{src}})]$$

Cross-Attention Grounding and Boosting

Grounding: Editing-Related Mask

Grounding identifies editing regions via foreground-background attention differences. We compute the mask from the difference between averaged attentions over trigger words $\mathcal{T}$ and other words in the prompt $\mathcal{P}$, and use the Heaviside step function $H$ for binarization:

$$M^{\phi} = H\left( \frac{\sum_{p \in \mathcal{T}} A_{\text{CA}}^{\phi}(p, Q_{\text{CA}}^{\phi})}{|\mathcal{T}|} - \frac{\sum_{p \in \mathcal{P} \setminus \mathcal{T}} A_{\text{CA}}^{\phi}(p, Q_{\text{CA}}^{\phi})}{|\mathcal{P} \setminus \mathcal{T}|} \right)$$

where $\phi \in \{\text{src}, \text{tgt}\}$ denotes the branch. We obtain $M^{\text{src}}$ at $t=0$ and $M^{\text{tgt}}$ at $t=t^{\text{inj}}$, then compute the union mask $M = \bigcup_{\phi} M^{\phi}$.

Boosting: Regional Editing Enhancement

Boosting enhances trigger words inside editing regions. We modify the attention scores with weight $w_{p,q}$ controlled by $\omega$:

$$A_{\text{CA}}^{\text{tgt}}(p,q) = \frac{\exp(S(p,q) + \ln w_{p,q})}{\sum_{j \in Q_{\text{CA}}^{\text{tgt}}} \exp(S(j,q) + \ln w_{j,q})}, \quad w_{p,q} = \begin{cases} \omega & \text{if } p \in \mathcal{T} \text{ and } M(q)=1 \\ 1 & \text{otherwise} \end{cases}$$

where $S(p,q)$ is the pre-softmax attention score, and $\omega$ controls editing strength.

Source-Oriented Guidance

To suppress stochastic artifacts, we compute the velocity prediction error by comparing against the linear-interpolation ground-truth:

$$v_{t_i}^{\text{gt}} = \epsilon_{t_i} - x_0^{\text{src}}, \quad g_{t_i} = v_{t_i}^{\text{gt}} - v_{t_i}^{\text{src}}$$

This error is projected to the target branch via soft mask AMN(·) (Abs-Mean-Norm: channel-wise mean of absolute values, then min-max normalized) to correct the target velocity:

$$\tilde{v}_{t_i}^{\text{tgt}} = v_{t_i}^{\text{tgt}} + \text{AMN}(v_{t_i}^{\text{tgt}} - v_{t_i}^{\text{src}}) \odot g_{t_i}, \quad z_{t_{i-1}}^{\text{tgt}} = x_{t_i}^{\text{tgt}} - t_i \tilde{v}_{t_i}^{\text{tgt}}$$

Visual Prompting

For fine-grained visual control, users can provide an edited first frame as a visual prompt. The framework treats it as a previously generated video chunk, incorporating detailed editing effects into target generation. It requires only one additional forward of the streaming generator.

StreamGVE

Abstract

Video Gallery

Short-Video Editing

Long-Video Editing

Method

The Paradigm Shift: Data-to-Data → Noise-to-Data

Technical Details

Dual-Branch Fast Sampling

Self-Attention Bridge

Query Blending: Structure and Motion Preservation

Key Blending: Editing Effectiveness and Consistency

Source KV Injection: Background Preservation

Cross-Attention Grounding and Boosting

Grounding: Editing-Related Mask

Boosting: Regional Editing Enhancement

Source-Oriented Guidance

Visual Prompting

Results

Short-Video Editing

Long-Video Editing

Citation

Acknowledgements