A Review of Video Evaluation Metrics

Video generative models have been rapidly improving recently, but how do we evaluate them efficiently and effectively? In this blog post, we review the existing evaluation metrics and highlight their pros and cons.

Introduction

Video evaluation metrics fall into two categories: 1) set-to-set comparison metrics and 2) unary metrics .

Video generative models have been booming recently with the advent of powerful deep learning architectures and large-scale video datasets. However, evaluating the quality of generated videos remains a challenging task. The lack of a robust and reliable metric makes it difficult to assess the performance of video generative models quantitatively.

Arguably, the ultimate goal for video evaluation metric is to align with human judgment: the desideratum is for generative models to create videos that meet our aesthetic standards To demonstrate the quality of a new model, human subjects usually rate its generated samples in comparison to an existing baseline. Subjects are usually presented with pairs of generated clips from two different video models. They are then asked to indicate which of the two examples they prefer in regard to a specific evaluation criterion. Depending on the study, the ratings can either purely reflect the subject's personal preference, or they can refer to specific aspects of the video such as temporal consistency and adherence to the prompt. See for details. . Humans are very good at judging what looks natural and identifying small temporal inconsistencies. However, the downsides of human ratings, like in every other human-in-the-loop machine learning task, are poor scalability and high costs. For this reason, it is important to develop automated evaluation metrics for model development and related purposes Human studies can not only be used to evaluate model performance but also to measure how well the automated metrics align with human preferences. Specifically, they can statistically evaluate whether human judgments agree with metric-given results when assessing similar videos ..

Video evaluation metrics can be categorized into two types: 1) set-to-set comparison metrics and 2) unary metrics. The first type measures the difference between the generated set of data and the reference dataset, typically using statistical measures such as the Fréchet distance . The second type, unary metrics, does not require a reference set, making them suitable for video generation in the wild or video editing, where a gold-standard reference is absent.

Below, we elaborate on the most commonly used video evaluation metrics and provide a quantitative comparison of these metrics on the TikTok dataset.

Set-to-set Comparison Metrics

Set-to-set metrics evaluate the disparity between a generated dataset and a reference dataset, usually within the feature space.

Fréchet Inception Distance (FID) was originally proposed to measure the similarity between the output distribution of an image generative model and its training data. Generated images are first passed through a pre-trained Inception Net to extract features, which are then used to calculate the Fréchet distance between the real and synthetic data distributions. It has been extended to the video domain by computing the FID between the features of individual frames in the generated and reference videos. However, as one could imagine, this metric does not consider the temporal coherence between frames.

Fréchet Video Distance (FVD) has been proposed as an extension of FID for the video domain. Its backbone is replaced by a 3D ConvNet pre-trained on action recognition tasks in YouTube videos (I3D ). The authors acknowledge that the FVD measure is not only sensitive to spatial degradation (different kinds of noise) but also to temporal aberrations such as the swapping of video frames. Kernel Video Distance (KVD) is an alternative to FVD proposed in the same work, using a polynomial kernel. It is computed in an analogous manner, except that a polynomial kernel is applied to the features of the Inception Net. However, FVD was found to align better with human judgments than KVD. Nevertheless, both are commonly reported as benchmark metrics for unconditional video generation.

Fréchet Video Motion Distance (FVMD) is a metric focused on temporal consistency, measuring the similarity between motion features of generated and reference videos using Fréchet Distance. It begins by tracking keypoints using the pre-trained PIPs++ model , then calculates the velocity and acceleration fields for each frame. The metric aggregates these features into statistical histograms and measures their differences using the Fréchet Distance. FVMD assesses motion consistency by analyzing speed and acceleration patterns, assuming smooth motions should follow physical laws and avoid abrupt changes.

In addition to these modern video-based metrics, the traditional Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) are image-level metrics for video quality assessment. Specifically, SSIM characterizes the brightness, contrast, and structural attributes of the reference and generated videos, while PSNR quantifies the ratio of the peak signal to the Mean Squared Error (MSE). Originally proposed for imaging tasks such as super-resolution and in-painting, these metrics are nonetheless repurposed for video evaluation. Unlike the aforementioned methods, PSNR and SSIM do not need pre-trained models. Nor do they consider the temporal coherence between frames, which is crucial for video generation tasks.

Unary Metrics

Unary metrics assess the quality of given video samples without the need for a reference set, making them ideal for applications such as video generation in the wild or video editing where a gold-standard reference is unavailable.

VBench proposes a comprehensive set of fine-grained video evaluation metrics to assess temporal and frame-wise video quality, as well as video-text consistency in terms of semantics and style. They employ a number of pre-trained models, e.g., RAFT for dynamic degree, and MUSIQ for imaging quality, along with heuristics-inspired algorithms, e.g., visual smoothness and temporal flickering, based on inter-frame interpolation and reconstruction error. The overall score is determined by a weighted sum of a number of fine-grained metrics, and the authors also conduct human studies to validate the effectiveness of these metrics.

For text-to-video generation tasks, CLIP cosine similarity is often used to measure the consistency between text prompts and video frames. CLIP is a family of vision transformer auto-encoders that map image and text data into a shared embedding space During training, the distance between embedded images and their associated text labels is minimized through self-supervised learning objective. Thereby, visual concepts are represented close to words that describe them in the embedding space.. The similarity between text and image CLIP embeddings is measured through cosine distance, where a value of 1 indicates identical concepts, and -1 implies completely unrelated concepts. To determine how well a video sequence adheres to the text prompt, the average similarity between each video frame and the text prompt is calculated (prompt consistency) . Temporal coherence can be assessed by computing the mean CLIP similarity between adjacent video frames (frame consistency). In video editing tasks, the percentage of frames with a higher prompt consistency score than in the original is also reported (frame accuracy) .

For generative models trained on video data with categorical labels, the Inception Score (IS) is a widely used metric. Similar to FID, IS was originally proposed for image generation tasks: an Inception Net classifier pre-trained on the ImageNet dataset is first used to predict the class labels of each generated image. The IS score is then calculated using the Kullback-Leibler divergence between the conditional class probability distribution $p(y|x)$ and the marginal class distribution $p(y)$ of the generated samples, where $y$ is the discrete label and $x$ is the generated image. It has been generalized to the video domain , specifically for the UCF101 dataset , where a pre-trained action recognition classifier (C3D ) is used for score computation. However, this metric in practice is highly specific to the UCF101 dataset and is hardly applicable to videos in the wild due to classification difficulty.

Comparison on TikTok Dataset

Let’s see how these evaluation metrics work in real life! We adopt a generic setup without using text prompts or discrete labels in the video generation task. We use the TikTok dataset to provide a quantitative comparison of various video evaluation metrics.

Specifically, we generate 50 videos using different checkpoints named (a) through (e) The video samples are reproduced from the following models: (a) is from Magic Animate ; (b), (c), and (e) are from Animate Anyone , each with different training hyperparameters; and (d) is from DisCo . and measure their performance using the FVD , FID , VBench , and FVMD metrics. We do not use CLIP or IS in this comparison, as they are not suitable for our setup. The models (a) to (e) are sorted based on human ratings collected through a user study, from worse to better visual quality (model (e) has the best visual quality and model (a) has the worst). We can then compare how well the evaluation metrics align with human judgments.

We evaluate video samples created by various video generative models trained on the TikTok dataset to compare the fidelity of different evaluation metrics.

We put together a couple of videos generated by different models which clearly differ in visual quality. Models (a), (b), and (c) result in videos with incomplete human shapes and unnatural motions. Model (d) produces a video with better visual quality, but the motion is still not smooth, resulting in a lot of flickering. In comparison, model (e) generates a video with better visual quality and motion consistency. Disclaimer: These video samples are nowhere near perfect; however, they are sufficient to compare different evaluation metrics.

Quantitative Results.

Metrics	Model (a)	Model (b)	Model (c)	Model (d)	Model (e)	Human Corr.↑
FID↓	73.20 (3rd)	79.35 (4th)	63.15 (2nd)	89.57 (5th)	18.94 (1st)	0.3
FVD↓	405.26 (4th)	468.50 (5th)	247.37 (2nd)	358.17 (3rd)	147.90 (1st)	0.8
VBench↑	0.7430 (5th)	0.7556 (4th)	0.7841 (2nd)	0.7711 (3rd)	0.8244 (1st)	0.9
FVMD↓	7765.91 (5th)	3178.80 (4th)	2376.00 (3rd)	1677.84 (2nd)	926.55 (1st)	1.0

In this table, we show the raw scores given by the metrics, where FVD, FID, and FVMD are lower-is-better metrics, while VBench is higher-is-better. The scores are computed by comparing a set of generated videos (as shown in the video above) to a set of reference videos. We also report the corresponding ranking among the five models based on quantitative results. The ranking correlation between the metrics evaluation and human ratings is also reported, where a higher value indicates better alignment with human judgments.

We can see the ambiguity of some evaluation metrics. Model (a), which has the poorest quality, cannot be effectively distinguished from models (b-d) based on the FID or VBench metrics. Additionally, model (c) is mistakenly ranked higher than model (d) by all metrics except for the FVMD metric. In particular, VBench gives very close scores to models (a-d) with clearly different visual quality, which are not consistent with human judgments. FVMD, on the other hand, ranks the models correctly in line with human ratings. Moreover, FVMD gives distinct scores for video samples of different quality, showing a clearer separation between models. This suggests that FVMD is a promising metric for evaluating video generative models, especially when motion consistency is concerned.

Frames Comparison.
We also present visualizations of video frames for one randomly selected scene to further compare the metrics fidelity.

click here for more frames comparison

Summary

We review the video evaluation metrics used to assess video generative models. These metrics can be categorized into two types: set-to-set comparison metrics (FID, FVD, KVD, FVMD, PSNR, and SSIM) and unary metrics (VBench, CLIP score, and IS). We discuss the pros and cons of each type and provide a detailed comparison using the TikTok dataset. The results show that the FVMD metric aligns better with human judgments than other metrics, especially for assessing motion consistency. This suggests that FVMD is a promising metric for evaluating video generative models.

Wonder why FVMD performs so much better than other metrics? Check out the second part of our blog post to find out more! We will delve into the details of the FVMD metric and explain why it is more effective in assessing video quality and motion consistency.