This is part two in a series of blog posts about flow matching. In this post, we dig deep into continuous normalizing flows as they form the basis for flow matching methods. We discuss the benefits but also the drawbacks of continuous normalizing flows.

This blog post is part 2 in a series covering flow matching. Check out part 1 of the series for the background on discrete normalizing flows.

In the normalizing flows setup, the transformation from the simple distribution to the data distribution is expressed as a finite composition of functions. We can intepret this as a discrete time process with \(K\) time steps. At each time step, there is a corresponding intermediary distribution. But how can we obtain a transformation from \(p\) to \(q\) in continuous time rather than discrete time? Imagine this as taking the composition of infinitely many functions. We can express this idea using Ordinary Differential Equations (ODE), the fundamental component of Continuous Normalizing Flows (CNF).

There is an even deeper connection between ODEs and residual flows that will lead us to continuous time flows. We can write the residual layer more generally as,

\[\mathbf{x}_{t+1} = \mathbf{x}_t + h u(\mathbf{x}_t),\]where \(h > 0\) is some constant and \(u\) is the neural network. First, observe that this equation looks like the Euler discretization of an ODE. Following the analogy, \(\mathbf{x}_t\) represents the current point we are at. To get to the point \(\mathbf{x}_{t+1}\) we move in the direction of the derivative, \(u(\mathbf{x}_t)\) with step size \(h\). In fact, if we rearrange this equation, we start to see something that resembles the definition of the derivative,

\[\frac{\mathbf{x}_{t+1} - \mathbf{x}_t}{h} = u(\mathbf{x}_t).\]If we take \(h \to 0\) and increase the number of layers \(t \to \infty\) we arrive at the following ODE:

\[\frac{d\mathbf{x}(t)}{dt} = u_t(\mathbf{x}(t)),\]where \(u_t\) is a time varying vector field that we parameterize with a neural network with parameters \(\theta\). This is called a Neural Ordinary Differential Equation. When we first introduced residual flows, it may have seemed strange to denote the layers with a time parameter \(t\). Now we know that residual layers are just a discretization of the continuous time dynamics of an ODE. Also, since we have represented residual flows in continuous time, each layer does not have its own parameters. Instead, the parameters are shared across time. Now, we are modeling the time varying vector field that transforms a distribution \(p\) to \(q\). There are a few main benefits that we gain from using Neural ODEs.

1) The Euler discretization method is very rudimentary. ODEs and numerical integration is a mature field and we have much better numerical integrators at our disposal. With CNFs, we can use faster and more accurate solvers to integrate the time varying vector field we model with a neural network. Residual flows required specifying the number of layers of the ResNet which we no longer need to do. ODE solvers can determine the discretization steps needed to obtain a certain error threshold.

2) Discrete Normalizing flows required computing the determinant of the Jacobian matrix which is an \(\mathcal{O}(d^3)\) operation. As we will see, CNFs allow us to perform the same operation with some numerical approximation in just \(\mathcal{O}(d)\) time.

To gain some intuition for flows and ODEs, consider a two dimensional vector field \(v(x,y)\) that describes the movement of water flowing along a river. For simplicity, assume it’s time-independent. The velocity of the water at point \((x,y)\) is the vector \(v(x,y)\). The path of a pebble thrown into the water at time \(t=0\) is a curve we can parameterize as a function of time:

\[\mathbf{r}(t) = \langle x(t), y(t) \rangle, \qquad \mathbf{r}(0) = \langle x(0), y(0) \rangle.\]We can solve for the position of the pebble at time \(t\) by making the following observation. At time \(t\), the velocity of the pebble, \(\frac{d\mathbf{r}(t)}{dt}\), is the same as the velocity of the water at the position of the pebble, \(\mathbf{r}(t)\). We can model this with the following ODE:

\[\frac{d\mathbf{r}(t)}{dt} = v(\mathbf{r}(t)) = v(x(t), y(t)), \qquad \mathbf{r}(0) = \langle x(0), y(0) \rangle.\]This example demonstrate how we can describe the movement of a particle induced by a vector field given some initial position. Specifically, we can construct a function \(\mathbf{r}(t)\) that describes the path taken by a single particle starting at a specific point in space at \(t=0\). As we will see, a flow in the context of CNFs is a more general object that represents the motion of all particles through time.

Let’s provide a more rigorous definition of a flow. Suppose we have a vector field \(u: \mathbb{R}^d \times [0, 1] \to \mathbb{R}^d\). Unlike the example above, this is a time-dependent vector field and we will denote the time parameter as a subscript, \(u_t(x)\). In this setup, \(d\) is the dimension of our data space.

A flow, which is induced by the vector field \(u_t\), is a mapping \(\phi: \mathbb{R}^d \times [0,1] \to \mathbb{R}^d\) which satisfies the following ODE:

\[\frac{d\phi_t(\mathbf{x})}{dt} = u_t(\phi_t(\mathbf{x})),\]with initial condition \(\phi_0(\mathbf{x}) = \mathbf{x}\).

To gain a better intiution of what \(\phi\) represents we can compare it to \(\mathbf{r}(t)\). Given some initial point \(\mathbf{x_0}\), \(\mathbf{r}(t)\) is the position of that point at time \(t\) induced by the movement of water. Similarly, when we provide \(\mathbf{x_0}\) as input to \(\phi\), we will get the function \(\phi(t, \mathbf{x_0}): [0, 1] \to \mathbb{R}^d\) which is only a function of time. It parameterizes a curve in \(\mathbb{R}^d\) that represents the position of the point \(\mathbf{x_0}\) with time induced by the vector field \(u_t\). We can view \(\phi\) from another perspective. Given a specific point in time \(t_0 \in [0,1]\) as input to \(\phi\), we will obtain a function \(\phi(t_0, \mathbf{x}): \mathbb{R}^d \to \mathbb{R}^d\). This function maps all points at time \(t=0\) to the position they would be at time \(t=t_0\). Overall, the mapping \(\phi\) describes the movement of all points starting from time \(t=0\) to time \(t = 1\).For consistent notation, we will denote the time parameter as a subscript \(\phi_t\).

Another important object in CNFs is the probability density path \({p_t: \mathbb{R}^d \times [0,1] \to \mathbb{R}_{>0}}\). It is a time-dependent probability density function i.e. \(\int p_t(\mathbf{x})d\mathbf{x} = 1\). Similar to normalizing flows, we let \(p_0 = p\) be a simple distribution such as a canonical Gaussian. Then \(p_t\) is defined by a change of variables from \(p_0\) using mapping \(\phi_t\):

\[\begin{equation}\label{COV_CNF} p_t(\mathbf{x}) = p_0(\phi_t^{-1}(\mathbf{x}))\det \left| \frac{\partial \phi_t^{-1}}{\partial \mathbf{x}}(\mathbf{x}) \right|. \end{equation}\]With some regularity conditions on \(u_t\), we can gaurauntee that \(\phi_t\) is invertible. Therefore, a vector field generates a single unique probability density path. This also implies that the paths generated by the flow ODE are non-crossing which can be shown by simple contradiction. Suppose the paths of two different points do overlap at some point in time \(t \in [0,1]\). This means that two different points are mapped to the same point at time \(t\). But this would mean that \(\phi_t\) is not an invertible mapping.

In the setting of CNFs, we let \(p_1\) be the data distibution. The goal is to learn a vector field \(v_t\) which induces a flow \(\phi_t\). This flow is responsible for transforming the simple distribution \(p_0 = p\) at time \(t=0\) to the data distribution \(p_1 = q\) at time \(t=1\).

The training objective is the same as in normalizing flows. We maximize the log-likelihood of the data. Given a data point \(\mathbf{x_1} \in \mathbb{R}^d\), to compute \(\log p_1(\mathbf{x_1})\) we could use Equation \(\eqref{COV_CNF}\). However, as in normalizing flows, that would require computing the Jacobian which is an \(O(d^3)\) operation. A benefit of CNFs is that once we are in the continuous setting, there is an alternative method available so we don’t have to do this computation. The alternative method involves the continuity equation:

\[\begin{equation}\label{cont_eq} \frac{\partial}{\partial t}p_t(\mathbf{x}) + \nabla \cdot (p_t(\mathbf{x})u_t(\mathbf{x})) = 0. \end{equation}\]The continuity equation is a Partial Differential Equation (PDE) where \(\nabla \cdot\) represents the divergence operator. The divergence is computed with respect to the spatial dimensions \(\frac{\partial}{\partial x_i}\). The continuity equation provides a necassary and sufficient condition to ensure that a vector field \(u_t\) generates the probability density path \(p_t\). A key detail to note is that a given probability density path can have infinitely many vector fields that generate it. Although, a specific vector field generates only one unique probability density path.

The continuity equation can be derived using some basic vector calculus. It also has a nice physics interpretation. Let’s start by considering an arbitary volume \(V\) in \(\mathbb{R}^3\) for the purposes of visualization. The volume \(V\) is enclosed by the surface \(S\). By definition, \(p_t\) has to integrate to \(1\) over \(\mathbb{R}^3\). This is a key observation. It means that analagous to mass, the probability density \(p_t\) is a conserved quantity. It cannot appear or disappear out of thin air. Therefore, the change in probability density across the volume must equal the difference in probablity density that has entered the volume and the density that has exited the volume. To gain some physical intiution, imagine \(u_t\) as the vector field representing the flow of water through the volume \(V\). Let \(p_t\) be the mass of the water. The change in mass of the flowing water in the volume must be the difference in the mass of water entering and mass of water leaving. So, we can write the change in probability density as follows:

\[\frac{d}{dt}\iiint_V p_t dV.\]The triple integral is the total mass or probability density inside the volume. To measure the change, we take the derivative. Notice the only way for density to enter or leave the volume is through the surface \(S\). Now, let \(n: \mathbb{R}^3 \to \mathbb{R}^3\) represent the outward normal vector to \(S\) at point \((x,y,z)\). Consider an infinitesimally small part of the surface \(S\). The flow of density entering or leaving is the dot product of the normal \(n\) in that small region and the flow vector field \(u_t\). Then the amount of probability density entering or leaving the small region is \((u_t \cdot n)p_t\). Therefore, the change of probability density can also be represented as

\[\frac{d}{dt}\iiint_V p_t dV = - \iint_S (u_t \cdot n) p_t dS.\]We have a negative sign because any density leaving the volume means a negative rate of change of the probability density. Now we can apply Gauss’s divergence theorem:

\[- \iint_S (u_t \cdot n) p_t dS = - \iiint_V \nabla \cdot (p_tu_t) dV.\]We have written the surface integral as a volume integral. Then,

\[\frac{d}{dt}\iiint_V p_t dV = - \iiint_V \nabla \cdot (p_tu_t) dV.\]Moving everything to one side and simplfying we get,

\[\iiint_V \left[ \frac{d}{dt}p_t + \nabla \cdot (p_tu_t) \right] dV = 0.\]Since this is true for any arbitrary volume \(V\) it must be that the quantity inside the integral is equal to \(0\). This results in the continuity equation.

Using the continuity equation and the ODE describing the flow \(\phi_t\) we get the instantaneous change of variable equation:

\[\frac{d}{dt}\log p_t(\phi_t(\mathbf{x})) + \nabla \cdot u_t(\phi_t(\mathbf{x})) = 0.\]The proof of this fact is rather short so we provide it here. Consider the total derivative of \(\log p_t(\phi_t(\mathbf{x}))\),

\[\begin{align} \frac{d\log p_t(\phi_t(\mathbf{x}))}{dt} &= \frac{\partial \log p_t(\phi_t(\mathbf{x}))}{\partial t} \cdot \frac{\partial t}{\partial t} + \nabla_{\mathbf{x}} \log p_t(\phi_t(\mathbf{x})) \cdot \frac{d \phi_t(\mathbf{x})}{d t} \notag \\ &= \frac{\partial \log p_t(\phi_t(\mathbf{x}))}{\partial t} + \nabla_{\mathbf{x}} \log p_t(\phi_t(\mathbf{x})) \cdot \frac{d \phi_t(\mathbf{x})}{d t} \notag \\ &= \frac{\partial \log p_t(\phi_t(\mathbf{x}))}{\partial t} + \nabla_{\mathbf{x}} \log p_t(\phi_t(\mathbf{x})) \cdot u_t(\phi_t(x)) \label{cov_deriv} \end{align}\]Notice the first term is the partial derivative with respect to \(t\). We can obtain this term by rearranging the continuity equation. One property of the divergence operator is that \(\nabla \cdot (p_t(\mathbf{x})u_t(\mathbf{x})) = p_t(\mathbf{x}) \nabla \cdot u_t(\mathbf{x}) + u_t(\mathbf{x}) \cdot \nabla_\mathbf{x} p_t(\mathbf{x})\). So the continuity equation becomes,

\[\begin{equation*} \frac{\partial}{\partial t}p_t(\phi_t(\mathbf{x})) + p_t(\phi_t(\mathbf{x})) \nabla \cdot u_t(\phi_t(\mathbf{x})) + u_t(\phi_t(\mathbf{x})) \cdot \nabla_\mathbf{x} p_t(\phi_t(\mathbf{x})) = 0. \end{equation*}\]Now divide by \(p_t(\phi_t(\mathbf{x}))\),

\[\begin{equation*} \frac{1}{p_t(\phi_t(\mathbf{x}))}\frac{\partial}{\partial t}p_t(\phi_t(\mathbf{x})) + \nabla \cdot u_t(\phi_t(\mathbf{x})) + u_t(\phi_t(\mathbf{x})) \cdot \nabla_\mathbf{x} \frac{p_t(\phi_t(\mathbf{x}))}{p_t(\phi_t(\mathbf{x}))} = 0. \end{equation*}\]Recognize the derivative of \(\log\) and move some terms to the other side to get,

\[\begin{equation*} \frac{\partial}{\partial t}\log p_t(\phi_t(\mathbf{x})) = -\nabla \cdot u_t(\phi_t(\mathbf{x})) - u_t(\phi_t(\mathbf{x})) \cdot \nabla_\mathbf{x} \log p_t(\phi_t(\mathbf{x})). \end{equation*}\]Now substitute this formula into \(\eqref{cov_deriv}\) to obtain the desired result. Remember that in the discrete normalizing flow setup, the change of variable formula required computing the determinant of the Jacobian which was a \(\mathcal{O}(d^3)\) operation. Using the instantaneous change of variables formula we can compute the log-likelihood by integrating the ODE,

\[\log p_1(\phi_1(\mathbf{x})) = \log p_0(\phi_0(\mathbf{x})) - \int_0^1 \nabla \cdot u_t(\phi_t(\mathbf{x})) dt.\]Observe that divergence with respect to the spatial dimension is the same as trace of the Jacobian of \(u_t\). Computing the trace is an \(\mathcal{O}(d^2)\) operation. Using Hutchinson’s trace estimator formula we can reduce the cost down to \(\mathcal{O}(d)\).

Now we have an ODE that describes the change of the log-probability along the flow trajectory. So how can we use this ODE to compute \(\log p_1(\mathbf{x_1})\), and train a CNF with maximum likelihood? So far, we have discussed ODEs in the forward direction i.e. increasing time which is needed to transform the noise distribution into a data distribution. We can also compute and solve ODEs in the reverse direction allowing us to transfrom \(q\) to \(p\). In order to compute the log-likelihood of the data, we need to use the reverse direction ODE. First, we sample a point \(\mathbf{x_1}\) from \(q\). Then we solve the reverse ODE,

\[\frac{d\phi_{1-s}(\mathbf{x})}{ds} = -u_{1-s}(\phi_{1-s}(\mathbf{x})),\]with initial condition \(\phi_1(\mathbf{x}) = \mathbf{x_1}\) with \(s \in [0,1]\). The solution to this is a point \(\mathbf{x_0}\) from the noise distribution. Now we can solve the reverse ODE corresponding to instantenous change of variables formula,

\[\frac{d}{ds}\log p_{1-s}(\phi_{1-s}(\mathbf{x})) = \nabla \cdot u_{1-s}(\phi_{1-s}(\mathbf{x})).\]with initial condition \(\log p_0(\phi_0(\mathbf{x})) = \log p(\mathbf{x_0})\). The fact that \(p_0 = p\) is a simple distribution is a key property because that allows us to evaluate the log-likelihood \(\log p_0(\mathbf{x_0})\). Instead of having to evaluate the \(u_{1-s}\) again to solve this ODE, we can solve the log-likelihood and flow trajectory in a coupled manner:

\[\frac{d}{ds} \begin{bmatrix} \phi_{1-s}(\mathbf{x}) \\ f(1-s) \end{bmatrix} = \begin{bmatrix} -u_{1-s}(\phi_{1-s}(\mathbf{x})) \\ \nabla \cdot u_{1-s}(\phi_{1-s}(\mathbf{x})) \end{bmatrix}\]with \(f(t) = \log p_t(\phi_t(\mathbf{x})) - \log p_1(\phi_1(\mathbf{x}))\). At \(t=1\) we want the difference between the two likelihoods to match so our initial condition is \(f(1) = 0\). The combined initial conditions are,

\[\begin{bmatrix} \phi_{1}(\mathbf{x}) \\ f(1) \end{bmatrix} = \begin{bmatrix} \mathbf{x_1} \\ 0 \end{bmatrix}.\]To summarize, we can train CNFs with maximum likelihood using reverse ODEs. Unlike training discrete normalizing flows which require computing the determinant with cost \(\mathcal{O}(d^3)\), CNFs only need \(\mathcal{O}(d)\) for computing the divergence. However, there is still a downside to training CNFs. The caveat is we have to simulate the flow trajectory to obtain the log-probability. Simulation is very slow even with the \(\mathcal{O}(d)\) operation cost. As a result, continuous normalizing flows scale very poorly which is why they were not as popular as other deep generative methods. In the next blog post, we will discuss flow matching which aims to solve this issue.