An introduction to Markov Chains.
A stochastic process $X = { x(t): t\in T}$ is a collection of random variables. There are two elements:
Markov chain is a discrete-time process for which the future behaviour, given the past and the present, only depends on the present and not on the past.
Markov process is the continuous-time version of a Markov chain.
\[\small P(X_{t} = a_t \vert X_{t-1} = a_{t-1}, X_{t-2} = a_{t-2}, ..., X_0 = a_0) = P(X_{t} = a_t \vert X_{t-1} = a_{t-1}) = P_{a_{t-1}, a_{t}}\]Definition 1. [Markov chain] A discrete time stochastic process $ X_0, X_1, X_2, $. . . is a Markov chain if the following Markov property holds:
Remark 1: This is time-homogeneous markov chain, for $\forall t$, for $\forall a_{t-1}, a_{t} \in \Omega$, the transition probability $P_{a_{t-1}, a_{t} }$ is the same.
Remark 2: In DDPM
The state $X_{t}$ depends on the previous state $X_{t-1}$ but is independent of the particular history $X_{t-2}, X_{t-3},…$. This is called the Markov property or memoryless property.
The Markov property does not imply that $X_{t}$ is independent of the random variables $X_{0}$, $X_{1}$,…, $X_{t-2}$; it just implies that any dependency of $X_{t}$ on the past is captured in the value of $X_{t-1}$.
The Markov chain is uniquely defined by the one-step transition probability matrix $P$:
\[P = \begin{pmatrix} P_{0,0} & P_{0, 1} & \cdots & P_{0, j} & \cdots\\ \vdots & \vdots & \ddots & \vdots& \vdots\\ P_{i,0} & P_{i, 1} & \cdots & P_{i, j} & \cdots\\ \vdots & \vdots & \ddots & \vdots& \vdots\\ \end{pmatrix},\]where $P_{i,j}$ is the probability of transition from state $i$ to state $j$. $P_{i,j} = P(X_{t} = j \vert X_{t-1} = i), i,j \in \Omega$. $\forall i, \sum_{j \geq 0} P_{i,j} = 1$.
For simplicity, we assume that the state space $\Omega$ is finite. Below we introduce some concepts related to the classification of states.
Definition 2. [Communicating class] A state $j$ is reachable from state $i$ if there exists a positive integer $n$ such that $P_{i,j}^{(n)} > 0$. We write $i \rightarrow j$. If $j$ is reachable from $i$, and $i$ is reachable from $j$, then the states $i$ and $j$ are said to communicate, denoted by $i \leftrightarrow j$. A communicating class $C$ is a maximal set of states that communicate with each other. No state in $C$ communicates with any state not in $C$.
Definition 3. [Irreducibility] A Markov chain is irreducible if all states belong to one communicating class.
This means that any state can be reached from any other state, i.e., $\forall i, j \in \Omega$, $P_{i,j} > 0$.
Lemma 1. A finite Markov chain is irreducible if and only if its graph representation is a strongly connected graph.
Let $r_{i,j}^{t}$ denote the probability that the chain, starting at state $i$, the first time transition to state $j$ occurs at time $t$. That is, $ r_{i,j}^{t} = P(X_{t} = j, X_{s} \neq j, \forall 1 \leq s \leq t-1 \vert X_{0} = i). $
Definition 4. [Transient and Recurrent States] A state is recurrent if $\sum_{t \geq 1} r_{i,i}^{t} = 1$ and it is transient if $\sum_{t \geq 1} r_{i,i}^{t} < 1$. A Markov chain is recurrent if every state in the chain is recurrent.
If state i is recurrent then, once the chain visits that state, it will (with probability 1) eventually return to that state. Hence the chain will visit state $i$ over and over again, infinitely often.
A transient state has the property that a Markov chain starting at this state returns to this state only finitely often, with probability 1.
If one state in a communicating class is transient (respectively, recurrent) then all states in that class are transient (respectively, recurrent).
Definition 5. An irreducible Markov chain is called recurrent if at least one (equivalently, every) state in this chain is recurrent. An irreducible Markov chain is called transient if at least one (equivalently, every) state in this chain is transient.
Let $\mu_{i} = \sum_{t \geq 1} t \cdot r_{i,i}^{t}$ denote the expected time to return to state $i$ when starting at state $i$.
Definition 6. A state $i$ is positive recurrent if $\mu_{i} < \infty$ and null recurrent if $\mu_{i} = \infty$.
Here we give an example of a Markov chain that has null recurrent states. Consider the following markov chain whose states are the positive integers.
Starting at state 1, the probability of not having returned to state 1 within the first $t$ steps is \(\prod_{j=1}^{t} \frac{j}{j+1} = \frac{1}{t+1}.\) The probability of never returning to state 1 from state 1 is 0, and state 1 is recurrent. Thus, the probability of the first time transition to state $1$ occurs at time $t$ is \(r_{1,1}^{t} = \frac{1}{t} \cdot \frac{1}{t+1} = \frac{1}{t(t+1)}.\) The expected number of steps until the first return to state 1 when starting at state 1 is \(\mu_{1} = \sum_{t = 1}^{\infty} t \cdot r_{1,1}^{t} = \sum_{t = 1}^{\infty} \frac{1}{t+1} = \infty.\) State 1 is recurrent but null recurrent.
Lemma 2. In a finite Markov chain:
- at least one state is recurrent; and
- all recurrent states are positive recurrent.
Thus, all states of a finite, irreducible Markov chain are positive recurrent.
Definition 7. A state $j$ in a discrete time Markov chain is periodic if there exists an integer $k>1$ such that $P(X_{t+s}= j \vert X_t = j) = 0$ unless $s$ is divisible by $k$. A discrete time Markov chain is periodic if any state in the chain is periodic. A state or chain that is not periodic is aperiodic.
A state $i$ is periodic means that for $s = k, 2k, 3k,…$, $P(X_{t+s}= j \vert X_t = j) > 0$ (N.B.: k > 1).
Definition 8. An aperiodic, positive recurrent state is an ergodic state. A Markov chain is ergodic if all its states are ergodic.
Corollary 1. Any finite, irreducible, and aperiodic Markov chain is an ergodic chain.
Consider the two-state “broken printer” Markov chain:
There are two state (0 and 1) in this Markov chain, and assume that the initial distribution is:
\[P(X_0 = 0) = \frac{\beta}{\alpha+\beta}, P(X_0 = 1) = \frac{\alpha}{\alpha+\beta}.\]Then, according to the transition probability matrix $P$, after one step, the distribution is
\[\small \begin{align*} P(X_1 = 0) &= P(X_0 = 0)P(X_1 = 0 \vert X_0 = 0) + P(X_0 = 1)P(X_1 = 0 \vert X_0 = 1) \\ &= \frac{\beta}{\alpha+\beta} \cdot (1-\alpha) + \frac{\alpha}{\alpha+\beta} \cdot \beta = \frac{\beta}{\alpha+\beta}, \\ P(X_1 = 1) &= P(X_0 = 0)P(X_1 = 1 \vert X_0 = 0) + P(X_0 = 1)P(X_1 = 1 \vert X_0 = 1) \\ &= \frac{\beta}{\alpha+\beta} \cdot \alpha + \frac{\alpha}{\alpha+\beta} \cdot (1-\beta) = \frac{\alpha}{\alpha+\beta}. \end{align*}\]Apparently, the distribution of $X_1$ is the same as the initial distribution. Similarly, we can prove that the distribution of $X_t$ is the same as the initial distribution for any $t$. Here, $\pi = (\frac{\beta}{\alpha+\beta}, \frac{\alpha}{\alpha+\beta})$ is called stationary distribution.
Definition 9. A probability distribution $\pi = (\pi_i)$, $\sum_{i \in \Omega} \pi_i = 1$(row vector) on the state space $\Omega$ is called a stationary distribution (or an equilibrium distribution) for the Markov chain with transition probability matrix $P$ if $\pi = \pi P$, equivalently, $\pi_j = \sum_{i \in \Omega}\pi_i P_{i,j}$ for all $j \in \Omega$.
One interpretation of the stationary distribution: if we started off a thousand Markov chains, choosing each starting position to be state $i$ with probability $\pi_i$, then(roughly) $1000 \pi_j$ of them would be in state $j$ at any time in the future – but not necessarily the same ones each time.
If a chain ever reaches a stationary distribution then it maintains that distribution for all future time, and thus a stationary distribution represents a steady state or an equilibrium in the chain’s behavior.
Consider the following no-claims discount Markov chain with state space $\Omega = {1,2,3}$ and transition matrix:
\[P = \begin{pmatrix} \frac{1}{4} & \frac{3}{4} & 0\\ \frac{1}{4} & 0 & \frac{3}{4}\\ 0 & \frac{1}{4} & \frac{3}{4} \end{pmatrix}.\]Step 1: Assume $\pi = {\pi_1, \pi_2, \pi_3} $ is a stationary distribution. According to the definition 9 of stationary distribution, we need to solve the following equations:
\[\begin{align*} \pi_1 &= \frac{1}{4}\pi_1 + \frac{1}{4}\pi_2, \\ \pi_2 &= \frac{3}{4}\pi_1 + \frac{1}{4}\pi_3, \\ \pi_3 &= \frac{3}{4}\pi_2 + \frac{3}{4}\pi_3. \end{align*}\]Adding the normalising condition $\pi_1 + \pi_2 + \pi_3 = 1$, we get four equations in three unknown parameters.
Step 2: Choose one of the parameters, say $\pi_1$, and solve for the other two parameters in terms of $\pi_1$. We get
\[\pi_1 = \frac{1}{4}\pi_1 + \frac{1}{4}\pi_2 \Rightarrow \pi_2 = 3\pi_1, \pi_3 = 3\pi_2 = 9\pi_1.\]Step 3: Combining with the normalising condition, we get
\[\pi_1 + 3\pi_1 + 9\pi_1 = 1 \Rightarrow \pi_1 = \frac{1}{13}, \pi_2 = \frac{3}{13}, \pi_3 = \frac{9}{13}.\]Finally, we get the stationary distribution $\pi = (\frac{1}{13}, \frac{3}{13}, \frac{9}{13})$.
Given a Markov chaine, how can we know whether it has a stationary distribution? If it has, is it unique? At this part, we will answer these questions.
Some notations:
Hitting time to hit the state $j$: $H_{j} = \min \{ t \in \{0, 1, 2,…\}: X_t = j\}$. Note that here we include time $t = 0$.
Hitting probability to hit the state $j$ staring from state $i$: $h_{i,j} = P(X_t = j, \text{for some} \ t \geq 0 \vert X_0 = i) = P(H_{j} < \infty \vert X_0 = i) = \sum_{t \geq 0} r_{i,j}^{t}.$
Note that this is different from $r_{i,j}^{t}$, which denotes the probability that the chain, starting at state $i$, the first time transition to state $j$ occurs at time $t$.
We also have
\[h_{i,j} = \begin{cases} \sum_{k \in \Omega}P_{i,k}h_{k,j}, & \text{if} \quad j \ne i, \\ 1, & \text{if} \quad j = i. \end{cases}\]Expected hitting time: $\eta_{i,j} = \mathbb{E}(H_{j} \vert X_0 = i) = \sum_{t \geq 0} t \cdot r_{i,j}^{t}$. The expected time until we hit state $j$ starting from state $i$. We also have
\[\eta_{i,j} = \begin{cases} 1 + \sum_{k \in \Omega}P_{i,k}\eta_{k,j}, & \text{if} \quad j \ne i, \\ 0, & \text{if} \quad j = i. \end{cases}\](For the first case, we add 1 because we need to consider the first step from state $i$ to state $k$.)
Theorem 1. Consider an irreducible Markov chain (finite or infinite), (1) if it is positive recurrent, $\exists$ an unique stationary distribution $\pi$, such that $\pi_i = \frac{1}{\mu_{i}}$. (2) if it is null recurrent or transient, no stationary distribution exists.
Remark: If the chain is finite irreducible, it must be positive recurrent, thus it has an unique stationary distribution.
Remark: If the Markov chain is not irreducible, we can decompose the state space into several communicating classes. Then, we can consider each communicating class separately.
Now, we give the proof of Theorem 1. We first prove that if a Markov chain is irreducible and positive recurrent, then there exists a stationary distribution. Next, we will prove the stationary distribution is unique. Since the second part with the null recurrent or transitive Markov chains is less important and more complicated, we will omit it. If you are interested in it, you can refer to the book Markov Chains by James Norris
Proof. (1) Suppose that $(X_0, X_1 …)$ a recurrent Markov chain, which can be positive recurrent or null recurrent. Then we can desigh a stationary distribution as follows. (If we can desigh a stationary distribution, then it must be existed.)
Let $\nu_i$ be the expected number of visits to $i$ before we return back to $k$,
\[\begin{align*} \nu_i &= \mathbb{E}(\# \text{visits to $i$ before returning to } k \vert X_0 = k) \\ &= \mathbb{E}\sum_{t=1}^{M_k} P(X_t = i \vert X_0 = k) \\ &= \mathbb{E}\sum_{t = 0}^{M_k - 1} P(X_t = i \vert X_0 = k) \end{align*}\]The last equation holds because of $ P(X_0 = i \vert X_0 = k) = 0$ and $ P(X_{M_k} = i \vert X_0 = k) = 0$.
If we want design a stationary distribution, it must statisfy $\pi P = \pi$ and $\sum_{i \in \Omega}\pi_i = 1$.
(a) We first prove that $\nu P = \nu$.
\[\begin{align*} \sum_{i \in \Omega} \nu_i P_{i,j} &= \mathbb{E}\sum_{i \in \Omega} \sum_{t = 0}^{M_k - 1} P(X_t = i, X_{t+1} = j \vert X_0 = k) \\ &= \mathbb{E}\sum_{t = 0}^{M_k - 1} \sum_{i \in \Omega} P(X_t = i, X_{t+1} = j \vert X_0 = k) \\ &= \mathbb{E} \sum_{t = 0}^{M_k - 1} P(X_{t+1} = j \vert X_0 = k) \\ &= \mathbb{E} \sum_{t = 1}^{M_k } P(X_{t} = j \vert X_0 = k) \\ &= \mathbb{E} \sum_{t = 0}^{M_k - 1} \nu_i \\ &= \nu_j. \end{align*}\](b) Next, what we need to do is to normalize $\nu$ to get a stationary distribution. We have
\[\begin{align*} \sum_{i \in \Omega} \nu_i &= \sum_{i \in \Omega} \mathbb{E} \sum_{t = 0}^{M_k - 1} P(X_t = i \vert X_0 = k) \\ &=\mathbb{E} \sum_{t = 0}^{M_k - 1} \sum_{i \in \Omega} P(X_t = i \vert X_0 = k) \\ &= \mathbb{E}(M_k \vert X_0 = i) \\ &= \mu_k. \end{align*}\]Thus, we can define $\pi_i = \nu_i/\mu_k$, $\pi = {\pi_i, i \in \Omega}$ is one of the stationary distribution.
(2) Next, we prove that if a Markov chain is irreducible and positive recurrent, then the stationary distribution is unique and is given by $\pi_j = \frac{1}{\mu_j}$.
Given a stationary distribution $\pi$, if we prove that for all $i$, $\pi_j == \frac{1}{\mu_j}$, then we prove that the stationary distribution is unique.
Remember that the expected hitting time:
\begin{equation} \eta_{i,j} = 1 + \sum_{k \in \Omega}P_{i,k}\eta_{k,j}, j \ne i \label{eq:1} \end{equation}
We multiply both sides of Eq. \eqref{eq:1} by $\pi_i$ and sum over $i (i \ne j)$ to get
\begin{equation} \sum_{i \ne j} \pi_i \eta_{i,j} = \sum_{i \ne j} \pi_i + \sum_{i \ne j} \sum_{k \in \Omega} \pi_i P_{i,k}\eta_{k,j} \label{eq:2} \end{equation}
Since $\eta_{j,j} = 0$, we can rewrite the above equation as
\begin{equation} \sum_{i \in \Omega} \pi_i \eta_{i,j} = \sum_{i \ne j} \pi_i + \sum_{i \ne j} \sum_{k \in \Omega} \pi_i P_{i,k}\eta_{k,j}. \label{eq:3} \end{equation}
(The above equality lacks $j$, and we also want to design $\pi_j = 1/\mu_j$.)
Remember that the expected return time:
\[\mu_{j} = 1 + \sum_{i \in \Omega} P_{j,i}\eta_{i,j}.\]We multiply both sides of Eq. \eqref{eq:2} by $\pi_j$ to get
\begin{equation} \pi_j \mu_{j} =\pi_j + \sum_{k \in \Omega} \pi_j P_{j,k}\eta_{k,j} \label{eq:4} \end{equation}
Adding Eq. \eqref{eq:2} and Eq. \eqref{eq:4}, we get
\[\begin{align*} \sum_{i \in \Omega} \pi_i \eta_{i,j} + \pi_j \mu_{j} &= \sum_{i \in \Omega} \pi_i + \sum_{i \in \Omega} \sum_{k \in \Omega} \pi_i P_{i,k}\eta_{k,j} \\ &= 1 + \sum_{k \in \Omega} \sum_{i \in \Omega} \pi_i P_{i,k}\eta_{k,j} \\ &= 1 + \sum_{k \in \Omega} \pi_k \eta_{k,j} \qquad (\text{since} \sum_{i \in \Omega} \pi_i P_{i,k} = \pi_k) \\ \end{align*}\]Since the Markov chain is irreducible and positive recurrent, that means all states belong to a communication class and the expected return time of each state is finite. Thus, the space $\Omega$ is a finite dimensional space.
We can substract $\sum_{k \in \Omega} \pi_k \eta_{k,j}$ and $\sum_{i \in \Omega} \pi_i \eta_{i,j} $ (equal) from both sides of the above equation to get $ \pi_j \mu_{j}=1, $ which means $\pi_j = 1/\mu_j$. Similarly, we can prove that $\pi_i = 1/\mu_i$ for all $i \in \Omega$.
Theorem 2. [Limit theorem] Consider an irreducible, aperiodic Markov chain (maybe infinite), we have $\lim\limits_{t \to \infty} P_{i,j}^{t} = \frac{1}{\mu_{j}}$. Spectially, (1) Suppose the Markov chain is positive recurrent. Then $\lim\limits_{t \to \infty} P_{i,j}^{t} = \pi_j = \frac{1}{\mu_{j}}$. (2) Suppose the Markov chain is null recurrent or transient. Then there is no limite probability.
Three conditions for convergence to an equilibrium probability distribution: irreducibility, aperiodicity, and positive recurrence. The limit probability is
\[P = \begin{pmatrix} \pi_1 & \pi_2 & \cdots & \pi_N\\ \pi_1 & \pi_2 & \cdots & \pi_N\\ \vdots & \vdots & \ddots & \vdots\\ \pi_1 & \pi_2 & \cdots & \pi_N\\ \end{pmatrix},\]where each row is identical.
Define $V_{i,j}^{t} = \vert{ n < t \vert X_n = j}\vert$. $V_{i,j}^{t}$ is the number of visits to state $j$ before time $t$ starting from state $i$. Then we can interpret $V_{i,j}^{t}/t$ as the proportion of time up to time $t$ spent in state $j$.
Theorem 3. [Ergodic theorem] Consider an irreducible Markov chain, we have $\lim\limits_{t \to \infty} V_{i,j}^{t}/t = \frac{1}{\mu_{j}}$ almost surely. Spectially, (1) Suppose the Markov chain is positive recurrent. Then $\lim\limits_{t \to \infty} V_{i,j}^{t}/t = \pi_j = \frac{1}{\mu_{j}}$ almost surely. (2) Suppose the Markov chain is null recurrent or transient. Then $ V_{i,j}^{t}/t \to 0$ almost surely for all $j$.
The term almost surely means that the convergence probability of the event is 1.
Theorem 4. [Detailed balance condition]. Consider a finite, irreducible, and ergodic Markov chain with transition matrix $P$. If there are nonnegative numbers $\bar{\pi} = (\pi_0, \pi_1, …, \pi_n)$ such that $\sum_{i=0}^{n} \pi_i = 1$ and if, for any pair of states $i, j$, $ \pi_i P_{i,j} = \pi_{j} P_{j,i}, $ then $\bar{\pi}$ is the stationary distribution corresponding to $P$.
Proof. We have $ \sum_{i} \pi_i P_{i,j} = \sum_{i}\pi_{j} P_{j,i} = \pi_{j} $ Thus, $\bar{\pi} = \bar{\pi}P$. Since this is a finite, irreducible, and ergodic Markov chain, $\bar{\pi}$ must be the unique stationary distribution of the Markov chain.
Remark: Theorem 4 is a sufficient but not necessary condition.
We have introduced the basic concepts of Markov chains, including the definition, classification of states, and theorems on irreducible Markov chains. If you are interested in more details, you can refer to the following materials: