公式推导
混合分布是多个概率密度函数的线性组合,形式如下:
p ( x ∣ Θ ) = ∑ k = 1 M α k p k ( x ∣ θ k ) , and ∑ k = 1 M α k = 1 ( 1.1 ) p(x \mid \Theta) = \sum_{k=1}^M \alpha_k \, p_k(x \mid \theta_k), \; \text{and} \sum_{k=1}^M \alpha_k = 1 \qquad (1.1)
p ( x ∣ Θ ) = k = 1 ∑ M α k p k ( x ∣ θ k ) , and k = 1 ∑ M α k = 1 ( 1.1 )
对于 Θ = ( α 1 , … , α M , θ 1 , … , θ k ) \Theta = (\alpha_1, \ldots, \alpha_M, \theta_1, \ldots, \theta_k) Θ = ( α 1 , … , α M , θ 1 , … , θ k ) ,表示该混合分布由 M M M 个分布构成,每个分布的权重为 α k \alpha_k α k 。当每个分布均服从高斯分布时,则称该混合分布为 M M M 个分布的高斯混合模型。
假设样本观测值为 X = { x i } , 1 ≤ i ≤ N X = \{ x_i \}, 1 \le i \le N X = { x i } , 1 ≤ i ≤ N ,则由上式可知,高斯混合模型中每个高斯分布的概率密度函数 p k ( x ∣ θ k ) p_k(x \mid \theta_k) p k ( x ∣ θ k ) 为:
p k ( x ∣ θ k ) = 1 2 π σ k 2 e − ( x i − μ k ) 2 2 σ k 2 ( 1.2 ) p_k(x \mid \theta_k) = \frac{1}{\sqrt{2\pi \sigma_k^2}} \, e^{- \frac{(x_i - \mu_k)^2}{2\sigma_k^2}} \qquad (1.2)
p k ( x ∣ θ k ) = 2 π σ k 2 1 e − 2 σ k 2 ( x i − μ k ) 2 ( 1.2 )
对于 N N N 个独立同分布的样本观察值,它们的联合分布有:
p ( X ∣ Θ ) = ∏ i = 1 N p ( x i ∣ Θ ) ( 1.3 ) p(X \mid \Theta) = \prod_{i=1}^{N} p(x_i \mid \Theta) \qquad (1.3)
p ( X ∣ Θ ) = i = 1 ∏ N p ( x i ∣ Θ ) ( 1.3 )
它们的对数似然函数可以写成:
ln L ( Θ ∣ X ) = ln ∏ i = 1 N p ( x i ∣ Θ ) = ∑ i = 1 N ln [ ∑ k = 1 M α k p k ( x i ∣ θ k ) ] ( 1.4 ) \ln L(\Theta \mid X) = \ln \prod_{i=1}^{N} p(x_i \mid \Theta)
= \sum_{i=1}^{N} \ln \left[ \sum_{k=1}^{M} \alpha_k \, p_k(x_i \mid \theta_k) \right] \qquad (1.4)
ln L ( Θ ∣ X ) = ln i = 1 ∏ N p ( x i ∣ Θ ) = i = 1 ∑ N ln [ k = 1 ∑ M α k p k ( x i ∣ θ k ) ] ( 1.4 )
我们的目的是求 Q = arg max Θ L ( Θ ∣ X ) Q = \arg\max_{\Theta} L(\Theta \mid X) Q = arg max Θ L ( Θ ∣ X ) 的极大值,从数学的基础理论上来分析,可以对上述对数似然函数进行求导,从而寻求极值。但是这是比较困难的任务,因为它含有对数函数和多项式求和。为了克服这个困难,引入隐变量 Y = { y i } , 1 ≤ i ≤ N Y = \{ y_i \}, 1 \le i \le N Y = { y i } , 1 ≤ i ≤ N ,并且对于每个隐变量 y i ∈ { 1 , 2 , … , M } y_i \in \{1, 2, \ldots, M\} y i ∈ { 1 , 2 , … , M } ,当 y i = k y_i = k y i = k 时,表示第 i i i 个样本观测值 x i x_i x i 是高斯混合分布的第 k k k 个分布产生的。因此,引入隐变量 Y Y Y 后,对数似然函数可以改写成为
ln L ( Θ ∣ X , Y ) = ln ∏ i = 1 N p ( x i , y i ∣ Θ ) = ∑ i = 1 N ln [ α y i p y i ( x i ∣ θ y i ) ] ( 1.5 ) \ln L(\Theta \mid X, Y)
= \ln \prod_{i=1}^{N} p(x_i, y_i \mid \Theta)
= \sum_{i=1}^{N} \ln \left[ \alpha_{y_i} \, p_{y_i}(x_i \mid \theta_{y_i}) \right] \qquad (1.5)
ln L ( Θ ∣ X , Y ) = ln i = 1 ∏ N p ( x i , y i ∣ Θ ) = i = 1 ∑ N ln [ α y i p y i ( x i ∣ θ y i ) ] ( 1.5 )
改写似然函数后,可以采用 EM 算法来对模型进行参数估计。
假设在第 t − 1 t-1 t − 1 次迭代开始,有 Θ \Theta Θ 的估计 Θ ^ t − 1 = ( α 1 t − 1 , … , α M t − 1 , θ 1 t − 1 , … , θ M t − 1 ) \hat{\Theta}^{t-1} = (\alpha_1^{t-1}, \ldots, \alpha_M^{t-1}, \theta_1^{t-1}, \ldots, \theta_M^{t-1}) Θ ^ t − 1 = ( α 1 t − 1 , … , α M t − 1 , θ 1 t − 1 , … , θ M t − 1 ) 。
在 EM 算法的 E 步,求完全数据的对数似然函数的期望,如下:
Q ( Θ ∣ Θ t − 1 ) = E [ ln L ( Θ ∣ X , Y ) ] = ∑ y ln [ L ( Θ ∣ X , y ) ] p ( y ∣ X , Θ t − 1 ) = ∑ y ∑ i = 1 N ln [ α y i p y i ( x i ∣ θ y i ) ] p ( y ∣ x i , Θ t − 1 ) ( 1.6 ) \begin{aligned}
Q(\Theta \mid \Theta^{t-1})
&= \mathbb{E} \left[ \ln L(\Theta \mid X, Y) \right] \\
&= \sum_{y} \ln \left[ L(\Theta \mid X, y) \right] \, p(y \mid X, \Theta^{t-1}) \\
&= \sum_{y} \sum_{i=1}^{N} \ln \left[ \alpha_{y_i} \, p_{y_i}(x_i \mid \theta_{y_i}) \right] \, p(y \mid x_i, \Theta^{t-1})
\end{aligned} \qquad (1.6)
Q ( Θ ∣ Θ t − 1 ) = E [ ln L ( Θ ∣ X , Y ) ] = y ∑ ln [ L ( Θ ∣ X , y ) ] p ( y ∣ X , Θ t − 1 ) = y ∑ i = 1 ∑ N ln [ α y i p y i ( x i ∣ θ y i ) ] p ( y ∣ x i , Θ t − 1 ) ( 1.6 )
已知第 i i i 个观测值 x i x_i x i 来自第 k k k 个分布的概率为 p ( y i = k ∣ x i , Θ t − 1 ) p(y_i = k \mid x_i, \Theta^{t-1}) p ( y i = k ∣ x i , Θ t − 1 ) 。因此上式可以写成:
Q ( Θ ∣ Θ t − 1 ) = ∑ k = 1 M ∑ i = 1 N ln [ α k p k ( x i ∣ θ k ) ] p ( k ∣ x i , Θ t − 1 ) = ∑ k = 1 M ∑ i = 1 N ln ( α k ) p ( k ∣ x i , Θ t − 1 ) + ∑ k = 1 M ∑ i = 1 N ln [ p k ( x i ∣ θ k ) ] p ( k ∣ x i , Θ t − 1 ) ( 1.7 ) \begin{aligned}
Q(\Theta \mid \Theta^{t-1})
&= \sum_{k=1}^{M} \sum_{i=1}^{N} \ln \left[ \alpha_k \, p_k(x_i \mid \theta_k) \right] \, p(k \mid x_i, \Theta^{t-1}) \\
&= \sum_{k=1}^{M} \sum_{i=1}^{N} \ln (\alpha_k) \, p(k \mid x_i, \Theta^{t-1}) + \sum_{k=1}^{M} \sum_{i=1}^{N} \ln \left[ p_k(x_i \mid \theta_k) \right] \, p(k \mid x_i, \Theta^{t-1})
\end{aligned} \qquad (1.7)
Q ( Θ ∣ Θ t − 1 ) = k = 1 ∑ M i = 1 ∑ N ln [ α k p k ( x i ∣ θ k ) ] p ( k ∣ x i , Θ t − 1 ) = k = 1 ∑ M i = 1 ∑ N ln ( α k ) p ( k ∣ x i , Θ t − 1 ) + k = 1 ∑ M i = 1 ∑ N ln [ p k ( x i ∣ θ k ) ] p ( k ∣ x i , Θ t − 1 ) ( 1.7 )
而由贝叶斯公式可知
p ( k ∣ x i , Θ t − 1 ) = p ( k , x i ∣ θ k t − 1 ) ∑ s = 1 M p ( s , x i ∣ θ s t − 1 ) = p k ( x i ∣ θ k t − 1 ) ∑ s = 1 M p s ( x i ∣ θ s t − 1 ) ( 1.8 ) p(k \mid x_i, \Theta^{t-1}) =
\frac{p(k, x_i \mid \theta_k^{t-1})}
{\sum\limits_{s=1}^{M} p(s, x_i \mid \theta_s^{t-1})} =
\frac{p_k(x_i \mid \theta_k^{t-1})}
{\sum\limits_{s=1}^{M} p_s(x_i \mid \theta_s^{t-1})} \qquad (1.8)
p ( k ∣ x i , Θ t − 1 ) = s = 1 ∑ M p ( s , x i ∣ θ s t − 1 ) p ( k , x i ∣ θ k t − 1 ) = s = 1 ∑ M p s ( x i ∣ θ s t − 1 ) p k ( x i ∣ θ k t − 1 ) ( 1.8 )
其中,p k ( x i ∣ θ k ) p_k(x_i \mid \theta_k) p k ( x i ∣ θ k ) 由式 (1.2) 给出。
在 EM 算法的 M 步,通过求 Θ t \Theta^{t} Θ t 来极大化式 (1.7) 的函数。
首先,求均值 μ k \mu_k μ k ,可以将 Q ( Θ ∣ Θ t − 1 ) Q(\Theta \mid \Theta^{t-1}) Q ( Θ ∣ Θ t − 1 ) 对 μ k \mu_k μ k 求偏导,并令其为零,有:
∂ Q ( Θ ∣ Θ t − 1 ) ∂ μ k = ∂ ∂ μ k [ ∑ i = 1 N ln [ p k ( x i ∣ θ k ) ] p ( k ∣ x i , Θ t − 1 ) ] = ∂ ∂ μ k [ ∑ i = 1 N ( ln ( 1 2 π σ k 2 ) − ( x i − μ k ) 2 2 σ k 2 ) p ( k ∣ x i , Θ t − 1 ) ] = 1 σ k 2 ∑ i = 1 N ( x i − μ k ) p ( k ∣ x i , Θ t − 1 ) = 1 σ k 2 ∑ i = 1 N x i p ( k ∣ x i , Θ t − 1 ) − μ k σ k 2 ∑ i = 1 N p ( k ∣ x i , Θ t − 1 ) = 0 \begin{aligned}
\frac{\partial Q(\Theta \mid \Theta^{t-1})}{\partial \mu_k}
&= \frac{\partial}{\partial \mu_k} \left[ \sum_{i=1}^{N} \ln \left[ p_k(x_i \mid \theta_k) \right] \, p(k \mid x_i, \Theta^{t-1}) \right] \\
&= \frac{\partial}{\partial \mu_k} \left[ \sum_{i=1}^{N} \left( \ln \left( \frac{1}{\sqrt{2\pi \sigma_k^2}} \right) - \frac{(x_i - \mu_k)^2}{2\sigma_k^2} \right) \, p(k \mid x_i, \Theta^{t-1}) \right] \\
&= \frac{1}{\sigma_k^2} \sum_{i=1}^{N} (x_i - \mu_k) \, p(k \mid x_i, \Theta^{t-1}) \\
&= \frac{1}{\sigma_k^2} \sum_{i=1}^{N} x_i \, p(k \mid x_i, \Theta^{t-1}) - \frac{\mu_k}{\sigma_k^2} \sum_{i=1}^{N} p(k \mid x_i, \Theta^{t-1}) \\
&= 0
\end{aligned}
∂ μ k ∂ Q ( Θ ∣ Θ t − 1 ) = ∂ μ k ∂ [ i = 1 ∑ N ln [ p k ( x i ∣ θ k ) ] p ( k ∣ x i , Θ t − 1 ) ] = ∂ μ k ∂ [ i = 1 ∑ N ( ln ( 2 π σ k 2 1 ) − 2 σ k 2 ( x i − μ k ) 2 ) p ( k ∣ x i , Θ t − 1 ) ] = σ k 2 1 i = 1 ∑ N ( x i − μ k ) p ( k ∣ x i , Θ t − 1 ) = σ k 2 1 i = 1 ∑ N x i p ( k ∣ x i , Θ t − 1 ) − σ k 2 μ k i = 1 ∑ N p ( k ∣ x i , Θ t − 1 ) = 0
因此,可以解出:
μ k = ∑ i = 1 N x i p ( k ∣ x i , Θ t − 1 ) ∑ i = 1 N p ( k ∣ x i , Θ t − 1 ) ( 1.9 ) \mu_k = \frac{\sum\limits_{i=1}^{N} x_i \, p(k \mid x_i, \Theta^{t-1})}
{\sum\limits_{i=1}^{N} p(k \mid x_i, \Theta^{t-1})}
\qquad (1.9)
μ k = i = 1 ∑ N p ( k ∣ x i , Θ t − 1 ) i = 1 ∑ N x i p ( k ∣ x i , Θ t − 1 ) ( 1.9 )
同理,求 σ k 2 \sigma_k^2 σ k 2 ,可以将 Q ( Θ ∣ Θ t − 1 ) Q(\Theta \mid \Theta^{t-1}) Q ( Θ ∣ Θ t − 1 ) 对 σ k 2 \sigma_k^2 σ k 2 求偏导,并令其偏导函数为零,可得:
σ k 2 = ∑ i = 1 N ( x i − μ k ) 2 p ( k ∣ x i , Θ t − 1 ) ∑ i = 1 N p ( k ∣ x i , Θ t − 1 ) ( 1.10 ) \sigma_k^2 = \frac{\sum\limits_{i=1}^{N} (x_i - \mu_k)^2 \, p(k \mid x_i, \Theta^{t-1})}
{\sum\limits_{i=1}^{N} p(k \mid x_i, \Theta^{t-1})}
\qquad (1.10)
σ k 2 = i = 1 ∑ N p ( k ∣ x i , Θ t − 1 ) i = 1 ∑ N ( x i − μ k ) 2 p ( k ∣ x i , Θ t − 1 ) ( 1.10 )
最后,求解 α k \alpha_k α k ,引入拉格朗日乘子,又 ∑ k = 1 M α k = 1 \sum\limits_{k=1}^{M} \alpha_k = 1 k = 1 ∑ M α k = 1 ,有:
∂ ∂ α k [ Q ( Θ ∣ Θ t − 1 ) + λ ( ∑ k = 1 M α k − 1 ) ] = ∂ ∂ α k [ ∑ k = 1 M ∑ i = 1 N ln ( α k ) p ( k ∣ x i , Θ t − 1 ) + λ ( ∑ k = 1 M α k − 1 ) ] = 1 α k ∑ i = 1 N p ( k ∣ x i , Θ t − 1 ) + λ = 0 \begin{aligned}
& \frac{\partial}{\partial \alpha_k} \left[ Q(\Theta \mid \Theta^{t-1}) + \lambda \left( \sum_{k=1}^{M} \alpha_k - 1 \right) \right] \\
&= \frac{\partial}{\partial \alpha_k} \left[ \sum_{k=1}^{M} \sum_{i=1}^{N} \ln(\alpha_k) \, p(k \mid x_i, \Theta^{t-1}) + \lambda \left( \sum_{k=1}^{M} \alpha_k - 1 \right) \right] \\
&= \frac{1}{\alpha_k} \sum_{i=1}^{N} p(k \mid x_i, \Theta^{t-1}) + \lambda \\
&= 0
\end{aligned}
∂ α k ∂ [ Q ( Θ ∣ Θ t − 1 ) + λ ( k = 1 ∑ M α k − 1 ) ] = ∂ α k ∂ [ k = 1 ∑ M i = 1 ∑ N ln ( α k ) p ( k ∣ x i , Θ t − 1 ) + λ ( k = 1 ∑ M α k − 1 ) ] = α k 1 i = 1 ∑ N p ( k ∣ x i , Θ t − 1 ) + λ = 0
因此有:
∑ i = 1 N p ( k ∣ x i , Θ t − 1 ) + λ α k = 0 , k = 1 , 2 , … , M ( 1.11 ) \sum_{i=1}^{N} p(k \mid x_i, \Theta^{t-1}) + \lambda \alpha_k = 0, \quad k = 1, 2, \ldots, M
\qquad (1.11)
i = 1 ∑ N p ( k ∣ x i , Θ t − 1 ) + λ α k = 0 , k = 1 , 2 , … , M ( 1.11 )
将上述 M M M 个式子求和可得:
∑ k = 1 M [ ∑ i = 1 N p ( k ∣ x i , Θ t − 1 ) + λ α k ] = 0 ( 1.12 ) \sum_{k=1}^{M} \left[ \sum_{i=1}^{N} p(k \mid x_i, \Theta^{t-1}) + \lambda \alpha_k \right] = 0
\qquad (1.12)
k = 1 ∑ M [ i = 1 ∑ N p ( k ∣ x i , Θ t − 1 ) + λ α k ] = 0 ( 1.12 )
又有 ∑ k = 1 M p ( k ∣ x i , Θ t − 1 ) = 1 \sum\limits_{k=1}^{M} p(k \mid x_i, \Theta^{t-1}) = 1 k = 1 ∑ M p ( k ∣ x i , Θ t − 1 ) = 1 和 ∑ k = 1 M α k = 1 \sum\limits_{k=1}^{M} \alpha_k = 1 k = 1 ∑ M α k = 1 ,代入上式得,λ = − N \lambda = -N λ = − N ,再将该结果带入式(1.11)中可解得:
α k = ∑ i = 1 N p ( k ∣ x i , Θ t − 1 ) N ( 1.13 ) \alpha_k = \frac{\sum\limits_{i=1}^{N} p(k \mid x_i, \Theta^{t-1})}{N}
\qquad (1.13)
α k = N i = 1 ∑ N p ( k ∣ x i , Θ t − 1 ) ( 1.13 )
综上,我们可以得到所有参数的更新公式。