楼宇建筑是当下能源消耗的 “大户”。建筑物严重依赖照明, 供暖, 空调等设施, 给楼层中的人们提供舒适的办公, 居住, 以及生活环境。研究人员为了研究能源使用的效率, 模拟了12 种不同的建筑外形。每一种建筑在覆盖玻璃的区域, 玻璃区域的分布位置, 建筑方位等方面都有所不同。 本案例提供了一份有关楼宇能源的数据集, 共有 10 个特征, 768 个样本。 需要关注的重点是建筑物的供暖负荷和冷却负荷。

我们使用高斯混合模型,对数据集进行聚类,并使用KMeans模型作为对照,展现高斯混合模型在聚类问题上的应用。

数据预处理

In [40]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
In [17]:
data = pd.read_csv('./input/energy-efficiency.csv')
type(data.values)
Out[17]:
numpy.ndarray
In [104]:
data.head()
Out[104]:
X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
0 0.98 514.5 294.0 110.25 7.0 2 0.0 0 15.55 21.33
1 0.98 514.5 294.0 110.25 7.0 3 0.0 0 15.55 21.33
2 0.98 514.5 294.0 110.25 7.0 4 0.0 0 15.55 21.33
3 0.98 514.5 294.0 110.25 7.0 5 0.0 0 15.55 21.33
4 0.90 563.5 318.5 122.50 7.0 2 0.0 0 20.84 28.28
In [37]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
X1    768 non-null float64
X2    768 non-null float64
X3    768 non-null float64
X4    768 non-null float64
X5    768 non-null float64
X6    768 non-null int64
X7    768 non-null float64
X8    768 non-null int64
Y1    768 non-null float64
Y2    768 non-null float64
dtypes: float64(8), int64(2)
memory usage: 60.1 KB
In [38]:
data.describe()
Out[38]:
X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
count 768.000000 768.000000 768.000000 768.000000 768.00000 768.000000 768.000000 768.00000 768.000000 768.000000
mean 0.764167 671.708333 318.500000 176.604167 5.25000 3.500000 0.234375 2.81250 22.307201 24.587760
std 0.105777 88.086116 43.626481 45.165950 1.75114 1.118763 0.133221 1.55096 10.090196 9.513306
min 0.620000 514.500000 245.000000 110.250000 3.50000 2.000000 0.000000 0.00000 6.010000 10.900000
25% 0.682500 606.375000 294.000000 140.875000 3.50000 2.750000 0.100000 1.75000 12.992500 15.620000
50% 0.750000 673.750000 318.500000 183.750000 5.25000 3.500000 0.250000 3.00000 18.950000 22.080000
75% 0.830000 741.125000 343.000000 220.500000 7.00000 4.250000 0.400000 4.00000 31.667500 33.132500
max 0.980000 808.500000 416.500000 220.500000 7.00000 5.000000 0.400000 5.00000 43.100000 48.030000

数据标准化

In [31]:
scaler = StandardScaler().fit(data)
In [33]:
scaler.mean_
Out[33]:
array([  7.64166667e-01,   6.71708333e+02,   3.18500000e+02,
         1.76604167e+02,   5.25000000e+00,   3.50000000e+00,
         2.34375000e-01,   2.81250000e+00,   2.23072005e+01,
         2.45877604e+01])
In [35]:
scaler.scale_
Out[35]:
array([  0.10570859,  88.02874964,  43.59806953,  45.13653573,
         1.75      ,   1.11803399,   0.1331338 ,   1.5499496 ,
        10.08362445,   9.50710999])
In [44]:
scaled_data = scaler.transform(data)
scaled_data
Out[44]:
array([[ 2.04177671, -1.78587489, -0.56195149, ..., -1.81457514,
        -0.67011624, -0.34266569],
       [ 2.04177671, -1.78587489, -0.56195149, ..., -1.81457514,
        -0.67011624, -0.34266569],
       [ 2.04177671, -1.78587489, -0.56195149, ..., -1.81457514,
        -0.67011624, -0.34266569],
       ..., 
       [-1.36381225,  1.55394308,  1.12390297, ...,  1.41133622,
        -0.58185433, -0.78654401],
       [-1.36381225,  1.55394308,  1.12390297, ...,  1.41133622,
        -0.5778875 , -0.83913623],
       [-1.36381225,  1.55394308,  1.12390297, ...,  1.41133622,
        -0.56202019, -0.9001432 ]])

数据可视化

In [18]:
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
In [54]:
iris = load_iris()
X_tsne = TSNE(n_components=3, learning_rate=100).fit_transform(scaled_data)
X_pca = PCA(n_components=3).fit_transform(scaled_data)

plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c = scaled_data[:,-2])
plt.subplot(122)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c = scaled_data[:, -2])
Out[54]:
<matplotlib.collections.PathCollection at 0x1188eb150>
In [55]:
plt.show()
In [62]:
from mpl_toolkits.mplot3d import Axes3D

X_tsne = TSNE(n_components=3, learning_rate=100).fit_transform(scaled_data)
X_pca = PCA(n_components=3).fit_transform(scaled_data)
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(121, projection='3d')
ax.scatter(X_tsne[:, 0], X_tsne[:, 1], X_tsne[:, 2], c = scaled_data[:,-2])
ax = fig.add_subplot(122, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c = scaled_data[:,-2])
plt.show()

KMeans聚类模型

In [95]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

estimator = KMeans(n_clusters=3)
estimator.fit(data.values)
labels = estimator.labels_
In [96]:
X_tsne = TSNE(n_components=2, learning_rate=100).fit_transform(scaled_data)

plt.figure(figsize=(10, 5))

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c = labels)
plt.show()

高斯混合模型

假定数据集中的数据记录来自不同的数据分布,那么对数据集进行聚类可以使用高斯混合(Gaussian Mixture Model)模型。GMM使用EM算法确定每个组成部分所在分布的参数。

给定$z$是一个$K$维的随机隐变量,其中$z_{k} \in \{0, 1\}, \quad \sum_{k}z_{k} = 1$,变量$z$的分布为$p(z_{k} = 1) = \pi_{k}$

\begin{equation} p(x) = \sum_{z}p(z)p(x \vert z) \\ = \sum_{k = 1}^{K}p(z)\prod_{k = 1}^{K} (\mathcal{N}(x \vert \mu_{k}, \Sigma_{k}))^{z_{k}} = \sum_{k = 1}^{K}\pi_{k} \mathcal{N}(x \vert \mu_{k}, \Sigma_{k}) \end{equation}

从上面的公式可以看出,GMM模型实质上是K个高斯模型的线性叠加。进一步来讲,对于聚类问题,我们关注$p(z_{k} = 1 \vert x, \theta )$,也就是数据x属于簇$k$的后验概率概率。

我们给出高斯混模型求参的EM框架:

  • Expecation步,计算$P(z_{k}= 1 \vert x_{n}, \theta)$,记为$\gamma_{n}(z_{k})$
\begin{equation} \gamma_{n}(z_{k}) = \dfrac{p(z_{k} = 1)p(x_{n} \vert z_{k} = 1)}{p(x_{n})} = \dfrac{p(z_{k} = 1)p(x_{n} \vert z_{k} = 1)}{\sum_{j = 1}^{K}p(z_{j} = 1)p(x_{n} \vert z_{j} = 1)} = \dfrac{\pi_{k}\mathcal{N}(x_{n} \vert \mu_{k}, \Sigma_{k})}{\sum_{k = 1}^{K}\pi_{k}\mathcal{N}(x \vert \mu_{k})} \end{equation}
  • Maximization步,最大化似然函数下界$LB(\theta, \theta^{old})$,即对数似然函数的期望
\begin{equation} LB(\theta, \theta^{old}) = \mathbb{E}(\log p(x, z \vert \theta)) = \mathbb{E}(\sum_{n = 1}^{N}\log p(x_{n}, z_{n} \vert \theta)) \\ = \sum_{n = 1}^{N} \mathbb{E}(\log (p(z_{n})p(x_{n} \vert z_{n}))) = \sum_{n = 1}^{N} \mathbb{E} (\log(\prod_{k = 1}^{K}(\pi_{k}p(x_{n} \vert \theta_{k}))^{z_{k}}))\\ =\sum_{n}\sum_{k}\mathbb{E}(z_{k})\log(\pi_{k}p(x_{n}\vert\theta_{k})) = \sum_{n}\sum_{k} \gamma_{n}(z_{k})\log \pi_{k}+ \sum_{n}\sum_{k}\gamma_{n}(z_{k})\log p(x_{n} \vert \theta_{k}) \end{equation}

对上式求偏导,就可以得到参数的解析式为

\begin{equation} \pi_{k} = \dfrac{1}{N}\sum_{n = 1}^{N}\gamma_{n}(z_{k}) \\ \mu_{k} = \sum_{n = 1}^{N}\dfrac{\gamma_{n}(z_{k})x_{n}}{\gamma(z_{k})} \\ \Sigma_{k} = \dfrac{\sum_{n = 1}^{N}\gamma_{n}(z_{k})x_{n}x_{n}^{T}}{\gamma(z_{k})} - \mu_{k}\mu_{k}^{T} \end{equation}
In [81]:
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
In [98]:
gmm_random = GaussianMixture(n_components=3, covariance_type='full', init_params='random').fit(scaled_data)

labels_random = gmm.predict(scaled_data)
In [99]:
gmm_kmeans = GaussianMixture(n_components=3, covariance_type='full').fit(scaled_data)

labels_kmeans = gmm.predict(scaled_data)
In [101]:
X_tsne = TSNE(n_components=2, learning_rate=100).fit_transform(scaled_data)

#plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c = labels_random)
plt.subplot(122)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c = labels_kmeans)
plt.show()