15. K-Means

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  # for plot styling
import numpy as np
import pandas as pd
import os

15.1. Toy Example 直接走一次

15.1.1. 產生資料

  • 我們用內建的 function,來產出一筆模擬資料

# from sklearn.datasets.samples_generator import make_blobs # 舊版寫法,已不能使用
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.60, random_state=0)
# 看資料的維度
print(X.shape)
# 看資料的前五筆
print(X[:5]) 
(300, 2)
[[ 0.83685684  2.13635938]
 [-1.4136581   7.40962324]
 [ 1.15521298  5.09961887]
 [-1.01861632  7.81491465]
 [ 1.27135141  1.89254207]]
ax = plt.gca()
ax.axis('equal')
ax.scatter(X[:, 0], X[:, 1], s=40, cmap='viridis', zorder=2);
../_images/KMeans_7_0.png
  • 從上面的資料和圖,可以得知我們拿到一組 nxp = 300x2 的資料

  • 在座標系上描出這 300 個點,長得像上圖一樣。

  • 直觀來看,分成四群。由於這是模擬資料,所以其實我知道他真的是來自 4 群,而且我還有他們的真實標籤 (y_true)

15.1.2. 概念講解

15.1.3. code 實作

15.1.3.1. training

# 必要參數,我想分成幾群
num_clusters = 4

# training
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = num_clusters);
kmeans.fit(X);
  • 有幾個我們想知道的內容:

    • 最後得到的那些 centers,座標分別是多少? -> 用 kmeans.cluster_centers_ 來查看.

    • 最後分完全群後,inertia (i.e. Total within sum of square) 是多少 -> 用 kmeans.inertia_ 來查看

# centers 的座標
kmeans.cluster_centers_
array([[-1.58438467,  2.83081263],
       [-1.37324398,  7.75368871],
       [ 1.98258281,  0.86771314],
       [ 0.94973532,  4.41906906]])

15.1.3.2. 預測 目前/未來 的 instance 屬於哪一群

# prediction
y_kmeans = kmeans.predict(X)
  • 可以看到分群結果:

y_kmeans
array([2, 1, 3, 1, 2, 2, 0, 3, 1, 1, 0, 1, 3, 1, 2, 3, 3, 2, 0, 0, 2, 2,
       3, 0, 0, 3, 2, 3, 0, 3, 1, 1, 3, 1, 1, 1, 1, 1, 0, 2, 3, 0, 3, 3,
       0, 0, 1, 0, 1, 2, 0, 2, 1, 2, 2, 0, 1, 0, 1, 2, 1, 3, 1, 0, 0, 0,
       1, 2, 1, 0, 3, 0, 1, 0, 0, 1, 0, 3, 2, 1, 2, 3, 2, 2, 1, 3, 2, 3,
       1, 1, 3, 2, 1, 0, 0, 3, 2, 2, 3, 0, 1, 2, 1, 2, 3, 2, 2, 3, 1, 3,
       0, 0, 2, 1, 2, 3, 1, 2, 2, 3, 0, 2, 0, 2, 2, 2, 2, 0, 2, 0, 1, 0,
       0, 2, 1, 0, 0, 1, 3, 1, 1, 0, 3, 0, 3, 0, 1, 3, 1, 1, 1, 3, 1, 3,
       2, 0, 1, 0, 2, 3, 1, 3, 3, 2, 3, 0, 0, 3, 2, 3, 3, 1, 2, 3, 0, 1,
       2, 2, 3, 0, 2, 3, 0, 0, 3, 3, 3, 3, 2, 1, 3, 0, 3, 3, 0, 0, 0, 3,
       0, 1, 3, 0, 2, 0, 3, 1, 0, 1, 3, 1, 3, 0, 3, 3, 1, 0, 0, 2, 2, 3,
       1, 2, 2, 0, 2, 0, 3, 1, 1, 3, 3, 1, 3, 2, 0, 3, 2, 0, 1, 0, 2, 3,
       2, 1, 1, 1, 1, 0, 0, 1, 3, 0, 2, 3, 0, 0, 0, 2, 2, 1, 3, 3, 0, 2,
       1, 0, 3, 1, 3, 2, 2, 0, 0, 3, 2, 2, 2, 3, 1, 1, 2, 2, 3, 2, 2, 2,
       1, 0, 1, 3, 2, 2, 1, 1, 1, 2, 2, 3, 1, 0], dtype=int32)

15.1.3.3. 視覺化

ax = plt.gca()
ax.axis('equal')
ax.scatter(X[:, 0], X[:, 1], s=40, cmap='viridis');

centers = kmeans.cluster_centers_
ax.scatter(centers[:, 0], centers[:, 1], c='red', s=100, alpha=0.5);
../_images/KMeans_20_0.png

15.2. 細節討論

15.2.1. 與 效果 有關的討論

15.2.1.1. 分群結果到底好不好?

  • 看圖:

    • 直接畫圖來看,是一定要先做的事 (二維就直接畫圖,多維就先降維再畫圖)

    • 以此例來說,很明顯分的不錯,看起來是 4 群,實際也分成 4 群

  • 看量化指標:

    • 我們可以把 centers 當成是那一群的預測值,然後去算 total square error,就可以知道預測的好不好.

    • 那這個total square error,其時就是 within group variance.

    • 而在 kmeans 中,被稱為 inertia,我們可以這樣得到結果:

print("inertia(i.e. total square error; within group sum of square): ", kmeans.inertia_)
print("RMSE(i.e. sqrt(inertia/n_points): ", np.sqrt(kmeans.inertia_/X.shape[0]))
inertia(i.e. total square error; within group sum of square):  212.00599621083478
RMSE(i.e. sqrt(inertia/n_points):  0.8406465690384491
  • 如果,我們對 domain knowledge 是有 fu 的,那就可以用 mean square error 的角度去解釋這樣算大算小: 平均來說,分完群後,大家離各群的中心有 RMSE 這麼遠

15.2.1.2. 如果分的不好怎麼辦?

  • 有時候分群完一看圖,就覺得分得有點 NG,例如:

  • 那接下來可以做的包括

    • 不是線性可分的資料,改成用非線性的方法(例如 kernel 版 KMeans; 或是本來就是 nonlinear 的分群演算法,如:…)

    • 可能是預設群數設定不佳的問題,可以用xxx來決定群數.

15.2.2. 與 EDA 有關的討論

15.2.2.1. 使用 KMeans 有 assumption 嗎?

  • KMeans 雖然看起來很簡單好算,又沒有特別假設資料要符合什麼分佈,但其實,他是有假設的,只是沒說而已

  • 延續剛剛的例子,我們把每一群的 center 標出來,然後拿該群離 center 最遠的點和 center 的距離當半徑,可以畫出下圖:

from scipy.spatial.distance import cdist

def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):
    labels = kmeans.fit_predict(X)

    # plot the input data
    ax = ax or plt.gca()
    ax.axis('equal')
    ax.scatter(X[:, 0], X[:, 1], c=labels, s=20, cmap='viridis', alpha = 0.5, zorder=2)

    # plot the representation of the KMeans model
    centers = kmeans.cluster_centers_
    radii = [cdist(X[labels == i], [center]).max()
             for i, center in enumerate(centers)]
    for c, r in zip(centers, radii):
        ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.8, zorder=1))
kmeans = KMeans(n_clusters=4, random_state=0)
plot_kmeans(kmeans, X)
../_images/KMeans_33_0.png
  • 從圖中可以看到,在圓內的點,就會被分為該群。

  • 以中間那兩群來說,中間的黃色點,他介於兩群中間,但因為他和黃色的中心,比和藍色的中心近,所以他被歸為黃色那群。那意思就是,在這圈圈內的,都被歸為黃色的群。

  • 所以,把這個圓當 contour 來看,就會是個二元常態,且變異數共變數矩陣必須是對角矩陣,且對角線上的每個 element 都一樣大 (i.e. \(\Sigma = c*I\),c 是常數。因為 contour 是正圓,不是斜橢圓,表變異數相等)。

  • 所以,KMeans 的假設,可以寫成:

    • \((X_i | S_i = k) \sim N(\mu_k, \Sigma_k)\): 第 i 筆資料,given 他是第 k 群的話(\(S_i = k\)),那他會 follow 一個 multivariate normal distribution.

    • 而且, \(\Sigma_k = \lambda_k * I, k = 1,2,...,K\): 意思是,變異數共變數矩陣,不僅是對角矩陣,而且對角線的 element 每個都一樣大

    • \(X_i\) 就是來自 mixture of k 個 Multivariate Normal

  • 由此可知,之後會介紹的 Gaussian Mixture Model (GMM),可以視為 KMeans 的推廣。而 KMeans 只是 GMM 的特例。

  • 可以先睹為快

    • 如果資料不是分佈的像正圓,而是橢圓的樣子,那 KMeans 會分不好,但 GMM 會分的好

    • 如果資料分布的就是正圓的樣子,那 GMM 會分的根 KMeans 一樣好

15.2.2.2. Imbalance 的群數可以處理嗎?

  • 如果今天有 100 筆資料,而且我們手中握有真實的標籤(2類),1類有 90 筆,另一類 10 筆,那 KMeans 有辦法幫我分成兩群嗎?

15.2.2.3. 要分成幾群?

  • 先畫圖來點直覺.

    • 例如剛剛的範例,一畫完圖,就知道分成 4 群是好選擇.

    • 如果今天資料的 feature 大於 2 維,無法直接畫圖,那就先做降維,用 2 維的圖來判斷一下要分成幾群.

  • 畫完圖還看不出來,可以進一步用:

    • inertia + elbow method

      • 用迴圈,分成 k 群, k = 1,2,…,K.

      • 每做一次,就算 inertia,然後看 inertia 在哪裡有拐點.

    • 輪廓圖:

inertia_list = []
for k in range(1, 10):
    try_kmeans = KMeans(n_clusters= k)
    try_kmeans.fit(X)
    inertia_list.append(try_kmeans.inertia_)
plt.plot(inertia_list, '--bo');
../_images/KMeans_39_0.png
  • 可以看到,在第 4 個 點的時候,出現 elbow,所以適合用 k = 4 來當群數

  • silhouette 的範例如下。算法是:

    • 每一個 instance,先算出 silhouette score: (b-a) / max(b, a);

      • a: 此 instance 和自己這群所有點的平均距離: 表示他和自己這群的平均距離

      • b: 此 instance 與最接近的臨群的平均距離。最接近的臨群的定義,是可以讓 b 最小的群。這個指標可以想成,他距離最能跳槽的群的平均距離

      • 算出來的值介於 -1 ~ 1:

        • 1 表示 b-a 的值很接近 b 了,表示 b 比 a 大很多,所以表示該 instance 被分的很好,就在自己的群裡

        • 0 表示,此 instance 在 decision boundary 上,因為 b 和 a 幾乎一樣大

        • -1 表示,此 instance 應該被分到最近的臨群才合理,因為他離最接近臨群更近

    • 把所有 instance 的 silhouette 取平均,得到 silhouette coefficient

from sklearn.metrics import silhouette_score

kmeans_per_k = [KMeans(k).fit(X) for k in range(2,10)] # 從 k = 2 開始,因為 k = 1 無法算 Silhouette
silhouette_scores = [silhouette_score(X, model.labels_) for model in kmeans_per_k]
plt.figure(figsize=(8, 3))
plt.plot(range(2, 10), silhouette_scores, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette score", fontsize=14)
#plt.axis([1.8, 8.5, 0.55, 0.7])
#save_fig("silhouette_score_vs_k_plot")
#plt.show()
Text(0, 0.5, 'Silhouette score')
../_images/KMeans_43_1.png
  • 從上圖可知,選 k = 3, 4, 5 都還 ok。訊息量就比 elbow 豐富了。

  • 再來,畫個別的輪廓圖 (silhouette diagram)

from sklearn.metrics import silhouette_samples
from matplotlib.ticker import FixedLocator, FixedFormatter
import matplotlib as mpl

plt.figure(figsize=(11, 9))

for k in (3, 4, 5, 6):
    plt.subplot(2, 2, k - 2)
    
    y_pred = kmeans_per_k[k - 1].labels_
    silhouette_coefficients = silhouette_samples(X, y_pred)

    padding = len(X) // 30
    pos = padding
    ticks = []
    for i in range(k):
        coeffs = silhouette_coefficients[y_pred == i]
        coeffs.sort()

        color = mpl.cm.Spectral(i / k)
        plt.fill_betweenx(np.arange(pos, pos + len(coeffs)), 0, coeffs,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ticks.append(pos + len(coeffs) // 2)
        pos += len(coeffs) + padding

    plt.gca().yaxis.set_major_locator(FixedLocator(ticks))
    plt.gca().yaxis.set_major_formatter(FixedFormatter(range(k)))
    if k in (3, 5):
        plt.ylabel("Cluster")
    
    if k in (5, 6):
        plt.gca().set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
        plt.xlabel("Silhouette Coefficient")
    else:
        plt.tick_params(labelbottom=False)

    plt.axvline(x=silhouette_scores[k - 2], color="red", linestyle="--")
    plt.title("$k={}$".format(k), fontsize=16)

#save_fig("silhouette_analysis_plot")
#plt.show()
../_images/KMeans_45_0.png
  • 如果用了畫圖,和 elbow method,都還是很難決定群數,那可以考慮:

    • 用別的指標(e.g. AIC)來取代inertia: 例如改用 Gausian Mixture Model 來分群,他提供的 AIC/BIC 指標也許可以更好的選到 k 要幾群.

    • 用可以直接幫你找出群數的分群方法:

      • Baysian 版的 GMM

      • DBSCAN.

      • mean-shift.

      • affinity propagation

15.2.2.4. Outlier 會影響到 KMeans 嗎?

15.2.2.5. 對於非數值形的 feature 如何處理?

15.2.2.6. 使用 KMeans 前要做前處理嗎?

  • 先講結論:

    • require: Normalization.

    • optional: PCA

  • 由於 KMeans 在算兩點的距離時,就直接把所有 feature 一視同仁的算 Euclidean Distance,所以把大家的單位都調成一樣,是很重要的

  • 如果對每個 feature 做 Normalization,那就等於把變異數共變數矩陣的對角項所有元素設為1(等變異)。此時的 contour 仍不一定是正圓,而是 x, y 方向的 range 一樣大的斜橢圓.

  • 但如果想做到確保 contour 是正圓,那就接著做 PCA (就 de-correlation了)

15.2.3. 與建模有關的討論

15.2.3.1. 起始值有影響嗎?

  • KMeans 在做起始值時,是用隨機撒點的方式(從現有資料中,隨機抽出 k 筆資料當中心).

  • 所以,如果你起始值撒得不好

  • 例如,我們已經知道一組爛起始值如下:

# 已知的爛起始值
rng = np.random.RandomState(0)
i = rng.permutation(X.shape[0])[:num_clusters]
print("起始值是取資料的這 4 個 row: ", i)
print("起始 centers 的座標如下:")
print(X[i])
起始值是取資料的這 4 個 row:  [208 188  12 221]
起始 centers 的座標如下:
[[1.07627418 4.68480619]
 [2.47019077 1.31451315]
 [1.24258802 4.50399192]
 [2.5270643  0.6178122 ]]
plt.scatter(
    x = X[:, 0], 
    y = X[:, 1], 
    s = 50, 
    cmap='viridis'
)

plt.scatter(X[i][:, 0], X[i][:, 1], c='red', s=200, alpha=0.9);
../_images/KMeans_57_0.png
  • 所以,如果一開始起始值取到這組,那結果會很慘:

kmeans_bad = KMeans(
    n_init = 1, # 預設是 10
    n_clusters = num_clusters,
    init = X[i] # 給定剛剛那組起始值
)
kmeans_bad.fit(X)
y_pred_bad = kmeans_bad.predict(X)

plt.scatter(
    x = X[:, 0], 
    y = X[:, 1], 
    c = y_pred_bad, 
    s = 50, 
    cmap='viridis'
)

centers_bad = kmeans_bad.cluster_centers_
plt.scatter(centers_bad[:, 0], centers_bad[:, 1], c='red', s=200, alpha=0.5);
../_images/KMeans_59_0.png
  • 可以看到,分群結果不理想.

  • 那這部分在實作時,是可以不用太擔心

  • 因為 sklearn 知道 KMeans 有這個缺陷,所以他會.

    • 預設就先幫你撒 10 組 init (可從 n_init = 10 這個參數來做設定).

    • 每一組 init,就會做一次完整的 KMeans 分群,分完後,得到 inertia.

    • 只取出 10 組中, inertia 最小的給你,當作結果

  • 選初始值的其他方式: 另一本書的 p237 開始

15.2.3.2. 善用 ? 可得 function 滿滿細節

?KMeans
Init signature:
KMeans(
    n_clusters=8,
    *,
    init='k-means++',
    n_init=10,
    max_iter=300,
    tol=0.0001,
    verbose=0,
    random_state=None,
    copy_x=True,
    algorithm='auto',
)
Docstring:     
K-Means clustering.

Read more in the :ref:`User Guide <k_means>`.

Parameters
----------

n_clusters : int, default=8
    The number of clusters to form as well as the number of
    centroids to generate.

init : {'k-means++', 'random'}, callable or array-like of shape             (n_clusters, n_features), default='k-means++'
    Method for initialization:

    'k-means++' : selects initial cluster centers for k-mean
    clustering in a smart way to speed up convergence. See section
    Notes in k_init for more details.

    'random': choose `n_clusters` observations (rows) at random from data
    for the initial centroids.

    If an array is passed, it should be of shape (n_clusters, n_features)
    and gives the initial centers.

    If a callable is passed, it should take arguments X, n_clusters and a
    random state and return an initialization.

n_init : int, default=10
    Number of time the k-means algorithm will be run with different
    centroid seeds. The final results will be the best output of
    n_init consecutive runs in terms of inertia.

max_iter : int, default=300
    Maximum number of iterations of the k-means algorithm for a
    single run.

tol : float, default=1e-4
    Relative tolerance with regards to Frobenius norm of the difference
    in the cluster centers of two consecutive iterations to declare
    convergence.

verbose : int, default=0
    Verbosity mode.

random_state : int, RandomState instance or None, default=None
    Determines random number generation for centroid initialization. Use
    an int to make the randomness deterministic.
    See :term:`Glossary <random_state>`.

copy_x : bool, default=True
    When pre-computing distances it is more numerically accurate to center
    the data first. If copy_x is True (default), then the original data is
    not modified. If False, the original data is modified, and put back
    before the function returns, but small numerical differences may be
    introduced by subtracting and then adding the data mean. Note that if
    the original data is not C-contiguous, a copy will be made even if
    copy_x is False. If the original data is sparse, but not in CSR format,
    a copy will be made even if copy_x is False.

algorithm : {"auto", "full", "elkan"}, default="auto"
    K-means algorithm to use. The classical EM-style algorithm is "full".
    The "elkan" variation is more efficient on data with well-defined
    clusters, by using the triangle inequality. However it's more memory
    intensive due to the allocation of an extra array of shape
    (n_samples, n_clusters).

    For now "auto" (kept for backward compatibility) chooses "elkan" but it
    might change in the future for a better heuristic.

    .. versionchanged:: 0.18
        Added Elkan algorithm

Attributes
----------
cluster_centers_ : ndarray of shape (n_clusters, n_features)
    Coordinates of cluster centers. If the algorithm stops before fully
    converging (see ``tol`` and ``max_iter``), these will not be
    consistent with ``labels_``.

labels_ : ndarray of shape (n_samples,)
    Labels of each point

inertia_ : float
    Sum of squared distances of samples to their closest cluster center,
    weighted by the sample weights if provided.

n_iter_ : int
    Number of iterations run.

n_features_in_ : int
    Number of features seen during :term:`fit`.

    .. versionadded:: 0.24

feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.

    .. versionadded:: 1.0

See Also
--------
MiniBatchKMeans : Alternative online implementation that does incremental
    updates of the centers positions using mini-batches.
    For large scale learning (say n_samples > 10k) MiniBatchKMeans is
    probably much faster than the default batch implementation.

Notes
-----
The k-means problem is solved using either Lloyd's or Elkan's algorithm.

The average complexity is given by O(k n T), where n is the number of
samples and T is the number of iteration.

The worst case complexity is given by O(n^(k+2/p)) with
n = n_samples, p = n_features. (D. Arthur and S. Vassilvitskii,
'How slow is the k-means method?' SoCG2006)

In practice, the k-means algorithm is very fast (one of the fastest
clustering algorithms available), but it falls in local minima. That's why
it can be useful to restart it several times.

If the algorithm stops before fully converging (because of ``tol`` or
``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
i.e. the ``cluster_centers_`` will not be the means of the points in each
cluster. Also, the estimator will reassign ``labels_`` after the last
iteration to make ``labels_`` consistent with ``predict`` on the training
set.

Examples
--------

>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10.,  2.],
       [ 1.,  2.]])
File:           /Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py
Type:           type
Subclasses:     MiniBatchKMeans

15.2.3.3. 如何將 KMeans 變成 nonlinear?

  • KMeans 的算法,是算 點 到 center 的距離,所以概念上,他用 “圓形” 的 decision boundary 在分群,只能做到 線性 分群

  • 那如果碰到以下經典的 moon dataset,用 KMeans 就分不好:

from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)

labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');
../_images/KMeans_65_0.png
  • 可以看到, Kmeans 把他切成左邊和右邊,但…應該要分成上半圓和下半圓

  • 如果想改善的話,有兩條路可以走:

    • 把資料先投影到高維度,再用 KMeans 對高維資料做分群 (所以,概念上就是 kernel methods, e.g. SVM).

    • 換方法吧…,xxx方法 可以做到 nonlinear 分群

  • 這邊來試試 kernel 的方法:

from sklearn.cluster import SpectralClustering
model = SpectralClustering(
    n_clusters=2, 
    affinity='nearest_neighbors',
    assign_labels='kmeans'
)

labels = model.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/manifold/_spectral_embedding.py:260: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
../_images/KMeans_67_1.png

15.2.3.4. KMeans 可以處理 big data 嗎?

  • KMeans 在每次 iteration 的時候,都要算組平均.

  • 這意味著,他要一直對 “所有” 的 data 算 mean_vector.

  • 那資料量一多,就會算很慢.

  • 所以解決方法是:

    • 使用 batch-based k-means

    • 也就是每次迭代時,我只用一個 batch 的資料來更新 mean_vector 就好.

    • 這邊的做法,可用 sklearn.cluster.MiniBatchKMeans 來處理,等等可在哪裡看到這個例子

15.2.4. 與後續應用有關的討論

15.2.4.1. KMeans 可以用在訊號分群嗎?

  • 可以 der,舉股票走勢 的例子.

15.2.4.2. KMeans 可以用在影像分群嗎?

  • 可以 der,舉 number digit 分群的例子

15.3. 推薦的 workflow

  • EDA:

  • 建模:

15.4. Real world examples

15.4.1. [訊號分群] Stock Movent

  • 這個應用蠻好玩的,我們可以對股票的 “走勢” 做分群。

  • 資料如以下:

movements = pd.read_csv("data/company-stock-movements.csv", index_col = 0)
movements.head()
2010-01-04 2010-01-05 2010-01-06 2010-01-07 2010-01-08 2010-01-11 2010-01-12 2010-01-13 2010-01-14 2010-01-15 ... 2013-10-16 2013-10-17 2013-10-18 2013-10-21 2013-10-22 2013-10-23 2013-10-24 2013-10-25 2013-10-28 2013-10-29
Apple 0.580000 -0.220005 -3.409998 -1.170000 1.680011 -2.689994 -1.469994 2.779997 -0.680003 -4.999995 ... 0.320008 4.519997 2.899987 9.590019 -6.540016 5.959976 6.910011 -5.359962 0.840019 -19.589981
AIG -0.640002 -0.650000 -0.210001 -0.420000 0.710001 -0.200001 -1.130001 0.069999 -0.119999 -0.500000 ... 0.919998 0.709999 0.119999 -0.480000 0.010002 -0.279998 -0.190003 -0.040001 -0.400002 0.660000
Amazon -2.350006 1.260009 -2.350006 -2.009995 2.960006 -2.309997 -1.640007 1.209999 -1.790001 -2.039994 ... 2.109985 3.699982 9.570008 -3.450013 4.820008 -4.079986 2.579986 4.790009 -1.760009 3.740021
American express 0.109997 0.000000 0.260002 0.720002 0.190003 -0.270001 0.750000 0.300004 0.639999 -0.130001 ... 0.680001 2.290001 0.409996 -0.069999 0.100006 0.069999 0.130005 1.849999 0.040001 0.540001
Boeing 0.459999 1.770000 1.549999 2.690003 0.059997 -1.080002 0.360000 0.549999 0.530002 -0.709999 ... 1.559997 2.480003 0.019997 -1.220001 0.480003 3.020004 -0.029999 1.940002 1.130005 0.309998

5 rows × 963 columns

  • 可以看到,每一列是一家公司,每一行是時間點,值是股價(大概做過一些調整了,所以有正有負,我們可以不用管他,就當是股價即可)

  • 我們就取這前 5 家公司,來看看他的走勢圖:

  • 如果我今天要做的分群,是對絕對的數值做分群,那我就直接用這張表分群就好.

  • 因為,我去算兩列的 歐式距離,就等於在算兩條訊號的距離有多近。所以,比較近的就被分在同一群裡.

  • 但如果我今天是想對 “走勢” 做分群,那我會希望對 “列” 做標準化。

    • 舉例來說,台積電的股價變化是 600, 580, 600, 620, 640,啟基是 60, 58, 60, 62, 64。

    • 那從 “走勢” 來看,台積跟啟基走勢是一樣的,應該被分為一群,但如果直接做 kmeans,就再見了,因為光 600 和 60 的距離就很遠。

    • 另外,台積股價的變化是 -20, 20, 20, 20; 啟基是 -2, 2, 2, 2,這個變動差距也不同,但如果改成看變化百分比(把股價放分母,變化當分子),那兩邊就又差不多了.

    • 所以,我如果先對列做標準化,那兩個公司的數值就都變成 [-0.39, -1.37, -0.39, 0.58, 1.56],一模模一樣樣,euclidean distance 變成 0,分群時一定放在一塊兒

  • 所以,這一個例子,我們就要對 “列” 做標準化,那就要用到 Normalizer 這個 preprocessor:

from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline, make_pipeline

# Create a normalizer: normalizer
normalizer = Normalizer()
normalizer.fit(movements)
movements_norm = normalizer.transform(movements)
movements.head()
2010-01-04 2010-01-05 2010-01-06 2010-01-07 2010-01-08 2010-01-11 2010-01-12 2010-01-13 2010-01-14 2010-01-15 ... 2013-10-16 2013-10-17 2013-10-18 2013-10-21 2013-10-22 2013-10-23 2013-10-24 2013-10-25 2013-10-28 2013-10-29
Apple 0.580000 -0.220005 -3.409998 -1.170000 1.680011 -2.689994 -1.469994 2.779997 -0.680003 -4.999995 ... 0.320008 4.519997 2.899987 9.590019 -6.540016 5.959976 6.910011 -5.359962 0.840019 -19.589981
AIG -0.640002 -0.650000 -0.210001 -0.420000 0.710001 -0.200001 -1.130001 0.069999 -0.119999 -0.500000 ... 0.919998 0.709999 0.119999 -0.480000 0.010002 -0.279998 -0.190003 -0.040001 -0.400002 0.660000
Amazon -2.350006 1.260009 -2.350006 -2.009995 2.960006 -2.309997 -1.640007 1.209999 -1.790001 -2.039994 ... 2.109985 3.699982 9.570008 -3.450013 4.820008 -4.079986 2.579986 4.790009 -1.760009 3.740021
American express 0.109997 0.000000 0.260002 0.720002 0.190003 -0.270001 0.750000 0.300004 0.639999 -0.130001 ... 0.680001 2.290001 0.409996 -0.069999 0.100006 0.069999 0.130005 1.849999 0.040001 0.540001
Boeing 0.459999 1.770000 1.549999 2.690003 0.059997 -1.080002 0.360000 0.549999 0.530002 -0.709999 ... 1.559997 2.480003 0.019997 -1.220001 0.480003 3.020004 -0.029999 1.940002 1.130005 0.309998

5 rows × 963 columns

  • 最後,就可以把 company 做分群,得到以下結果:

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters=10)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)
Pipeline(steps=[('normalizer', Normalizer()),
                ('kmeans', KMeans(n_clusters=10))])
# Predict the cluster labels: labels
labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 
                   'companies': movements.index})

# Display df sorted by cluster label
df.sort_values(["labels"])
labels companies
57 0 Exxon
44 0 Schlumberger
42 0 Royal Dutch Shell
10 0 ConocoPhillips
53 0 Valero Energy
12 0 Chevron
51 1 Texas instruments
31 1 McDonalds
50 1 Taiwan Semiconductor Manufacturing
23 1 IBM
32 1 3M
20 1 Home Depot
47 1 Symantec
30 1 MasterCard
54 1 Walgreen
35 1 Navistar
8 1 Caterpillar
4 1 Boeing
13 1 DuPont de Nemours
58 1 Xerox
15 2 Ford
16 2 General Electrics
55 2 Wells Fargo
18 2 Goldman Sachs
5 2 Bank of America
3 2 American express
1 2 AIG
26 2 JPMorgan Chase
36 3 Northrop Grumman
29 3 Lookheed Martin
27 4 Kimberly-Clark
25 4 Johnson & Johnson
9 4 Colgate-Palmolive
56 4 Wal-Mart
38 4 Pepsi
40 4 Procter Gamble
41 4 Philip Morris
28 4 Coca Cola
52 5 Unilever
49 5 Total
19 5 GlaxoSmithKline
37 5 Novartis
39 5 Pfizer
6 5 British American Tobacco
43 5 SAP
46 5 Sanofi-Aventis
0 6 Apple
33 6 Microsoft
24 6 Intel
11 6 Cisco
34 7 Mitsubishi
48 7 Toyota
21 7 Honda
7 7 Canon
45 7 Sony
17 8 Google/Alphabet
2 8 Amazon
59 8 Yahoo
22 9 HP
14 9 Dell

15.4.2. [影像分群] Number Digits

15.4.2.1. 資料

from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
(1797, 64)
  • 一樣用 Number Digits 的資料,但這次的目標,是不看 y ,只看 x (有64個 feature),看能不能分成 10 群

  • 先不急著用 KMeans,我們先來看一下資料投影到 2 維時,能不能輕易地看出分成 10 群

  • 我們分別把圖投影到(PC1, PC2), (PC2, PC3), (PC1, PC3),看起來沒有那麼漂亮

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
(10, 64)
  • 我們來看看這 10 個 centers 的圖長怎樣:

fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
../_images/KMeans_97_0.png
  • 看來這 10 個群組的中心,都還有模有樣的,但 1 和 8 似乎分不太清楚

  • 因為我們其實有答案,所以可以來對個答案看看.

  • 但因為我們現在只知道每張照片被分到某一群,但我不知道這一群是 0~9 的哪個數字,所以我打算用以下方式來給 label:

    • 對於 第 i 群 的圖片來說,我取出這些圖片的答案 labels. ,再對這些 labels 取眾數,就用眾數來代表這一群.

    • 舉例來說,左上圖這一群裡面,假設有 1000 張圖片,那如果這 1000 張圖片的眾樹是 4 ,我就把這一群命名為 4

  • 結果如下:

from scipy.stats import mode

labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]
print("各張照片被分到第幾群: ", clusters)
print("該群代表的數字 label: ", labels)
各張照片被分到第幾群:  [5 7 7 ... 7 3 3]
該群代表的數字 label:  [0 8 8 ... 8 9 9]
  • 那,接下來就可以對答案了。先看一下準確率:

from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)
0.7935447968836951
  • 並不意外~ 因為從 centers 來看,除了 1 和 8 這兩個數字外,其他看起來都很明顯,所以有 80% 的資料分的很好.

  • 用 confusion matrix 來確認一下:

from sklearn.metrics import confusion_matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
../_images/KMeans_105_0.png
  • 1 和 8 最常分錯(不意外)

  • 9 和 5 看來也分的不好

  • 9 和 8 也分的不好

  • 我們如果降到 2 維來看原始資料,也可發現這三個數字 overlap 較嚴重.

  • 如果想提升 KMeans 的表現,下一步就是做 kernel KMeans。

  • 但我們這邊不想這麼做,我們想試試另一種做法(純屬好玩).

  • 先把資料做 t-SNE,看看做完後是不是資料散得很開,用 KMeans 就能分出來:

from sklearn.manifold import TSNE

# Project the data: this step will take several seconds
tsne = TSNE(n_components=2, init='random', random_state=0)
digits_proj = tsne.fit_transform(digits.data)
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
digits_proj.shape
(1797, 2)
plt.scatter(
    digits_proj[:, 0], 
    digits_proj[:, 1], 
    c=digits.target,
    cmap='viridis'
);
../_images/KMeans_111_0.png
  • 看起來各個數字分的很開,而且用圓形就可以一圈一圈的分出來.

  • 那就來看看用 KMeans 可以做多好吧:

# Compute the clusters
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)

# Permute the labels
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

# Compute the accuracy
accuracy_score(digits.target, labels)
0.9432387312186978
  • 發現 accuracy 可以到 94%

  • 在我不知道 label 的情況下,還可以有這種準確率,讚啦!!

  • 但要提醒的是,在實務上,不太能這樣做,因為

    • t-SNE 只能對手上這筆樣本作轉換,未來新資料進來時,我沒辦法把該圖的 64 維,投影到現在看到的 2 維

    • 所以只能對手邊這組影像做分群,無法對未來的影像做分群

15.4.3. [影像分群] Olivetti 臉譜分群

from sklearn.datasets import fetch_olivetti_faces

olivetti = fetch_olivetti_faces()

print(olivetti.target)
print(olivetti.data.shape)
[ 0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  2  2  2  2
  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4
  4  4  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  6  7  7
  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9
  9  9  9  9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11
 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14
 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16
 16 16 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 19 19
 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21
 21 21 21 21 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 23 23 23 23
 24 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 26 26 26 26
 26 26 26 26 26 26 27 27 27 27 27 27 27 27 27 27 28 28 28 28 28 28 28 28
 28 28 29 29 29 29 29 29 29 29 29 29 30 30 30 30 30 30 30 30 30 30 31 31
 31 31 31 31 31 31 31 31 32 32 32 32 32 32 32 32 32 32 33 33 33 33 33 33
 33 33 33 33 34 34 34 34 34 34 34 34 34 34 35 35 35 35 35 35 35 35 35 35
 36 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 37 37 37 37 38 38 38 38
 38 38 38 38 38 38 39 39 39 39 39 39 39 39 39 39]
(400, 4096)
  • Olivetti 臉譜資料,有 40 個人,每個人都拍 10 張照片,共有 400 張 64x64 的灰階人臉照片。

  • 每張照片都被 flatten 為 4096 維的 1D 向量,

  • 每張照片,都已經被 normalize 到 0~1 之間

  • 這組資料常被用來作為訓練看照片判斷是誰的人臉.

  • 這邊我們要練習分群,看用 KMeans 分完群後,是不是每一群裡的人臉長的都差不多

# 先用分層隨機抽樣,將資料分成 train, valid, 和 test

from sklearn.model_selection import StratifiedShuffleSplit

strat_split = StratifiedShuffleSplit(n_splits=1, test_size=40, random_state=42)
train_valid_idx, test_idx = next(strat_split.split(olivetti.data, olivetti.target))
X_train_valid = olivetti.data[train_valid_idx]
y_train_valid = olivetti.target[train_valid_idx]
X_test = olivetti.data[test_idx]
y_test = olivetti.target[test_idx]

strat_split = StratifiedShuffleSplit(n_splits=1, test_size=80, random_state=43)
train_idx, valid_idx = next(strat_split.split(X_train_valid, y_train_valid))
X_train = X_train_valid[train_idx]
y_train = y_train_valid[train_idx]
X_valid = X_train_valid[valid_idx]
y_valid = y_train_valid[valid_idx]
print(X_train.shape, y_train.shape)
print(X_valid.shape, y_valid.shape)
print(X_test.shape, y_test.shape)
(280, 4096) (280,)
(80, 4096) (80,)
(40, 4096) (40,)
  • 先做 PCA,把維度從 4096 維降下來 (這一步,就和 auto-encoder要取 feature 一樣)

from sklearn.decomposition import PCA

pca = PCA(0.99)
X_train_pca = pca.fit_transform(X_train)
X_valid_pca = pca.transform(X_valid)
X_test_pca = pca.transform(X_test)

pca.n_components_
199
  • 可以看到,4096維被降到199維.

  • 接下來,做 KMeans

from sklearn.cluster import KMeans

k_range = range(5, 150, 5)
kmeans_per_k = []
for k in k_range:
    print("k={}".format(k))
    kmeans = KMeans(n_clusters=k, random_state=42).fit(X_train_pca)
    kmeans_per_k.append(kmeans)
k=5
k=10
k=15
k=20
k=25
k=30
k=35
k=40
k=45
k=50
k=55
k=60
k=65
k=70
k=75
k=80
k=85
k=90
k=95
k=100
k=105
k=110
k=115
k=120
k=125
k=130
k=135
k=140
k=145
  • 接著,用 silhouette_score 來選最佳群數.

from sklearn.metrics import silhouette_score

silhouette_scores = [silhouette_score(X_train_pca, model.labels_)
                     for model in kmeans_per_k]
best_index = np.argmax(silhouette_scores)
best_k = k_range[best_index]
best_score = silhouette_scores[best_index]

print(best_k)

plt.figure(figsize=(8, 3))
plt.plot(k_range, silhouette_scores, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette score", fontsize=14)
plt.plot(best_k, best_score, "rs");
120
../_images/KMeans_125_1.png
  • 可以看到,最佳的群數,是分成 120 群

  • 如果用 Inertia 來分,看得更不清楚

inertias = [model.inertia_ for model in kmeans_per_k]
best_inertia = inertias[best_index]

plt.figure(figsize=(8, 3.5))
plt.plot(k_range, inertias, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Inertia", fontsize=14)
plt.plot(best_k, best_inertia, "rs")
[<matplotlib.lines.Line2D at 0x1302e0be0>]
../_images/KMeans_127_1.png
  • 雖然最後分成 120 群,不是我們預想的 40 群 (因為有 40 個人),但這也正常,因為同一個人,可能在不同張照片看起來不同(e.g. 有時候有戴眼鏡,有時沒戴,或是照片有左或右的shift)

  • 我們 keep 這個分群 model,並看看群內的照片是不是都很像

best_model = kmeans_per_k[best_index]
def plot_faces(faces, labels, n_cols=5):
    faces = faces.reshape(-1, 64, 64)
    n_rows = (len(faces) - 1) // n_cols + 1
    plt.figure(figsize=(n_cols, n_rows * 1.1))
    for index, (face, label) in enumerate(zip(faces, labels)):
        plt.subplot(n_rows, n_cols, index + 1)
        plt.imshow(face, cmap="gray")
        plt.axis("off")
        plt.title(label)
    plt.show()

for cluster_id in np.unique(best_model.labels_):
    print("Cluster", cluster_id)
    in_cluster = best_model.labels_==cluster_id
    faces = X_train[in_cluster]
    labels = y_train[in_cluster]
    plot_faces(faces, labels)
Cluster 0
../_images/KMeans_130_1.png
Cluster 1
../_images/KMeans_130_3.png
Cluster 2
../_images/KMeans_130_5.png
Cluster 3
../_images/KMeans_130_7.png
Cluster 4
../_images/KMeans_130_9.png
Cluster 5
../_images/KMeans_130_11.png
Cluster 6
../_images/KMeans_130_13.png
Cluster 7
../_images/KMeans_130_15.png
Cluster 8
../_images/KMeans_130_17.png
Cluster 9
../_images/KMeans_130_19.png
Cluster 10
../_images/KMeans_130_21.png
Cluster 11
../_images/KMeans_130_23.png
Cluster 12
../_images/KMeans_130_25.png
Cluster 13
../_images/KMeans_130_27.png
Cluster 14
../_images/KMeans_130_29.png
Cluster 15
../_images/KMeans_130_31.png
Cluster 16
../_images/KMeans_130_33.png
Cluster 17
../_images/KMeans_130_35.png
Cluster 18
../_images/KMeans_130_37.png
Cluster 19
../_images/KMeans_130_39.png
Cluster 20
../_images/KMeans_130_41.png
Cluster 21
../_images/KMeans_130_43.png
Cluster 22
../_images/KMeans_130_45.png
Cluster 23
../_images/KMeans_130_47.png
Cluster 24
../_images/KMeans_130_49.png
Cluster 25
../_images/KMeans_130_51.png
Cluster 26
../_images/KMeans_130_53.png
Cluster 27
../_images/KMeans_130_55.png
Cluster 28
../_images/KMeans_130_57.png
Cluster 29
../_images/KMeans_130_59.png
Cluster 30
../_images/KMeans_130_61.png
Cluster 31
../_images/KMeans_130_63.png
Cluster 32
../_images/KMeans_130_65.png
Cluster 33
../_images/KMeans_130_67.png
Cluster 34
../_images/KMeans_130_69.png
Cluster 35
../_images/KMeans_130_71.png
Cluster 36
../_images/KMeans_130_73.png
Cluster 37
../_images/KMeans_130_75.png
Cluster 38
../_images/KMeans_130_77.png
Cluster 39
../_images/KMeans_130_79.png
Cluster 40
../_images/KMeans_130_81.png
Cluster 41
../_images/KMeans_130_83.png
Cluster 42
../_images/KMeans_130_85.png
Cluster 43
../_images/KMeans_130_87.png
Cluster 44
../_images/KMeans_130_89.png
Cluster 45
../_images/KMeans_130_91.png
Cluster 46
../_images/KMeans_130_93.png
Cluster 47
../_images/KMeans_130_95.png
Cluster 48
../_images/KMeans_130_97.png
Cluster 49
../_images/KMeans_130_99.png
Cluster 50
../_images/KMeans_130_101.png
Cluster 51
../_images/KMeans_130_103.png
Cluster 52
../_images/KMeans_130_105.png
Cluster 53
../_images/KMeans_130_107.png
Cluster 54
../_images/KMeans_130_109.png
Cluster 55
../_images/KMeans_130_111.png
Cluster 56
../_images/KMeans_130_113.png
Cluster 57
../_images/KMeans_130_115.png
Cluster 58
../_images/KMeans_130_117.png
Cluster 59
../_images/KMeans_130_119.png
Cluster 60
../_images/KMeans_130_121.png
Cluster 61
../_images/KMeans_130_123.png
Cluster 62
../_images/KMeans_130_125.png
Cluster 63
../_images/KMeans_130_127.png
Cluster 64
../_images/KMeans_130_129.png
Cluster 65
../_images/KMeans_130_131.png
Cluster 66
../_images/KMeans_130_133.png
Cluster 67
../_images/KMeans_130_135.png
Cluster 68
../_images/KMeans_130_137.png
Cluster 69
../_images/KMeans_130_139.png
Cluster 70
../_images/KMeans_130_141.png
Cluster 71
../_images/KMeans_130_143.png
Cluster 72
../_images/KMeans_130_145.png
Cluster 73
../_images/KMeans_130_147.png
Cluster 74
../_images/KMeans_130_149.png
Cluster 75
../_images/KMeans_130_151.png
Cluster 76
../_images/KMeans_130_153.png
Cluster 77
../_images/KMeans_130_155.png
Cluster 78
../_images/KMeans_130_157.png
Cluster 79
../_images/KMeans_130_159.png
Cluster 80
../_images/KMeans_130_161.png
Cluster 81
../_images/KMeans_130_163.png
Cluster 82
../_images/KMeans_130_165.png
Cluster 83
../_images/KMeans_130_167.png
Cluster 84
../_images/KMeans_130_169.png
Cluster 85
../_images/KMeans_130_171.png
Cluster 86
../_images/KMeans_130_173.png
Cluster 87
../_images/KMeans_130_175.png
Cluster 88
../_images/KMeans_130_177.png
Cluster 89
../_images/KMeans_130_179.png
Cluster 90
../_images/KMeans_130_181.png
Cluster 91
../_images/KMeans_130_183.png
Cluster 92
../_images/KMeans_130_185.png
Cluster 93
../_images/KMeans_130_187.png
Cluster 94
../_images/KMeans_130_189.png
Cluster 95
../_images/KMeans_130_191.png
Cluster 96
../_images/KMeans_130_193.png
Cluster 97
../_images/KMeans_130_195.png
Cluster 98
../_images/KMeans_130_197.png
Cluster 99
../_images/KMeans_130_199.png
Cluster 100
../_images/KMeans_130_201.png
Cluster 101
../_images/KMeans_130_203.png
Cluster 102
../_images/KMeans_130_205.png
Cluster 103
../_images/KMeans_130_207.png
Cluster 104
../_images/KMeans_130_209.png
Cluster 105
../_images/KMeans_130_211.png
Cluster 106
../_images/KMeans_130_213.png
Cluster 107
../_images/KMeans_130_215.png
Cluster 108
../_images/KMeans_130_217.png
Cluster 109
../_images/KMeans_130_219.png
Cluster 110
../_images/KMeans_130_221.png
Cluster 111
../_images/KMeans_130_223.png
Cluster 112
../_images/KMeans_130_225.png
Cluster 113
../_images/KMeans_130_227.png
Cluster 114
../_images/KMeans_130_229.png
Cluster 115
../_images/KMeans_130_231.png
Cluster 116
../_images/KMeans_130_233.png
Cluster 117
../_images/KMeans_130_235.png
Cluster 118
../_images/KMeans_130_237.png
Cluster 119
../_images/KMeans_130_239.png
  • 大概有 2/3 的 cluster 是有用的: 因為至少包含兩張同一個人的照片。

  • 大概有 1/2 的 cluster,不是只有一張照片,就是混雜多個人,就沒那麼有用了.

  • 像這樣把照片分群,雖然很不準確,但他最大的用處是幫我們加速 labeling。有了這個基礎後再去 label 是可以變快很多的。

15.4.4. [影像分類前處理] Olivetti

  • 延續剛剛,如果我們把它改成分類問題。然後,拿 KMeans 分群後的結果,當新的 feature,看看效果如何

  • 先 train 一個 randomforest 的分類器 (feature 一樣用 PCA 降下來的 199 維當 feature

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=150, random_state=42)
clf.fit(X_train_pca, y_train)
clf.score(X_valid_pca, y_valid)
0.925
  • 預測準確率有 92.5%

  • 接下來,用分群結果當 feature,看結果會不會比較好

X_train_extended = np.c_[X_train_pca, best_model.predict(X_train_pca)]
X_valid_extended = np.c_[X_valid_pca, best_model.predict(X_valid_pca)]
X_test_extended = np.c_[X_test_pca, best_model.predict(X_test_pca)]
clf = RandomForestClassifier(n_estimators=150, random_state=42)
clf.fit(X_train_extended, y_train)
clf.score(X_valid_extended, y_valid)
0.9375
  • 有些微提升了.

15.4.5. [影像半監督學習] Number Digits

15.4.6. [影像分割]

  • 這邊先介紹影像分割裡面最簡單的應用:顏色分割

  • 首先來看,我們有一張圖長這樣:

from matplotlib.image import imread

image = plt.imread("images/ladybug.png")
print(image.shape)
plt.imshow(image)
plt.axis('off')
(533, 800, 3)
(-0.5, 799.5, 532.5, -0.5)
../_images/KMeans_142_2.png
  • 可以看到,這是張 (533, 800, 3) 的圖片

  • 現在,我把每個 pixel,當作一個 sample,然後他對應的 RGB,當作是他的 feature

  • 那這張影像,就可轉成 machine learning 常見的 tabular data 的樣子 (533x800, 3):

X = image.reshape(-1, 3)
X
array([[0.09803922, 0.11372549, 0.00784314],
       [0.09411765, 0.10980392, 0.00392157],
       [0.09411765, 0.11372549, 0.        ],
       ...,
       [0.03921569, 0.22745098, 0.        ],
       [0.01960784, 0.20392157, 0.        ],
       [0.00784314, 0.1882353 , 0.        ]], dtype=float32)
  • 那對任兩個像素來說,如果他們之間的距離越近 (R, G, B 三個 feature 的 euclidean distance 越近),就越能分在一組

  • 所以,如果我將這張圖做 KMeans,分成 2 個 cluster,那等於,我想把原本這張圖的顏色,降到兩種就好

  • 作法如下:

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

# 每個像素所屬的組別
print(kmeans.predict(X))

# 兩個 cluster 的 center
kmeans.cluster_centers_
[1 1 1 ... 1 1 1]
array([[0.86087525, 0.8004718 , 0.10489635],
       [0.1336768 , 0.26555252, 0.04000121]], dtype=float32)
  • 從上面可以看到,第一個像素,被分到第 1 個 cluster,而第 1 個 cluster 的 center 是 [0.86, 0.80, 0.10],其實就是 RGB,就是第一個 cluster 的代表色

  • 那,把原本的每個像素,都換成他所屬 cluster 的代表色,就可以將原圖分成兩類了:

segmented_img = kmeans.cluster_centers_[kmeans.predict(X)]
segmented_img = segmented_img.reshape(image.shape)
plt.imshow(segmented_img);
plt.axis("off");
../_images/KMeans_148_0.png
  • 如此一來,就順利將圖形分成兩種顏色了。

  • 我們可以進一步看一下 cluster 數量所造成的影像:

segmented_imgs = []
n_colors = (10, 8, 6, 4, 2)
for n_clusters in n_colors:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
    segmented_img = kmeans.cluster_centers_[kmeans.labels_]
    segmented_imgs.append(segmented_img.reshape(image.shape))
plt.figure(figsize=(10,5))
plt.subplots_adjust(wspace=0.05, hspace=0.1)

plt.subplot(231)
plt.imshow(image)
plt.title("Original image")
plt.axis('off')

for idx, n_clusters in enumerate(n_colors):
    plt.subplot(232 + idx)
    plt.imshow(segmented_imgs[idx])
    plt.title("{} colors".format(n_clusters))
    plt.axis('off')

plt.show()
../_images/KMeans_151_0.png