CLUSTERING
Pengertian
Clustering adalah metode pengelompokan data. Clustering bisa disebut sebagai suatu proses untuk mengelompokan data ke dalam beberapa cluster sehingga data dalam satu cluster memiliki tingkat kemiripan yang maksimum dan data antar cluster memiliki kemiripan yang minimum.
Clustering merupakan proses partisi satu set objek data ke dalam himpunan bagian yang disebut dengan cluster. Objek yang di dalam cluster memiliki kemiripan karakteristik antar satu sama lainnya dan berbeda dengan cluster yang lain. Partisi tidak dilakukan secara manual melainkan dengan suatu algoritma clustering. Oleh karena itu, clustering sangat berguna dan bisa menemukan group atau kelompokyang tidak dikenal dalam data. Clustering banyak digunakan dalam berbagai aplikasi seperti misalnya pada business inteligence, pengenalan pola citra, web search, bidang ilmu biologi, dan untuk keamanan (security). Di dalam business inteligence, clustering bisa mengatur banyak customer ke dalam banyaknya kelompok. Contohnya mengelompokan customer ke dalam beberapa cluster dengan kesamaan karakteristik yang kuat. Clustering juga dikenal sebagai data segmentasi karena clustering mempartisi banyak data set ke dalam banyak group berdasarkan kesamaannya. Selain itu clustering juga bisa sebagai outlier detection.
Metode K-Means Clustering
K-Means adalah salah satu algoritma clustering / pengelompokan data yang bersifat Unsupervised Learning, yang berarti masukan dari algoritma ini menerima data tanpa label kelas. Secara umum metode k-means ini melakukan proses pengelompokan dengan prosedur sebagai berikut: 1. Menentukan jumlah cluster 2. Mengalokasikan data secara random ke cluster yang ada 3. Menghitung rata-rata setiap cluster dari data yang tergabung sebelumnya 4. Mengalokasikan kembali semua data ke cluster tersebut 5. Mengulang proses nomor 3, sampai tidak ada perubahan atau perubahan yang terjadi masih sudah di bawah treshold Prosedur dasar ini bisa berubah mengikuti pendekatan pengalokasian data yang diterapkan, apakah *crisp* atau *fuzzy*. Setelah meneliti clustering dari sudut yang lain, saya menemukan bahwa *k-means clustering* mempunyai beberapa kelemahan. Fungsi dari algoritma ini adalah mengelompokkan data kedalam beberapa cluster. karakteristik dari algoritma ini adalah : . Memiliki n buah data. . Input berupa jumlah data dan jumlah cluster (kelompok). . Pada setiap cluster/kelompok memiliki sebuah centroid yang mempresentasikan cluster tersebut.Rumus K-Means
$$ d(x,y)=|x-y|= \sqrt{\sum _ { i = 1 } ^ { n } (x _ { i }-y_{i})^2} $$Implementasikan K-Means Menggunakan Python
from time import time import numpy as np import matplotlib.pyplot as plt from sklearn import metrics from sklearn.cluster import KMeans from sklearn.datasets import load_digits from sklearn.decomposition import PCA from sklearn.preprocessing import scale np.random.seed(42) digits = load_digits() data = scale(digits.data) n_samples, n_features = data.shape n_digits = len(np.unique(digits.target)) labels = digits.target sample_size = 300 print("n_digits: %d, \t n_samples %d, \t n_features %d" % (n_digits, n_samples, n_features)) print(82 * '_') print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette') def bench_k_means(estimator, name, data): t0 = time() estimator.fit(data) print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f' % (name, (time() - t0), estimator.inertia_, metrics.homogeneity_score(labels, estimator.labels_), metrics.completeness_score(labels, estimator.labels_), metrics.v_measure_score(labels, estimator.labels_), metrics.adjusted_rand_score(labels, estimator.labels_), metrics.adjusted_mutual_info_score(labels, estimator.labels_, average_method='arithmetic'), metrics.silhouette_score(data, estimator.labels_, metric='euclidean', sample_size=sample_size))) bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10), name="k-means++", data=data) bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10), name="random", data=data) # in this case the seeding of the centers is deterministic, hence we run the # kmeans algorithm only once with n_init=1 pca = PCA(n_components=n_digits).fit(data) bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1), name="PCA-based", data=data) print(82 * '_') # Visualize the results on PCA-reduced data reduced_data = PCA(n_components=2).fit_transform(data) kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10) kmeans.fit(reduced_data) # Step size of the mesh. Decrease to increase the quality of the VQ. h = .02 # point in the mesh [x_min, x_max]x[y_min, y_max]. # Plot the decision boundary. For that, we will assign a color to each x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1 y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) # Obtain labels for each point in mesh. Use last trained model. Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure(1) plt.clf() plt.imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, aspect='auto', origin='lower') plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2) # Plot the centroids as a white X centroids = kmeans.cluster_centers_ plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=169, linewidths=3, color='w', zorder=10) plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n' 'Centroids are marked with white cross') plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks(()) plt.yticks(()) plt.show()
n_digits: 10, n_samples 1797, n_features 64 ------ init time inertia homo compl v-meas ARI AMI silhouette k-means++ 0.19s 69432 0.602 0.650 0.625 0.465 0.621 0.146 random 0.19s 69694 0.669 0.710 0.689 0.553 0.686 0.147 PCA-based 0.04s 70804 0.671 0.698 0.684 0.561 0.681 0.118
K-Means dengan Tiga Cluster
import numpy as np import matplotlib.pyplot as plt # Though the following import is not directly being used, it is required # for 3D projection to work from mpl_toolkits.mplot3d import Axes3D from sklearn.cluster import KMeans from sklearn import datasets np.random.seed(5) iris = datasets.load_iris() X = iris.data y = iris.target estimators = [('k_means_iris_8', KMeans(n_clusters=8)), ('k_means_iris_3', KMeans(n_clusters=3)), ('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1, init='random'))] fignum = 1 titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization'] for name, est in estimators: fig = plt.figure(fignum, figsize=(4, 3)) ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) est.fit(X) labels = est.labels_ ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float), edgecolor='k') ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) ax.set_xlabel('Petal width') ax.set_ylabel('Sepal length') ax.set_zlabel('Petal length') ax.set_title(titles[fignum - 1]) ax.dist = 12 fignum = fignum + 1 # Plot the ground truth fig = plt.figure(fignum, figsize=(4, 3)) ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]: ax.text3D(X[y == label, 3].mean(), X[y == label, 0].mean(), X[y == label, 2].mean() + 2, name, horizontalalignment='center', bbox=dict(alpha=.2, edgecolor='w', facecolor='w')) # Reorder the labels to have colors matching the cluster results y = np.choose(y, [1, 2, 0]).astype(np.float) ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k') ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) ax.set_xlabel('Petal width') ax.set_ylabel('Sepal length') ax.set_zlabel('Petal length') ax.set_title('Ground Truth') ax.dist = 12 fig.show()
import numpy as np from kmodes.kmodes import KModes # random categorical data data = np.random.choice(20, (100, 10)) km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1) clusters = km.fit_predict(data) # Print the cluster centroids print(km.cluster_centroids_)
Init: initializing centroids Init: initializing clusters Starting iterations... Run 1, iteration: 1/100, moves: 28, cost: 793.0 Run 1, iteration: 2/100, moves: 1, cost: 793.0 Init: initializing centroids Init: initializing clusters Starting iterations... Run 2, iteration: 1/100, moves: 28, cost: 791.0 Run 2, iteration: 2/100, moves: 4, cost: 789.0 Run 2, iteration: 3/100, moves: 3, cost: 789.0 Init: initializing centroids Init: initializing clusters Starting iterations... Run 3, iteration: 1/100, moves: 20, cost: 797.0 Run 3, iteration: 2/100, moves: 7, cost: 792.0 Run 3, iteration: 3/100, moves: 3, cost: 792.0 Init: initializing centroids Init: initializing clusters Starting iterations... Run 4, iteration: 1/100, moves: 21, cost: 799.0 Run 4, iteration: 2/100, moves: 6, cost: 798.0 Run 4, iteration: 3/100, moves: 0, cost: 798.0 Init: initializing centroids Init: initializing clusters Starting iterations... Run 5, iteration: 1/100, moves: 18, cost: 795.0 Run 5, iteration: 2/100, moves: 6, cost: 795.0 Best run was number 2 [[14 8 0 18 3 7 0 1 16 3] [ 7 1 12 4 18 16 5 17 6 2] [ 9 17 3 2 11 5 11 0 11 1] [ 8 13 8 3 9 0 2 12 6 9]]
import numpy as np from kmodes.kprototypes import KPrototypes import matplotlib.pyplot as plt from matplotlib import style style.use("ggplot") colors = ['b', 'orange', 'g', 'r', 'c', 'm', 'y', 'k', 'Brown', 'ForestGreen'] #Data points with their publisher name,category score, category name, place name syms = np.genfromtxt('travel.csv', dtype=str, delimiter=',')[:, 1] X = np.genfromtxt('travel.csv', dtype=object, delimiter=',')[:, 2:] X[:, 0] = X[:, 0].astype(float) kproto = KPrototypes(n_clusters=15, init='Cao', verbose=2) clusters = kproto.fit_predict(X, categorical=[1, 2]) # Print cluster centroids of the trained model. print(kproto.cluster_centroids_) # Print training statistics print(kproto.cost_) print(kproto.n_iter_) for s, c in zip(syms, clusters): print("Result: {}, cluster:{}".format(s, c)) # Plot the results for i in set(kproto.labels_): index = kproto.labels_ == i plt.plot(X[index, 0], X[index, 1], 'o') plt.suptitle('Data points categorized with category score', fontsize=18) plt.xlabel('Category Score', fontsize=16) plt.ylabel('Category Type', fontsize=16) plt.show() # Clustered result fig1, ax3 = plt.subplots() scatter = ax3.scatter(syms, clusters, c=clusters, s=50) ax3.set_xlabel('Data points') ax3.set_ylabel('Cluster') plt.colorbar(scatter) ax3.set_title('Data points classifed according to known centers') plt.show() result = zip(syms, kproto.labels_) sortedR = sorted(result, key=lambda x: x[1]) print(sortedR)
240,Ransika Fernando,0.59375,plant,No Data 240,Ransika Fernando,0.04296875,outdoor_,No Data 240,Ransika Fernando,0.26953125,outdoor_road,No Data 241,Sachini Jagodaarachchi,0.98046875,outdoor_mountain,Manigala Mountain 242,Chathuri Senanayake,0.96484375,outdoor_mountain,Adara Kanda 242,Chathuri Senanayake,0.1953125,building_,No Data 242,Chathuri Senanayake,0.00390625,outdoor_,No Data 242,Chathuri Senanayake,0.23046875,building_,Kuwait 242,Chathuri Senanayake,0.2578125,building_street,Kuwait 242,Chathuri Senanayake,0.015625,outdoor_,Kuwait 243,Nilantha Premakumara,0.9453125,sky_sun,No Data 243,Nilantha Premakumara,0.75,outdoor_mountain,No Data 244,Chathuri Senanayake,0.00390625,outdoor_,Trincomalee 244,Chathuri Senanayake,0.6328125,outdoor_oceanbeach,Trincomalee 245,Surangani Bandara,0.7734375,plant_tree,No Data 246,Hasitha Lakmal,0.4140625,people_many,No Data 246,Hasitha Lakmal,0.0078125,outdoor_,No Data 247,Pradeep Kalansooriya,0.40234375,building_,No Data 247,Pradeep Kalansooriya,0.0078125,outdoor_,No Data 248,Dilini Wijesinghe,0.07421875,outdoor_,Victoria Dam 248,Dilini Wijesinghe,0.0078125,others_,Victoria Dam 249,Chiranthi Vinghghani,0.015625,outdoor_,No Data 249,Chiranthi Vinghghani,0.6484375,outdoor_waterside,No Data 250,Janindu Praneeth Weerawarnakula,0.671875,outdoor_oceanbeach,Galle Fort 251,Chathurangi Shyalika,0.00390625,outdoor_,No Data 252,Chathurangi Shyalika,0.9296875,trans_trainstation,No Data 253,Surangani Bandara,0.625,outdoor_field,No Data 253,Surangani Bandara,0.01171875,outdoor_,No Data 254,Surangani Bandara,0.99609375,sky_object,No Data 255,Chathurangi Shyalika,0.00390625,outdoor_,No Data 256,Chathurangi Shyalika,0.33984375,outdoor_field,No Data