K-Means 聚類算法(fa)

K-Means 概念定義：

K-Means 是一種基(ji)于距離的排他的聚類劃(hua)分(fen)方法。

上面的 K-Means 描述中(zhong)包含了(le)幾(ji)個(ge)概念：

聚類（Clustering）：K-Means 是一種聚類分析（Cluster Analysis）方法。聚類就是將數據對象分組成為多個類或者簇 (Cluster)，使得在同一個簇中的對象之間具有較高的相似度，而不同簇中的對象差別較大。
劃分（Partitioning）：聚類可以基于劃分，也可以基于分層。劃分即將對象劃分成不同的簇，而分層是將對象分等級。
排他（Exclusive）：對于一個數據對象，只能被劃分到一個簇中。如果一個數據對象可以被劃分到多個簇中，則稱為可重疊的（Overlapping）。
距離（Distance）：基于距離的聚類是將距離近的相似的對象聚在一起。基于概率分布模型的聚類是在一組對象中，找到能符合特定分布模型的對象的集合，他們不一定是距離最近的或者最相似的，而是能完美的呈現出概率分布模型所描述的模型。

K-Means 問題描述：

給定一(yi)個(ge) n 個(ge)對象的(de)數據(ju)集，它可以(yi)構建數據(ju)的(de) k 個(ge)劃分(fen)，每個(ge)劃分(fen)就是一(yi)個(ge)簇，并且 k ≤ n。同時還需滿足：

每個組至少包含一個對象。
每個對象必須屬于且僅屬于一個簇。

Simply speaking, K-Means clustering is an algorithm to classify or to group your objects based on attributes/features, into K number of groups. K is a positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-means clustering is to classify the data.

例(li)如，有如下包含(han) 10 條數據的集(ji)合。集(ji)合中每(mei)項描(miao)述了一個人的身(shen)高(gao)（Height: inches）和體重（Weight: kilograms）。

Height Weight
-------------
(73.0, 72.6) 
(61.0, 54.4) 
(67.0, 99.9) 
(68.0, 97.3) 
(62.0, 59.0) 
(75.0, 81.6) 
(74.0, 77.1) 
(66.0, 97.3) 
(68.0, 93.3) 
(61.0, 59.0)

通過按照(zhao)身高和(he)體(ti)重的聚類，可以將上述(shu) 10 條數據分組(zu)成(cheng) 3 類。

Height Weight
-------------
(67.0, 99.9) 
(68.0, 97.3) 
(66.0, 97.3) 
(68.0, 93.3)

(73.0, 72.6) 
(75.0, 81.6) 
(74.0, 77.1)

(61.0, 54.4) 
(62.0, 59.0) 
(61.0, 59.0)

分類結果(guo)可(ke)以(yi)描述為：中等身(shen)高(gao)并(bing)(bing)且(qie)很重、很高(gao)并(bing)(bing)且(qie)中等體重、矮并(bing)(bing)且(qie)輕。如果(guo)用圖形來觀察分組狀(zhuang)況則結果(guo)一目(mu)了然。

K-Means 算法實現：

由于 K-Means 算法值針對給定的(de)完整(zheng)數據集進行操作，不(bu)需要(yao)任何特殊的(de)訓練數據，所(suo)以 K-Means 是一種無監督(du)的(de)機器學習方法（Unsupervised Machine Learning Technique）。

K-Means 算法最常見的實現方式是(shi)使用迭代式精化(hua)啟(qi)發(fa)法的 。

給定劃分數量 k。創建一個初始劃分，從數據集中隨機地選擇 k 個對象，每個對象初始地代表了一個簇中心（Cluster Centroid）。對于其他對象，計算其與各個簇中心的距離，將它們劃入距離最近的簇。
采用迭代的重定位技術，嘗試通過對象在劃分間移動來改進劃分。所謂重定位技術，就是當有新的對象加入簇或者已有對象離開簇的時候，重新計算簇的平均值，然后對對象進行重新分配。這個過程不斷重復，直到各簇中對象不再變化為止。

randomly assign all data items to a cluster 
loop until no change in cluster assignments 
  compute centroids for each cluster 
  reassign each data item to cluster of closest centroid 
end

簡潔點兒的(de)表述即為：

initialize clustering 
loop 
  update centroids 
  update clustering 
end loop

應用 K-Means 算法到上述(shu)身高與體重的示例，聚類過程如(ru)下圖所示。

K-Means 優缺點：

當結果(guo)簇(cu)(cu)是(shi)(shi)(shi)密集(ji)的(de)，而且簇(cu)(cu)和(he)簇(cu)(cu)之間的(de)區別比較明顯時，K-Means 的(de)效果(guo)較好。對于(yu)大(da)數(shu)據(ju)集(ji)，K-Means 是(shi)(shi)(shi)相對可伸縮的(de)和(he)高效的(de)，它的(de)復雜度(du)是(shi)(shi)(shi) O(nkt)，n 是(shi)(shi)(shi)對象的(de)個數(shu)，k 是(shi)(shi)(shi)簇(cu)(cu)的(de)數(shu)目(mu)，t 是(shi)(shi)(shi)迭代的(de)次數(shu)，通常 k << n，且 t << n，所以(yi)算法經常以(yi)局部最(zui)優結束。

K-Means 的最(zui)大問題是要求先給出 k 的個數(shu)。k 的選擇一般基于經驗值和(he)多(duo)次(ci)實驗結果，對(dui)于不同的數(shu)據(ju)(ju)集，k 的取(qu)值沒有可借鑒性。另外(wai)，K-Means 對(dui)孤立點數(shu)據(ju)(ju)是敏(min)感的，少量(liang)噪聲數(shu)據(ju)(ju)就能對(dui)平均值造成極大的影響。

Basic K-Means - Lloyd's algorithm C# 代碼實現：

Code below referenced from Machine Learning Using C# Succinctly by James McCaffrey, and article .

  1 using System;
  2 
  3 namespace ClusterNumeric
  4 {
  5   class ClusterNumProgram
  6   {
  7     static void Main(string[] args)
  8     {
  9       Console.WriteLine("\nBegin k-means clustering demo\n");
 10 
 11       double[][] rawData = new double[10][];
 12       rawData[0] = new double[] { 73, 72.6 };
 13       rawData[1] = new double[] { 61, 54.4 };
 14       rawData[2] = new double[] { 67, 99.9 };
 15       rawData[3] = new double[] { 68, 97.3 };
 16       rawData[4] = new double[] { 62, 59.0 };
 17       rawData[5] = new double[] { 75, 81.6 };
 18       rawData[6] = new double[] { 74, 77.1 };
 19       rawData[7] = new double[] { 66, 97.3 };
 20       rawData[8] = new double[] { 68, 93.3 };
 21       rawData[9] = new double[] { 61, 59.0 };
 22 
 23       Console.WriteLine("Raw unclustered height (in.) weight (kg.) data:\n");
 24       Console.WriteLine(" ID Height Weight");
 25       Console.WriteLine("---------------------");
 26       ShowData(rawData, 1, true, true);
 27 
 28       int numClusters = 3;
 29       Console.WriteLine("\nSetting numClusters to " + numClusters);
 30 
 31       Console.WriteLine("Starting clustering using k-means algorithm");
 32       Clusterer c = new Clusterer(numClusters);
 33       int[] clustering = c.Cluster(rawData);
 34       Console.WriteLine("Clustering complete\n");
 35 
 36       Console.WriteLine("Final clustering in internal form:\n");
 37       ShowVector(clustering, true);
 38 
 39       Console.WriteLine("Raw data by cluster:\n");
 40       Console.WriteLine(" ID Height Weight");
 41       ShowClustered(rawData, clustering, numClusters, 1);
 42 
 43       Console.WriteLine("\nEnd k-means clustering demo\n");
 44       Console.ReadLine();
 45     }
 46 
 47     static void ShowData(
 48       double[][] data, int decimals,
 49       bool indices, bool newLine)
 50     {
 51       for (int i = 0; i < data.Length; ++i)
 52       {
 53         if (indices == true)
 54           Console.Write(i.ToString().PadLeft(3) + " ");
 55 
 56         for (int j = 0; j < data[i].Length; ++j)
 57         {
 58           double v = data[i][j];
 59           Console.Write(v.ToString("F" + decimals) + "   ");
 60         }
 61 
 62         Console.WriteLine("");
 63       }
 64 
 65       if (newLine == true)
 66         Console.WriteLine("");
 67     }
 68 
 69     static void ShowVector(int[] vector, bool newLine)
 70     {
 71       for (int i = 0; i < vector.Length; ++i)
 72         Console.Write(vector[i] + " ");
 73 
 74       if (newLine == true)
 75         Console.WriteLine("\n");
 76     }
 77 
 78     static void ShowClustered(
 79       double[][] data, int[] clustering,
 80       int numClusters, int decimals)
 81     {
 82       for (int k = 0; k < numClusters; ++k)
 83       {
 84         Console.WriteLine("===================");
 85         for (int i = 0; i < data.Length; ++i)
 86         {
 87           int clusterID = clustering[i];
 88           if (clusterID != k) continue;
 89           Console.Write(i.ToString().PadLeft(3) + " ");
 90           for (int j = 0; j < data[i].Length; ++j)
 91           {
 92             double v = data[i][j];
 93             Console.Write(v.ToString("F" + decimals) + "   ");
 94           }
 95           Console.WriteLine("");
 96         }
 97         Console.WriteLine("===================");
 98       }
 99     }
100   }
101 
102   public class Clusterer
103   {
104     private int numClusters; // number of clusters 
105     private int[] clustering; // index = a tuple, value = cluster ID 
106     private double[][] centroids; // mean (vector) of each cluster 
107     private Random rnd; // for initialization 
108 
109     public Clusterer(int numClusters)
110     {
111       this.numClusters = numClusters;
112       this.centroids = new double[numClusters][];
113       this.rnd = new Random(0); // arbitrary seed 
114     }
115 
116     public int[] Cluster(double[][] data)
117     {
118       int numTuples = data.Length;
119       int numValues = data[0].Length;
120       this.clustering = new int[numTuples];
121 
122       for (int k = 0; k < numClusters; ++k) // allocate each centroid 
123         this.centroids[k] = new double[numValues];
124 
125       InitRandom(data);
126 
127       Console.WriteLine("\nInitial random clustering:");
128       for (int i = 0; i < clustering.Length; ++i)
129         Console.Write(clustering[i] + " ");
130       Console.WriteLine("\n");
131 
132       bool changed = true; // change in clustering? 
133       int maxCount = numTuples * 10; // sanity check 
134       int ct = 0;
135       while (changed == true && ct <= maxCount)
136       {
137         ++ct; // k-means typically converges very quickly 
138         UpdateCentroids(data); // no effect if fail 
139         changed = UpdateClustering(data); // no effect if fail 
140       }
141 
142       int[] result = new int[numTuples];
143       Array.Copy(this.clustering, result, clustering.Length);
144       return result;
145     }
146 
147     private void InitRandom(double[][] data)
148     {
149       int numTuples = data.Length;
150 
151       int clusterID = 0;
152       for (int i = 0; i < numTuples; ++i)
153       {
154         clustering[i] = clusterID++;
155         if (clusterID == numClusters)
156           clusterID = 0;
157       }
158       for (int i = 0; i < numTuples; ++i)
159       {
160         int r = rnd.Next(i, clustering.Length);
161         int tmp = clustering[r];
162         clustering[r] = clustering[i];
163         clustering[i] = tmp;
164       }
165     }
166 
167     private void UpdateCentroids(double[][] data)
168     {
169       int[] clusterCounts = new int[numClusters];
170       for (int i = 0; i < data.Length; ++i)
171       {
172         int clusterID = clustering[i];
173         ++clusterCounts[clusterID];
174       }
175 
176       // zero-out this.centroids so it can be used as scratch 
177       for (int k = 0; k < centroids.Length; ++k)
178         for (int j = 0; j < centroids[k].Length; ++j)
179           centroids[k][j] = 0.0;
180 
181       for (int i = 0; i < data.Length; ++i)
182       {
183         int clusterID = clustering[i];
184         for (int j = 0; j < data[i].Length; ++j)
185           centroids[clusterID][j] += data[i][j]; // accumulate sum 
186       }
187 
188       for (int k = 0; k < centroids.Length; ++k)
189         for (int j = 0; j < centroids[k].Length; ++j)
190           centroids[k][j] /= clusterCounts[k]; // danger? 
191     }
192 
193     private bool UpdateClustering(double[][] data)
194     {
195       // (re)assign each tuple to a cluster (closest centroid) 
196       // returns false if no tuple assignments change OR 
197       // if the reassignment would result in a clustering where 
198       // one or more clusters have no tuples. 
199 
200       bool changed = false; // did any tuple change cluster? 
201 
202       int[] newClustering = new int[clustering.Length]; // proposed result 
203       Array.Copy(clustering, newClustering, clustering.Length);
204 
205       double[] distances = new double[numClusters]; // from tuple to centroids
206 
207       for (int i = 0; i < data.Length; ++i) // walk through each tuple 
208       {
209         for (int k = 0; k < numClusters; ++k)
210           distances[k] = Distance(data[i], centroids[k]);
211 
212         int newClusterID = MinIndex(distances); // find closest centroid 
213         if (newClusterID != newClustering[i])
214         {
215           changed = true; // note a new clustering 
216           newClustering[i] = newClusterID; // accept update 
217         }
218       }
219 
220       if (changed == false)
221         return false; // no change so bail 
222 
223       // check proposed clustering cluster counts 
224       int[] clusterCounts = new int[numClusters];
225       for (int i = 0; i < data.Length; ++i)
226       {
227         int clusterID = newClustering[i];
228         ++clusterCounts[clusterID];
229       }
230 
231       for (int k = 0; k < numClusters; ++k)
232         if (clusterCounts[k] == 0)
233           return false; // bad clustering 
234 
235       Array.Copy(newClustering, clustering, newClustering.Length); // update 
236       return true; // good clustering and at least one change 
237     }
238 
239     // Euclidean distance between two vectors for UpdateClustering() 
240     private static double Distance(double[] tuple, double[] centroid)
241     {
242       double sumSquaredDiffs = 0.0;
243       for (int j = 0; j < tuple.Length; ++j)
244         sumSquaredDiffs += (tuple[j] - centroid[j]) * (tuple[j] - centroid[j]);
245       return Math.Sqrt(sumSquaredDiffs);
246     }
247 
248     // helper for UpdateClustering() to find closest centroid 
249     private static int MinIndex(double[] distances)
250     {
251       int indexOfMin = 0;
252       double smallDist = distances[0];
253       for (int k = 1; k < distances.Length; ++k)
254       {
255         if (distances[k] < smallDist)
256         {
257           smallDist = distances[k];
258           indexOfMin = k;
259         }
260       }
261       return indexOfMin;
262     }
263   }
264 }

運行結果如下：

參考資料

本篇文章《K-Means 聚類算法》由 Dennis Gao 發表自博客園個人博客，未經作者本人同意禁止以任何的形式轉載，任何自動的或人為的爬蟲轉載行為均為耍流氓。

posted @ 2015-02-09 23:06 sangmado 閱讀(26438) 評論(3) 收藏舉報

刷新頁面返回頂部

中文字幕精品亚洲无线码二区,国产黄a三级三级三级看三级,亚洲七七久久桃花影院,丰满少妇被猛烈进入,国产小视频在线观看网站

{ "熵碼匠藝": "Software Craftsmanship" }

K-Means 聚類算法(fa)

參考資料

公告