中文字幕精品亚洲无线码二区,国产黄a三级三级三级看三级,亚洲七七久久桃花影院,丰满少妇被猛烈进入,国产小视频在线观看网站

K-Means 聚類算法(fa)

K-Means 概念定義:

K-Means 是一種基(ji)于距離的排他的聚類劃(hua)分(fen)方法。

上面的 K-Means 描述中(zhong)包含了(le)幾(ji)個(ge)概念:

  • 聚類(Clustering):K-Means 是一種聚類分析(Cluster Analysis)方法。聚類就是將數據對象分組成為多個類或者簇 (Cluster),使得在同一個簇中的對象之間具有較高的相似度,而不同簇中的對象差別較大。
  • 劃分(Partitioning):聚類可以基于劃分,也可以基于分層。劃分即將對象劃分成不同的簇,而分層是將對象分等級。
  • 排他(Exclusive):對于一個數據對象,只能被劃分到一個簇中。如果一個數據對象可以被劃分到多個簇中,則稱為可重疊的(Overlapping)。
  • 距離(Distance):基于距離的聚類是將距離近的相似的對象聚在一起。基于概率分布模型的聚類是在一組對象中,找到能符合特定分布模型的對象的集合,他們不一定是距離最近的或者最相似的,而是能完美的呈現出概率分布模型所描述的模型。

K-Means 問題描述:

給定一(yi)個(ge) n 個(ge)對象的(de)數據(ju)集,它可以(yi)構建數據(ju)的(de) k 個(ge)劃分(fen),每個(ge)劃分(fen)就是一(yi)個(ge)簇,并且 k ≤ n。同時還需滿足:

  1. 每個組至少包含一個對象。
  2. 每個對象必須屬于且僅屬于一個簇。

Simply speaking, K-Means clustering is an algorithm to classify or to group your objects based on attributes/features, into K number of groups. K is a positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-means clustering is to classify the data.

例(li)如,有如下包含(han) 10 條數據的集(ji)合。集(ji)合中每(mei)項描(miao)述了一個人的身(shen)高(gao)(Height: inches)和體重(Weight: kilograms)。

Height Weight
-------------
(73.0, 72.6) 
(61.0, 54.4) 
(67.0, 99.9) 
(68.0, 97.3) 
(62.0, 59.0) 
(75.0, 81.6) 
(74.0, 77.1) 
(66.0, 97.3) 
(68.0, 93.3) 
(61.0, 59.0)

通過按照(zhao)身高和(he)體(ti)重的聚類,可以將上述(shu) 10 條數據分組(zu)成(cheng) 3 類。

Height Weight
-------------
(67.0, 99.9) 
(68.0, 97.3) 
(66.0, 97.3) 
(68.0, 93.3)

(73.0, 72.6) 
(75.0, 81.6) 
(74.0, 77.1)

(61.0, 54.4) 
(62.0, 59.0) 
(61.0, 59.0)

分類結果(guo)可(ke)以(yi)描述為:中等身(shen)高(gao)并(bing)(bing)且(qie)很重、很高(gao)并(bing)(bing)且(qie)中等體重、矮并(bing)(bing)且(qie)輕。如果(guo)用圖形來觀察分組狀(zhuang)況則結果(guo)一目(mu)了然。

K-Means 算法實現:

由于 K-Means 算法值針對給定的(de)完整(zheng)數據集進行操作,不(bu)需要(yao)任何特殊的(de)訓練數據,所(suo)以 K-Means 是一種無監督(du)的(de)機器學習方法(Unsupervised Machine Learning Technique)。

K-Means 算法最常見的實現方式是(shi)使用迭代式精化(hua)啟(qi)發(fa)法的 。

  • 給定劃分數量 k。創建一個初始劃分,從數據集中隨機地選擇 k 個對象,每個對象初始地代表了一個簇中心(Cluster Centroid)。對于其他對象,計算其與各個簇中心的距離,將它們劃入距離最近的簇。
  • 采用迭代的重定位技術,嘗試通過對象在劃分間移動來改進劃分。所謂重定位技術,就是當有新的對象加入簇或者已有對象離開簇的時候,重新計算簇的平均值,然后對對象進行重新分配。這個過程不斷重復,直到各簇中對象不再變化為止。
randomly assign all data items to a cluster 
loop until no change in cluster assignments 
  compute centroids for each cluster 
  reassign each data item to cluster of closest centroid 
end

簡潔點兒的(de)表述即為:

initialize clustering 
loop 
  update centroids 
  update clustering 
end loop

應用 K-Means 算法到上述(shu)身高與體重的示例,聚類過程如(ru)下圖所示。

K-Means 優缺點:

當結果(guo)簇(cu)(cu)是(shi)(shi)(shi)密集(ji)的(de),而且簇(cu)(cu)和(he)簇(cu)(cu)之間的(de)區別比較明顯時,K-Means 的(de)效果(guo)較好。對于(yu)大(da)數(shu)據(ju)集(ji),K-Means 是(shi)(shi)(shi)相對可伸縮的(de)和(he)高效的(de),它的(de)復雜度(du)是(shi)(shi)(shi) O(nkt),n 是(shi)(shi)(shi)對象的(de)個數(shu),k 是(shi)(shi)(shi)簇(cu)(cu)的(de)數(shu)目(mu),t 是(shi)(shi)(shi)迭代的(de)次數(shu),通常 k << n,且 t << n,所以(yi)算法經常以(yi)局部最(zui)優結束。

K-Means 的最(zui)大問題是要求先給出 k 的個數(shu)。k 的選擇一般基于經驗值和(he)多(duo)次(ci)實驗結果,對(dui)于不同的數(shu)據(ju)(ju)集,k 的取(qu)值沒有可借鑒性。另外(wai),K-Means 對(dui)孤立點數(shu)據(ju)(ju)是敏(min)感的,少量(liang)噪聲數(shu)據(ju)(ju)就能對(dui)平均值造成極大的影響。

Basic K-Means - Lloyd's algorithm C# 代碼實現:

Code below referenced from Machine Learning Using C# Succinctly by James McCaffrey, and article .

  1 using System;
  2 
  3 namespace ClusterNumeric
  4 {
  5   class ClusterNumProgram
  6   {
  7     static void Main(string[] args)
  8     {
  9       Console.WriteLine("\nBegin k-means clustering demo\n");
 10 
 11       double[][] rawData = new double[10][];
 12       rawData[0] = new double[] { 73, 72.6 };
 13       rawData[1] = new double[] { 61, 54.4 };
 14       rawData[2] = new double[] { 67, 99.9 };
 15       rawData[3] = new double[] { 68, 97.3 };
 16       rawData[4] = new double[] { 62, 59.0 };
 17       rawData[5] = new double[] { 75, 81.6 };
 18       rawData[6] = new double[] { 74, 77.1 };
 19       rawData[7] = new double[] { 66, 97.3 };
 20       rawData[8] = new double[] { 68, 93.3 };
 21       rawData[9] = new double[] { 61, 59.0 };
 22 
 23       Console.WriteLine("Raw unclustered height (in.) weight (kg.) data:\n");
 24       Console.WriteLine(" ID Height Weight");
 25       Console.WriteLine("---------------------");
 26       ShowData(rawData, 1, true, true);
 27 
 28       int numClusters = 3;
 29       Console.WriteLine("\nSetting numClusters to " + numClusters);
 30 
 31       Console.WriteLine("Starting clustering using k-means algorithm");
 32       Clusterer c = new Clusterer(numClusters);
 33       int[] clustering = c.Cluster(rawData);
 34       Console.WriteLine("Clustering complete\n");
 35 
 36       Console.WriteLine("Final clustering in internal form:\n");
 37       ShowVector(clustering, true);
 38 
 39       Console.WriteLine("Raw data by cluster:\n");
 40       Console.WriteLine(" ID Height Weight");
 41       ShowClustered(rawData, clustering, numClusters, 1);
 42 
 43       Console.WriteLine("\nEnd k-means clustering demo\n");
 44       Console.ReadLine();
 45     }
 46 
 47     static void ShowData(
 48       double[][] data, int decimals,
 49       bool indices, bool newLine)
 50     {
 51       for (int i = 0; i < data.Length; ++i)
 52       {
 53         if (indices == true)
 54           Console.Write(i.ToString().PadLeft(3) + " ");
 55 
 56         for (int j = 0; j < data[i].Length; ++j)
 57         {
 58           double v = data[i][j];
 59           Console.Write(v.ToString("F" + decimals) + "   ");
 60         }
 61 
 62         Console.WriteLine("");
 63       }
 64 
 65       if (newLine == true)
 66         Console.WriteLine("");
 67     }
 68 
 69     static void ShowVector(int[] vector, bool newLine)
 70     {
 71       for (int i = 0; i < vector.Length; ++i)
 72         Console.Write(vector[i] + " ");
 73 
 74       if (newLine == true)
 75         Console.WriteLine("\n");
 76     }
 77 
 78     static void ShowClustered(
 79       double[][] data, int[] clustering,
 80       int numClusters, int decimals)
 81     {
 82       for (int k = 0; k < numClusters; ++k)
 83       {
 84         Console.WriteLine("===================");
 85         for (int i = 0; i < data.Length; ++i)
 86         {
 87           int clusterID = clustering[i];
 88           if (clusterID != k) continue;
 89           Console.Write(i.ToString().PadLeft(3) + " ");
 90           for (int j = 0; j < data[i].Length; ++j)
 91           {
 92             double v = data[i][j];
 93             Console.Write(v.ToString("F" + decimals) + "   ");
 94           }
 95           Console.WriteLine("");
 96         }
 97         Console.WriteLine("===================");
 98       }
 99     }
100   }
101 
102   public class Clusterer
103   {
104     private int numClusters; // number of clusters 
105     private int[] clustering; // index = a tuple, value = cluster ID 
106     private double[][] centroids; // mean (vector) of each cluster 
107     private Random rnd; // for initialization 
108 
109     public Clusterer(int numClusters)
110     {
111       this.numClusters = numClusters;
112       this.centroids = new double[numClusters][];
113       this.rnd = new Random(0); // arbitrary seed 
114     }
115 
116     public int[] Cluster(double[][] data)
117     {
118       int numTuples = data.Length;
119       int numValues = data[0].Length;
120       this.clustering = new int[numTuples];
121 
122       for (int k = 0; k < numClusters; ++k) // allocate each centroid 
123         this.centroids[k] = new double[numValues];
124 
125       InitRandom(data);
126 
127       Console.WriteLine("\nInitial random clustering:");
128       for (int i = 0; i < clustering.Length; ++i)
129         Console.Write(clustering[i] + " ");
130       Console.WriteLine("\n");
131 
132       bool changed = true; // change in clustering? 
133       int maxCount = numTuples * 10; // sanity check 
134       int ct = 0;
135       while (changed == true && ct <= maxCount)
136       {
137         ++ct; // k-means typically converges very quickly 
138         UpdateCentroids(data); // no effect if fail 
139         changed = UpdateClustering(data); // no effect if fail 
140       }
141 
142       int[] result = new int[numTuples];
143       Array.Copy(this.clustering, result, clustering.Length);
144       return result;
145     }
146 
147     private void InitRandom(double[][] data)
148     {
149       int numTuples = data.Length;
150 
151       int clusterID = 0;
152       for (int i = 0; i < numTuples; ++i)
153       {
154         clustering[i] = clusterID++;
155         if (clusterID == numClusters)
156           clusterID = 0;
157       }
158       for (int i = 0; i < numTuples; ++i)
159       {
160         int r = rnd.Next(i, clustering.Length);
161         int tmp = clustering[r];
162         clustering[r] = clustering[i];
163         clustering[i] = tmp;
164       }
165     }
166 
167     private void UpdateCentroids(double[][] data)
168     {
169       int[] clusterCounts = new int[numClusters];
170       for (int i = 0; i < data.Length; ++i)
171       {
172         int clusterID = clustering[i];
173         ++clusterCounts[clusterID];
174       }
175 
176       // zero-out this.centroids so it can be used as scratch 
177       for (int k = 0; k < centroids.Length; ++k)
178         for (int j = 0; j < centroids[k].Length; ++j)
179           centroids[k][j] = 0.0;
180 
181       for (int i = 0; i < data.Length; ++i)
182       {
183         int clusterID = clustering[i];
184         for (int j = 0; j < data[i].Length; ++j)
185           centroids[clusterID][j] += data[i][j]; // accumulate sum 
186       }
187 
188       for (int k = 0; k < centroids.Length; ++k)
189         for (int j = 0; j < centroids[k].Length; ++j)
190           centroids[k][j] /= clusterCounts[k]; // danger? 
191     }
192 
193     private bool UpdateClustering(double[][] data)
194     {
195       // (re)assign each tuple to a cluster (closest centroid) 
196       // returns false if no tuple assignments change OR 
197       // if the reassignment would result in a clustering where 
198       // one or more clusters have no tuples. 
199 
200       bool changed = false; // did any tuple change cluster? 
201 
202       int[] newClustering = new int[clustering.Length]; // proposed result 
203       Array.Copy(clustering, newClustering, clustering.Length);
204 
205       double[] distances = new double[numClusters]; // from tuple to centroids
206 
207       for (int i = 0; i < data.Length; ++i) // walk through each tuple 
208       {
209         for (int k = 0; k < numClusters; ++k)
210           distances[k] = Distance(data[i], centroids[k]);
211 
212         int newClusterID = MinIndex(distances); // find closest centroid 
213         if (newClusterID != newClustering[i])
214         {
215           changed = true; // note a new clustering 
216           newClustering[i] = newClusterID; // accept update 
217         }
218       }
219 
220       if (changed == false)
221         return false; // no change so bail 
222 
223       // check proposed clustering cluster counts 
224       int[] clusterCounts = new int[numClusters];
225       for (int i = 0; i < data.Length; ++i)
226       {
227         int clusterID = newClustering[i];
228         ++clusterCounts[clusterID];
229       }
230 
231       for (int k = 0; k < numClusters; ++k)
232         if (clusterCounts[k] == 0)
233           return false; // bad clustering 
234 
235       Array.Copy(newClustering, clustering, newClustering.Length); // update 
236       return true; // good clustering and at least one change 
237     }
238 
239     // Euclidean distance between two vectors for UpdateClustering() 
240     private static double Distance(double[] tuple, double[] centroid)
241     {
242       double sumSquaredDiffs = 0.0;
243       for (int j = 0; j < tuple.Length; ++j)
244         sumSquaredDiffs += (tuple[j] - centroid[j]) * (tuple[j] - centroid[j]);
245       return Math.Sqrt(sumSquaredDiffs);
246     }
247 
248     // helper for UpdateClustering() to find closest centroid 
249     private static int MinIndex(double[] distances)
250     {
251       int indexOfMin = 0;
252       double smallDist = distances[0];
253       for (int k = 1; k < distances.Length; ++k)
254       {
255         if (distances[k] < smallDist)
256         {
257           smallDist = distances[k];
258           indexOfMin = k;
259         }
260       }
261       return indexOfMin;
262     }
263   }
264 }

運行結果如下:

參考資料

本篇文章《K-Means 聚類算法》由 Dennis Gao 發表自博客園個人博客,未經作者本人同意禁止以任何的形式轉載,任何自動的或人為的爬蟲轉載行為均為耍流氓。

posted @ 2015-02-09 23:06  sangmado  閱讀(26438)  評論(3)    收藏  舉報