Original source from: http://bikramhanjra.blogspot.ch/2013/05/k-means-clustering-algorithm-and-opencv.html

**K means** is a unsupervised learning algorithm for clustering $N$ observations into $K$ clusters with each cluster having its own cluster center.

**Input to algorithm**

- $K$(number of clusters) – The number of cluster into which the data is to be clustered.
- The observation matrix $x$ i.e $\{ $ $x^{(1)}$,$x^{(2)}$,$x^{(3)}$…..,$x^{(n)}$,…..,$x^{N},$ $\} $

**Output of algorithm**

- $u_c$ This matrix contains the same value for the observations belong to same cluster.
- Cluster Centroids in matrix $u^{(n)}$.

**Parameters**

- $k$ or – Index of Cluster . Here $(1<k<K)$
- $n$ – Index of Observation. Here $(1<n<N)$
- $x^{(n)}$ – nth observation.
- $c^{(n)}$ The $nth$ element of c matrix contains the index of the cluster center to which the nth observation is assigned.
- $u^{(n)}$ The $kth$ element of u matrix represents the cluster centroid $k$ .
- $u_c^{(n)}$ Here the it contains the value of the cluster centroid to which $x^{n}$belongs.

**Note – If there are 4 cluster i.e. $k$=4 and the let $x^{(i)}$ belongs to $k$=2 cluster then $c^{n}$=2 and **** $u_c^{(n)}$=$u^{(n)}$**.

**ALGORITHM**

*Assignment Step*– For all the observations in matrix $X$, update matrix $c$ where the $i$th value contains the index of the cluster which has the lowest euclidean distance from $x_{(i)}$.*Update Step –*For the cluster centroids in in $c$ , update the value of the $k$th element by the average of the observations that are assigned to that cluster.

*In figure -*

$(a)$ A set of 2D Observations (Green dots) .

$(b)$ Randomly initialize 2 cluster centroids (Red and Blue cross).

$(c)$ First *Assignment Step* , calculation of Euclidean distance of each observation from both the cluster centroids and assign it to the cluster which has minimum Euclidean distance (Red dots represent observations assigned to Red cross and vice-versa).

$(d)$ First *Update Step*, update the clusters to the average or centroid or mean of all the observations that are assigned to the cluster in each dimension. (The 2 crosses have there position changed to the centroid of there colors points).

$(e)$ Second *Assignment Step .*

$(f)$ Second *Update Step .*

REFERENCE – https://www.coursera.org/course/ml .

**Implementation of K Means algorithm is also present in OpenCV .(**C++)

double kmeans(InputArray data, int K, InputOutputArray bestLabels, TermCriteria

criteria, int attempts, int flags, OutputArray centers=noArray() )

Here-

*data* – These are the observations which have to be clustered.

*K* – $K$ is the number clusters to be formed.

*bestLabels* – This contains the index of the cluster assigned to the ith observation.

*criteria* – The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy.

*attempts* – Flag to specify the number of times the algorithm is executed using different initial labellings.

*Flag*- This is used to define ,how to select the values of $u$ or the initial cluster centtroids.

*centers* – Returns the cluster centers.

Kmeans clustering is one of the most widely used UnSupervised Learning Algorithms. If you are not sure what Kmeans is, refer this article. Also if you have heard about the term Vector Quantization, Kmeans is closely related to that (refer this article to know more about it). Autonlab has a great ppt on Kmeans Clustering.

First, I’ll talk about the kmeans usage in OpenCV with C++ and then I’ll explain it with a program. If you are not yet comfortable in OpenCV with C++, please refer to this article and the pretty much everything else is the same as in C API (where you use IplImage*,etc).

Btw, My other programs in OpenCV will be posted here.

Function call in C++ API of OpenCV accepts the input in following format:

**double kmeans(const Mat& samples, int clusterCount, Mat& labels, TermCriteria termcrit, int attempts, int flags, Mat* centers);**

Parameters explained as follows:

**samples**: It contains the data. Each row represents a Feature Vector. Each co lumn in a row represent a dimension. So, we can have multiple dimensions of data in the feature vector. Example if we have**50, 5 dimensional feature vector, we will have 50 rows, 5 colums**of this matrix. One thing interesting which I’ve noticed is**kmeans**doesn’t work with**CV_64F**type.**clusterCount**: It should be specified beforehand. We need to know how many clusters do we divide the data into. It is an integer.**labels**: It is an output Matrix. If we had a Matrix of above specified size (i.e**50 x 5**), we will have**50 x 1**output Matrix. It determines which cluster the feature vector belongs. It starts with**0, 1, …. (number of clusters-1)**.**TermCriteria**: It determines the criteria in applying the algorithm. Max iterations, accuracy,etc.**attempts**: number of attempts made with different initial labelling. Also refer documentation for elaborate information on this parameter.**flags**: It can be

**KMEANS_RANDOM_CENTERS**(for random initialization of cluster centers).

**KMEANS_PP_CENTERS**(for kmeans++ version of initializing cluster centers)

**KMEANS_USE_INITIAL_LABELS**(for user defined initialization).**centers**: Matrix holding center of each cluster. If we divide the**50 x 5**feature vector into**2 clusters**, we will have**2 centers of each in 5 dimensions**.

`#include "opencv2/highgui/highgui.hpp"`

`#include "opencv2/core/core.hpp"`

`#include <iostream>`

`using`

`namespace`

`cv;`

`using`

`namespace`

`std;`

`int`

`main( `

`int`

`/*argc*/`

`, `

`char`

`** `

`/*argv*/`

`)`

`{`

` `

`cout << `

`"\n Usage in C++ API:\n double kmeans(const Mat& samples, int clusterCount, Mat& labels, TermCriteria termcrit, int attempts, int flags, Mat* centers) \n\n\n"`

`<< endl;`

` `

`int`

`clusterCount = 2;`

` `

`int`

`dimensions = 5;`

` `

`int`

`sampleCount = 50;`

` `

`Mat points(sampleCount,dimensions, CV_32F,Scalar(10));`

` `

`Mat labels;`

` `

`Mat centers(clusterCount, 1, points.type());`

` `

`// values of 1st half of data set is set to 10`

` `

`//change the values of 2nd half of the data set; i.e. set it to 20`

` `

`for`

`(`

`int`

`i =24;i<points.rows;i++)`

` `

`{`

` `

`for`

`(`

`int`

`j=0;j<points.cols;j++)`

` `

`{`

` `

`points.at<`

`float`

`>(i,j)=20;`

` `

`}`

` `

`}`

` `

`kmeans(points, clusterCount, labels, TermCriteria( CV_TERMCRIT_EPS+CV_TERMCRIT_ITER, 10, 1.0), 3, KMEANS_PP_CENTERS, centers);`

` `

`// we can print the matrix directly.`

` `

`cout<<`

`"Data: \n"`

`<<points<<endl;`

` `

`cout<<`

`"Center: \n"`

`<<centers<<endl;`

` `

`cout<<`

`"Labels: \n"`

`<<labels<<endl;`

` `

`return`

`0;`

`}`