Classification: the task is to learn to assign instances to predefined classes. classification requires supervised learning, i.e., the training data has to specify what we are trying to learn, and further make predications based on the training data.

Clustering: no predefined classification is required. The task is to learn a classification from the data. Clustering algorithms divide a data set into natural groups (clusters). Instances in the same cluster are similar to each other, they share certain properties. In other words,  clustering is not going to make predications, thus in no need of the training data.
   And the results for the clustering is only a group of clusters without labeling what this cluster represents


What is k-means
   k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
because it is told in advance how many distinct clusters to generate.
  It is the simplest and most commonly used algorithm employing a squared error criterion. It starts with a random initial partition and keeps reassigning the observations to clusters based on the similarity between the observations and the cluster centers until there is no reassignment of any observation from one cluster to another, or the squared error ceases to decrease significantly after some number of iterations).
k-means is polythetic; that is, all features enter into the computation of distances between
observations, and decisions  are based on those distances. it have advantages in applications involving large data sets

How k-means works

J is the target function,  k represents the number of clusters while n represents the number of observations, the target goal of this function is to minimize it to optimum

  Suppose we are given with a dataset of J data items or observations, in which individual data item has several continuous and numeral features or attributes that could be mapped onto multi-dimensional space.
  In the k-means algorism, we are known the dataset is to be clustered into predefined k clusters from m1…,mk,, which may be specified randomly or by some heuristic.  And it is often to call the random cluster point as centroid.
  The algorithm firstly initializes k centroid from the existing data items, and then proceeds by alternating between two steps until convergence has been reached.:
   Assignment step: Assign each observation to the cluster with the closest mean
  Update step: Calculate the new means to be the centroid of the observations in the cluster.

                                Figure 1   the example for k-means

For example as showed in the figure 1, In the first frame, the two centroids (shown as dark circles) are placed randomly. Frame 2 shows that each of the items is assigned to the nearest centroid—in this case, A and B are assigned to the top centroid and C, D, and E are assigned to the bottom centroid. In the third frame, each centroid has been moved to the average location of the items that were assigned to it. When the assignments are calculated again, it turns out that C is now closer to the top centroid, while D and E remain closest to the bottom one. Thus, the final result is reached with A, B, and C in one cluster, and D and E in the other.

Short introduction about hierarchical clustering

    Hierarchical clustering builds up a hierarchy of groups by continuously merging the
two most similar groups. Each of these groups starts as a single item.  In each iteration this method calculates the distances between every pair of groups, and the closest ones are merged together to form a new group. This is repeated until there is only one group.

                              Figure 2  hierarchical 
  For example,  in the figure 2, the similarity of the items is represented by their relative locations—the closer two items are, the more similar they are. At first, the groups are just individual items. In the second step, you can see that A and B, the two items closest together, have merged to form a new group whose location is halfway between the two. In the third step, this new group is merged with C. Since D and E are now the two closest items, they form a new group. The final step unifies the two remaining groups
  However , it can be very time consuming to do all these calculations, so it’s a good
idea to store the correlation results for each pair, since they will have to be calculated
again and again until one of the items in the pair is merged into another cluster.
  The tree view doesn’t really break the data into distinct groups without
additional work, and the algorithm is extremely computationally intensive. Because
the relationship between every pair of items must be calculated and then recalculated
when items are merged, the algorithm will run slowly on very large datasets.
  Hierarchical clustering is much preferable for detailed data analysis.


The application of k-means clustering
Clustering is used frequently in data-intensive applications.
Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
WWW: document classification; clustering weblog data to discover groups of similar access patterns.
Clustering is also heavily used in computational biology to find groups of genes that exhibit similar behavior, which might indicate that they respond to a treatment in the same way or are part of the same biological pathway.

Image segmentation
The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition.

Image compression

Reducing the number of colors in an image is useful to compress it: assuming to represent each color component with 8 bits, a small 100×100 image would need10000*24 bits (29.3 kBytes); if we manage to accettably represent the image using only 32 different colors, we can encode each with just 5 bits, and reduce the needed memory to 6.1 kBytes, plus the space required to define the 32 used colors (32*24 bits).