The ways to calculate the similarity of distance

In order to categorize the data items into distinct groups, we should find things that share the same patterns or properties. By examining these properties, we could find to what the extent two data items are closely related to each other.
An important component of a clustering algorithm is the distance measure between data points.
Since similarity is fundamental to the definition of a cluster, a measure of the similarity between two patterns drawn from the same feature space is essential to most clustering procedures. Because of the variety of feature types and scales, the distance measure (or measures) must be chosen carefully. It is most common to calculate the dissimilarity between two patterns using a distance measure defined on the feature space. We will focus on the well-known distance measures used for patterns whose features are all continuous.

1. Euclidean Distance Score

The most popular metric for continuous features is the Euclidean distance.
The Euclidean distance between points p and q is the length of the line segment . If p = (p1, p2,…, pn) andq = (q1, q2,…, qn) are two points, then the distance from p to q is given by:

Suppose the data items or observations are mapped onto a multi-dimensional space, in which each individual dimension is corresponded to a attribute of the data item. Generally , it is better to imagine a data item with two continuous numeral features, which are to be map onto a two dimensional chart, so it is much easier to view the similarities between different data items by watching their distances between each pair, The closer two data items are in the preference space, the more similar their preferences are.
  To calculate the distance between each pair of observations, take the difference of attribute in each dimension, square them and add them together, then take the square root of the sum
   If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances.
   However, even in this case the Euclidean distance can sometimes be misleading. Figure shown below illustrates this with an example of the width and height measurements of an object. Despite both measurements being taken in the same physical units, an informed decision has to be made as to the relative scaling. As the figure shows, different scalings can lead to different clusterings.

2 Pearson Correlation Score

   A slightly more sophisticated way to determine the similarity is to use a Pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line. The formula for this is more complicated than the Euclidean distance score, but it tends to give better results in situations where the data isn’t well normalized
Here’s the formula for computing the coefficient for two items x and y is the following:

It then calculates the sums and the sum of the squares of the ratings for the two items ,and calculates the sum of their products . Finally, it uses these results to calculate the Pearson correlation coefficient
   Unlike the distance metric, this formula is not very intuitive, but it does tell you how much the variables change together divided by the product of how much they vary individually.
Remember that the Pearson correlation is 1.0 when two items match perfectly, and is
close to 0.0 when there’s no relationship at all.

What is the condition to stop the iteration

   With every iteration, the rows are each assigned to one of the centroids, and the centroid data is updated to the average of all its assignees. When the assignments are the same as they were the previous time, the process ends and the k lists, each representing a cluster, are returned. The number of iterations it takes to produce the final result is quite small compared to hierarchical clustering.

The comparison between k-means and hierarchical algorithms

The k-means algorithm is popular because it is easy to implement, and its time complexity is O~n!,
The reasons behind the popularity of the k-means algorithm are:
(1) Its time complexity is O~nkl!, where n is the number of patterns, k is the number of clusters, and l is the number of iterations taken by the algorithm to converge. Typically,
k and l are fixed in advance and so the algorithm has linear time complexity in the size of the data set
(2) Its space complexity is O~k 1 n!. It requires additional space to store the data matrix. It is possible to store the data matrix in a secondary memory and access each pattern based on need. However, this scheme requires a huge access time because of the iterative nature of
the algorithm, and as a consequence processing time increases enormously. It is order-independent; for a given initial seed set of cluster centers, it generates the same partition of the data irrespective of the order in which the patterns are presented to the algorithm.

Hierarchical algorithms are more versatile.

But they have the following disadvantages:
(1) The time complexity of hierarchical agglomerative algorithms is O~n2 log n!. It is possible to obtain single-link clusters using an MST of the data, which can be constructed in O~n log2 n! time for two-dimensional data
(2) The space complexity of agglomerative algorithms is O~n2!. This is because
a similarity matrix of size n 3 n has to be stored. To cluster every pixel in a 100 3 100 image,
approximately 200 megabytes of storage would be required (assuning single-precision storage of similarities).
It is possible to compute the entries of this matrix based on need instead of storing them (this would increase the algorithm’s time complexity


 Application of k-means on blog clustering

The application of k-means will help us to cluster the blogs, the topics they discuss, and their particular word usage to show that blogs can be grouped according to their text and that words can be grouped by their usage.
By clustering blogs based on word frequencies, it might be possible to determine if
there are groups of blogs that frequently write about similar subjects or write in similar
styles. Such a result could be very useful in searching, cataloging, and discovering
the huge number of blogs that are currently online.

Description about my dataset

                                        Figure 3  the dataset
The figure 3 shows the dataset I am going to work through . In my dataset, the data items that  to be clustered are a set of 99 of the top blogs, and the data they’ll be clustered on is the number of times a particular set of words appears in each blog’s feed.
   The top row in the figure represents the 706 keywords for each blogs, in other words, these keywords are the key features or attributes that we based on to cluster the blogs into distinct groups. The leftmost rows are representing the 99 blog item to be clustered.
   As we know, k-means algorithm is not feasible for the datasets that are non-continuous, nominal, discrete. So the keywords for clustering are represented in the words frequency that is numeral and continuous.  
1.  feature extraction
   Actually, feature extraction is one significant task before the clustering task. Clustering task is so dependent on the features we collected to make decisions on distinction of each data item. 
Thus it is so crucial to make sure the all the features we selected for the experiment are able to be the representatives of target features.
In this case, the key words selecting from the blog are the words should be within a certain bounds, in this dataset, the creator makes the bound from minimum 10% to maximum 50% for each keyword
2 reading the dataset from csv
   The application reads the top row into the list of column names and reads the leftmost column into a list of row names, then puts all the data into a big list where every item in the list is the data for that row. The count for any cell can be referenced by its row and column in data, which also corresponds to the indices of the row names and column names lists.

The architecture on the implementation
1. The explanation of my class on k-means

On my simple implementation of k-means
Cluster class is referred to the cluster,  in which a centroid  is to store a random assignment,  and two arrays are used to keep the data items assigned to this cluster for last time and present time respectively.
   While Assignment class is referred to the assignment, in which have a array storing all the attributes, in other words,  the key features of  data item.
   There is one K-means class for implementing  the functionality of k-means algorithm, 
   This class  is however based on a interface – IKmeans , in this interface defines some basic functions for k-means to help to further extend k-mean++ and k-medoids in the future.
   DatasetConfig is responsible for retrieving data from file, parsing the data into arrays of specific assignment, then pass the array to Kmean class to handle the logic. 
The figure 4 shows the basic architecture of implementation

2 The process

  1.    K-means clustering begins with k randomly placed points, and then assign each item to the nearest points. After the assignment, the centroids would are moved to the average the location of the nodes around them .

step 1: read the data from the csv
  When reading the data from the csv file, the first most important thing is to make sure the number of all available attributes, since we need to know the number of attributes to initialize some arrays in the assignment class .
step 2:
place k centroid randomly, in practice, we ‘d like to build k clusters and randomly select the centroid from the dataset, assign the centroid into cluster.
   That is  k clusters[] accompanied with k lastTimeCluster[],  and n data items
   lastTimeCluster[] is used to determines when to end iteration, it stores the results last time, and compare with current results,
step 2:
for( each dataItem)
for(each centroid in k)
calculate the current distance between dataItem and centroid,
//compare the current distance with minimum distance
           if(minimuDistance > currentDistance)
//the specific centroid assign the item to him, that is
cluster[centroid ].add dataitem
if(clusters == lastTime_clusters)   break ;   //  finish iteration
{   lasttime_clusters = clusters
    clusters = [];
    go to step 3
step 3:
for (each centroid in cluster)
// move its position to the average position around the items assigned to it
2. the process repeats until the assignment stops assigning.

clustering results

In my experiments, I choose to cluster these 99 blogs into 10 clusters in the way of k-means. Generally only after the 4, or 5, sometimes 9 times iterations at most; the application could stop iteration, and come up with the results,

   In order to be able to better observe the results, after finishing the iteration the application would show, like what we have seen in the figure

1. Evaluation  on results
   About the evaluation of the results, test the resulting clusters intuitively, just inspect them and
see if they make sense. I test the blogs which are into the cluster of web blogs, usually this type of blog composes from 17 to 19 blogs, that the results varies within different experiments, 

   However, in general the results of the clustered blogs are highly consistent with what we make sense in our mind.
  However, one problem with the k-means is that it is so sensitive to the initial selection of centroids. This could be clearly oberserved by the slightly different results carried out by different experiment. It is possibly because the k initial state of centroids.
  This problem of k-means could lead to poor clustering.

2. Explanation on results

  As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. It has been shown that there exist certain point sets on which k-means takes superpolynomial time: 2Ω(√n) to converge, but these point sets do not seem to arise in practice.

  In practice, therefore, the algorithm is typically run multiple times with different starting states, and the best configuration obtained from all of the runs is used as the output clustering.

3.  the drawback of k-means

Not stable results
  It only finds a local maximum, not a global one. The clusters it comes up with depend a lot on which random cluster centers are chose initially.

Not feasible for non-continuous data
  Computation of distances between patterns with some or all features being Non-continuous is problematic, since the different types of features are not comparable and the notion of proximity is effectively binary-valued for nominal-scaled features.
  Non-continuous data, like discrete data or nominal data, with labels or names are considered as the data in Nominal scale. So the observation datas can’t represented as String, for it is impossible to calculate the distances or similarities between the things.
Not incremental
  The k-means algorithm is most successfully used on some data sets. This is because k-means algorithm is simple to implement and computationally attractive because of its linear time complexity. However, it is not feasible to use even this linear time algorithm on large data sets. Because it is not incremental for predicting new data items without extra work.