Re: GSoC project: K-medoids clustering in Madlib - Mailing list pgsql-students

From Maxence AHLOUCHE
Subject Re: GSoC project: K-medoids clustering in Madlib
Date
Msg-id CAJeaomW50Bt-RFZs73b6M4iDaqj8mFFrfjXzHsdhSJb4iJzOHQ@mail.gmail.com
Whole thread Raw
In response to Re: GSoC project: K-medoids clustering in Madlib  (Atri Sharma <atri.jiit@gmail.com>)
Responses Re: GSoC project: K-medoids clustering in Madlib
List pgsql-students
Hello!

I've pretty much improved my visualizer:
  • It is now able to color the points according to their assigned cluster. It has occured to me last night that it was the main inconvenient of the k-means algorithm: a point is assigned to the nearest centroid's cluster, which is easy to compute.
  • It's now able to load data from a file, if it has been previously generated, so it is no longer mandatory to specify the number of clusters wished when using "old" data.
  • It displays the original clustering and the k-means++ clustering for comparison purposes, and it is very easy to add new clusterings.
  • It is available on GitHub! https://github.com/viodlen/clustering_visualizer

Still, there is room for improvements:

  • It only tests with 2-dimensional spaces. This won't evolve, as it would get difficult to visualize.
  • It only uses the euclidean distance. This can be fixed, but would be heavy to implement, and probably ugly (a hashmap to match the python's distance function with the MADlib's one).
  • For now, the points are reparted in the clusters folowing a gaussian law (not sure of my vocabulary here). This can be easily fixed, and will probably be in a future version.

In attachment is an output example. It shows the poor results of the k-means algorithm on contiguous clusters.

It also shows the interest of the k-means++ algorithm: the small light-blue cluster (in the original clustering) is correctly identified as a complete cluster by the k-means++ algorithm, when it was  only a part of a bigger cluster with the simple k-means algorithm.


The characteristics of the k-means algorithm make it easy to calculate the clusters only from the points and the centroids, but this won't be true for the k-medoids algorithm. So, sadly, if I implement the latter, it won't be possible to keep the same function signature: some more data will have to be returned, along with the centroids list.


I've also wondered if it would be useful to implement the clustering algorithms for non-float vectors (for example, strings), provided the user gives a distance function for this type?

--
Maxence Ahlouche
06 06 66 97 00
93 avenue Paul DOUMER
24100 Bergerac
Attachment

pgsql-students by date:

Previous
From: Atri Sharma
Date:
Subject: Re: GSoC project: K-medoids clustering in Madlib
Next
From: Atri Sharma
Date:
Subject: Re: GSoC project: K-medoids clustering in Madlib