Home > mailing lists

Re: GSoC project: K-medoids clustering in Madlib - Mailing list pgsql-students

From	Atri Sharma
Subject	Re: GSoC project: K-medoids clustering in Madlib
Date	April 21, 2013 21:03:36
Msg-id	46950856-D053-4ED5-8E4C-9F620E15C981@gmail.com Whole thread Raw
In response to	Re: GSoC project: K-medoids clustering in Madlib (Maxence AHLOUCHE <maxence.ahlouche@gmail.com>)
Responses	Re: GSoC project: K-medoids clustering in Madlib
List	pgsql-students

Tree view

Sent from my iPad

On 21-Apr-2013, at 23:16, Maxence AHLOUCHE <maxence.ahlouche@gmail.com> wrote:

Hello!

I've pretty much improved my visualizer:
It is now able to color the points according to their assigned cluster. It has occured to me last night that it was the main inconvenient of the k-means algorithm: a point is assigned to the nearest centroid's cluster, which is easy to compute.
It's now able to load data from a file, if it has been previously generated, so it is no longer mandatory to specify the number of clusters wished when using "old" data.
It displays the original clustering and the k-means++ clustering for comparison purposes, and it is very easy to add new clusterings.
It is available on GitHub! https://github.com/viodlen/clustering_visualizer
Still, there is room for improvements:
It only tests with 2-dimensional spaces. This won't evolve, as it would get difficult to visualize.
It only uses the euclidean distance. This can be fixed, but would be heavy to implement, and probably ugly (a hashmap to match the python's distance function with the MADlib's one).
For now, the points are reparted in the clusters folowing a gaussian law (not sure of my vocabulary here). This can be easily fixed, and will probably be in a future version.
In attachment is an output example. It shows the poor results of the k-means algorithm on contiguous clusters.
It also shows the interest of the k-means++ algorithm: the small light-blue cluster (in the original clustering) is correctly identified as a complete cluster by the k-means++ algorithm, when it was only a part of a bigger cluster with the simple k-means algorithm.

The characteristics of the k-means algorithm make it easy to calculate the clusters only from the points and the centroids, but this won't be true for the k-medoids algorithm. So, sadly, if I implement the latter, it won't be possible to keep the same function signature: some more data will have to be returned, along with the centroids list.

I've also wondered if it would be useful to implement the clustering algorithms for non-float vectors (for example, strings), provided the user gives a distance function for this type?

Interesting! Good work!

Could you draw up a summary, giving your findings about the performance of different algorithms,and which one should be implemented,or both(k means++ vs k medoids).

Regards,

Atri

pgsql-students by date:

From: Maxence AHLOUCHE
Date: 21 April 2013, 20:46:54
Subject: Re: GSoC project: K-medoids clustering in Madlib

From: Maxence AHLOUCHE
Date: 22 April 2013, 00:21:22
Subject: Re: GSoC project: K-medoids clustering in Madlib

Re: GSoC project: K-medoids clustering in Madlib - Mailing list pgsql-students

Previous

Next