Home > mailing lists

Re: GSoC project: K-medoids clustering in Madlib - Mailing list pgsql-students

From	Maxence AHLOUCHE
Subject	Re: GSoC project: K-medoids clustering in Madlib
Date	April 21, 2013 20:46:54
Msg-id	CAJeaomW50Bt-RFZs73b6M4iDaqj8mFFrfjXzHsdhSJb4iJzOHQ@mail.gmail.com Whole thread Raw
In response to	Re: GSoC project: K-medoids clustering in Madlib (Atri Sharma <atri.jiit@gmail.com>)
Responses	Re: GSoC project: K-medoids clustering in Madlib
List	pgsql-students

Tree view

Hello!

I've pretty much improved my visualizer:

It is now able to color the points according to their assigned cluster. It has occured to me last night that it was the main inconvenient of the k-means algorithm: a point is assigned to the nearest centroid's cluster, which is easy to compute.
It's now able to load data from a file, if it has been previously generated, so it is no longer mandatory to specify the number of clusters wished when using "old" data.
It displays the original clustering and the k-means++ clustering for comparison purposes, and it is very easy to add new clusterings.
It is available on GitHub! https://github.com/viodlen/clustering_visualizer

Still, there is room for improvements:

It only tests with 2-dimensional spaces. This won't evolve, as it would get difficult to visualize.
It only uses the euclidean distance. This can be fixed, but would be heavy to implement, and probably ugly (a hashmap to match the python's distance function with the MADlib's one).
For now, the points are reparted in the clusters folowing a gaussian law (not sure of my vocabulary here). This can be easily fixed, and will probably be in a future version.

In attachment is an output example. It shows the poor results of the k-means algorithm on contiguous clusters.

It also shows the interest of the k-means++ algorithm: the small light-blue cluster (in the original clustering) is correctly identified as a complete cluster by the k-means++ algorithm, when it was only a part of a bigger cluster with the simple k-means algorithm.

The characteristics of the k-means algorithm make it easy to calculate the clusters only from the points and the centroids, but this won't be true for the k-medoids algorithm. So, sadly, if I implement the latter, it won't be possible to keep the same function signature: some more data will have to be returned, along with the centroids list.

I've also wondered if it would be useful to implement the clustering algorithms for non-float vectors (for example, strings), provided the user gives a distance function for this type?

--
Maxence Ahlouche
06 06 66 97 00
93 avenue Paul DOUMER
24100 Bergerac

Attachment

2013-04-21--1366565321_1366x768_scrot.png

pgsql-students by date:

From: Atri Sharma
Date: 20 April 2013, 17:59:48
Subject: Re: GSoC project: K-medoids clustering in Madlib

From: Atri Sharma
Date: 21 April 2013, 21:03:36
Subject: Re: GSoC project: K-medoids clustering in Madlib

Re: GSoC project: K-medoids clustering in Madlib - Mailing list pgsql-students

Attachment

Previous

Next