Thread: GSoC project : K-medoids clustering in Madlib

GSoC project : K-medoids clustering in Madlib

From
viod
Date:
Hello!

I'm an IT student, and I would like to apply for the 2013 GSoC.
I've been looking at this mailing list for a while now, and I saw a suggestion for GSoC that particularly interested me: implementing the K-medoids clustering in Madlib, as it is supposed to be more efficient than the K-means algorithm.

I didn't know about these algorithms before, but I have documented myself, and it looks quite interesting to me, and even more as I currently have lessons (but very very simplified unfortunately).

I've got a few questions:
Won't this be a quite short project? I can't get an idea of how long it would take me to implement this algorithm in a way that would be usable by postgresql, but 3 months looks long for this task, doesn't it?

Someone on the IRC channel (can't remember who, sorry) told me it was used in the KNN index. I guess this is used by pg_trgm, but are there other modules using it currently?
And could you please give me some links explaining the internals of this index? I've been through several articles presenting of it, but none very satisfying.

Thanks a lot in advance!

Re: GSoC project : K-medoids clustering in Madlib

From
Atri Sharma
Date:
I suggested a couple of algorithms to be implemented in MADLib(apart
from K Medoids). You could pick some(or all) of them, which would
require 3 months to be completed.

As for more information on index, you can refer

http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.1

along with the postgres wiki. The wiki is the standard for anything postgres.

pg_trgm used KNN, but I believe it uses its own implementation of the
algorithm. The idea I proposed aims at writing an implementation in
the MADlib so that any client program can use the algorithm(s) in
their code directly, using MADlib functions.

Regards,

Atri

On 3/26/13, viod <viod.len@gmail.com> wrote:
> Hello!
>
> I'm an IT student, and I would like to apply for the 2013 GSoC.
> I've been looking at this mailing list for a while now, and I saw a
> suggestion for GSoC that particularly interested me: implementing the
> K-medoids clustering in Madlib, as it is supposed to be more efficient than
> the K-means algorithm.
>
> I didn't know about these algorithms before, but I have documented myself,
> and it looks quite interesting to me, and even more as I currently have
> lessons (but very very simplified unfortunately).
>
> I've got a few questions:
> Won't this be a quite short project? I can't get an idea of how long it
> would take me to implement this algorithm in a way that would be usable by
> postgresql, but 3 months looks long for this task, doesn't it?
>
> Someone on the IRC channel (can't remember who, sorry) told me it was used
> in the KNN index. I guess this is used by pg_trgm, but are there other
> modules using it currently?
> And could you please give me some links explaining the internals of this
> index? I've been through several articles presenting of it, but none very
> satisfying.
>
> Thanks a lot in advance!
>


-- 
Regards,

Atri
*l'apprenant*



Re: GSoC project : K-medoids clustering in Madlib

From
Tom Lane
Date:
Atri Sharma <atri.jiit@gmail.com> writes:
> I suggested a couple of algorithms to be implemented in MADLib(apart
> from K Medoids). You could pick some(or all) of them, which would
> require 3 months to be completed.

> As for more information on index, you can refer

> http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.1

> along with the postgres wiki. The wiki is the standard for anything postgres.

> pg_trgm used KNN, but I believe it uses its own implementation of the
> algorithm. The idea I proposed aims at writing an implementation in
> the MADlib so that any client program can use the algorithm(s) in
> their code directly, using MADlib functions.

I'm a bit confused as to why this is being proposed as a
Postgres-related project.  I don't even know what MADlib is, but I'm
pretty darn sure that no part of Postgres uses it.  KNNGist certainly
doesn't.
        regards, tom lane



Re: GSoC project : K-medoids clustering in Madlib

From
Daniel Farina
Date:
On Tue, Mar 26, 2013 at 10:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Atri Sharma <atri.jiit@gmail.com> writes:
>> I suggested a couple of algorithms to be implemented in MADLib(apart
>> from K Medoids). You could pick some(or all) of them, which would
>> require 3 months to be completed.
>
>> As for more information on index, you can refer
>
>> http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.1
>
>> along with the postgres wiki. The wiki is the standard for anything postgres.
>
>> pg_trgm used KNN, but I believe it uses its own implementation of the
>> algorithm. The idea I proposed aims at writing an implementation in
>> the MADlib so that any client program can use the algorithm(s) in
>> their code directly, using MADlib functions.
>
> I'm a bit confused as to why this is being proposed as a
> Postgres-related project.  I don't even know what MADlib is, but I'm
> pretty darn sure that no part of Postgres uses it.  KNNGist certainly
> doesn't.

It's a reasonably well established extension for Postgres for
statistical and machine learning methods.  Rather neat, but as you
indicate, it's not part of Postgres proper.

http://madlib.net/

https://github.com/madlib/madlib/

--
fdr



Re: GSoC project : K-medoids clustering in Madlib

From
Atri Sharma
Date:
>> I'm a bit confused as to why this is being proposed as a
>> Postgres-related project.  I don't even know what MADlib is, but I'm
>> pretty darn sure that no part of Postgres uses it.  KNNGist certainly
>> doesn't.
>
> It's a reasonably well established extension for Postgres for
> statistical and machine learning methods.  Rather neat, but as you
> indicate, it's not part of Postgres proper.
>
> http://madlib.net/
>
> https://github.com/madlib/madlib/
>

It is the extension that is normally referred to when we talk about
data analytics in Postgres. As you said, it is not part of postgres
proper,but IMO, if we want to extend the data analytics
functionalities of postgres, we need to work on MADlib.

Regards,

Atri



--
Regards,

Atri
l'apprenant



Re: GSoC project : K-medoids clustering in Madlib

From
Heikki Linnakangas
Date:
On 27.03.2013 08:51, Atri Sharma wrote:
>>> I'm a bit confused as to why this is being proposed as a
>>> Postgres-related project.  I don't even know what MADlib is, but I'm
>>> pretty darn sure that no part of Postgres uses it.  KNNGist certainly
>>> doesn't.
>>
>> It's a reasonably well established extension for Postgres for
>> statistical and machine learning methods.  Rather neat, but as you
>> indicate, it's not part of Postgres proper.
>>
>> http://madlib.net/
>>
>> https://github.com/madlib/madlib/
>
> It is the extension that is normally referred to when we talk about
> data analytics in Postgres. As you said, it is not part of postgres
> proper,but IMO, if we want to extend the data analytics
> functionalities of postgres, we need to work on MADlib.

Perhaps we could do this under the PostgreSQL organization, but we'd 
definitely need to get someone from the MADLib project to mentor it.

But it would be even better if MADLib would apply to GSoC as an 
independent organization. The deadline for organization applications is 
on March 29th, so if the MADLIb people are interested in that, they need 
to hurry and send the application right now.

- Heikki



Re: GSoC project : K-medoids clustering in Madlib

From
Atri Sharma
Date:
> But it would be even better if MADLib would apply to GSoC as an independent
> organization. The deadline for organization applications is on March 29th,
> so if the MADLIb people are interested in that, they need to hurry and send
> the application right now.

Agreed. Is there any way we could add in house support for basic data
analytics,maybe as a proper postgres extension?

Regards,

Atri


--
Regards,

Atri
l'apprenant



Re: GSoC project : K-medoids clustering in Madlib

From
Thom Brown
Date:
On 27 March 2013 08:12, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 27.03.2013 08:51, Atri Sharma wrote:
>>>>
>>>> I'm a bit confused as to why this is being proposed as a
>>>> Postgres-related project.  I don't even know what MADlib is, but I'm
>>>> pretty darn sure that no part of Postgres uses it.  KNNGist certainly
>>>> doesn't.
>>>
>>>
>>> It's a reasonably well established extension for Postgres for
>>> statistical and machine learning methods.  Rather neat, but as you
>>> indicate, it's not part of Postgres proper.
>>>
>>> http://madlib.net/
>>>
>>> https://github.com/madlib/madlib/
>>
>>
>> It is the extension that is normally referred to when we talk about
>> data analytics in Postgres. As you said, it is not part of postgres
>> proper,but IMO, if we want to extend the data analytics
>> functionalities of postgres, we need to work on MADlib.
>
>
> Perhaps we could do this under the PostgreSQL organization, but we'd
> definitely need to get someone from the MADLib project to mentor it.
>
> But it would be even better if MADLib would apply to GSoC as an independent
> organization. The deadline for organization applications is on March 29th,
> so if the MADLIb people are interested in that, they need to hurry and send
> the application right now.

It would also help if they were able to get in contact so that I could
add them as a project we'd vouch for as part of our own application.

--
Thom