Thread: GSOC Student Project Idea

GSOC Student Project Idea

From
Michael Schuh
Date:
Greetings,

Hello, my name is Michael Schuh and I am a PhD student in Computer Science at Montana State University. I have never participated in GSOC before, but I am very excited to propose a project to PostgreSQL that I feel would be a great follow-up to last year's project by Alexander Korotkov (http://www.google-melange.com/gsoc/project/google/gsoc2012/akorotkov/53002). I contacted Mr. Korotkov's mentor from last year, Mr. Heikki Linnakangas, and he suggested I email this mailing list with my idea.

In brief, I would like to implement a state-of-the-art indexing algorithm (named iDistance) directly in PostgreSQL using GiST or SP-GiST trees and whatever means necessary. It is an ideal follow-up to last year's project with Mr. Korotkov, which implemented classical indexing structures for range queries. I strongly believe the community would greatly benefit from the inclusion of iDistance, which has been shown to be dramatically more effective than R-trees and KD-trees, especially for knn queries and above 10-20 dimensions.

A major focus of my current PhD thesis is high-dimensional data indexing and retrieval, with an emphasis towards applied use in CBIR systems. Recently, I published work which introduced a new open source implementation of iDistance in C++ (and some Python), which I believe makes me highly qualified and motivated for this opportunity. I have been strongly considering a PostgreSQL implementation for an easy plug-and-play use in existing applications, but with academic grant funding, the priority is low. Below are links to my google code repository and recent publication. I am happy to discuss any of this in further detail if you'd like.


Although I do not have a lot of experience with PostgreSQL development, I am eager to learn and commit my summer to enabling another fantastic feature for the community. Since iDistance is a non-recursive, data-driven, space-based partitioning strategy which builds directly onto a B+-tree, I believe the implementation should be possible using only GiST support. Please let me know if this is of any interest, or if you have any additional questions. Unfortunately, I will be unavailable most of the day, but I plan to fill out the GSOC application later this evening.

Thank you for your time,
Mike Schuh





Re: GSOC Student Project Idea

From
Stephen Frost
Date:
Michael,

* Michael Schuh (schuh.mike@gmail.com) wrote:
> In brief, I would like to implement a state-of-the-art indexing algorithm
> (named iDistance) directly in PostgreSQL using GiST or SP-GiST trees and
> whatever means necessary.

For my 2c, this sounds fantastic.  Excellent project idea and it sounds
like you could actually get it done.
Thanks,
    Stephen

Re: GSOC Student Project Idea

From
Alvaro Herrera
Date:
Michael Schuh escribió:

> http://www.cs.montana.edu/~timothy.wylie/files/bncod13.pdf

Um.  From the paper I get the impression that there's yet much to learn
in order for this indexing method to be really effective?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: GSOC Student Project Idea

From
Alexander Korotkov
Date:
Michael,

On Mon, Apr 22, 2013 at 4:28 PM, Michael Schuh <schuh.mike@gmail.com> wrote:
Although I do not have a lot of experience with PostgreSQL development, I am eager to learn and commit my summer to enabling another fantastic feature for the community. Since iDistance is a non-recursive, data-driven, space-based partitioning strategy which builds directly onto a B+-tree, I believe the implementation should be possible using only GiST support. Please let me know if this is of any interest, or if you have any additional questions. Unfortunately, I will be unavailable most of the day, but I plan to fill out the GSOC application later this evening.

I've taken a brief look on the paper and implementation. As I can see iDistance implements some global building strategy. I mean, for example, it selects some point, calculates distances from selected point to all points in dataset etc. So, it uses the whole dataset at the same time. 

GiST and SP-GiST now behaves differently. They have interface functions which operates at maximum very small portion of data which fits to disk page. You can try to fit iDistance into GiST or SP-GiST interface. But will it be that efficient in this case? I've experimented with fitting VP-tree (which looks similar to iDistance at first glance) into SP-GiST. Then my results was depressing: it was useless with levenshtein distance, and it was worse than R-tree with multidimensional points. So I think we really need global index building algorithm in order to reveal power of such data structures. But GiST and SP-GiST currently doesn't support it.

However you can try to implement global index building in GiST or SP-GiST. In this case I think you should carefully estimate your capabilities during single GSoC project. You would need to extend GiST or SP-GiST interface and write completely new implementation of tree building algorithm. Question of how to exactly extend GiST or SP-GiST interface for this case could appear to be very hard even theoretically.

In short my opinion is so: don't expect miracle by just implementing GiST or SP-GiST interface functions.

------
With best regards,
Alexander Korotkov.

Re: GSOC Student Project Idea

From
Florian Pflug
Date:
On Apr23, 2013, at 23:25 , Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I've taken a brief look on the paper and implementation. As I can see iDistance implements some global building
strategy.I mean, for example, it selects some point, calculates distances from selected point to all points in dataset
etc.So, it uses the whole dataset at the same time.  
>
> However you can try to implement global index building in GiST or SP-GiST. In this case I think you should carefully
estimateyour capabilities during single GSoC project. You would need to extend GiST or SP-GiST interface and write
completelynew implementation of tree building algorithm. Question of how to exactly extend GiST or SP-GiST interface
forthis case could appear to be very hard even theoretically. 

+1. That seemed to be a major roadblock to me too when I read the paper.

You could work around that by making partition identification a separate step. You'd have a function
 idist_analyze(cfg name, table name, field name)

which'd identify suitable partitions for the data distribution in table.field and store them somewhere. Such a set of
pre-identifiedpartitions would be akin to a tsearch configuration, i.e. all other parts of the iDistance machinery
woulduse it to map points to index keys and queries to ranges of those keys. You'll want to look at how tsearch handles
that,and check if the method can indeed be applied to iDistance. 

In a first cut, you'd probably only allow inserts into index which don't change the maximum distances from the
partitioncenters that idist_analyze() found. With that restriction in place, you might get away without GiST or
SP-GiSt,and simply use postgres' standard btree, if you find a way to map queries of the form "field
<idistance-knn-operator>idistance_knn_function(point, distance)" to "where (idstance_keys(field) between P1_lowerbound
ANDP2_upperbound) OR (idistance_keys(field) between P2_lowerbuild AND P2_upperbound) …" or something equivalent. Or you
couldstart with a GiST-based btree and extend it to support "distance" function. Looking at contrib/btree_gist and the
built-inGiST operator class for point types should help. 

best regards,
Florian Pflug




Re: GSOC Student Project Idea

From
Michael Schuh
Date:
On Wed, Apr 24, 2013 at 5:31 AM, Florian Pflug <fgp@phlo.org> wrote:
On Apr23, 2013, at 23:25 , Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I've taken a brief look on the paper and implementation. As I can see iDistance implements some global building strategy. I mean, for example, it selects some point, calculates distances from selected point to all points in dataset etc. So, it uses the whole dataset at the same time.
>
> However you can try to implement global index building in GiST or SP-GiST. In this case I think you should carefully estimate your capabilities during single GSoC project. You would need to extend GiST or SP-GiST interface and write completely new implementation of tree building algorithm. Question of how to exactly extend GiST or SP-GiST interface for this case could appear to be very hard even theoretically.

+1. That seemed to be a major roadblock to me too when I read the paper.

You could work around that by making partition identification a separate step. You'd have a function

  idist_analyze(cfg name, table name, field name)

which'd identify suitable partitions for the data distribution in table.field and store them somewhere. Such a set of pre-identified partitions would be akin to a tsearch configuration, i.e. all other parts of the iDistance machinery would use it to map points to index keys and queries to ranges of those keys. You'll want to look at how tsearch handles that, and check if the method can indeed be applied to iDistance.

In a first cut, you'd probably only allow inserts into index which don't change the maximum distances from the partition centers that idist_analyze() found. With that restriction in place, you might get away without GiST or SP-GiSt, and simply use postgres' standard btree, if you find a way to map queries of the form "field <idistance-knn-operator> idistance_knn_function(point, distance)" to "where (idstance_keys(field) between P1_lowerbound AND P2_upperbound) OR (idistance_keys(field) between P2_lowerbuild AND P2_upperbound) …" or something equivalent. 


Thank you both for the very helpful feedback. Perhaps the scope of this project (application's "completeness criteria") is better as a feasibility prototyping of the global/distance-based index strategy with B+-tree and/or GiST extension possibilities.

Let me briefly elaborate on the current implementation details related to your comments. The public C++ version uses a stand-alone STX B+-tree implementation, and has only been used in the research context of one-time data loading and indexing and then KNN retrieval performance efficiency. This requires an upfront global partition building (sequential look at all data points) and a bit of overhead information about the partitions (such as reference point locations and maximum distances in each). Of course, there are a lot of improvements, modifications, variants, etc., that can be built from this basic setup... but that's mostly my in-progress thesis work yet published or defended :)

A given query is broken down into range(s) of key values in the B+-tree, based on the negligible overhead info kept from partitioning. Only then does this small subset of pages need to be loaded from disk. where the partitions are located with respect to the query. Therefore it is necessary to have left/right leaf pointers within the B+-tree. While I think a DBMS-integrated implementation would be more ideal for general use by everyone, my naive assumption is that I could probably implement the idistance functionality in stored procedures and metadata tables (at the cost of performance and usability). 

Best regards,
Mike Schuh


Re: GSOC Student Project Idea

From
Heikki Linnakangas
Date:
On 24.04.2013 14:31, Florian Pflug wrote:
> On Apr23, 2013, at 23:25 , Alexander Korotkov<aekorotkov@gmail.com>
> wrote:
>> I've taken a brief look on the paper and implementation. As I can
>> see iDistance implements some global building strategy. I mean, for
>> example, it selects some point, calculates distances from selected
>> point to all points in dataset etc. So, it uses the whole dataset
>> at the same time.
>>
>> However you can try to implement global index building in GiST or
>> SP-GiST. In this case I think you should carefully estimate your
>> capabilities during single GSoC project. You would need to extend
>> GiST or SP-GiST interface and write completely new implementation
>> of tree building algorithm. Question of how to exactly extend GiST
>> or SP-GiST interface for this case could appear to be very hard
>> even theoretically.
>
> +1. That seemed to be a major roadblock to me too when I read the
> paper.
>
> You could work around that by making partition identification a
> separate step. You'd have a function
>
> idist_analyze(cfg name, table name, field name)
>
> which'd identify suitable partitions for the data distribution in
> table.field and store them somewhere. Such a set of pre-identified
> partitions would be akin to a tsearch configuration, i.e. all other
> parts of the iDistance machinery would use it to map points to index
> keys and queries to ranges of those keys. You'll want to look at how
> tsearch handles that, and check if the method can indeed be applied
> to iDistance.

You could perform that step as part of the index build. Before the index 
build starts to add tuples to the index, it could scan a random sample 
of the heap and identify the partitions based on that.

If you need to store the metadata, like a map of partitions, it becomes 
difficult to cajole this into a normal GiST or SP-GiST opclass. The API 
doesn't have any support for storing such metadata.

> In a first cut, you'd probably only allow inserts into index which
> don't change the maximum distances from the partition centers that
> idist_analyze() found.

That seems like a pretty serious restriction. I'd try to write it so 
that you can insert any value, but if the new values are very different 
from any existing values, it would be OK for the index quality to 
degrade. For example, you could simply add any out-of-bounds values to a 
separate branch in the index, which would have no particular structure 
and would just have to be scanned on every query. You can probably do 
better than that, but that would be a trivial way to deal with it.

- Heikki



Re: GSOC Student Project Idea

From
Heikki Linnakangas
Date:
On 24.04.2013 22:10, Michael Schuh wrote:
> Thank you both for the very helpful feedback. Perhaps the scope of this
> project (application's "completeness criteria") is better as
> a feasibility prototyping of the global/distance-based index strategy with
> B+-tree and/or GiST extension possibilities.

For GSoC, we'd really like to see some code that can be committed as a 
result. Prototyping is important, but if that's all you do during the 
summer, the work is likely going to waste if no-one is going to work 
actively on the prototype afterwards.

At this point, I think we need a more concrete plan on how this would be 
implemented. The idea of using a regular B-tree for this, with some 
functions to do the partition mapping might work. However, that would be 
a clunky interface - I don't think that would be accepted into 
PostgreSQL. So I don't think that makes a good GSoC project.

If you think this can be done with the existing GiST or SP-GiST APIs, 
I'd like to see a more concrete plan on how that would work. What 
datatype would this be for? How would the partitioning be done? If the 
APIs need to be extended, what would the extensions look like? The 
summer is short, so there isn't much time for exploration - we need to 
have a pretty good idea of what the result will look like, right now.

- Heikki



Re: GSOC Student Project Idea

From
Jim Nasby
Date:
On 5/8/13 3:54 AM, Heikki Linnakangas wrote:
> On 24.04.2013 14:31, Florian Pflug wrote:
>> On Apr23, 2013, at 23:25 , Alexander Korotkov<aekorotkov@gmail.com>
>> wrote:
>>> I've taken a brief look on the paper and implementation. As I can
>>> see iDistance implements some global building strategy. I mean, for
>>> example, it selects some point, calculates distances from selected
>>> point to all points in dataset etc. So, it uses the whole dataset
>>> at the same time.
>>>
>>> However you can try to implement global index building in GiST or
>>> SP-GiST. In this case I think you should carefully estimate your
>>> capabilities during single GSoC project. You would need to extend
>>> GiST or SP-GiST interface and write completely new implementation
>>> of tree building algorithm. Question of how to exactly extend GiST
>>> or SP-GiST interface for this case could appear to be very hard
>>> even theoretically.
>>
>> +1. That seemed to be a major roadblock to me too when I read the
>> paper.
>>
>> You could work around that by making partition identification a
>> separate step. You'd have a function
>>
>> idist_analyze(cfg name, table name, field name)
>>
>> which'd identify suitable partitions for the data distribution in
>> table.field and store them somewhere. Such a set of pre-identified
>> partitions would be akin to a tsearch configuration, i.e. all other
>> parts of the iDistance machinery would use it to map points to index
>> keys and queries to ranges of those keys. You'll want to look at how
>> tsearch handles that, and check if the method can indeed be applied
>> to iDistance.
>
> You could perform that step as part of the index build. Before the index build starts to add tuples to the index, it
couldscan a random sample of the heap and identify the partitions based on that.
 
>
> If you need to store the metadata, like a map of partitions, it becomes difficult to cajole this into a normal GiST
orSP-GiST opclass. The API doesn't have any support for storing such metadata.
 
>
>> In a first cut, you'd probably only allow inserts into index which
>> don't change the maximum distances from the partition centers that
>> idist_analyze() found.
>
> That seems like a pretty serious restriction. I'd try to write it so that you can insert any value, but if the new
valuesare very different from any existing values, it would be OK for the index quality to degrade. For example, you
couldsimply add any out-of-bounds values to a separate branch in the index, which would have no particular structure
andwould just have to be scanned on every query. You can probably do better than that, but that would be a trivial way
todeal with it.
 

Or you could use the new insert to start a new partition.

Heck, maybe the focus should actually be on partitions and not individual records/points. ISTM the entire challenge
hereis figuring out a way to maintain a set of partitions that:
 

- Are limited enough in number that you can quickly perform operations/searches across all partitions
- Yet small enough that once you've narrowed down a set of partitions you don't have a ton of raw records to still look
at

Before we had range types I experimented with representing time ranges as rectangles of varying size (ie: for (start,
end),create rectangle(point(start,start), point(end,end)). The problem with that is you had to convert timestamp into a
float,which was not exact. So when querying you could use a GiST index on all the rectangles to narrow your scope, but
youstill needed a set of exact clauses (ie: start >= now() - '1 year' AND end <= now()). Partitions would be similar in
thatthey wouldn't be exact but could greatly narrow the search space (of course we'd want to handle the secondary exact
checkinginternally instead of exposing the user to that).
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: GSOC Student Project Idea

From
Michael Schuh
Date:

On Wed, May 8, 2013 at 1:48 PM, Jim Nasby <jim@nasby.net> wrote:
On 5/8/13 3:54 AM, Heikki Linnakangas wrote:
On 24.04.2013 14:31, Florian Pflug wrote:
On Apr23, 2013, at 23:25 , Alexander Korotkov<aekorotkov@gmail.com>
wrote:
I've taken a brief look on the paper and implementation. As I can
see iDistance implements some global building strategy. I mean, for
example, it selects some point, calculates distances from selected
point to all points in dataset etc. So, it uses the whole dataset
at the same time.

However you can try to implement global index building in GiST or
SP-GiST. In this case I think you should carefully estimate your
capabilities during single GSoC project. You would need to extend
GiST or SP-GiST interface and write completely new implementation
of tree building algorithm. Question of how to exactly extend GiST
or SP-GiST interface for this case could appear to be very hard
even theoretically.

+1. That seemed to be a major roadblock to me too when I read the
paper.

You could work around that by making partition identification a
separate step. You'd have a function

idist_analyze(cfg name, table name, field name)

which'd identify suitable partitions for the data distribution in
table.field and store them somewhere. Such a set of pre-identified
partitions would be akin to a tsearch configuration, i.e. all other
parts of the iDistance machinery would use it to map points to index
keys and queries to ranges of those keys. You'll want to look at how
tsearch handles that, and check if the method can indeed be applied
to iDistance.

You could perform that step as part of the index build. Before the index build starts to add tuples to the index, it could scan a random sample of the heap and identify the partitions based on that.

If you need to store the metadata, like a map of partitions, it becomes difficult to cajole this into a normal GiST or SP-GiST opclass. The API doesn't have any support for storing such metadata.

In a first cut, you'd probably only allow inserts into index which
don't change the maximum distances from the partition centers that
idist_analyze() found.

That seems like a pretty serious restriction. I'd try to write it so that you can insert any value, but if the new values are very different from any existing values, it would be OK for the index quality to degrade. For example, you could simply add any out-of-bounds values to a separate branch in the index, which would have no particular structure and would just have to be scanned on every query. You can probably do better than that, but that would be a trivial way to deal with it.

Or you could use the new insert to start a new partition.

Heck, maybe the focus should actually be on partitions and not individual records/points. ISTM the entire challenge here is figuring out a way to maintain a set of partitions that:

- Are limited enough in number that you can quickly perform operations/searches across all partitions
- Yet small enough that once you've narrowed down a set of partitions you don't have a ton of raw records to still look at

Before we had range types I experimented with representing time ranges as rectangles of varying size (ie: for (start, end), create rectangle(point(start,start), point(end,end)). The problem with that is you had to convert timestamp into a float, which was not exact. So when querying you could use a GiST index on all the rectangles to narrow your scope, but you still needed a set of exact clauses (ie: start >= now() - '1 year' AND end <= now()). Partitions would be similar in that they wouldn't be exact but could greatly narrow the search space (of course we'd want to handle the secondary exact checking internally instead of exposing the user to that).

 
I appreciate all the responses, and I think everyone has more-or-less confirmed the scope of the project proposal I submitted. It was hard to find time during the final weeks of the semester to greatly explore the (SP-)GiST interfaces, but given the responses here, it seems the integrated implementation is clearly beyond scope for a summer project, which I agree with. Instead, I proposed my original plan that can surely be accomplished over the summer. Coincidentally enough, it is in essence, what Florian Pflug and the rest have discussed here. 

In short, I will use only the btree in postgresql to store single dimensional values mapped to multi-dimensional point data, and then query ranges of these values in the btree based on partition information stored separately. The information can be gathered upfront and periodically updated as needed, which done properly will not require downtime or reshuffling of the btree. This will result in a fully useable index, and the project will also include a performance and usability assessment at the end.

I still believe this would be an excellent GSoC project that I am highly motivated to make fully accessible to the community which could really use something like this. The only reason I used 'prototype' was to concede several realistic points, the first being the likely "clunky interface" Heikki Linnakangas alluded to (which will be immediately improved if performance results warrant). The second being several of the algorithmic tuning functions, which will be functional, but I doubt *perfected* for ideal use in only a few short months (hard to expect that from any project I would think). I thank you all for your time and thoughts.

- Mike Schuh