Re: GSOC Student Project Idea - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: GSOC Student Project Idea
Date
Msg-id 518AAC20.4040906@nasby.net
Whole thread Raw
In response to Re: GSOC Student Project Idea  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: GSOC Student Project Idea  (Michael Schuh <schuh.mike@gmail.com>)
List pgsql-hackers
On 5/8/13 3:54 AM, Heikki Linnakangas wrote:
> On 24.04.2013 14:31, Florian Pflug wrote:
>> On Apr23, 2013, at 23:25 , Alexander Korotkov<aekorotkov@gmail.com>
>> wrote:
>>> I've taken a brief look on the paper and implementation. As I can
>>> see iDistance implements some global building strategy. I mean, for
>>> example, it selects some point, calculates distances from selected
>>> point to all points in dataset etc. So, it uses the whole dataset
>>> at the same time.
>>>
>>> However you can try to implement global index building in GiST or
>>> SP-GiST. In this case I think you should carefully estimate your
>>> capabilities during single GSoC project. You would need to extend
>>> GiST or SP-GiST interface and write completely new implementation
>>> of tree building algorithm. Question of how to exactly extend GiST
>>> or SP-GiST interface for this case could appear to be very hard
>>> even theoretically.
>>
>> +1. That seemed to be a major roadblock to me too when I read the
>> paper.
>>
>> You could work around that by making partition identification a
>> separate step. You'd have a function
>>
>> idist_analyze(cfg name, table name, field name)
>>
>> which'd identify suitable partitions for the data distribution in
>> table.field and store them somewhere. Such a set of pre-identified
>> partitions would be akin to a tsearch configuration, i.e. all other
>> parts of the iDistance machinery would use it to map points to index
>> keys and queries to ranges of those keys. You'll want to look at how
>> tsearch handles that, and check if the method can indeed be applied
>> to iDistance.
>
> You could perform that step as part of the index build. Before the index build starts to add tuples to the index, it
couldscan a random sample of the heap and identify the partitions based on that.
 
>
> If you need to store the metadata, like a map of partitions, it becomes difficult to cajole this into a normal GiST
orSP-GiST opclass. The API doesn't have any support for storing such metadata.
 
>
>> In a first cut, you'd probably only allow inserts into index which
>> don't change the maximum distances from the partition centers that
>> idist_analyze() found.
>
> That seems like a pretty serious restriction. I'd try to write it so that you can insert any value, but if the new
valuesare very different from any existing values, it would be OK for the index quality to degrade. For example, you
couldsimply add any out-of-bounds values to a separate branch in the index, which would have no particular structure
andwould just have to be scanned on every query. You can probably do better than that, but that would be a trivial way
todeal with it.
 

Or you could use the new insert to start a new partition.

Heck, maybe the focus should actually be on partitions and not individual records/points. ISTM the entire challenge
hereis figuring out a way to maintain a set of partitions that:
 

- Are limited enough in number that you can quickly perform operations/searches across all partitions
- Yet small enough that once you've narrowed down a set of partitions you don't have a ton of raw records to still look
at

Before we had range types I experimented with representing time ranges as rectangles of varying size (ie: for (start,
end),create rectangle(point(start,start), point(end,end)). The problem with that is you had to convert timestamp into a
float,which was not exact. So when querying you could use a GiST index on all the rectangles to narrow your scope, but
youstill needed a set of exact clauses (ie: start >= now() - '1 year' AND end <= now()). Partitions would be similar in
thatthey wouldn't be exact but could greatly narrow the search space (of course we'd want to handle the secondary exact
checkinginternally instead of exposing the user to that).
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



pgsql-hackers by date:

Previous
From: Ryan Kelly
Date:
Subject: Re: RETURNING syntax for COPY
Next
From: Bruce Momjian
Date:
Subject: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4