Re: Hash index todo list item - Mailing list pgsql-hackers

From Tom Raney
Subject Re: Hash index todo list item
Date
Msg-id 46F9F176.1060705@comcast.net
Whole thread Raw
In response to Re: Hash index todo list item  (Kenneth Marshall <ktm@rice.edu>)
List pgsql-hackers
Kenneth Marshall wrote:
> On Tue, Sep 25, 2007 at 03:35:47PM +0100, Gregory Stark wrote:
>   
>> "Kenneth Marshall" <ktm@rice.edu> writes:
>>
>>     
>>> On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
>>>
>>>       
>>>> Using our implementation, build times and index sizes are
>>>> comparable with btree index build times and index sizes.
>>>>         
>>> ...
>>> That is super! (and timely)
>>>       
>> It is pretty super. I have a few comments to raise but don't take it to be too
>> negative, it sounds like this is a big step forward towards making hash
>> indexes valuable.
>>
>> Firstly in the graphs it seems the index size graph has an exponential x-axis
>> but a linear y-axis. This makes it hard to see what I would expect to be
>> pretty much linear growth. The curves look exponential which would mean linear
>> growth but of course it's hard to tell.
>>
>> Also, the growth in the time chart looks pretty much linear. That seems weird
>> since I would expect there would be a visible extra component since sort times
>> are n-log(n). Perhaps you need to test still larger tables to see that though.
>>
>> In any case it's clear from the results you have there that the change is a
>> positive one and fixes a fundamental problem with the hash index build code.
>>
>> Something else you should perhaps test is indexing something which is
>> substantially larger than hash function output. A hash function is going to
>> make the most sense when indexing something like strings for which you want to
>> avoid the long comparison costs. Especially consider testing this on a UTF8
>> locale with expensive comparisons (like a CJK locale for example).
>>
>> Note that the bottom line for the problems with hash indexes is that the
>> current implementation doesn't offer any advantages over btree indexes. Hash
>> indexes need to be not only as good of a choice as btree indexes but
>> significantly better a significantly better choice at least for some specific
>> circumstances.
>>
>> Also, if you're going to submit a patch you should check out a copy of the CVS
>> HEAD and work from that. I don't think there are huge differences in the area
>> of hash indexes though. But in most other areas you would be spending quite a
>> bit of time dealing details which have changed since.
>>
>> Finally note that we're in the final throes of the 8.3 feature freeze.
>> Normally any patch submitted now would be held until 8.3 is released and
>> development on 8.4 is started. I could imagine an exception being made for
>> hash indexes since they're so moribund currently but probably not. The flip
>> side of that argument is that there's not much point in making an exception
>> for something which will only be really useful once further work is done in
>> the same area.
>>
>>     
>
> Although I am very excited about this patch, I do not see any real value
> in including it in 8.3. As you mention, we need to to have a hash index
> implementation that outperforms btree in some problem regime and that is
> currently not the case. I have just recently started the process of
> gathering ideas and having discussions on various approaches to making
> hash indexes more performant and we have a number of ideas on which to
> start our investigation. I do think that this patch will make the testing
> and evaluation, that will be needed to truly improve the hash index, much
> much easier.
>
> Regards,
> Ken
>
>   
We're glad to contribute and be a part of Postgres.  The patch has been 
posted to pgsql-patches@postgresql.org.

Speeding up the creation time of hash indexes on non-trivial relations 
was our goal.  This will allow some interesting performance tests of the 
hash index on very large relations.  It may be that the near constant 
lookup time of the hash index outperforms the Btree index for some large 
data sets and for certain types of data and distributions.

Sincerely,
Tom Raney





pgsql-hackers by date:

Previous
From: "Luke Lonergan"
Date:
Subject: Re: top for postgresql (ptop?)
Next
From: "Pavel Stehule"
Date:
Subject: Re: plpgsql TABLE patch