Thread: Hash index build patch has *worse* performance at small table sizes

Hash index build patch has *worse* performance at small table sizes

From
Tom Lane
Date:
I've been reviewing the hash index build patch submitted here:
http://archives.postgresql.org/pgsql-patches/2007-10/msg00154.php

Although it definitely helps on large indexes, it's actually
counterproductive on not-so-large ones.  The test case I'm using
is random integers generated like this:create table foo as select (random() * N)::int as f1  from
generate_series(1,N);selectcount(*) from foo; -- force hint bit updatescheckpoint;
 
then timingcreate index fooi on foo using hash(f1);

Using all-default configuration settings on some not-very-new hardware,
at N = 1E6 I see

8.3.1:                    30 sec
With pre-expansion of index (CVS HEAD):    24 sec
With sorting:                72 sec
To build a btree index on same data:    34 sec

Now this isn't amazingly surprising, because the original argument for
doing sorting was to improve locality of access to the index during
the build, and that only matters if you've got an index significantly
bigger than memory.  If the index fits in RAM then the sort is pure
overhead.

The obvious response to this is to use the sorting approach only when
the estimated index size exceeds some threshold.  One possible choice of
threshold would be shared_buffers (or temp_buffers for a temp index)
but I think that is probably too conservative, because in most scenarios
the kernel's disk cache is available too.  Plus you can't tweak that
setting without a postmaster restart.  I'm tempted to use
effective_cache_size, which attempts to measure an appropriate number
and can be set locally within the session doing the CREATE INDEX if
necessary.  Or we could invent a new GUC parameter, but that is probably
overkill.

Comments?
        regards, tom lane


Re: Hash index build patch has *worse* performance at small table sizes

From
Bruce Momjian
Date:
Did we ever do anything about this?

---------------------------------------------------------------------------

Tom Lane wrote:
> I've been reviewing the hash index build patch submitted here:
> http://archives.postgresql.org/pgsql-patches/2007-10/msg00154.php
> 
> Although it definitely helps on large indexes, it's actually
> counterproductive on not-so-large ones.  The test case I'm using
> is random integers generated like this:
>     create table foo as select (random() * N)::int as f1
>       from generate_series(1,N);
>     select count(*) from foo; -- force hint bit updates
>     checkpoint;
> then timing
>     create index fooi on foo using hash(f1);
> 
> Using all-default configuration settings on some not-very-new hardware,
> at N = 1E6 I see
> 
> 8.3.1:                    30 sec
> With pre-expansion of index (CVS HEAD):    24 sec
> With sorting:                72 sec
> To build a btree index on same data:    34 sec
> 
> Now this isn't amazingly surprising, because the original argument for
> doing sorting was to improve locality of access to the index during
> the build, and that only matters if you've got an index significantly
> bigger than memory.  If the index fits in RAM then the sort is pure
> overhead.
> 
> The obvious response to this is to use the sorting approach only when
> the estimated index size exceeds some threshold.  One possible choice of
> threshold would be shared_buffers (or temp_buffers for a temp index)
> but I think that is probably too conservative, because in most scenarios
> the kernel's disk cache is available too.  Plus you can't tweak that
> setting without a postmaster restart.  I'm tempted to use
> effective_cache_size, which attempts to measure an appropriate number
> and can be set locally within the session doing the CREATE INDEX if
> necessary.  Or we could invent a new GUC parameter, but that is probably
> overkill.
> 
> Comments?
> 
>             regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Bruce Momjian <bruce@momjian.us> writes:
> Did we ever do anything about this?

Seems to be in there in CVS HEAD:
   /*    * If we just insert the tuples into the index in scan order, then    * (assuming their hash codes are pretty
random)there will be no locality    * of access to the index, and if the index is bigger than available RAM    * then
we'llthrash horribly.  To prevent that scenario, we can sort the    * tuples by (expected) bucket number.  However,
sucha sort is useless    * overhead when the index does fit in RAM.  We choose to sort if the    * initial index size
exceedseffective_cache_size.    *    * NOTE: this test will need adjustment if a bucket is ever different    * from one
page.   */   if (num_buckets >= (uint32) effective_cache_size)       buildstate.spool = _h_spoolinit(index,
num_buckets);  else       buildstate.spool = NULL;
 

        regards, tom lane