Fixing hash index build time - Mailing list pgsql-hackers

From Tom Lane
Subject Fixing hash index build time
Date
Msg-id 16911.1174420341@sss.pgh.pa.us
Whole thread Raw
Responses Re: Fixing hash index build time  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
There are several reasons why Postgres' hash indexes currently suck,
but one of the bigger ones is that the time to build an index on a
large existing table is excessive, eg
http://archives.postgresql.org/pgsql-novice/2007-03/msg00064.php

I'm not sure if this has been discussed before, but I suddenly realized
while responding to the above message that the reason for the awful
performance is pretty obvious: hashbuild starts with a minimum-size
index (two buckets) and repeatedly splits buckets as insertions are
done, exactly the same as ordinary dynamic growth of the index would do.
This means that for an N-row table, approximately N/entries-per-bucket
splits need to occur during index build, which results in roughly O(N^2)
performance because we have to reprocess already-inserted entries over
and over.  This explains the empiric observation I made a long time ago:
http://archives.postgresql.org/pgsql-hackers/2002-04/msg01379.php

This could be fixed with a relatively small amount of new code: when
beginning hashbuild, estimate the parent table's rowcount (the same
method used by the planner will do fine, viz RelationGetNumberOfBlocks
times an estimated tuple density) and construct the appropriate number
of buckets immediately.  No splits, just write out empty pages as fast
as we can.  *Then* do the insertions.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Money type todos?
Next
From: "D'Arcy J.M. Cain"
Date:
Subject: Re: Money type todos?