Thread: Why hash indexes suck

Why hash indexes suck

From
Tom Lane
Date:
"Dann Corbit" <DCorbit@connx.com> writes:
> There seems to be something seriously defective with hash indexes in old
> versions of PostgreSQL.

They still suck; I'm not aware of any situation where I'd recommend hash
over btree indexes in Postgres.  I think we have fixed the hash indexes'
deadlock problems as of 7.4, but there's still no real performance
advantage.

I just had an epiphany as to the probable reason why the performance sucks.
It's this: the hash bucket size is the same as the page size (ie, 8K).

This means that if you have only one or a few items per bucket, the
information density is awful, and you lose big on I/O requirements
compared to a btree index.  On the other hand, if you have enough
items per bucket to make the storage density competitive, you will
be doing linear searches through dozens if not hundreds of items that
are all in the same bucket, and you lose on CPU time (compared to btree
which can do binary search to find an item within a page).

It would probably be interesting to look into making the hash bucket
size be just a fraction of a page, with the intent of having no more
than a couple dozen items per bucket.  I'm not sure what the
implications would be for intra-page storage management or index locking
conventions, but offhand it seems like there wouldn't be any
insurmountable problems.

I'm not planning on doing this myself, just throwing it out as a
possible TODO item for anyone who's convinced that hash indexes ought
to work better than they do.
        regards, tom lane


Re: Why hash indexes suck

From
Sailesh Krishnamurthy
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
   Tom> This means that if you have only one or a few items per   Tom> bucket, the information density is awful, and
youlose big on   Tom> I/O requirements compared to a btree index.  On the other   Tom> hand, if you have enough items
perbucket to make the storage   Tom> density competitive, you will be doing linear searches   Tom> through dozens if
nothundreds of items that are all in the   Tom> same bucket, and you lose on CPU time (compared to btree   Tom> which
cando binary search to find an item within a page).
 

This is probably a crazy idea, but is it possible to organize the data
in a page of a hash bucket as a binary tree ? Then you wouldn't lose
wrt CPU time at least. 

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh




Re: Why hash indexes suck

From
Tom Lane
Date:
Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes:
> This is probably a crazy idea, but is it possible to organize the data
> in a page of a hash bucket as a binary tree ?

Only if you want to require a hash opclass to supply ordering operators,
which sort of defeats the purpose I think.  Hash is only supposed to
need equality not ordering.
        regards, tom lane


Re: Why hash indexes suck

From
Jeff Davis
Date:
On Sat, 2004-06-05 at 13:31, Tom Lane wrote:
> Only if you want to require a hash opclass to supply ordering operators,
> which sort of defeats the purpose I think.  Hash is only supposed to
> need equality not ordering.

Is it possible to assume some kind of ordering (i.e. strcmp() the binary
data of the type) as long as it's consistent?

Regards,Jeff



Re: Why hash indexes suck

From
Tom Lane
Date:
Jeff Davis <jdavis-pgsql@empires.org> writes:
> On Sat, 2004-06-05 at 13:31, Tom Lane wrote:
>> Only if you want to require a hash opclass to supply ordering operators,
>> which sort of defeats the purpose I think.  Hash is only supposed to
>> need equality not ordering.

> Is it possible to assume some kind of ordering (i.e. strcmp() the binary
> data of the type) as long as it's consistent?

Not really; that would assume that equality of the datatype is the same
as bitwise equality, which is not the case in general (consider -0
versus +0 in IEEE floats, or any datatype with pad bits in the struct).
Some time ago we got rid of the assumption that hash should hash on all
the bits without any type-specific intervention, and I don't want to
reintroduce those bugs.

We could safely sort on the hash value, but I'm not sure how effective
that would be, considering that we're talking about values that already
hashed into the same bucket --- there's likely not to be very many
distinct hash values there.
        regards, tom lane


Re: Why hash indexes suck

From
pgsql@mohawksoft.com
Date:
> Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes:
>> This is probably a crazy idea, but is it possible to organize the data
>> in a page of a hash bucket as a binary tree ?
>
> Only if you want to require a hash opclass to supply ordering operators,
> which sort of defeats the purpose I think.  Hash is only supposed to
> need equality not ordering.
>

A btree is frequently used within the buckets of a hash table, expecially
if you expect to have a large number of items in each bucket.

If PostgreSQL could create a hash table index which is a single top level
hash table with each hash bucket being a btree index. You can eliminate a
number of btree searches by hashing, and then fall into btree performance
after the first hash lookup. The administrator should be able to gather
statistics about the population of the hash buckets and rehash if
performance begins to behave like a btree or the data is not distributed
evenly. Given proper selection of the initial number of buckets, a hash
table could blow btree out of the water. Given a poor selection of the
number of buckets, i.e. 1, a hash will behave no worse than a btree.

Also, it would be helpful to be able to specify a hash function during the
create or rehash, given a specific class of data, extraction of the random
elements can be more efficient and/or effective given knowledge of the
data. Think something like "bar codes," there are portions of the data
which are ususally the same and portions of the data which are usually
different. Focusing on the portions of the data which tend to be different
will generally provide a more evenly distributed hash.










Re: Why hash indexes suck

From
"Zeugswetter Andreas SB SD"
Date:
> We could safely sort on the hash value, but I'm not sure how effective
> that would be, considering that we're talking about values that already
> hashed into the same bucket --- there's likely not to be very many
> distinct hash values there.

I think we can safely put that on the todo list.
The existing hash algorithm is very good. So I would on the
contrary beleive that only a few keys share a hash value per pagesized bucket.
For the equal keys case it does not matter since we want all of the rows anyways.
For the equal hash value case it would probably be best to sort by ctid.

TODO ?: order heap pointers inside hash index pages by hash value and ctid

Andreas


Re: Why hash indexes suck

From
Bruce Momjian
Date:
Added to TODO:
* Order heap pointers on hash index pages by hash value and ctid

---------------------------------------------------------------------------

Zeugswetter Andreas SB SD wrote:
> 
> > We could safely sort on the hash value, but I'm not sure how effective
> > that would be, considering that we're talking about values that already
> > hashed into the same bucket --- there's likely not to be very many
> > distinct hash values there.
> 
> I think we can safely put that on the todo list.
> The existing hash algorithm is very good. So I would on the 
> contrary beleive that only a few keys share a hash value per pagesized bucket.
> For the equal keys case it does not matter since we want all of the rows anyways.
> For the equal hash value case it would probably be best to sort by ctid.
> 
> TODO ?: order heap pointers inside hash index pages by hash value and ctid
> 
> Andreas
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Why hash indexes suck

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Added to TODO:
>     * Order heap pointers on hash index pages by hash value and ctid

[blink]  This seems to miss out on the actual point of the thread (hash
bucket size shouldn't be a disk page) in favor of an entirely
unsupported sub-suggestion.
        regards, tom lane


Re: Why hash indexes suck

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Added to TODO:
> >     * Order heap pointers on hash index pages by hash value and ctid
> 
> [blink]  This seems to miss out on the actual point of the thread (hash
> bucket size shouldn't be a disk page) in favor of an entirely
> unsupported sub-suggestion.

Yes, I was unsure of the text myself.  I have changed it to:
* Allow hash buckets to fill disk pages, rather than being  sparse

If we sorted the keys, how do we insert new entries efficiently?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Why hash indexes suck

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> [blink]  This seems to miss out on the actual point of the thread (hash
>> bucket size shouldn't be a disk page) in favor of an entirely
>> unsupported sub-suggestion.

> Yes, I was unsure of the text myself.  I have changed it to:
>     * Allow hash buckets to fill disk pages, rather than being
>       sparse

OK, though maybe "pack hash index buckets onto disk pages more
efficiently" would be clearer.

> If we sorted the keys, how do we insert new entries efficiently?

That is why I called it "unsupported".  I'm not clear what would happen
in buckets that overflow onto multiple pages --- do we try to maintain
ordering across all the pages, or just within a page, or what?  How much
does this buy us versus what it costs to maintain?  Maybe there's a win
there but I think it's pretty speculative ...
        regards, tom lane


Re: Why hash indexes suck

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> [blink]  This seems to miss out on the actual point of the thread (hash
> >> bucket size shouldn't be a disk page) in favor of an entirely
> >> unsupported sub-suggestion.
> 
> > Yes, I was unsure of the text myself.  I have changed it to:
> >     * Allow hash buckets to fill disk pages, rather than being
> >       sparse
> 
> OK, though maybe "pack hash index buckets onto disk pages more
> efficiently" would be clearer.

OK, updated.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Re: We have got a serious problem with pg_clog/WAL synchronization

From
Tom Lane
Date:
Kenneth Marshall <ktm@is.rice.edu> writes:
> Would it be possible to use a latch + version number in
> this case to minimize this problem by allowing all but the checkpoint to
> perform a read-only action on the latch?

How would a read-only action work to block out the checkpoint?

More generally, though, this lock is hardly the one I'd be most
concerned about in an SMP situation.  It's only taken once per
transaction, while there are others that may be taken many times.
(At least two of these, the WALInsertLock and the lock on shared
pg_clog, will need to be taken again in the process of recording
transaction commit.)

What I'd most like to find is a way to reduce contention for the
BufMgrLock --- there are at least some behavioral patterns in which
that is demonstrably a dominant cost.  See past discussions in the
archives ("context swap storm" should find you some recent threads).
        regards, tom lane


Re: Re: We have got a serious problem with pg_clog/WAL synchronization

From
Tom Lane
Date:
Kenneth Marshall <ktm@is.rice.edu> writes:
> On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:
>> How would a read-only action work to block out the checkpoint?

> The latch+version number is use by the checkpoint process. The
> other processes can do a read of the latch to determine if it has
> been set. This does not cause a cache invalidation hit. If the
> latch is set, the competing processes read until it has been
> cleared and the version updated. This makes the general case of
> no checkpoint not incur a write and the consequent cache-line
> invalidation and reload by all processors on an SMP system.

Except that reading the latch and finding it clear offers no guarantee
that a checkpoint isn't about to start.  The problem is that we are
performing two separate actions (write a COMMIT xlog record and update
transaction status in clog) and we have to prevent a checkpoint from
starting in between those actions.  I don't see that there's any way to
do that with a read-only latch.
        regards, tom lane


Re: Re: We have got a serious problem with pg_clog/WAL synchronization

From
Kenneth Marshall
Date:
> "Min Xu (Hsu)" <xu@cs.wisc.edu> writes:
> > It seems to me this is an interesting phenomena of interactions between 
> > frequent events of transaction commits and infrequent events of system 
> > checkpoints. A potential alternative solution to adding a new shared 
> > lock to the frequent commit operation is to let the infrequent 
> > checkpoint operation take more overhead. I suppose acquiring/releasing 
> > an extra lock for each commit would incur extra performance overhead, 
> > even when the lock is not contented.  On the other hand, let the 
> > checkpoing operation acquire some existing locks (exclusively) to 
> > effectively disallowing committing transactions to interfere with the 
> > checkpoint process might be a better solution since it incur higher 
> > overhead only when necessary.
> 
> Unfortunately, there isn't any pre-existing lock that will serve.
> A transaction that is between XLogInsert'ing its COMMIT record and
> updating the shared pg_clog data area does not hold any lock that
> could be used to prevent a checkpoint from starting.  (Or it didn't
> until yesterday's patch, anyway.)
> 
> I looked briefly at reorganizing the existing code so that we'd do the
> COMMIT XLogInsert while we're holding lock on the shared pg_clog data,
> which would solve the problem without adding any new lock acquisition.
> But this seemed extremely messy to do.  Also it would be optimizing
> transaction commit at the cost of pessimizing other uses of pg_clog,
> which might have to wait longer to get at the shared data.  Adding the
> new lock has the advantage that we can be sure it's not blocking
> anything we don't want it to block.
> 
> Thanks for thinking about the problem though ...
> 
>    regards, tom lane
> 

One problem with a high-traffic LWLock is that they require a write
to shared memory for both the shared lock and the exclusive lock. On
the increasingly prevalent SMP machines, this will cause the invalidation
of the cache-line containing the lock and the consequent reload and its
inherent delay. Would it be possible to use a latch + version number in
this case to minimize this problem by allowing all but the checkpoint to
perform a read-only action on the latch? This should eliminate the cache-line
shenanigans on SMP machines.

Ken Marshall 


Re: Re: We have got a serious problem with pg_clog/WAL synchronization

From
Kenneth Marshall
Date:
On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:
> Kenneth Marshall <ktm@is.rice.edu> writes:
> > Would it be possible to use a latch + version number in
> > this case to minimize this problem by allowing all but the checkpoint to
> > perform a read-only action on the latch?
> 
> How would a read-only action work to block out the checkpoint?
> 
> More generally, though, this lock is hardly the one I'd be most
> concerned about in an SMP situation.  It's only taken once per
> transaction, while there are others that may be taken many times.
> (At least two of these, the WALInsertLock and the lock on shared
> pg_clog, will need to be taken again in the process of recording
> transaction commit.)
> 
> What I'd most like to find is a way to reduce contention for the
> BufMgrLock --- there are at least some behavioral patterns in which
> that is demonstrably a dominant cost.  See past discussions in the
> archives ("context swap storm" should find you some recent threads).
> 
>             regards, tom lane
> 

The latch+version number is use by the checkpoint process. The
other processes can do a read of the latch to determine if it has
been set. This does not cause a cache invalidation hit. If the
latch is set, the competing processes read until it has been
cleared and the version updated. This makes the general case of
no checkpoint not incur a write and the consequent cache-line
invalidation and reload by all processors on an SMP system.

Ken


Re: Re: We have got a serious problem with pg_clog/WAL synchronization

From
Kenneth Marshall
Date:
On Thu, Aug 12, 2004 at 01:13:46PM -0400, Tom Lane wrote:
> Kenneth Marshall <ktm@is.rice.edu> writes:
> > On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:
> >> How would a read-only action work to block out the checkpoint?
> 
> > The latch+version number is use by the checkpoint process. The
> > other processes can do a read of the latch to determine if it has
> > been set. This does not cause a cache invalidation hit. If the
> > latch is set, the competing processes read until it has been
> > cleared and the version updated. This makes the general case of
> > no checkpoint not incur a write and the consequent cache-line
> > invalidation and reload by all processors on an SMP system.
> 
> Except that reading the latch and finding it clear offers no guarantee
> that a checkpoint isn't about to start.  The problem is that we are
> performing two separate actions (write a COMMIT xlog record and update
> transaction status in clog) and we have to prevent a checkpoint from
> starting in between those actions.  I don't see that there's any way to
> do that with a read-only latch.
> 
>             regards, tom lane

Yes, you are correct. I missed that part of the previous thread. When
I saw "exclusive lock" I thought latch since that is what I am investigating
to solve other performance issues that I am addressing.

Ken


Re: Re: We have got a serious problem with pg_clog/WAL synchronization

From
"Simon@2ndquadrant.com"
Date:
On Thu, Aug 12, 2004 at 01:13:46PM -0400, Tom Lane wrote:
> Kenneth Marshall <ktm@is.rice.edu> writes:
> > On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:
> >> How would a read-only action work to block out the checkpoint?
>
> > The latch+version number is use by the checkpoint process. The
> > other processes can do a read of the latch to determine if it has
> > been set. This does not cause a cache invalidation hit. If the
> > latch is set, the competing processes read until it has been
> > cleared and the version updated. This makes the general case of
> > no checkpoint not incur a write and the consequent cache-line
> > invalidation and reload by all processors on an SMP system.
>
> Except that reading the latch and finding it clear offers no guarantee
> that a checkpoint isn't about to start.  The problem is that we are
> performing two separate actions (write a COMMIT xlog record and update
> transaction status in clog) and we have to prevent a checkpoint from
> starting in between those actions.  I don't see that there's any way to
> do that with a read-only latch.
>

...just caught up on this.

ISTM that more heavily loading the checkpoint process IS possible if the
checkpoint uses a two-phase lock. That would replace 1 write lock with 2
lock reads...which is likely to be beneficial for SMP, given I have faith
that the other two problems you mention will succumb to some solution in the
mid-term. The first lock is an "intent lock" followed by a second,
heavyweight lock just as you now have it.

Comitter:
1. prior to COMMIT: reads for an intent lock, if found then it attempts to
take heavyweight lock...if that is not possible, then the commit waits until
after the checkpoint, just as you currently suggest
2. prior to update clog: reads for an intent lock, if found then takes
heavyweight lock...if that is not possible, then report a server error

Checkpointer: (straight to step 4 for a shutdown checkpoint)
1. writes an intent lock (it always can)
2. wait for the group commit timeout
3. wait for 0.5 second more
4. begins to wait on an exclusive heavyweight lock, before starting
checkpoint proper

This is not a provably correct state machine, but the error message should
not occur under current "normal" situations. (It is possible that an intent
lock could be written by Checkpointer (step 1), after a Committer reads for
it (step 1), then a very long delay occurs before Committer's step 2), such
that Checkpointer step 4 begins before Committer step 2.) It is very likely
that this would be noticed by Comitter step 2 and reported upon, in the
unlikely event that it occurs.

Is a longer term solution for pg to use a background log writer? That would
make group commit much easier to perform automatically without the
false-delay model currently available.

Best Regards, Simon Riggs



Re: Re: We have got a serious problem with pg_clog/WAL synchronization

From
Tom Lane
Date:
"Simon@2ndquadrant.com" <simon@2ndquadrant.com> writes:
> This is not a provably correct state machine

I think the discussion ends right there.  You are assuming that the
commit is guaranteed to finish in X amount of time, when it is not
possible to make any such guarantee.  We are not putting in an
unreliable commit mechanism in order to save a small amount of lock
contention.  (As I tried to point out already, but it doesn't seem
to have sunk in: this newly-added lock is not likely to be that much
more contention added to the commit path, seeing that the path of
control it protects already involves taking at least two exclusive
LWLocks.  Those locks will likely each cause as much or more SMP
cache thrashing as this one.)

What we could use is a better way to build LWLocks in general.  I do not
know how to do that, though, in the face of SMP machines that seem to
fundamentally not have any cheap locking mechanisms...
        regards, tom lane


Re: Re: We have got a serious problem with pg_clog/WAL synchronization

From
"Simon@2ndquadrant.com"
Date:
> Tom Lane
> 
> "Simon@2ndquadrant.com" <simon@2ndquadrant.com> writes:
> > This is not a provably correct state machine
> 
> I think the discussion ends right there.  

Yes...

Negative results are worth documenting too, IMHO.

Best Regards, Simon Riggs