Thread: Why hash indexes suck
"Dann Corbit" <DCorbit@connx.com> writes: > There seems to be something seriously defective with hash indexes in old > versions of PostgreSQL. They still suck; I'm not aware of any situation where I'd recommend hash over btree indexes in Postgres. I think we have fixed the hash indexes' deadlock problems as of 7.4, but there's still no real performance advantage. I just had an epiphany as to the probable reason why the performance sucks. It's this: the hash bucket size is the same as the page size (ie, 8K). This means that if you have only one or a few items per bucket, the information density is awful, and you lose big on I/O requirements compared to a btree index. On the other hand, if you have enough items per bucket to make the storage density competitive, you will be doing linear searches through dozens if not hundreds of items that are all in the same bucket, and you lose on CPU time (compared to btree which can do binary search to find an item within a page). It would probably be interesting to look into making the hash bucket size be just a fraction of a page, with the intent of having no more than a couple dozen items per bucket. I'm not sure what the implications would be for intra-page storage management or index locking conventions, but offhand it seems like there wouldn't be any insurmountable problems. I'm not planning on doing this myself, just throwing it out as a possible TODO item for anyone who's convinced that hash indexes ought to work better than they do. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> This means that if you have only one or a few items per Tom> bucket, the information density is awful, and youlose big on Tom> I/O requirements compared to a btree index. On the other Tom> hand, if you have enough items perbucket to make the storage Tom> density competitive, you will be doing linear searches Tom> through dozens if nothundreds of items that are all in the Tom> same bucket, and you lose on CPU time (compared to btree Tom> which cando binary search to find an item within a page). This is probably a crazy idea, but is it possible to organize the data in a page of a hash bucket as a binary tree ? Then you wouldn't lose wrt CPU time at least. -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh
Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes: > This is probably a crazy idea, but is it possible to organize the data > in a page of a hash bucket as a binary tree ? Only if you want to require a hash opclass to supply ordering operators, which sort of defeats the purpose I think. Hash is only supposed to need equality not ordering. regards, tom lane
On Sat, 2004-06-05 at 13:31, Tom Lane wrote: > Only if you want to require a hash opclass to supply ordering operators, > which sort of defeats the purpose I think. Hash is only supposed to > need equality not ordering. Is it possible to assume some kind of ordering (i.e. strcmp() the binary data of the type) as long as it's consistent? Regards,Jeff
Jeff Davis <jdavis-pgsql@empires.org> writes: > On Sat, 2004-06-05 at 13:31, Tom Lane wrote: >> Only if you want to require a hash opclass to supply ordering operators, >> which sort of defeats the purpose I think. Hash is only supposed to >> need equality not ordering. > Is it possible to assume some kind of ordering (i.e. strcmp() the binary > data of the type) as long as it's consistent? Not really; that would assume that equality of the datatype is the same as bitwise equality, which is not the case in general (consider -0 versus +0 in IEEE floats, or any datatype with pad bits in the struct). Some time ago we got rid of the assumption that hash should hash on all the bits without any type-specific intervention, and I don't want to reintroduce those bugs. We could safely sort on the hash value, but I'm not sure how effective that would be, considering that we're talking about values that already hashed into the same bucket --- there's likely not to be very many distinct hash values there. regards, tom lane
> Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes: >> This is probably a crazy idea, but is it possible to organize the data >> in a page of a hash bucket as a binary tree ? > > Only if you want to require a hash opclass to supply ordering operators, > which sort of defeats the purpose I think. Hash is only supposed to > need equality not ordering. > A btree is frequently used within the buckets of a hash table, expecially if you expect to have a large number of items in each bucket. If PostgreSQL could create a hash table index which is a single top level hash table with each hash bucket being a btree index. You can eliminate a number of btree searches by hashing, and then fall into btree performance after the first hash lookup. The administrator should be able to gather statistics about the population of the hash buckets and rehash if performance begins to behave like a btree or the data is not distributed evenly. Given proper selection of the initial number of buckets, a hash table could blow btree out of the water. Given a poor selection of the number of buckets, i.e. 1, a hash will behave no worse than a btree. Also, it would be helpful to be able to specify a hash function during the create or rehash, given a specific class of data, extraction of the random elements can be more efficient and/or effective given knowledge of the data. Think something like "bar codes," there are portions of the data which are ususally the same and portions of the data which are usually different. Focusing on the portions of the data which tend to be different will generally provide a more evenly distributed hash.
> We could safely sort on the hash value, but I'm not sure how effective > that would be, considering that we're talking about values that already > hashed into the same bucket --- there's likely not to be very many > distinct hash values there. I think we can safely put that on the todo list. The existing hash algorithm is very good. So I would on the contrary beleive that only a few keys share a hash value per pagesized bucket. For the equal keys case it does not matter since we want all of the rows anyways. For the equal hash value case it would probably be best to sort by ctid. TODO ?: order heap pointers inside hash index pages by hash value and ctid Andreas
Added to TODO: * Order heap pointers on hash index pages by hash value and ctid --------------------------------------------------------------------------- Zeugswetter Andreas SB SD wrote: > > > We could safely sort on the hash value, but I'm not sure how effective > > that would be, considering that we're talking about values that already > > hashed into the same bucket --- there's likely not to be very many > > distinct hash values there. > > I think we can safely put that on the todo list. > The existing hash algorithm is very good. So I would on the > contrary beleive that only a few keys share a hash value per pagesized bucket. > For the equal keys case it does not matter since we want all of the rows anyways. > For the equal hash value case it would probably be best to sort by ctid. > > TODO ?: order heap pointers inside hash index pages by hash value and ctid > > Andreas > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Added to TODO: > * Order heap pointers on hash index pages by hash value and ctid [blink] This seems to miss out on the actual point of the thread (hash bucket size shouldn't be a disk page) in favor of an entirely unsupported sub-suggestion. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Added to TODO: > > * Order heap pointers on hash index pages by hash value and ctid > > [blink] This seems to miss out on the actual point of the thread (hash > bucket size shouldn't be a disk page) in favor of an entirely > unsupported sub-suggestion. Yes, I was unsure of the text myself. I have changed it to: * Allow hash buckets to fill disk pages, rather than being sparse If we sorted the keys, how do we insert new entries efficiently? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> [blink] This seems to miss out on the actual point of the thread (hash >> bucket size shouldn't be a disk page) in favor of an entirely >> unsupported sub-suggestion. > Yes, I was unsure of the text myself. I have changed it to: > * Allow hash buckets to fill disk pages, rather than being > sparse OK, though maybe "pack hash index buckets onto disk pages more efficiently" would be clearer. > If we sorted the keys, how do we insert new entries efficiently? That is why I called it "unsupported". I'm not clear what would happen in buckets that overflow onto multiple pages --- do we try to maintain ordering across all the pages, or just within a page, or what? How much does this buy us versus what it costs to maintain? Maybe there's a win there but I think it's pretty speculative ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> [blink] This seems to miss out on the actual point of the thread (hash > >> bucket size shouldn't be a disk page) in favor of an entirely > >> unsupported sub-suggestion. > > > Yes, I was unsure of the text myself. I have changed it to: > > * Allow hash buckets to fill disk pages, rather than being > > sparse > > OK, though maybe "pack hash index buckets onto disk pages more > efficiently" would be clearer. OK, updated. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Kenneth Marshall <ktm@is.rice.edu> writes: > Would it be possible to use a latch + version number in > this case to minimize this problem by allowing all but the checkpoint to > perform a read-only action on the latch? How would a read-only action work to block out the checkpoint? More generally, though, this lock is hardly the one I'd be most concerned about in an SMP situation. It's only taken once per transaction, while there are others that may be taken many times. (At least two of these, the WALInsertLock and the lock on shared pg_clog, will need to be taken again in the process of recording transaction commit.) What I'd most like to find is a way to reduce contention for the BufMgrLock --- there are at least some behavioral patterns in which that is demonstrably a dominant cost. See past discussions in the archives ("context swap storm" should find you some recent threads). regards, tom lane
Kenneth Marshall <ktm@is.rice.edu> writes: > On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote: >> How would a read-only action work to block out the checkpoint? > The latch+version number is use by the checkpoint process. The > other processes can do a read of the latch to determine if it has > been set. This does not cause a cache invalidation hit. If the > latch is set, the competing processes read until it has been > cleared and the version updated. This makes the general case of > no checkpoint not incur a write and the consequent cache-line > invalidation and reload by all processors on an SMP system. Except that reading the latch and finding it clear offers no guarantee that a checkpoint isn't about to start. The problem is that we are performing two separate actions (write a COMMIT xlog record and update transaction status in clog) and we have to prevent a checkpoint from starting in between those actions. I don't see that there's any way to do that with a read-only latch. regards, tom lane
> "Min Xu (Hsu)" <xu@cs.wisc.edu> writes: > > It seems to me this is an interesting phenomena of interactions between > > frequent events of transaction commits and infrequent events of system > > checkpoints. A potential alternative solution to adding a new shared > > lock to the frequent commit operation is to let the infrequent > > checkpoint operation take more overhead. I suppose acquiring/releasing > > an extra lock for each commit would incur extra performance overhead, > > even when the lock is not contented. On the other hand, let the > > checkpoing operation acquire some existing locks (exclusively) to > > effectively disallowing committing transactions to interfere with the > > checkpoint process might be a better solution since it incur higher > > overhead only when necessary. > > Unfortunately, there isn't any pre-existing lock that will serve. > A transaction that is between XLogInsert'ing its COMMIT record and > updating the shared pg_clog data area does not hold any lock that > could be used to prevent a checkpoint from starting. (Or it didn't > until yesterday's patch, anyway.) > > I looked briefly at reorganizing the existing code so that we'd do the > COMMIT XLogInsert while we're holding lock on the shared pg_clog data, > which would solve the problem without adding any new lock acquisition. > But this seemed extremely messy to do. Also it would be optimizing > transaction commit at the cost of pessimizing other uses of pg_clog, > which might have to wait longer to get at the shared data. Adding the > new lock has the advantage that we can be sure it's not blocking > anything we don't want it to block. > > Thanks for thinking about the problem though ... > > regards, tom lane > One problem with a high-traffic LWLock is that they require a write to shared memory for both the shared lock and the exclusive lock. On the increasingly prevalent SMP machines, this will cause the invalidation of the cache-line containing the lock and the consequent reload and its inherent delay. Would it be possible to use a latch + version number in this case to minimize this problem by allowing all but the checkpoint to perform a read-only action on the latch? This should eliminate the cache-line shenanigans on SMP machines. Ken Marshall
On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote: > Kenneth Marshall <ktm@is.rice.edu> writes: > > Would it be possible to use a latch + version number in > > this case to minimize this problem by allowing all but the checkpoint to > > perform a read-only action on the latch? > > How would a read-only action work to block out the checkpoint? > > More generally, though, this lock is hardly the one I'd be most > concerned about in an SMP situation. It's only taken once per > transaction, while there are others that may be taken many times. > (At least two of these, the WALInsertLock and the lock on shared > pg_clog, will need to be taken again in the process of recording > transaction commit.) > > What I'd most like to find is a way to reduce contention for the > BufMgrLock --- there are at least some behavioral patterns in which > that is demonstrably a dominant cost. See past discussions in the > archives ("context swap storm" should find you some recent threads). > > regards, tom lane > The latch+version number is use by the checkpoint process. The other processes can do a read of the latch to determine if it has been set. This does not cause a cache invalidation hit. If the latch is set, the competing processes read until it has been cleared and the version updated. This makes the general case of no checkpoint not incur a write and the consequent cache-line invalidation and reload by all processors on an SMP system. Ken
On Thu, Aug 12, 2004 at 01:13:46PM -0400, Tom Lane wrote: > Kenneth Marshall <ktm@is.rice.edu> writes: > > On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote: > >> How would a read-only action work to block out the checkpoint? > > > The latch+version number is use by the checkpoint process. The > > other processes can do a read of the latch to determine if it has > > been set. This does not cause a cache invalidation hit. If the > > latch is set, the competing processes read until it has been > > cleared and the version updated. This makes the general case of > > no checkpoint not incur a write and the consequent cache-line > > invalidation and reload by all processors on an SMP system. > > Except that reading the latch and finding it clear offers no guarantee > that a checkpoint isn't about to start. The problem is that we are > performing two separate actions (write a COMMIT xlog record and update > transaction status in clog) and we have to prevent a checkpoint from > starting in between those actions. I don't see that there's any way to > do that with a read-only latch. > > regards, tom lane Yes, you are correct. I missed that part of the previous thread. When I saw "exclusive lock" I thought latch since that is what I am investigating to solve other performance issues that I am addressing. Ken
Re: Re: We have got a serious problem with pg_clog/WAL synchronization
From
"Simon@2ndquadrant.com"
Date:
On Thu, Aug 12, 2004 at 01:13:46PM -0400, Tom Lane wrote: > Kenneth Marshall <ktm@is.rice.edu> writes: > > On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote: > >> How would a read-only action work to block out the checkpoint? > > > The latch+version number is use by the checkpoint process. The > > other processes can do a read of the latch to determine if it has > > been set. This does not cause a cache invalidation hit. If the > > latch is set, the competing processes read until it has been > > cleared and the version updated. This makes the general case of > > no checkpoint not incur a write and the consequent cache-line > > invalidation and reload by all processors on an SMP system. > > Except that reading the latch and finding it clear offers no guarantee > that a checkpoint isn't about to start. The problem is that we are > performing two separate actions (write a COMMIT xlog record and update > transaction status in clog) and we have to prevent a checkpoint from > starting in between those actions. I don't see that there's any way to > do that with a read-only latch. > ...just caught up on this. ISTM that more heavily loading the checkpoint process IS possible if the checkpoint uses a two-phase lock. That would replace 1 write lock with 2 lock reads...which is likely to be beneficial for SMP, given I have faith that the other two problems you mention will succumb to some solution in the mid-term. The first lock is an "intent lock" followed by a second, heavyweight lock just as you now have it. Comitter: 1. prior to COMMIT: reads for an intent lock, if found then it attempts to take heavyweight lock...if that is not possible, then the commit waits until after the checkpoint, just as you currently suggest 2. prior to update clog: reads for an intent lock, if found then takes heavyweight lock...if that is not possible, then report a server error Checkpointer: (straight to step 4 for a shutdown checkpoint) 1. writes an intent lock (it always can) 2. wait for the group commit timeout 3. wait for 0.5 second more 4. begins to wait on an exclusive heavyweight lock, before starting checkpoint proper This is not a provably correct state machine, but the error message should not occur under current "normal" situations. (It is possible that an intent lock could be written by Checkpointer (step 1), after a Committer reads for it (step 1), then a very long delay occurs before Committer's step 2), such that Checkpointer step 4 begins before Committer step 2.) It is very likely that this would be noticed by Comitter step 2 and reported upon, in the unlikely event that it occurs. Is a longer term solution for pg to use a background log writer? That would make group commit much easier to perform automatically without the false-delay model currently available. Best Regards, Simon Riggs
"Simon@2ndquadrant.com" <simon@2ndquadrant.com> writes: > This is not a provably correct state machine I think the discussion ends right there. You are assuming that the commit is guaranteed to finish in X amount of time, when it is not possible to make any such guarantee. We are not putting in an unreliable commit mechanism in order to save a small amount of lock contention. (As I tried to point out already, but it doesn't seem to have sunk in: this newly-added lock is not likely to be that much more contention added to the commit path, seeing that the path of control it protects already involves taking at least two exclusive LWLocks. Those locks will likely each cause as much or more SMP cache thrashing as this one.) What we could use is a better way to build LWLocks in general. I do not know how to do that, though, in the face of SMP machines that seem to fundamentally not have any cheap locking mechanisms... regards, tom lane
Re: Re: We have got a serious problem with pg_clog/WAL synchronization
From
"Simon@2ndquadrant.com"
Date:
> Tom Lane > > "Simon@2ndquadrant.com" <simon@2ndquadrant.com> writes: > > This is not a provably correct state machine > > I think the discussion ends right there. Yes... Negative results are worth documenting too, IMHO. Best Regards, Simon Riggs