Thread: Next Steps with Hash Indexes
Hi, I've been investigating hash indexes and have what I think is a clear picture in my head, so time for discussion. It would be very desirable to allow Hash Indexes to become Primary Key Indexes, which requires both amroutine->amcanunique = true; amroutine->amcanmulticol = true; Every other hash index TODO seems like performance tuning, so can wait awhile, even if it is tempting to do that first. 1. Multi-column Indexes seems to have floundered because of this thread "Multicolumn Hash Indexes", https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us, but those issues don't apply in most common cases and so they seem acceptable restrictions, especially since some already apply to btrees etc.. (And noting that Hash indexes already assume strict hash operators, so that is not an issue). For me, this separates into two sub-concerns: 1.1 Allow multi-columns to be defined for hash indexes Enabling this a simple one line patch amroutine->amcanmulticol = true; which works just fine on current HEAD without further changes (manual testing, as yet). If we do this first, then any work on uniqueness checking can take into account multiple columns. 1.2 Combining multi-column hashes into one hash value Trivially, this is already how it works, in the sense that we just use the first column, however many columns there are in the index! Doing more is an already solved problem in Postgres, [TupleHashTableHash_internal() in src/backend/executor/execGrouping.c] as pointed out here: "Combining hash values" https://www.postgresql.org/message-id/CAEepm%3D3rdgjfxW4cKvJ0OEmya2-34B0qHNG1xV0vK7TGPJGMUQ%40mail.gmail.com though noting there was no discussion on that point [1]. This just needs a little refactoring to improve things, but it seems more like a nice to have than an essential aspect of hash indexes that need not block us from enabling multi-column hash indexes. 2. Unique Hash Indexes have been summarized here: https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com which also seems to have two parts to it. 2.1 Uniqueness Check Amit: "to ensure that there is no duplicate entry we need to traverse the whole bucket chain" Agreed. That seems straightforward and can also be improved later. 2.2 Locking Amit's idea of holding ExclusiveLock on the bucket page works for me, but there was some doubt about splitting. Splitting seems to be an awful behavior that users would wish to avoid if they knew about the impact and duration. In my understanding, splitting involves moving 50% of rows and likely touches all blocks eventually. If the existing index is the wrong shape then just run REINDEX. If we tune the index build, looks like REINDEX would be quicker and easier way of growing an index than trying to split an existing index. i.e. rely on ecdysis not internal growth. This is much more viable now because of the CIC changes in PG14. (I would argue that removing splitting completely is a good idea, similar to the way we have removed the original VACUUM FULL algorithm, but that will take a while to swallow that thought). Instead, I suggest we introduce a new indexam option for hash indexes of autosplit=on (default) | off, so that users can turn splitting off. Which means we would probably also need another option for initial_buckets=0 (default) means use number of existing rows to size, or N=use that specific size. Note that turning off splitting does not limit the size of the index, it just stops the index from re-freshing its number of buckets. B-trees are the default for PKs, so Hash indexes are an option for larger tables only, so there is less need to have hash indexes cope with tables of unknown size - we wouldn't even be using hash unless we already know it is a big table. If splitting causes any difficulty at all, then we should simply say that Unique Hash Index indexes should initially force autosplit=off, so we don't have to worry about the correctness of the locking. I suggest we implement that first and then decide if we really care about splitting, cos I'd bet we don't. Yes, I consider uniqueness much more desirable than splitting. I've written a patch that refactors index build so we *never* need to perform a split during index build, allowing us to more credibly skip index splitting completely. (Incidentally, it avoids the need to update the metapage for each row during the build, allowing us to consider writing in batches to the index as a next step). So there need be no *requirement* for splitting to be supported with uniqueness, while build/reindex looks like it can be much faster. I can post it if anyone wants to see it, but I don't want to distract us from discussion of the main requirements. I have other performance tuning ideas, but they can wait. Anyway, that's what I think at present. Thoughts? -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > 2. Unique Hash Indexes have been summarized here: > https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com > which also seems to have two parts to it. > > 2.1 Uniqueness Check > Amit: "to ensure that there is no duplicate entry we need to traverse > the whole bucket chain" > Agreed. That seems straightforward and can also be improved later. > > 2.2 Locking > Amit's idea of holding ExclusiveLock on the bucket page works for me, > but there was some doubt about splitting. > I think the main thing to think about for uniqueness check during split (where we scan both the old and new buckets) was whether we need to lock both the old (bucket_being_split) and new (bucket_being_populated) buckets or just holding locks on one of them (the current bucket in which we are inserting) is sufficient? During a scan of the new bucket, we just retain pins on both the buckets (see comments in _hash_first()) but if we need to retain locks on both buckets then we need to do something different then we do it for scans. But, I think it is sufficient to just hold an exclusive lock on the primary bucket page in the bucket we are trying to insert and pin on the other bucket (old bucket as we do for scans). Because no concurrent inserter should try to insert into the old bucket and new bucket the same tuple as before starting the split we always update the metapage for hashm_lowmask and hashm_highmask which decides the routing of the tuples. Now, I think here the other problem we need to think about is that for the hash index after finding the tuple in the index, we need to always recheck in the heap as we don't store the actual value in the hash index. For that in the scan, we get the tuple(s) from the index (release locks) and then match qual after fetching tuple from the heap. But we can't do that for uniqueness check because if we release the locks on the index bucket page then another inserter could come before we match it in heap. I think we need some mechanism that after fetching TID from the index, we recheck the actual value in heap before releasing the lock on the index bucket page. The other thing could be that if we have unique support for hash index then probably we can allow Insert ... ON Conflict if the user specifies unique index column as conflict_target. I am not sure if multicol index support is mandatory to allow uniqueness for hash indexes, sure it would be good but I feel that can be done as a separate patch as well. -- With Regards, Amit Kapila.
On Tue, Jul 20, 2021 at 5:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > > > 2. Unique Hash Indexes have been summarized here: > > https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com > > which also seems to have two parts to it. > > > > 2.1 Uniqueness Check > > Amit: "to ensure that there is no duplicate entry we need to traverse > > the whole bucket chain" > > Agreed. That seems straightforward and can also be improved later. > > > > 2.2 Locking > > Amit's idea of holding ExclusiveLock on the bucket page works for me, > > but there was some doubt about splitting. > > > > I think the main thing to think about for uniqueness check during > split (where we scan both the old and new buckets) was whether we need > to lock both the old (bucket_being_split) and new > (bucket_being_populated) buckets or just holding locks on one of them > (the current bucket in which we are inserting) is sufficient? During a > scan of the new bucket, we just retain pins on both the buckets (see > comments in _hash_first()) but if we need to retain locks on both > buckets then we need to do something different then we do it for > scans. But, I think it is sufficient to just hold an exclusive lock on > the primary bucket page in the bucket we are trying to insert and pin > on the other bucket (old bucket as we do for scans). Because no > concurrent inserter should try to insert into the old bucket and new > bucket the same tuple as before starting the split we always update > the metapage for hashm_lowmask and hashm_highmask which decides the > routing of the tuples. > > Now, I think here the other problem we need to think about is that for > the hash index after finding the tuple in the index, we need to always > recheck in the heap as we don't store the actual value in the hash > index. For that in the scan, we get the tuple(s) from the index > (release locks) and then match qual after fetching tuple from the > heap. But we can't do that for uniqueness check because if we release > the locks on the index bucket page then another inserter could come > before we match it in heap. I think we need some mechanism that after > fetching TID from the index, we recheck the actual value in heap > before releasing the lock on the index bucket page. > One more thing we need to think about here is when to find the right bucket page in the chain where we can insert the new tuple. Do we first try to complete the uniqueness check (which needs to scan through the entire bucket chain) and then again scan the bucket with space to insert or do we want to do it along with uniqueness check scan and remember it? -- With Regards, Amit Kapila.
On Tue, Jul 20, 2021 at 1:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > > > 2. Unique Hash Indexes have been summarized here: > > https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com > > which also seems to have two parts to it. > > > > 2.1 Uniqueness Check > > Amit: "to ensure that there is no duplicate entry we need to traverse > > the whole bucket chain" > > Agreed. That seems straightforward and can also be improved later. > > > > 2.2 Locking > > Amit's idea of holding ExclusiveLock on the bucket page works for me, > > but there was some doubt about splitting. > > > > I think the main thing to think about for uniqueness check during > split (where we scan both the old and new buckets) was whether we need > to lock both the old (bucket_being_split) and new > (bucket_being_populated) buckets or just holding locks on one of them > (the current bucket in which we are inserting) is sufficient? During a > scan of the new bucket, we just retain pins on both the buckets (see > comments in _hash_first()) but if we need to retain locks on both > buckets then we need to do something different then we do it for > scans. But, I think it is sufficient to just hold an exclusive lock on > the primary bucket page in the bucket we are trying to insert and pin > on the other bucket (old bucket as we do for scans). Because no > concurrent inserter should try to insert into the old bucket and new > bucket the same tuple as before starting the split we always update > the metapage for hashm_lowmask and hashm_highmask which decides the > routing of the tuples. During an incomplete split, we need to scan both old and new. So during insert, we need to scan both old and new, while holding exclusive locks on both bucket pages. I've spent a few days looking at the split behavior and this seems a complete approach. I'm working on a patch now, still at hacking stage. (BTW, my opinion of the split mechanism has now changed from bad to very good. It works really well for unique data, but can be completely ineffective for badly skewed data). > Now, I think here the other problem we need to think about is that for > the hash index after finding the tuple in the index, we need to always > recheck in the heap as we don't store the actual value in the hash > index. For that in the scan, we get the tuple(s) from the index > (release locks) and then match qual after fetching tuple from the > heap. But we can't do that for uniqueness check because if we release > the locks on the index bucket page then another inserter could come > before we match it in heap. I think we need some mechanism that after > fetching TID from the index, we recheck the actual value in heap > before releasing the lock on the index bucket page. I don't think btree does that, so I'm not sure we do need that for hash. Yes, there is a race condition, but someone will win. Do we care who? Do we care enough to take the concurrency hit? Duplicate inserts would be very rare in a declared unique index, so it would be a poor trade-off. > The other thing could be that if we have unique support for hash index > then probably we can allow Insert ... ON Conflict if the user > specifies unique index column as conflict_target. Yes, that looks doable. > I am not sure if multicol index support is mandatory to allow > uniqueness for hash indexes, sure it would be good but I feel that can > be done as a separate patch as well. I have a patch for multicol support, attached. -- Simon Riggs http://www.EnterpriseDB.com/
Attachment
On Tue, Jul 20, 2021 at 1:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > One more thing we need to think about here is when to find the right > bucket page in the chain where we can insert the new tuple. Do we > first try to complete the uniqueness check (which needs to scan > through the entire bucket chain) and then again scan the bucket with > space to insert or do we want to do it along with uniqueness check > scan and remember it? The latter approach, but that is just a performance tweak for later. On a unique hash index, regular splitting means there are almost no bucket chains more than 2 long (bucket plus overflow), so it seems like mostly wasted effort at this point. -- Simon Riggs http://www.EnterpriseDB.com/
On Tue, Jul 20, 2021 at 6:32 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > On Tue, Jul 20, 2021 at 1:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs > > <simon.riggs@enterprisedb.com> wrote: > > > > I think the main thing to think about for uniqueness check during > > split (where we scan both the old and new buckets) was whether we need > > to lock both the old (bucket_being_split) and new > > (bucket_being_populated) buckets or just holding locks on one of them > > (the current bucket in which we are inserting) is sufficient? During a > > scan of the new bucket, we just retain pins on both the buckets (see > > comments in _hash_first()) but if we need to retain locks on both > > buckets then we need to do something different then we do it for > > scans. But, I think it is sufficient to just hold an exclusive lock on > > the primary bucket page in the bucket we are trying to insert and pin > > on the other bucket (old bucket as we do for scans). Because no > > concurrent inserter should try to insert into the old bucket and new > > bucket the same tuple as before starting the split we always update > > the metapage for hashm_lowmask and hashm_highmask which decides the > > routing of the tuples. > > During an incomplete split, we need to scan both old and new. So > during insert, we need to scan both old and new, while holding > exclusive locks on both bucket pages. > It will surely work if we have an exclusive lock on both the buckets (old and new) in this case but I think it is better if we can avoid exclusive locking the old bucket (bucket_being_split) unless it is really required. We need an exclusive lock on the primary bucket where we are trying to insert to avoid any other inserter with the same key but I think we don't need it for the old bucket because no inserter with the same key can try to insert the key in an old bucket which would belong to the new bucket. > I've spent a few days looking at > the split behavior and this seems a complete approach. I'm working on > a patch now, still at hacking stage. > > (BTW, my opinion of the split mechanism has now changed from bad to > very good. It works really well for unique data, but can be completely > ineffective for badly skewed data). > > > Now, I think here the other problem we need to think about is that for > > the hash index after finding the tuple in the index, we need to always > > recheck in the heap as we don't store the actual value in the hash > > index. For that in the scan, we get the tuple(s) from the index > > (release locks) and then match qual after fetching tuple from the > > heap. But we can't do that for uniqueness check because if we release > > the locks on the index bucket page then another inserter could come > > before we match it in heap. I think we need some mechanism that after > > fetching TID from the index, we recheck the actual value in heap > > before releasing the lock on the index bucket page. > > I don't think btree does that, so I'm not sure we do need that for > hash. Yes, there is a race condition, but someone will win. Do we care > who? Do we care enough to take the concurrency hit? > I think if we don't care we might end up with duplicates. I might be missing something here but let me explain the problem I see. Say, while doing a unique check we found the same hash value in the bucket we are trying to insert, we can't say unique key violation at this stage and error out without checking the actual value in heap. This is because there is always a chance that two different key values can map to the same hash value. Now, after checking in the heap if we found that the actual value doesn't match so we decide to insert the value in the hash index, and in the meantime, another insert of the same key value already performed these checks and ends up inserting the value in hash index and that would lead to a duplicate value in the hash index. I think btree doesn't have a similar risk so we don't need such a mechanism for btree. -- With Regards, Amit Kapila.
On Thu, 22 Jul 2021 at 06:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > It will surely work if we have an exclusive lock on both the buckets > (old and new) in this case but I think it is better if we can avoid > exclusive locking the old bucket (bucket_being_split) unless it is > really required. We need an exclusive lock on the primary bucket where > we are trying to insert to avoid any other inserter with the same key > but I think we don't need it for the old bucket because no inserter > with the same key can try to insert the key in an old bucket which > would belong to the new bucket. Agreed. > > I don't think btree does that, so I'm not sure we do need that for > > hash. Yes, there is a race condition, but someone will win. Do we care > > who? Do we care enough to take the concurrency hit? > > > > I think if we don't care we might end up with duplicates. I might be > missing something here but let me explain the problem I see. Say, > while doing a unique check we found the same hash value in the bucket > we are trying to insert, we can't say unique key violation at this > stage and error out without checking the actual value in heap. This is > because there is always a chance that two different key values can map > to the same hash value. Now, after checking in the heap if we found > that the actual value doesn't match so we decide to insert the value > in the hash index, and in the meantime, another insert of the same key > value already performed these checks and ends up inserting the value > in hash index and that would lead to a duplicate value in the hash > index. I think btree doesn't have a similar risk so we don't need such > a mechanism for btree. Agreed, after thinking about it more while coding. All of the above implemented in the patches below: Complete patch for hash_multicol.v3.patch attached, slightly updated from earlier patch. Docs, tests, passes make check. WIP for hash_unique.v4.patch attached, patch-on-patch, to allow us to discuss flow of logic and shape of code. The actual comparison is not implemented yet. Not trivial, but can wait until we decide main logic. Passes make check and executes attached tests. Tests in separate file also attached, will eventually be merged into src/test/regress/sql/hash_index.sql No tests yet for splitting or deferred uniqueness checks. The latter is because there are parse analysis changes needed to undo the assumption that only btrees support uniqueness, but nothing there looks like an issue. Thanks for your input, have a good weekend. -- Simon Riggs http://www.EnterpriseDB.com/
Attachment
On Fri, Jul 23, 2021 at 6:16 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > On Thu, 22 Jul 2021 at 06:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > Complete patch for hash_multicol.v3.patch attached, slightly updated > from earlier patch. > Docs, tests, passes make check. I was looking into the hash_multicoul.v3.patch, I have a question <para> - Hash indexes support only single-column indexes and do not allow - uniqueness checking. + Hash indexes support uniqueness checking. + Hash indexes support multi-column indexes, but only store the hash value + for the first column, so multiple columns are useful only for uniquness + checking. </para> The above comments say that we store hash value only for the first column, my question is why don't we store for other columns as well? I mean we can search the bucket based on the first column hash but the hashes for the other column could be payload data and we can use that to match the hash value for other key columns before accessing the heap, as discussed here[1]. IMHO, this will further reduce the heap access. [1] https://www.postgresql.org/message-id/7192.1506527843%40sss.pgh.pa.us -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 15, 2021 at 9:41 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > It would be very desirable to allow Hash Indexes to become Primary Key > Indexes, which requires both > amroutine->amcanunique = true; > amroutine->amcanmulticol = true; Why do you say that? I don't think it's self-evident that it's desirable. In general I don't think that hash indexes are all that compelling compared to B-Trees. In practice the flexibility of B-Trees tends to win out, even if B-Trees are slightly slower than hash indexes with certain kinds of benchmarks that are heavy on point lookups and have no locality. I have no reason to object to any of this, and I don't object. I'm just asking. -- Peter Geoghegan
On Tue, Aug 10, 2021 at 6:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jul 23, 2021 at 6:16 PM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > > > On Thu, 22 Jul 2021 at 06:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Complete patch for hash_multicol.v3.patch attached, slightly updated > > from earlier patch. > > Docs, tests, passes make check. > > I was looking into the hash_multicoul.v3.patch, I have a question > > <para> > - Hash indexes support only single-column indexes and do not allow > - uniqueness checking. > + Hash indexes support uniqueness checking. > + Hash indexes support multi-column indexes, but only store the hash value > + for the first column, so multiple columns are useful only for uniquness > + checking. > </para> > > The above comments say that we store hash value only for the first > column, my question is why don't we store for other columns as well? > I mean we can search the bucket based on the first column hash but the > hashes for the other column could be payload data and we can use that > to match the hash value for other key columns before accessing the > heap, as discussed here[1]. IMHO, this will further reduce the heap > access. > True, the other idea could be that in the payload we store the value after 'combining multi-column hashes into one hash value'. This will allow us to satisfy queries where the search is on all columns of the index efficiently provided the planner doesn't remove some of them in which case we need to do more work. One more thing which we need to consider is 'hashm_procid' stored in meta page, currently, it works for the single-column index but for the multi-column index, we might want to set it as InvalidOid. -- With Regards, Amit Kapila.
On Tue, Aug 10, 2021 at 8:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I was looking into the hash_multicoul.v3.patch, I have a question > > <para> > - Hash indexes support only single-column indexes and do not allow > - uniqueness checking. > + Hash indexes support uniqueness checking. > + Hash indexes support multi-column indexes, but only store the hash value > + for the first column, so multiple columns are useful only for uniquness > + checking. > </para> > > The above comments say that we store hash value only for the first > column, my question is why don't we store for other columns as well? I suspect it would be hard to store multiple hash values, one per column. It seems to me that what we ought to do is combine the hash values for the individual columns using hash_combine(64) and store the combined value. I can't really imagine why we would NOT do that. It seems like it would be easy to do and make the behavior of the feature way less surprising. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I suspect it would be hard to store multiple hash values, one per > column. It seems to me that what we ought to do is combine the hash > values for the individual columns using hash_combine(64) and store the > combined value. I can't really imagine why we would NOT do that. That would make it impossible to use the index except with queries that provide equality conditions on all the index columns. Maybe that's fine, but it seems less flexible than other possible definitions. It really makes me wonder why anyone would bother with a multicol hash index. regards, tom lane
On Wed, Aug 11, 2021 at 10:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I suspect it would be hard to store multiple hash values, one per > > column. It seems to me that what we ought to do is combine the hash > > values for the individual columns using hash_combine(64) and store the > > combined value. I can't really imagine why we would NOT do that. > > That would make it impossible to use the index except with queries > that provide equality conditions on all the index columns. Maybe > that's fine, but it seems less flexible than other possible definitions. > It really makes me wonder why anyone would bother with a multicol > hash index. Hmm. That is a point I hadn't considered. I have to admit that after working with Amit on all the work to make hash indexes WAL-logged a few years ago, I was somewhat disillusioned with the whole AM. It seems like a cool idea to me but it's just not that well-implemented. For example, the strategy of just doubling the number of buckets in one shot seems pretty terrible for large indexes, and ea69a0dead5128c421140dc53fac165ba4af8520 will buy only a limited amount of relief. Likewise, the fact that keys are stored in hash value order within pages but that the bucket as a whole is not kept in order seems like it's bad for search performance and really bad for implementing unique indexes with reasonable amounts of locking. (I don't know how the present patch tries to solve that problem.) It's tempting to think that we should think about creating something altogether new instead of hacking on the existing implementation, but that's a lot of work and I'm not sure what specific design would be best. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I have to admit that after working with Amit on all the work to make > hash indexes WAL-logged a few years ago, I was somewhat disillusioned > with the whole AM. It seems like a cool idea to me but it's just not > that well-implemented. Yeah, agreed. The whole buckets-are-integral-numbers-of-pages scheme is pretty well designed to ensure bloat, but trying to ameliorate that by reducing the number of buckets creates its own problems (since, as you mention, we have no scheme whatever for searching within a bucket). I'm quite unimpressed with Simon's upthread proposal to turn off bucket splitting without doing anything about the latter issue. I feel like we'd be best off to burn the AM to the ground and start over. I do not know what a better design would look like exactly, but I feel like it's got to decouple buckets from pages somehow. Along the way, I'd want to store 64-bit hash values (we still haven't done that have we?). As far as the specific point at hand is concerned, I think storing a hash value per index column, while using only the first column's hash for bucket selection, is what to do for multicol indexes. We still couldn't set amoptionalkey=true for hash indexes, because without a hash for the first column we don't know which bucket to look in. But storing hashes for the additional columns would allow us to check additional conditions in the index, and usually save trips to the heap on queries that provide additional column conditions. You could also imagine sorting the contents of a bucket on all the hashes, which would ease uniqueness checks. regards, tom lane
On Wed, Aug 11, 2021 at 10:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
> don't know how the present patch tries to solve that problem.) It's
> tempting to think that we should think about creating something
> altogether new instead of hacking on the existing implementation, but
> that's a lot of work and I'm not sure what specific design would be
> best.
(Standard disclaimer that I'm not qualified to design index AMs) I've seen one mention in the literature about the possibility of simply having a btree index over the hash values. That would require faster search within pages, in particular using abbreviated keys in the ItemId array of internal pages [1] and interpolated search rather than pure binary search (which should work reliably with high-entropy keys like hash values), but doing that would speed up all btree indexes, so that much is worth doing regardless of how hash indexes are implemented. In that scheme, the hash index AM would just be around for backward compatibility.
> tempting to think that we should think about creating something
> altogether new instead of hacking on the existing implementation, but
> that's a lot of work and I'm not sure what specific design would be
> best.
(Standard disclaimer that I'm not qualified to design index AMs) I've seen one mention in the literature about the possibility of simply having a btree index over the hash values. That would require faster search within pages, in particular using abbreviated keys in the ItemId array of internal pages [1] and interpolated search rather than pure binary search (which should work reliably with high-entropy keys like hash values), but doing that would speed up all btree indexes, so that much is worth doing regardless of how hash indexes are implemented. In that scheme, the hash index AM would just be around for backward compatibility.
John Naylor <john.naylor@enterprisedb.com> writes: > (Standard disclaimer that I'm not qualified to design index AMs) I've seen > one mention in the literature about the possibility of simply having a > btree index over the hash values. Yeah, that's been talked about in the past --- we considered it moderately seriously back when the hash AM was really only a toy for lack of WAL support. The main knock on it is that searching a btree is necessarily O(log N), while in principle a hash probe could be O(1). Of course, a badly-implemented hash AM could be worse than O(1), but we'd basically be giving up on ever getting to O(1). There's a separate discussion to be had about whether there should be an easier way for users to build indexes that are btrees of hashes. You can do it today but the indexes aren't pleasant to use, requiring query adjustment. regards, tom lane
On Wed, Aug 11, 2021 at 11:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, agreed. The whole buckets-are-integral-numbers-of-pages scheme > is pretty well designed to ensure bloat, but trying to ameliorate that > by reducing the number of buckets creates its own problems (since, as > you mention, we have no scheme whatever for searching within a bucket). > I'm quite unimpressed with Simon's upthread proposal to turn off bucket > splitting without doing anything about the latter issue. Maybe. I don't think that it should be a huge problem to decide that an occupied bucket has to consume an entire page; if that's a big issue, you should just have fewer buckets. I do think it's a problem that a bucket containing no tuples at all still consumes an entire page, because the data is often going to be skewed so that many buckets are entirely empty. I also think it's a problem that expanding the directory from 2^{N} buckets to 2^{N+1} buckets requires 4 allocations of 2^{N-2} *consecutive* pages. That's a lot of consecutive pages for even fairly modest values of N. Imagine a design where we have a single-page directory with 1024 slots, each corresponding to one bucket. Each slot stores a block number, which might be InvalidBlockNumber if there are no tuples in that bucket. Given a tuple with hash value H, check slot H%1024 and then go to that page to look further. If there are more tuples in that bucket than can fit on the page, then it can link to another page. If we assume for the sake of argument that 1024 is the right number of buckets, this is going to use about as much space as the current system when the data distribution is uniform, but considerably less when it's skewed. The larger you make the number of buckets, the better this kind of thing looks on skewed data. Now, you can't just always have 1024 buckets, so we'd actually have to do something a bit more clever, probably involving multiple levels of directories. For example, suppose a directory page contains only 32 slots. That will leave a lot of empty space in the page, which can be used to store tuples. An index search has to scan all tuples that are stored directly in the page, and also use the first 5 bits of the hash key to search the appropriate bucket. But the bucket is itself a directory: it can have some tuples stored directly in the page, and then it has 32 more slots and you use the next 5 bits of the hash key to decide which one of those to search. Then it becomes possible to incrementally expand the hash index: when the space available in a directory page fills up, you can either create a sub-directory and move as many tuples as you can into that page, or add an overflow page that contains only tuples. It's important to be able to do either one, because sometimes a bucket fills up with tuples that have identical hash values, and sometimes a bucket fills up with tuples that have a variety of hash values. The current implementation tends to massively increase the number of buckets even when it does very little to spread the index entries out. ("Hey, I doubled the number of buckets and the keys are still almost all in one bucket ... let me double the number of buckets again and see if it works better this time!") If we were going to create a replacement, we'd want the index to respond differently to a bunch of dups vs. a bunch of non-dups. > As far as the specific point at hand is concerned, I think storing > a hash value per index column, while using only the first column's > hash for bucket selection, is what to do for multicol indexes. > We still couldn't set amoptionalkey=true for hash indexes, because > without a hash for the first column we don't know which bucket to > look in. But storing hashes for the additional columns would allow > us to check additional conditions in the index, and usually save > trips to the heap on queries that provide additional column > conditions. You could also imagine sorting the contents of a bucket > on all the hashes, which would ease uniqueness checks. That sounds reasonable I guess, but I don't know how hard it is to implement. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > As far as the specific point at hand is concerned, I think storing > a hash value per index column, while using only the first column's > hash for bucket selection, is what to do for multicol indexes. > We still couldn't set amoptionalkey=true for hash indexes, because > without a hash for the first column we don't know which bucket to > look in. But storing hashes for the additional columns would allow > us to check additional conditions in the index, and usually save > trips to the heap on queries that provide additional column > conditions. You could also imagine sorting the contents of a bucket > on all the hashes, which would ease uniqueness checks. Earlier, I was thinking that we have two hashes, one for the first key column that is for identifying the bucket, and one for the remaining key columns which will further help with heap lookup and ordering for uniqueness checking. But yeah if we have a hash value for each column then it will make it really flexible. I was also looking into other databases that how they support hash indexes, then I see at least in MySQL[1] the multiple column index has a limitation that you have to give all key columns in search for selecting the index scan. IMHO, that limitation might be there because they are storing just one hash value based on all key columns and also selecting the bucket based on the same hash value. [1] https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 12, 2021 at 9:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > As far as the specific point at hand is concerned, I think storing > > a hash value per index column, while using only the first column's > > hash for bucket selection, is what to do for multicol indexes. > > We still couldn't set amoptionalkey=true for hash indexes, because > > without a hash for the first column we don't know which bucket to > > look in. But storing hashes for the additional columns would allow > > us to check additional conditions in the index, and usually save > > trips to the heap on queries that provide additional column > > conditions. Yeah, this sounds reasonable but I think the alternative proposal by Dilip (see below) and me [1] also has merits. > > You could also imagine sorting the contents of a bucket > > on all the hashes, which would ease uniqueness checks. > Yeah, we can do that but the current design also seems to have merits for uniqueness checks. For sorting all the hashes in the bucket, we need to read all the overflow pages and then do sort, which could lead to additional I/O in some cases. The other possibility is to keep all the bucket pages sorted during insertion but that would also require additional I/O. OTOH, in the current design, if the value is not found in the current bucket page (which has hash values in sorted order), only then we move to the next page. > Earlier, I was thinking that we have two hashes, one for the first key > column that is for identifying the bucket, and one for the remaining > key columns which will further help with heap lookup and ordering for > uniqueness checking. > I have also mentioned an almost similar idea yesterday [1]. If we go with a specification similar to MySQL and SQLServer then probably it would be better than storing the hashes for all the columns. But yeah if we have a hash value for each column > then it will make it really flexible. > > I was also looking into other databases that how they support hash > indexes, then I see at least in MySQL[1] the multiple column index has > a limitation that you have to give all key columns in search for > selecting the index scan. I see that SQLServer also has the same specification for multi-column hash index [2]. See the "Multi-column index" section. So it might not be a bad idea to have a similar specification. [1] - https://www.postgresql.org/message-id/CAA4eK1JD1%3DnPDi0kDPGLC%2BJDGEYP8DgTanobvgve%2B%2BKniQ68TA%40mail.gmail.com [2] - https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15 -- With Regards, Amit Kapila.
On Wed, Aug 11, 2021 at 8:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > I have to admit that after working with Amit on all the work to make > > hash indexes WAL-logged a few years ago, I was somewhat disillusioned > > with the whole AM. It seems like a cool idea to me but it's just not > > that well-implemented. > > Yeah, agreed. The whole buckets-are-integral-numbers-of-pages scheme > is pretty well designed to ensure bloat, but trying to ameliorate that > by reducing the number of buckets creates its own problems (since, as > you mention, we have no scheme whatever for searching within a bucket). > I'm quite unimpressed with Simon's upthread proposal to turn off bucket > splitting without doing anything about the latter issue. > The design of the patch has changed since the initial proposal. It tries to perform unique inserts by holding a write lock on the bucket page to avoid duplicate inserts. -- With Regards, Amit Kapila.
On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > The design of the patch has changed since the initial proposal. It > tries to perform unique inserts by holding a write lock on the bucket > page to avoid duplicate inserts. Do you mean that you're holding a buffer lwlock while you search the whole bucket? If so, that's surely not OK. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 11, 2021 at 8:51 AM John Naylor <john.naylor@enterprisedb.com> wrote: > (Standard disclaimer that I'm not qualified to design index AMs) I've seen one mention in the literature about the possibilityof simply having a btree index over the hash values. That would require faster search within pages, in particularusing abbreviated keys in the ItemId array of internal pages [1] and interpolated search rather than pure binarysearch (which should work reliably with high-entropy keys like hash values), but doing that would speed up all btreeindexes, so that much is worth doing regardless of how hash indexes are implemented. In that scheme, the hash indexAM would just be around for backward compatibility. I think that it's possible (though hard) to do that without involving hashing, even for datatypes like text. Having some kind of prefix compression that makes the final abbreviated keys have high entropy would be essential, though. I agree that it would probably be significantly easier when you knew you were dealing with hash values, but even there you need some kind of prefix compression. In any case I suspect that it would make sense to reimplement hash indexes as a translation layer between hash index opclasses and nbtree. Robert said "Likewise, the fact that keys are stored in hash value order within pages but that the bucket as a whole is not kept in order seems like it's bad for search performance". Obviously we've already done a lot of work on an index AM that deals with a fully ordered keyspace very well. This includes dealing with large groups of duplicates gracefully, since in a certain sense there are no duplicate B-Tree index tuples -- the heap TID tiebreaker ensures that. And it ensures that you have heap-wise locality within these large groups, which is a key enabler of things like opportunistic index deletion. When hash indexes have been used in database systems, it tends to be in-memory database systems where the recovery path doesn't recover indexes -- they're just rebuilt from scratch instead. If that's already a baked-in assumption then hash indexes make more sense. To me it seems like the problem with true hash indexes is that they're constructed in a top-down fashion, which is approximately the opposite of the bottom-up, incremental approach used by B-Tree indexing. This seems to be where all the skew problems arise from. This approach cannot be robust to changes in the keyspace over time, really. -- Peter Geoghegan
On Thu, Aug 12, 2021 at 8:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > The design of the patch has changed since the initial proposal. It > > tries to perform unique inserts by holding a write lock on the bucket > > page to avoid duplicate inserts. > > Do you mean that you're holding a buffer lwlock while you search the > whole bucket? If so, that's surely not OK. > I think here you are worried that after holding lwlock we might perform reads of overflow pages which is not a good idea. I think there are fewer chances of having overflow pages for unique indexes so we don't expect such cases in common and as far as I can understand this can happen in btree as well during uniqueness check. Now, I think the other possibility could be that we do some sort of lock chaining where we grab the lock of the next bucket before releasing the lock of the current bucket as we do during bucket clean up but not sure about the same. I haven't studied the patch in detail so it is better for Simon to pitch in here to avoid any incorrect information or if he has a different understanding/idea. -- With Regards, Amit Kapila.
On Fri, Aug 13, 2021 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 12, 2021 at 8:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > The design of the patch has changed since the initial proposal. It > > > tries to perform unique inserts by holding a write lock on the bucket > > > page to avoid duplicate inserts. > > > > Do you mean that you're holding a buffer lwlock while you search the > > whole bucket? If so, that's surely not OK. > > > > I think here you are worried that after holding lwlock we might > perform reads of overflow pages which is not a good idea. I think > there are fewer chances of having overflow pages for unique indexes so > we don't expect such cases in common I think if we identify the bucket based on the hash value of all the columns then there should be a fewer overflow bucket, but IIUC, in this patch bucket, is identified based on the hash value of the first column only so there could be a lot of duplicates on the first column. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 13, 2021 at 11:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Aug 13, 2021 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 12, 2021 at 8:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > The design of the patch has changed since the initial proposal. It > > > > tries to perform unique inserts by holding a write lock on the bucket > > > > page to avoid duplicate inserts. > > > > > > Do you mean that you're holding a buffer lwlock while you search the > > > whole bucket? If so, that's surely not OK. > > > > > > > I think here you are worried that after holding lwlock we might > > perform reads of overflow pages which is not a good idea. I think > > there are fewer chances of having overflow pages for unique indexes so > > we don't expect such cases in common > > I think if we identify the bucket based on the hash value of all the > columns then there should be a fewer overflow bucket, but IIUC, in > this patch bucket, is identified based on the hash value of the first > column only so there could be a lot of duplicates on the first column. IMHO, as discussed above, since other databases also have the limitation that if you create a multi-column hash index then the hash index can not be used until all the key columns are used in the search condition. So my point is that users might be using the hash index with this limitation and their use case might be that they want to gain the best performance when they use this particular case and they might not be looking for much flexibility like we provide in BTREE. For reference: https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15 We already know that performance will be better with a single hash for multiple columns, but still I just wanted to check the performance difference in PG. This might help us to decide the approach we need to go for. With a quick POC of both the ideas, I have observed there is a major performance advantage with single combined hash for multi-key columns. Performance Test Details: (Used PGBENCH Tool) Initialize cmd: “./pgbench -i -s 100 -d postgres" postgres=# \d+ pgbench_accounts Table "public.pgbench_accounts" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description ----------+---------------+-----------+----------+---------+----------+-------------+--------------+------------- aid | integer | | not null | | plain | | | bid | integer | | | | plain | | | abalance | integer | | | | plain | | | filler | character(84) | | | | extended | | | Indexes: "pgbench_accounts_idx" hash (aid, bid) Access method: heap Options: fillfactor=100 Test Command: “./pgbench -j 1 postgres -C -M prepared -S -T 300” Performance Test Results: Idea-1: Single Hash value for multiple key columns TPS = ~292 Idea-2: Separate Hash values for each key column. But use only the first one to search the bucket. Other hash values are used as payload to get to the matching tuple before going to the heap. TPS = ~212 Note: Here we got near to 25% better performance in a single combine hash approach with only TWO key columns. If we go for separate Hash values for all key columns mentioned then there will be a performance dip and storage also will be relatively higher when we have more key columns. I have just done separate POC patches to get the performance results as mentioned above, there are many other scenarios like split case, to be taken care further. Attaching the POC patches here just for reference… Thanks & Regards SadhuPrasad EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Aug 27, 2021 at 4:27 PM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > > IMHO, as discussed above, since other databases also have the > limitation that if you create a multi-column hash index then the hash > index can not be used until all the key columns are used in the search > condition. So my point is that users might be using the hash index > with this limitation and their use case might be that they want to > gain the best performance when they use this particular case and they > might not be looking for much flexibility like we provide in BTREE. > > For reference: > https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html > https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15 > > We already know that performance will be better with a single hash for > multiple columns, but still I just wanted to check the performance > difference in PG. This might help us to decide the approach we need to > go for. With a quick POC of both the ideas, I have observed there is a > major performance advantage with single combined hash for multi-key > columns. > > Performance Test Details: (Used PGBENCH Tool) > Initialize cmd: “./pgbench -i -s 100 -d postgres" > > postgres=# \d+ pgbench_accounts > > Table "public.pgbench_accounts" > > Column | Type | Collation | Nullable | Default | Storage > | Compression | Stats target | Description > > ----------+---------------+-----------+----------+---------+----------+-------------+--------------+------------- > > aid | integer | | not null | | plain > | | | > bid | integer | | | | plain > | | | > abalance | integer | | | | plain > | | | > filler | character(84) | | | | extended > | | | > > Indexes: > "pgbench_accounts_idx" hash (aid, bid) > Access method: heap > Options: fillfactor=100 > > Test Command: “./pgbench -j 1 postgres -C -M prepared -S -T 300” > > Performance Test Results: > Idea-1: Single Hash value for multiple key columns > TPS = ~292 > > Idea-2: Separate Hash values for each key column. But use only the > first one to search the bucket. Other hash values are used as payload > to get to the matching tuple before going to the heap. > TPS = ~212 > > Note: Here we got near to 25% better performance in a single combine > hash approach with only TWO key columns. > That's a significant difference. Have you checked via perf or some other way what causes this difference? I have seen that sometimes single client performance with pgbench is not stable, so can you please once check with 4 clients or so and possibly with a larger dataset as well. One more thing to consider is that it seems that the planner requires a condition for the first column of an index before considering an indexscan plan. See Tom's email [1] in this regard. I think it would be better to see what kind of work is involved there if you want to explore a single hash value for all columns idea. [1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us -- With Regards, Amit Kapila.
> > That's a significant difference. Have you checked via perf or some > other way what causes this difference? I have seen that sometimes > single client performance with pgbench is not stable, so can you > please once check with 4 clients or so and possibly with a larger > dataset as well. I have verified manually, without the PGBENCH tool also. I can see a significant difference for each query fired in both the versions of patch implemented. We can see as mentioned below, I have run the SAME query on the SAME dataset on both patches. We have a significant performance impact with Separate Hash values for multiple key columns. SingleHash_MultiColumn: postgres=# create table perftest(a int, b int, c int, d int, e int, f int); CREATE TABLE postgres=# insert into perftest values (generate_series(1, 10000000), generate_series(1, 10000000), generate_series(1, 10000000), 9, 7); INSERT 0 10000000 postgres=# create index idx on perftest using hash(a, b, c); CREATE INDEX postgres=# select * from perftest where a=5999 and b=5999 and c=5999; a | b | c | d | e | f ------+------+------+---+---+--- 5999 | 5999 | 5999 | 9 | 7 | (1 row) Time: 2.022 ms postgres=# select * from perftest where a=597989 and b=597989 and c=597989; a | b | c | d | e | f --------+--------+--------+---+---+--- 597989 | 597989 | 597989 | 9 | 7 | (1 row) Time: 0.867 ms postgres=# select * from perftest where a=6297989 and b=6297989 and c=6297989; a | b | c | d | e | f ---------+---------+---------+---+---+--- 6297989 | 6297989 | 6297989 | 9 | 7 | (1 row) Time: 1.439 ms postgres=# select * from perftest where a=6290798 and b=6290798 and c=6290798; a | b | c | d | e | f ---------+---------+---------+---+---+--- 6290798 | 6290798 | 6290798 | 9 | 7 | (1 row) Time: 1.013 ms postgres=# select * from perftest where a=6290791 and b=6290791 and c=6290791; a | b | c | d | e | f ---------+---------+---------+---+---+--- 6290791 | 6290791 | 6290791 | 9 | 7 | (1 row) Time: 0.903 ms postgres=# select * from perftest where a=62907 and b=62907 and c=62907; a | b | c | d | e | f -------+-------+-------+---+---+--- 62907 | 62907 | 62907 | 9 | 7 | (1 row) Time: 0.894 ms SeparateHash_MultiColumn: postgres=# create table perftest(a int, b int, c int, d int, e int, f int); CREATE TABLE postgres=# insert into perftest values (generate_series(1, 10000000), generate_series(1, 10000000), generate_series(1, 10000000), 9, 7); INSERT 0 10000000 postgres=# create index idx on perftest using hash(a, b, c); CREATE INDEX postgres=# select * from perftest where a=5999 and b=5999 and c=5999; a | b | c | d | e | f ------+------+------+---+---+--- 5999 | 5999 | 5999 | 9 | 7 | (1 row) Time: 2.915 ms postgres=# select * from perftest where a=597989 and b=597989 and c=597989; a | b | c | d | e | f --------+--------+--------+---+---+--- 597989 | 597989 | 597989 | 9 | 7 | (1 row) Time: 1.129 ms postgres=# select * from perftest where a=6297989 and b=6297989 and c=6297989; a | b | c | d | e | f ---------+---------+---------+---+---+--- 6297989 | 6297989 | 6297989 | 9 | 7 | (1 row) Time: 2.454 ms postgres=# select * from perftest where a=6290798 and b=6290798 and c=6290798; a | b | c | d | e | f ---------+---------+---------+---+---+--- 6290798 | 6290798 | 6290798 | 9 | 7 | (1 row) Time: 2.327 ms postgres=# select * from perftest where a=6290791 and b=6290791 and c=6290791; a | b | c | d | e | f ---------+---------+---------+---+---+--- 6290791 | 6290791 | 6290791 | 9 | 7 | (1 row) Time: 1.676 ms postgres=# select * from perftest where a=62907 and b=62907 and c=62907; a | b | c | d | e | f -------+-------+-------+---+---+--- 62907 | 62907 | 62907 | 9 | 7 | (1 row) Time: 2.614 ms If I do a test with 4 clients, then there is not much visible difference. I think this is because of contentions. And here our focus is single thread & single operation performance. > > One more thing to consider is that it seems that the planner requires > a condition for the first column of an index before considering an > indexscan plan. See Tom's email [1] in this regard. I think it would > be better to see what kind of work is involved there if you want to > explore a single hash value for all columns idea. > > [1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us About this point, I will analyze further and update. Thanks & Regards SadhuPrasad EnterpriseDB: http://www.enterprisedb.com
> > One more thing to consider is that it seems that the planner requires > > a condition for the first column of an index before considering an > > indexscan plan. See Tom's email [1] in this regard. I think it would > > be better to see what kind of work is involved there if you want to > > explore a single hash value for all columns idea. > > > > [1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us > > About this point, I will analyze further and update. > I have checked the planner code, there does not seem to be any complicated changes needed to cover if we take up a single hash value for all columns... Below are the major part of changes needed: In build_index_paths(), there is a check like, "if (index_clauses == NIL && !index->amoptionalkey)", which helps to figure out if the leading column has any clause or not. This needs to be moved out of the loop and check for clauses on all key columns. With this we need to add a "amallcolumncluse" field to Index structure, which will be set to TRUE for HASH index and FALSE in other cases. And to get the multi-column hash index selected, we may set enable_hashjoin =off, to avoid any condition become join condition, saw similar behaviors in other DBs as well... Thanks & Regards SadhuPrasad EnterpriseDB: http://www.enterprisedb.com
On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > > > > One more thing to consider is that it seems that the planner requires > > > a condition for the first column of an index before considering an > > > indexscan plan. See Tom's email [1] in this regard. I think it would > > > be better to see what kind of work is involved there if you want to > > > explore a single hash value for all columns idea. > > > > > > [1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us > > > > About this point, I will analyze further and update. > > > > I have checked the planner code, there does not seem to be any > complicated changes needed to cover if we take up a single hash value > for all columns... Below are the major part of changes needed: > > In build_index_paths(), there is a check like, "if (index_clauses == > NIL && !index->amoptionalkey)", which helps to figure out if the > leading column has any clause or not. This needs to be moved out of > the loop and check for clauses on all key columns. > With this we need to add a "amallcolumncluse" field to Index > structure, which will be set to TRUE for HASH index and FALSE in other > cases. Right we can add an AM level option and based on that we can decide whether to select the index scan if conditions are not given for all the key columns. And changes don't look that complicated. > > And to get the multi-column hash index selected, we may set > enable_hashjoin =off, to avoid any condition become join condition, > saw similar behaviors in other DBs as well... This may be related to Tom's point that, if some of the quals are removed due to optimization or converted to join quals, then now, even if the user has given qual on all the key columns the index scan will not be selected because we will be forcing that the hash index can only be selected if it has quals on all the key attributes? I don't think suggesting enable_hashjoin =off is a solution, this can happen with merge join or the nested loop join with materialized node, in all such cases join filter can not be pushed down to the inner node because the outer node will not start to scan until we materialize/sort/hash the inner node. But yeah if we test this behavior in other databases also and if it appeared that this is how the hash index is being used then maybe this behavior can be documented. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > > > > And to get the multi-column hash index selected, we may set > > enable_hashjoin =off, to avoid any condition become join condition, > > saw similar behaviors in other DBs as well... > > This may be related to Tom's point that, if some of the quals are > removed due to optimization or converted to join quals, then now, even > if the user has given qual on all the key columns the index scan will > not be selected because we will be forcing that the hash index can > only be selected if it has quals on all the key attributes? > > I don't think suggesting enable_hashjoin =off is a solution, > Yeah, this doesn't sound like a good idea. How about instead try to explore the idea where the hash (bucket assignment and search) will be based on the first index key and the other columns will be stored as payload? I think this might pose some difficulty in the consecutive patch to enable a unique index because it will increase the chance of traversing more buckets for uniqueness checks. If we see such problems, then I have another idea to minimize the number of buckets that we need to lock during uniqueness check which is by lock chaining as is used during hash bucket clean up where at a time we don't need to lock more than two buckets at a time. -- With Regards, Amit Kapila.
On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > > > > > > And to get the multi-column hash index selected, we may set > > > enable_hashjoin =off, to avoid any condition become join condition, > > > saw similar behaviors in other DBs as well... > > > > This may be related to Tom's point that, if some of the quals are > > removed due to optimization or converted to join quals, then now, even > > if the user has given qual on all the key columns the index scan will > > not be selected because we will be forcing that the hash index can > > only be selected if it has quals on all the key attributes? > > > > I don't think suggesting enable_hashjoin =off is a solution, > > > > Yeah, this doesn't sound like a good idea. How about instead try to > explore the idea where the hash (bucket assignment and search) will be > based on the first index key and the other columns will be stored as > payload? I think this might pose some difficulty in the consecutive > patch to enable a unique index because it will increase the chance of > traversing more buckets for uniqueness checks. If we see such > problems, then I have another idea to minimize the number of buckets > that we need to lock during uniqueness check which is by lock chaining > as is used during hash bucket clean up where at a time we don't need > to lock more than two buckets at a time. I have presented a simple, almost trivial, patch to allow multi-col hash indexes. It hashes the first column only, which can be a downside in *some* cases. If that is clearly documented, it would not cause many issues, IMHO. However, it does not have any optimization issues or complexities, which is surely a very good thing. Trying to involve *all* columns in the hash index is a secondary optimization. It requires subtle changes in optimizer code, as Tom points out. It also needs fine tuning to make the all-column approach beneficial for the additional cases without losing against what the "first column" approach gives. I did consider both approaches and after this discussion I am still in favour of committing the very simple "first column" approach to multi-col hash indexes now. -- Simon Riggs http://www.EnterpriseDB.com/
On Fri, 13 Aug 2021 at 05:01, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 12, 2021 at 8:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Aug 12, 2021 at 12:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > The design of the patch has changed since the initial proposal. It > > > tries to perform unique inserts by holding a write lock on the bucket > > > page to avoid duplicate inserts. > > > > Do you mean that you're holding a buffer lwlock while you search the > > whole bucket? If so, that's surely not OK. > > > > I think here you are worried that after holding lwlock we might > perform reads of overflow pages which is not a good idea. I think > there are fewer chances of having overflow pages for unique indexes so > we don't expect such cases in common and as far as I can understand > this can happen in btree as well during uniqueness check. Now, I think > the other possibility could be that we do some sort of lock chaining > where we grab the lock of the next bucket before releasing the lock of > the current bucket as we do during bucket clean up but not sure about > the same. > > I haven't studied the patch in detail so it is better for Simon to > pitch in here to avoid any incorrect information or if he has a > different understanding/idea. That is correct. After analysis of their behavior, I think further simple work on hash indexes is worthwhile. With unique data, starting at 1 and monotonically ascending, hash indexes will grow very nicely from 0 to 10E7 rows without causing >1 overflow block to be allocated for any bucket. This keeps the search time for such data to just 2 blocks (bucket plus, if present, 1 overflow block). The small number of overflow blocks is because of the regular and smooth way that splits occur, which works very nicely without significant extra latency. The probability of bucket collision while we hold the lock is fairly low. This is because even with adjacent data values the hash values would be spread across multiple buckets, so we would expect the contention to be less than we would get on a monotonically increasing btree. So I don't now see any problem from holding the buffer lwlock on the bucket while we do multi-buffer operations. -- Simon Riggs http://www.EnterpriseDB.com/
On Tue, Oct 5, 2021 at 4:08 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > > > > > > > > And to get the multi-column hash index selected, we may set > > > > enable_hashjoin =off, to avoid any condition become join condition, > > > > saw similar behaviors in other DBs as well... > > > > > > This may be related to Tom's point that, if some of the quals are > > > removed due to optimization or converted to join quals, then now, even > > > if the user has given qual on all the key columns the index scan will > > > not be selected because we will be forcing that the hash index can > > > only be selected if it has quals on all the key attributes? > > > > > > I don't think suggesting enable_hashjoin =off is a solution, > > > > > > > Yeah, this doesn't sound like a good idea. How about instead try to > > explore the idea where the hash (bucket assignment and search) will be > > based on the first index key and the other columns will be stored as > > payload? I think this might pose some difficulty in the consecutive > > patch to enable a unique index because it will increase the chance of > > traversing more buckets for uniqueness checks. If we see such > > problems, then I have another idea to minimize the number of buckets > > that we need to lock during uniqueness check which is by lock chaining > > as is used during hash bucket clean up where at a time we don't need > > to lock more than two buckets at a time. > > I have presented a simple, almost trivial, patch to allow multi-col > hash indexes. It hashes the first column only, which can be a downside > in *some* cases. If that is clearly documented, it would not cause > many issues, IMHO. However, it does not have any optimization issues > or complexities, which is surely a very good thing. > > Trying to involve *all* columns in the hash index is a secondary > optimization. It requires subtle changes in optimizer code, as Tom > points out. It also needs fine tuning to make the all-column approach > beneficial for the additional cases without losing against what the > "first column" approach gives. > > I did consider both approaches and after this discussion I am still in > favour of committing the very simple "first column" approach to > multi-col hash indexes now. But what about the other approach suggested by Tom, basically we hash only based on the first column for identifying the bucket, but we also store the hash value for other columns? With that, we don't need changes in the optimizer and we can also avoid a lot of disk fetches because after finding the bucket we can match the secondary columns before fetching the disk tuple. I agree, one downside with this approach is we will increase the index size. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, 5 Oct 2021 at 12:24, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Oct 5, 2021 at 4:08 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > > > On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: > > > > > > > > > > And to get the multi-column hash index selected, we may set > > > > > enable_hashjoin =off, to avoid any condition become join condition, > > > > > saw similar behaviors in other DBs as well... > > > > > > > > This may be related to Tom's point that, if some of the quals are > > > > removed due to optimization or converted to join quals, then now, even > > > > if the user has given qual on all the key columns the index scan will > > > > not be selected because we will be forcing that the hash index can > > > > only be selected if it has quals on all the key attributes? > > > > > > > > I don't think suggesting enable_hashjoin =off is a solution, > > > > > > > > > > Yeah, this doesn't sound like a good idea. How about instead try to > > > explore the idea where the hash (bucket assignment and search) will be > > > based on the first index key and the other columns will be stored as > > > payload? I think this might pose some difficulty in the consecutive > > > patch to enable a unique index because it will increase the chance of > > > traversing more buckets for uniqueness checks. If we see such > > > problems, then I have another idea to minimize the number of buckets > > > that we need to lock during uniqueness check which is by lock chaining > > > as is used during hash bucket clean up where at a time we don't need > > > to lock more than two buckets at a time. > > > > I have presented a simple, almost trivial, patch to allow multi-col > > hash indexes. It hashes the first column only, which can be a downside > > in *some* cases. If that is clearly documented, it would not cause > > many issues, IMHO. However, it does not have any optimization issues > > or complexities, which is surely a very good thing. > > > > Trying to involve *all* columns in the hash index is a secondary > > optimization. It requires subtle changes in optimizer code, as Tom > > points out. It also needs fine tuning to make the all-column approach > > beneficial for the additional cases without losing against what the > > "first column" approach gives. > > > > I did consider both approaches and after this discussion I am still in > > favour of committing the very simple "first column" approach to > > multi-col hash indexes now. > > But what about the other approach suggested by Tom, basically we hash > only based on the first column for identifying the bucket, but we also > store the hash value for other columns? With that, we don't need > changes in the optimizer and we can also avoid a lot of disk fetches > because after finding the bucket we can match the secondary columns > before fetching the disk tuple. I agree, one downside with this > approach is we will increase the index size. Identifying the bucket is the main part of a hash index's work, so that part would be identical. Once we have identified the bucket, we sort the bucket page by hash, so having an all-col hash would help de-duplicate multi-col hash collisions, but not enough to be worth it, IMHO, given that storing an extra 4 bytes per index tuple is a significant size increase which would cause extra overflow pages etc.. The same thought applies to 8-byte hashes. IMHO, multi-column hash collisions are a secondary issue, given that we can already select the column order for an index and hash indexes would only be used by explicit user choice. If there are some minor sub-optimal aspects of using hash indexes, then btree was already the default and a great choice for many cases. If btree didn't already exist I would care more about making hash indexes perfect. I just want to make them usable. -- Simon Riggs http://www.EnterpriseDB.com/
On 10/5/21 18:28, Simon Riggs wrote: > On Tue, 5 Oct 2021 at 12:24, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> On Tue, Oct 5, 2021 at 4:08 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: >>> >>> On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >>>>> >>>>> On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote: >>>>>> >>>>>> And to get the multi-column hash index selected, we may set >>>>>> enable_hashjoin =off, to avoid any condition become join condition, >>>>>> saw similar behaviors in other DBs as well... >>>>> >>>>> This may be related to Tom's point that, if some of the quals are >>>>> removed due to optimization or converted to join quals, then now, even >>>>> if the user has given qual on all the key columns the index scan will >>>>> not be selected because we will be forcing that the hash index can >>>>> only be selected if it has quals on all the key attributes? >>>>> >>>>> I don't think suggesting enable_hashjoin =off is a solution, >>>>> >>>> >>>> Yeah, this doesn't sound like a good idea. How about instead try to >>>> explore the idea where the hash (bucket assignment and search) will be >>>> based on the first index key and the other columns will be stored as >>>> payload? I think this might pose some difficulty in the consecutive >>>> patch to enable a unique index because it will increase the chance of >>>> traversing more buckets for uniqueness checks. If we see such >>>> problems, then I have another idea to minimize the number of buckets >>>> that we need to lock during uniqueness check which is by lock chaining >>>> as is used during hash bucket clean up where at a time we don't need >>>> to lock more than two buckets at a time. >>> >>> I have presented a simple, almost trivial, patch to allow multi-col >>> hash indexes. It hashes the first column only, which can be a downside >>> in *some* cases. If that is clearly documented, it would not cause >>> many issues, IMHO. However, it does not have any optimization issues >>> or complexities, which is surely a very good thing. >>> >>> Trying to involve *all* columns in the hash index is a secondary >>> optimization. It requires subtle changes in optimizer code, as Tom >>> points out. It also needs fine tuning to make the all-column approach >>> beneficial for the additional cases without losing against what the >>> "first column" approach gives. >>> >>> I did consider both approaches and after this discussion I am still in >>> favour of committing the very simple "first column" approach to >>> multi-col hash indexes now. >> >> But what about the other approach suggested by Tom, basically we hash >> only based on the first column for identifying the bucket, but we also >> store the hash value for other columns? With that, we don't need >> changes in the optimizer and we can also avoid a lot of disk fetches >> because after finding the bucket we can match the secondary columns >> before fetching the disk tuple. I agree, one downside with this >> approach is we will increase the index size. > > Identifying the bucket is the main part of a hash index's work, so > that part would be identical. > > Once we have identified the bucket, we sort the bucket page by hash, > so having an all-col hash would help de-duplicate multi-col hash > collisions, but not enough to be worth it, IMHO, given that storing an > extra 4 bytes per index tuple is a significant size increase which > would cause extra overflow pages etc.. The same thought applies to > 8-byte hashes. > IMO it'd be nice to show some numbers to support the claims that storing the extra hashes and/or 8B hashes is not worth it ... I'm sure there are cases where it'd be a net loss, but imagine for example a case when the first column has a lot of duplicate values. Which is not all that unlikely - duplicates seem like one of the natural reasons why people want multi-column hash indexes. And those duplicates are quite expensive, due to having to access the heap. Being able to eliminate those extra accesses cheaply might be a clear win, even if it makes the index a bit larger (shorter hashes might be enough). > IMHO, multi-column hash collisions are a secondary issue, given that > we can already select the column order for an index and hash indexes > would only be used by explicit user choice. If there are some minor > sub-optimal aspects of using hash indexes, then btree was already the > default and a great choice for many cases. > But we can't select arbitrary column order, because the first column is used to select the bucket. Which means it's also about what columns are used by the queries. If the query is not using the first column, it can't use the index. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 5 Oct 2021 at 20:06, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >>> I have presented a simple, almost trivial, patch to allow multi-col > >>> hash indexes. It hashes the first column only, which can be a downside > >>> in *some* cases. If that is clearly documented, it would not cause > >>> many issues, IMHO. However, it does not have any optimization issues > >>> or complexities, which is surely a very good thing. > >>> > >>> Trying to involve *all* columns in the hash index is a secondary > >>> optimization. It requires subtle changes in optimizer code, as Tom > >>> points out. It also needs fine tuning to make the all-column approach > >>> beneficial for the additional cases without losing against what the > >>> "first column" approach gives. > >>> > >>> I did consider both approaches and after this discussion I am still in > >>> favour of committing the very simple "first column" approach to > >>> multi-col hash indexes now. > >> > >> But what about the other approach suggested by Tom, basically we hash > >> only based on the first column for identifying the bucket, but we also > >> store the hash value for other columns? With that, we don't need > >> changes in the optimizer and we can also avoid a lot of disk fetches > >> because after finding the bucket we can match the secondary columns > >> before fetching the disk tuple. I agree, one downside with this > >> approach is we will increase the index size. > > > > Identifying the bucket is the main part of a hash index's work, so > > that part would be identical. > > > > Once we have identified the bucket, we sort the bucket page by hash, > > so having an all-col hash would help de-duplicate multi-col hash > > collisions, but not enough to be worth it, IMHO, given that storing an > > extra 4 bytes per index tuple is a significant size increase which > > would cause extra overflow pages etc.. The same thought applies to > > 8-byte hashes. > > > > IMO it'd be nice to show some numbers to support the claims that storing > the extra hashes and/or 8B hashes is not worth it ... Using an 8-byte hash is possible, but only becomes effective when 4-byte hash collisions get hard to manage. 8-byte hash also makes the index 20% bigger, so it is not a good default. Let's look at the distribution of values: In a table with 100 million rows, with consecutive monotonic values, starting at 1 No Collisions - 98.8% 1 Collision - 1.15% 2+ Collisions - 0.009% (8979 values colliding) Max=4 In a table with 1 billion rows (2^30), with consecutive monotonic values, starting at 1 No Collisions - 89.3% 1 Collision - 9.8% 2 Collisions - 0.837% 3+ Collisions - 0.0573% (615523 values colliding) Max=9 At 100 million rows, the collisions from a 4-byte hash are not painful, but by a billion rows they are starting to become a problem, and by 2 billion rows we have a noticeable issue (>18% collisions). Clearly, 8-byte hash values would be appropriate for tables larger than this. However, we expect users to know about and to use partitioning, with reasonable limits somewhere in the 100 million row (with 100 byte rows, 10GB) to 1 billion row (with 100 byte rows, 100GB) range. The change from 4 to 8 byte hashes seems simple, so I am not against it for that reason. IMHO there is no use case for 8-byte hashes since reasonable users would not have tables big enough to care. That is my reasoning, YMMV. > I'm sure there are cases where it'd be a net loss, but imagine for > example a case when the first column has a lot of duplicate values. > Which is not all that unlikely - duplicates seem like one of the natural > reasons why people want multi-column hash indexes. And those duplicates > are quite expensive, due to having to access the heap. Being able to > eliminate those extra accesses cheaply might be a clear win, even if it > makes the index a bit larger (shorter hashes might be enough). I agree, eliminating duplicates would be a good thing, if that is possible. However, hashing on multiple columns doesn't eliminate duplicates, we can still get them from different combinations of rows. With a first-column hash then (1,1) and (1,2) collide. But with an all-column hash then (1,2) and (2,1) collide. So we can still end up with collisions and this depends on the data values and types. We can all come up with pessimal use cases. Perhaps it would be possible to specify a parameter that says how many columns in the index are part of the hash? Not sure how easy that is. If you have a situation with lots of duplicates, then use btrees instead. We shouldn't have to make hash indexes work well for *all* cases before we allow multiple columns for some cases. The user will already get to compare btree-vs-hash before they use them in a particular use case. The perfect should not be the enemy of the good. Storing multiple hashes uses more space and is more complex. It doesn't feel like a good solution to me, given the purpose of an index is not completeness but optimality. Storing 2 4-byte hashes uses 20% more space than one 4-byte hash. Storing 2 8-byte hashes uses 40% more space than one 4-byte hash. > > IMHO, multi-column hash collisions are a secondary issue, given that > > we can already select the column order for an index and hash indexes > > would only be used by explicit user choice. If there are some minor > > sub-optimal aspects of using hash indexes, then btree was already the > > default and a great choice for many cases. > > > > But we can't select arbitrary column order, because the first column is > used to select the bucket. Which means it's also about what columns are > used by the queries. If the query is not using the first column, it > can't use the index. Neither approach works in that case, so it is moot. i.e. you cannot use a first-column hash, nor an all-column hash. -- Simon Riggs http://www.EnterpriseDB.com/
On Wed, Oct 13, 2021 at 3:44 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > IMO it'd be nice to show some numbers to support the claims that storing > > the extra hashes and/or 8B hashes is not worth it ... > > Using an 8-byte hash is possible, but only becomes effective when > 4-byte hash collisions get hard to manage. 8-byte hash also makes the > index 20% bigger, so it is not a good default. Are you sure? I know that nbtree index tuples for a single-column int8 index are exactly the same size as those from a single column int4 index, due to alignment overhead at the tuple level. So my guess is that hash index tuples (which use the same basic IndexTuple representation) work in the same way. -- Peter Geoghegan
On Wed, Oct 13, 2021 at 12:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > Are you sure? I know that nbtree index tuples for a single-column int8 > index are exactly the same size as those from a single column int4 > index, due to alignment overhead at the tuple level. So my guess is > that hash index tuples (which use the same basic IndexTuple > representation) work in the same way. I'm assuming a 64-bit architecture here, by the way. That assumption shouldn't matter, since of course approximately 100% of all computers that run Postgres are 64-bit these days. -- Peter Geoghegan
On Wed, 13 Oct 2021 at 20:16, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Oct 13, 2021 at 3:44 AM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > > IMO it'd be nice to show some numbers to support the claims that storing > > > the extra hashes and/or 8B hashes is not worth it ... > > > > Using an 8-byte hash is possible, but only becomes effective when > > 4-byte hash collisions get hard to manage. 8-byte hash also makes the > > index 20% bigger, so it is not a good default. > > Are you sure? I know that nbtree index tuples for a single-column int8 > index are exactly the same size as those from a single column int4 > index, due to alignment overhead at the tuple level. So my guess is > that hash index tuples (which use the same basic IndexTuple > representation) work in the same way. The hash index tuples are 20-bytes each. If that were rounded up to 8-byte alignment, then that would be 24 bytes. Using pageinspect, the max(live_items) on any data page (bucket or overflow) is 407 items, so they can't be 24 bytes long. Other stats of interest would be that the current bucket design/page splitting is very effective at maintaining distribution. On a hash index for a table with 2 billion rows in it, with integer values from 1 to 2billion, there are 3670016 bucket pages and 524286 overflow pages, distributed so that 87.5% of buckets have no overflow pages, and 12.5% of buckets have only one overflow page; there are no buckets with >1 overflow page. The most heavily populated overflow page has 209 items. The CREATE INDEX time is fairly poor at present, but that can be optimized easily enough, but I expect to do that after uniqueness is added, since it would complicate the code to do that work in a different order. -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Oct 14, 2021 at 12:48 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > The hash index tuples are 20-bytes each. If that were rounded up to > 8-byte alignment, then that would be 24 bytes. > > Using pageinspect, the max(live_items) on any data page (bucket or > overflow) is 407 items, so they can't be 24 bytes long. That's the same as an nbtree page, which confirms my suspicion. The 20 bytes consists of a 16 byte tuple, plus a 4 byte line pointer. The tuple-level alignment overhead gets you from 12 bytes to 16 bytes with a single int4 column. So the padding is there for the taking. -- Peter Geoghegan
On Thu, 14 Oct 2021 at 16:09, Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Oct 14, 2021 at 12:48 AM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > The hash index tuples are 20-bytes each. If that were rounded up to > > 8-byte alignment, then that would be 24 bytes. > > > > Using pageinspect, the max(live_items) on any data page (bucket or > > overflow) is 407 items, so they can't be 24 bytes long. > > That's the same as an nbtree page, which confirms my suspicion. The 20 > bytes consists of a 16 byte tuple, plus a 4 byte line pointer. The > tuple-level alignment overhead gets you from 12 bytes to 16 bytes with > a single int4 column. So the padding is there for the taking. Thank you for nudging me to review the tuple length. Since hash indexes never store Nulls, and the hash is always fixed length, ISTM that we can compress the hash index entries down to ItemPointerData (6 bytes) plus any hashes. That doesn't change any arguments about size differences between approaches, but we can significantly reduce index size (by up to 50%). -- Simon Riggs http://www.EnterpriseDB.com/
On Wed, Oct 13, 2021 at 4:13 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > On Tue, 5 Oct 2021 at 20:06, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > >>> I have presented a simple, almost trivial, patch to allow multi-col > > >>> hash indexes. It hashes the first column only, which can be a downside > > >>> in *some* cases. If that is clearly documented, it would not cause > > >>> many issues, IMHO. However, it does not have any optimization issues > > >>> or complexities, which is surely a very good thing. > > >>> > > >>> Trying to involve *all* columns in the hash index is a secondary > > >>> optimization. It requires subtle changes in optimizer code, as Tom > > >>> points out. It also needs fine tuning to make the all-column approach > > >>> beneficial for the additional cases without losing against what the > > >>> "first column" approach gives. > > >>> > > >>> I did consider both approaches and after this discussion I am still in > > >>> favour of committing the very simple "first column" approach to > > >>> multi-col hash indexes now. > > >> > > >> But what about the other approach suggested by Tom, basically we hash > > >> only based on the first column for identifying the bucket, but we also > > >> store the hash value for other columns? With that, we don't need > > >> changes in the optimizer and we can also avoid a lot of disk fetches > > >> because after finding the bucket we can match the secondary columns > > >> before fetching the disk tuple. I agree, one downside with this > > >> approach is we will increase the index size. > > > > > > Identifying the bucket is the main part of a hash index's work, so > > > that part would be identical. > > > > > > Once we have identified the bucket, we sort the bucket page by hash, > > > so having an all-col hash would help de-duplicate multi-col hash > > > collisions, but not enough to be worth it, IMHO, given that storing an > > > extra 4 bytes per index tuple is a significant size increase which > > > would cause extra overflow pages etc.. The same thought applies to > > > 8-byte hashes. > > > > > > > IMO it'd be nice to show some numbers to support the claims that storing > > the extra hashes and/or 8B hashes is not worth it ... > > Using an 8-byte hash is possible, but only becomes effective when > 4-byte hash collisions get hard to manage. 8-byte hash also makes the > index 20% bigger, so it is not a good default. > > Let's look at the distribution of values: > > In a table with 100 million rows, with consecutive monotonic values, > starting at 1 > No Collisions - 98.8% > 1 Collision - 1.15% > 2+ Collisions - 0.009% (8979 values colliding) > Max=4 > > In a table with 1 billion rows (2^30), with consecutive monotonic > values, starting at 1 > No Collisions - 89.3% > 1 Collision - 9.8% > 2 Collisions - 0.837% > 3+ Collisions - 0.0573% (615523 values colliding) > Max=9 > > At 100 million rows, the collisions from a 4-byte hash are not > painful, but by a billion rows they are starting to become a problem, > and by 2 billion rows we have a noticeable issue (>18% collisions). > > Clearly, 8-byte hash values would be appropriate for tables larger > than this. However, we expect users to know about and to use > partitioning, with reasonable limits somewhere in the 100 million row > (with 100 byte rows, 10GB) to 1 billion row (with 100 byte rows, > 100GB) range. > > The change from 4 to 8 byte hashes seems simple, so I am not against > it for that reason. IMHO there is no use case for 8-byte hashes since > reasonable users would not have tables big enough to care. > > That is my reasoning, YMMV. > > > I'm sure there are cases where it'd be a net loss, but imagine for > > example a case when the first column has a lot of duplicate values. > > Which is not all that unlikely - duplicates seem like one of the natural > > reasons why people want multi-column hash indexes. And those duplicates > > are quite expensive, due to having to access the heap. Being able to > > eliminate those extra accesses cheaply might be a clear win, even if it > > makes the index a bit larger (shorter hashes might be enough). > > I agree, eliminating duplicates would be a good thing, if that is possible. > > However, hashing on multiple columns doesn't eliminate duplicates, we > can still get them from different combinations of rows. > > With a first-column hash then (1,1) and (1,2) collide. > But with an all-column hash then (1,2) and (2,1) collide. > So we can still end up with collisions and this depends on the data > values and types. > I don't think this will happen if we store the first column as bucket identifier and other cols as payload. > We can all come up with pessimal use cases. > > Perhaps it would be possible to specify a parameter that says how many > columns in the index are part of the hash? Not sure how easy that is. > > If you have a situation with lots of duplicates, then use btrees > instead. We shouldn't have to make hash indexes work well for *all* > cases before we allow multiple columns for some cases. The user will > already get to compare btree-vs-hash before they use them in a > particular use case. The perfect should not be the enemy of the good. > > Storing multiple hashes uses more space and is more complex. > I agree that storing trailing columns (except the first one) as payload uses more space but it will save heap fetches in many cases. Apart from search, even for unique key insertion, we need to do heap fetches as we can only verify the other values after fetching the row from the heap. Now, here I feel the question is do we want to save random heap I/O or save extra space in a hash? I think both approaches have pros and cons but probably saving heap I/O appears to be important especially for unique index checks where we need to hold bucket lock as well. -- With Regards, Amit Kapila.
On Sun, Oct 17, 2021 at 4:30 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > On Thu, 14 Oct 2021 at 16:09, Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Thu, Oct 14, 2021 at 12:48 AM Simon Riggs > > <simon.riggs@enterprisedb.com> wrote: > > > The hash index tuples are 20-bytes each. If that were rounded up to > > > 8-byte alignment, then that would be 24 bytes. > > > > > > Using pageinspect, the max(live_items) on any data page (bucket or > > > overflow) is 407 items, so they can't be 24 bytes long. > > > > That's the same as an nbtree page, which confirms my suspicion. The 20 > > bytes consists of a 16 byte tuple, plus a 4 byte line pointer. The > > tuple-level alignment overhead gets you from 12 bytes to 16 bytes with > > a single int4 column. So the padding is there for the taking. > > Thank you for nudging me to review the tuple length. > > Since hash indexes never store Nulls, and the hash is always fixed > length, ISTM that we can compress the hash index entries down to > ItemPointerData (6 bytes) plus any hashes. > Nice observation but we use INDEX_AM_RESERVED_BIT (as INDEX_MOVED_BY_SPLIT_MASK) for hash indexes, so we need to take care of that in some way. -- With Regards, Amit Kapila.
On Tue, Oct 5, 2021 at 6:50 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > With unique data, starting at 1 and monotonically ascending, hash > indexes will grow very nicely from 0 to 10E7 rows without causing >1 > overflow block to be allocated for any bucket. This keeps the search > time for such data to just 2 blocks (bucket plus, if present, 1 > overflow block). The small number of overflow blocks is because of the > regular and smooth way that splits occur, which works very nicely > without significant extra latency. It is my impression that with non-unique data things degrade rather badly. There's no way to split the buckets that are overflowing without also splitting the buckets that are completely empty or, in any event, not full enough to need any overflow pages. I think that's rather awful. > The probability of bucket collision while we hold the lock is fairly > low. This is because even with adjacent data values the hash values > would be spread across multiple buckets, so we would expect the > contention to be less than we would get on a monotonically increasing > btree. > > So I don't now see any problem from holding the buffer lwlock on the > bucket while we do multi-buffer operations. I don't think that contention is the primary concern here. I think one major concern is interruptibility: a process must be careful not to hold lwlocks across long stretches of code, because it cannot be cancelled while it does. Even if the code is bug-free and the database has no corruption, long pauses before cancels take effect can be inconvenient. But as soon as you add in those considerations things get much worse. Imagine a hash index that is corrupted so that there is a loop in the list of overflow pages. No matter what, we're going to go into an infinite loop scanning that bucket, but if we're holding a buffer lock while we do it, there's no way to escape short of bouncing the entire server. That's pretty bad. Undetected deadlock is something to think about, too. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 27, 2021 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Oct 5, 2021 at 6:50 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > With unique data, starting at 1 and monotonically ascending, hash > > indexes will grow very nicely from 0 to 10E7 rows without causing >1 > > overflow block to be allocated for any bucket. This keeps the search > > time for such data to just 2 blocks (bucket plus, if present, 1 > > overflow block). The small number of overflow blocks is because of the > > regular and smooth way that splits occur, which works very nicely > > without significant extra latency. > > It is my impression that with non-unique data things degrade rather > badly. > But we will hold the bucket lock only for unique-index in which case there shouldn't be non-unique data in the index. The non-unique case should work as it works today. I guess this is the reason Simon took an example of unique data. -- With Regards, Amit Kapila.
On Wed, 27 Oct 2021 at 12:58, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Oct 27, 2021 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Oct 5, 2021 at 6:50 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > > With unique data, starting at 1 and monotonically ascending, hash > > > indexes will grow very nicely from 0 to 10E7 rows without causing >1 > > > overflow block to be allocated for any bucket. This keeps the search > > > time for such data to just 2 blocks (bucket plus, if present, 1 > > > overflow block). The small number of overflow blocks is because of the > > > regular and smooth way that splits occur, which works very nicely > > > without significant extra latency. > > > > It is my impression that with non-unique data things degrade rather > > badly. > > > > But we will hold the bucket lock only for unique-index in which case > there shouldn't be non-unique data in the index. Even in unique indexes there might be many duplicate index entries: A frequently updated row, to which HOT cannot apply, whose row versions are waiting for vacuum (which is waiting for that one long-running transaction to commit) will have many entries in each index. Sure, it generally won't hit 10E7 duplicates, but we can hit large numbers of duplicates fast on a frequently updated row. Updating one row 1000 times between two runs of VACUUM is not at all impossible, and although I don't think it happens all the time, I do think it can happen often enough on e.g. an HTAP system to make it a noteworthy test case. Kind regards, Matthias van de Meent
On Wed, Oct 27, 2021 at 4:55 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Wed, 27 Oct 2021 at 12:58, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Oct 27, 2021 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > On Tue, Oct 5, 2021 at 6:50 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > > > With unique data, starting at 1 and monotonically ascending, hash > > > > indexes will grow very nicely from 0 to 10E7 rows without causing >1 > > > > overflow block to be allocated for any bucket. This keeps the search > > > > time for such data to just 2 blocks (bucket plus, if present, 1 > > > > overflow block). The small number of overflow blocks is because of the > > > > regular and smooth way that splits occur, which works very nicely > > > > without significant extra latency. > > > > > > It is my impression that with non-unique data things degrade rather > > > badly. > > > > > > > But we will hold the bucket lock only for unique-index in which case > > there shouldn't be non-unique data in the index. > > Even in unique indexes there might be many duplicate index entries: A > frequently updated row, to which HOT cannot apply, whose row versions > are waiting for vacuum (which is waiting for that one long-running > transaction to commit) will have many entries in each index. > > Sure, it generally won't hit 10E7 duplicates, but we can hit large > numbers of duplicates fast on a frequently updated row. Updating one > row 1000 times between two runs of VACUUM is not at all impossible, > and although I don't think it happens all the time, I do think it can > happen often enough on e.g. an HTAP system to make it a noteworthy > test case. > I think it makes to test such cases and see the behavior w.r.t overflow buckets. -- With Regards, Amit Kapila.