Next Steps with Hash Indexes - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Next Steps with Hash Indexes |
Date | |
Msg-id | CANbhV-G7QLy3hP=6+=09OepzRPj-WgzoSZ8o2X416oxACJ1jUA@mail.gmail.com Whole thread Raw |
Responses |
Re: Next Steps with Hash Indexes
Re: Next Steps with Hash Indexes |
List | pgsql-hackers |
Hi, I've been investigating hash indexes and have what I think is a clear picture in my head, so time for discussion. It would be very desirable to allow Hash Indexes to become Primary Key Indexes, which requires both amroutine->amcanunique = true; amroutine->amcanmulticol = true; Every other hash index TODO seems like performance tuning, so can wait awhile, even if it is tempting to do that first. 1. Multi-column Indexes seems to have floundered because of this thread "Multicolumn Hash Indexes", https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us, but those issues don't apply in most common cases and so they seem acceptable restrictions, especially since some already apply to btrees etc.. (And noting that Hash indexes already assume strict hash operators, so that is not an issue). For me, this separates into two sub-concerns: 1.1 Allow multi-columns to be defined for hash indexes Enabling this a simple one line patch amroutine->amcanmulticol = true; which works just fine on current HEAD without further changes (manual testing, as yet). If we do this first, then any work on uniqueness checking can take into account multiple columns. 1.2 Combining multi-column hashes into one hash value Trivially, this is already how it works, in the sense that we just use the first column, however many columns there are in the index! Doing more is an already solved problem in Postgres, [TupleHashTableHash_internal() in src/backend/executor/execGrouping.c] as pointed out here: "Combining hash values" https://www.postgresql.org/message-id/CAEepm%3D3rdgjfxW4cKvJ0OEmya2-34B0qHNG1xV0vK7TGPJGMUQ%40mail.gmail.com though noting there was no discussion on that point [1]. This just needs a little refactoring to improve things, but it seems more like a nice to have than an essential aspect of hash indexes that need not block us from enabling multi-column hash indexes. 2. Unique Hash Indexes have been summarized here: https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com which also seems to have two parts to it. 2.1 Uniqueness Check Amit: "to ensure that there is no duplicate entry we need to traverse the whole bucket chain" Agreed. That seems straightforward and can also be improved later. 2.2 Locking Amit's idea of holding ExclusiveLock on the bucket page works for me, but there was some doubt about splitting. Splitting seems to be an awful behavior that users would wish to avoid if they knew about the impact and duration. In my understanding, splitting involves moving 50% of rows and likely touches all blocks eventually. If the existing index is the wrong shape then just run REINDEX. If we tune the index build, looks like REINDEX would be quicker and easier way of growing an index than trying to split an existing index. i.e. rely on ecdysis not internal growth. This is much more viable now because of the CIC changes in PG14. (I would argue that removing splitting completely is a good idea, similar to the way we have removed the original VACUUM FULL algorithm, but that will take a while to swallow that thought). Instead, I suggest we introduce a new indexam option for hash indexes of autosplit=on (default) | off, so that users can turn splitting off. Which means we would probably also need another option for initial_buckets=0 (default) means use number of existing rows to size, or N=use that specific size. Note that turning off splitting does not limit the size of the index, it just stops the index from re-freshing its number of buckets. B-trees are the default for PKs, so Hash indexes are an option for larger tables only, so there is less need to have hash indexes cope with tables of unknown size - we wouldn't even be using hash unless we already know it is a big table. If splitting causes any difficulty at all, then we should simply say that Unique Hash Index indexes should initially force autosplit=off, so we don't have to worry about the correctness of the locking. I suggest we implement that first and then decide if we really care about splitting, cos I'd bet we don't. Yes, I consider uniqueness much more desirable than splitting. I've written a patch that refactors index build so we *never* need to perform a split during index build, allowing us to more credibly skip index splitting completely. (Incidentally, it avoids the need to update the metapage for each row during the build, allowing us to consider writing in batches to the index as a next step). So there need be no *requirement* for splitting to be supported with uniqueness, while build/reindex looks like it can be much faster. I can post it if anyone wants to see it, but I don't want to distract us from discussion of the main requirements. I have other performance tuning ideas, but they can wait. Anyway, that's what I think at present. Thoughts? -- Simon Riggs http://www.EnterpriseDB.com/
pgsql-hackers by date: