Do we actually need an ItemId array on nbtree pages containingfixed-width tuples? - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Do we actually need an ItemId array on nbtree pages containingfixed-width tuples? |
Date | |
Msg-id | CAH2-WznALt6PNSsmB+boTL6wr2BgCWS2tyUbx8u1hCoihMitmQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Do we actually need an ItemId array on nbtree pages containingfixed-width tuples?
|
List | pgsql-hackers |
I've been working on adding a new feature to pg_hexedit [1] to allow it to work with GIN indexes. This led to an idea about how we could do better within nbtree, which I will now describe. I noticed that GIN's internal posting tree pages (B-Trees over TIDs) do away with the need for an ItemId array, despite more or less working in the same way as classic B-Tree pages (posting tree *leaf* pages are a bit weird OTOH, but that's not important). This is possible because internal posting list pages store fixed-width PostingItem structs instead of IndexTuples. A posting tree only needs to work with TID key values -- not values of some arbitrary data type, that could be a variable-width type. This raises the question: since we really just use ItemIds to deal with variable-width tuples, can we do away with the need for an ItemId array at the start of nbtree pages in some important, common cases? Applicability ============= When there is a single attribute within an ItemId array, that we know will always be 16 bytes (on 64-bit systems), ISTM that it should be possible to not have any ItemId array. Simple pointer arithmetic on the page, based on a tuple width that is known at compile time can be used instead -- no need to go off of an lp_off value -- see GinDataPageGetPostingItem() for an example of this. All ItemId metadata becomes redundant, and we add IndexTuples from the start of the page, just after the page header, and continue to add them until the special area at the end of the page is reached (again, see the GIN posting tree code). Note that ItemId is already quasi-redundant within indexes. In general, nbtree indexes only actually need the LP_DEAD bit, as well as lp_off. (lp_len is simply redundant, even for variable-width index tuples [2].) Advantages ========== Before I go into more depth on implementation issues, let me first list off the ways in which I'd expect this to improve performance: * We currently fit 367 IndexTuples on to a leaf page of a single-int4-or-int8-attribute nbtree index (with the default leaf fillfactor and BLCKSZ) after many SERIAL/BIGSERIAL style sequential inserts (workloads where all leaf page splits are 10:90 splits of the rightmost page). If we removed the ItemId array, we'd be able to fit 458 IndexTuples on the left half post-split (while still leaving the same amount of freespace on the page as before). IOW, we'd make indexes that avail of the optimization about 20% smaller, thereby reducing random I/O and so on. The benefits for internal pages would also be significant, because we win in at least 3 places: * The first win comes from the fact that we can store more IndexTuples (downlinks) now, because the optimization works just as well for internal pages as it does for leaf pages. This much is obvious. * The second win comes from the fact that there are now about 20% fewer downlinks/internal tuples needed in the first place, because that directly corresponds to the number of pages one level down. The advantages *compound*; our savings in the total number of internal pages are close to 40% (it's less than a full 40% because of the 30:70 splits we do on internal pages). * The third win comes from the fact that we will make more efficient use of CPU cache within internal pages, by packing IndexTuples together. Binary searches with many cache misses are something that has been identified as a bottleneck for some workloads [3]. I haven't bothered to apply the classic "2/3 of a page" utilization rule for my free space accounting in this example, just to keep things simple. (That rule applies to workloads with random insertions.) The general approach here should be to keep things simple by changing the physical layout of B-Tree pages (where the optimization applies) without changing their logical structure. So, very few changes ought to be required within amcheck to make it work with this, and they should all be fairly mechanical. Similarly, nothing would change about the interface of the nbtree pageinspect functions. We never make any IndexTuple bigger, so we can mix the old and new formats when pg_upgrade is used without needing to be very defensive all over the place. Issues ====== I'm not planning to work on this myself, so I encourage other contributors to think about it. These tricky issues will need to be worked through: * In the case of GIN, routines like GinDataPageAddPostingItem() are used instead of the generic PageAddItem() routine. This optimization requires similar new code that performs page modifications that must care about page space management (code that must memmove() ranges of tuples to make space for a new tuples, and so on). The implementation should probably generalize its approach, so that this optimization could in principle be used by other index AMs, such as the hash AM. The PageAddItem() code (and similar bufpage.c code) that is now used by nbtree would only be used in pages that still need an ItemId array. The new fixed-width-aware code can be maintained alongside its variable-width equivalent, within bufpage. * A lot of the code in nbtinsert.c will need to be adjusted, most notably the pagesplit split point picking logic. * The LP_DEAD/kill_prior_tuple optimization is very valuable, and definitely needs to be totally preserved. I think that we can do this by using the free bit within IndexTupleData (the 13th IndexTuple.t_info bit) instead of LP_DEAD. We'd need to develop alternatives to the ItemId-specific macros that manage the optimization to make it work. This seems doable. * The representation of NULL values will need some work. It will need to take account of the fact that there cannot be a NULL bitmap when the optimization is used, since IndexTuples must always be the same width. We could represent an optimized IndexTuple as having a t_info-wise length of 0xFFF, a magic value. This would allow us to do away with a NULL bitmap (INDEX_NULL_MASK becomes "only attribute is NULL"), and to readily mix old and new formats in most nbtree code. We can and should represent that pages are in the new format by using a new special area flag bit, too (a BTPageOpaqueData.btpo_flag bit). * For similar reasons, we'd need to pad-out the "minus infinity" item (the first "real" item on nbtree internal pages) to always be 16 bytes (it's 8 bytes now). That's easy, though. Thoughts? [1] https://github.com/petergeoghegan/pg_hexedit [2] https://postgr.es/m/CAH2-Wz=mV4dmOaPFicRSyNtv2KinxEOtBwUY5R7fXXOC-OearA@mail.gmail.com [3] https://www.postgresql.org/message-id/CANP8%2BjLf6ozqc2ie30%2B--h6oxFUL48jmjQXPS5dpiTFtvtYPYQ%40mail.gmail.com -- Peter Geoghegan
pgsql-hackers by date: