Thread: Do we actually need an ItemId array on nbtree pages containingfixed-width tuples?
Do we actually need an ItemId array on nbtree pages containingfixed-width tuples?
From
Peter Geoghegan
Date:
I've been working on adding a new feature to pg_hexedit [1] to allow it to work with GIN indexes. This led to an idea about how we could do better within nbtree, which I will now describe. I noticed that GIN's internal posting tree pages (B-Trees over TIDs) do away with the need for an ItemId array, despite more or less working in the same way as classic B-Tree pages (posting tree *leaf* pages are a bit weird OTOH, but that's not important). This is possible because internal posting list pages store fixed-width PostingItem structs instead of IndexTuples. A posting tree only needs to work with TID key values -- not values of some arbitrary data type, that could be a variable-width type. This raises the question: since we really just use ItemIds to deal with variable-width tuples, can we do away with the need for an ItemId array at the start of nbtree pages in some important, common cases? Applicability ============= When there is a single attribute within an ItemId array, that we know will always be 16 bytes (on 64-bit systems), ISTM that it should be possible to not have any ItemId array. Simple pointer arithmetic on the page, based on a tuple width that is known at compile time can be used instead -- no need to go off of an lp_off value -- see GinDataPageGetPostingItem() for an example of this. All ItemId metadata becomes redundant, and we add IndexTuples from the start of the page, just after the page header, and continue to add them until the special area at the end of the page is reached (again, see the GIN posting tree code). Note that ItemId is already quasi-redundant within indexes. In general, nbtree indexes only actually need the LP_DEAD bit, as well as lp_off. (lp_len is simply redundant, even for variable-width index tuples [2].) Advantages ========== Before I go into more depth on implementation issues, let me first list off the ways in which I'd expect this to improve performance: * We currently fit 367 IndexTuples on to a leaf page of a single-int4-or-int8-attribute nbtree index (with the default leaf fillfactor and BLCKSZ) after many SERIAL/BIGSERIAL style sequential inserts (workloads where all leaf page splits are 10:90 splits of the rightmost page). If we removed the ItemId array, we'd be able to fit 458 IndexTuples on the left half post-split (while still leaving the same amount of freespace on the page as before). IOW, we'd make indexes that avail of the optimization about 20% smaller, thereby reducing random I/O and so on. The benefits for internal pages would also be significant, because we win in at least 3 places: * The first win comes from the fact that we can store more IndexTuples (downlinks) now, because the optimization works just as well for internal pages as it does for leaf pages. This much is obvious. * The second win comes from the fact that there are now about 20% fewer downlinks/internal tuples needed in the first place, because that directly corresponds to the number of pages one level down. The advantages *compound*; our savings in the total number of internal pages are close to 40% (it's less than a full 40% because of the 30:70 splits we do on internal pages). * The third win comes from the fact that we will make more efficient use of CPU cache within internal pages, by packing IndexTuples together. Binary searches with many cache misses are something that has been identified as a bottleneck for some workloads [3]. I haven't bothered to apply the classic "2/3 of a page" utilization rule for my free space accounting in this example, just to keep things simple. (That rule applies to workloads with random insertions.) The general approach here should be to keep things simple by changing the physical layout of B-Tree pages (where the optimization applies) without changing their logical structure. So, very few changes ought to be required within amcheck to make it work with this, and they should all be fairly mechanical. Similarly, nothing would change about the interface of the nbtree pageinspect functions. We never make any IndexTuple bigger, so we can mix the old and new formats when pg_upgrade is used without needing to be very defensive all over the place. Issues ====== I'm not planning to work on this myself, so I encourage other contributors to think about it. These tricky issues will need to be worked through: * In the case of GIN, routines like GinDataPageAddPostingItem() are used instead of the generic PageAddItem() routine. This optimization requires similar new code that performs page modifications that must care about page space management (code that must memmove() ranges of tuples to make space for a new tuples, and so on). The implementation should probably generalize its approach, so that this optimization could in principle be used by other index AMs, such as the hash AM. The PageAddItem() code (and similar bufpage.c code) that is now used by nbtree would only be used in pages that still need an ItemId array. The new fixed-width-aware code can be maintained alongside its variable-width equivalent, within bufpage. * A lot of the code in nbtinsert.c will need to be adjusted, most notably the pagesplit split point picking logic. * The LP_DEAD/kill_prior_tuple optimization is very valuable, and definitely needs to be totally preserved. I think that we can do this by using the free bit within IndexTupleData (the 13th IndexTuple.t_info bit) instead of LP_DEAD. We'd need to develop alternatives to the ItemId-specific macros that manage the optimization to make it work. This seems doable. * The representation of NULL values will need some work. It will need to take account of the fact that there cannot be a NULL bitmap when the optimization is used, since IndexTuples must always be the same width. We could represent an optimized IndexTuple as having a t_info-wise length of 0xFFF, a magic value. This would allow us to do away with a NULL bitmap (INDEX_NULL_MASK becomes "only attribute is NULL"), and to readily mix old and new formats in most nbtree code. We can and should represent that pages are in the new format by using a new special area flag bit, too (a BTPageOpaqueData.btpo_flag bit). * For similar reasons, we'd need to pad-out the "minus infinity" item (the first "real" item on nbtree internal pages) to always be 16 bytes (it's 8 bytes now). That's easy, though. Thoughts? [1] https://github.com/petergeoghegan/pg_hexedit [2] https://postgr.es/m/CAH2-Wz=mV4dmOaPFicRSyNtv2KinxEOtBwUY5R7fXXOC-OearA@mail.gmail.com [3] https://www.postgresql.org/message-id/CANP8%2BjLf6ozqc2ie30%2B--h6oxFUL48jmjQXPS5dpiTFtvtYPYQ%40mail.gmail.com -- Peter Geoghegan
Re: Do we actually need an ItemId array on nbtree pages containingfixed-width tuples?
From
Andrey Borodin
Date:
Hi, Peter! > 4 дек. 2017 г., в 4:55, Peter Geoghegan <pg@bowt.ie> написал(а): > Thoughts? I like the idea of more compact B-tree. Chances are that I didn't understood all your ideas. But ItemId's let you insert a tuple among two existing tuples without data movement. New tuple is places wherever free spacestarts. You just shift bytes in ItemId array. And you always have to insert tuple in specific position, since B-tree relies on tuple order. Best regards, Andrey Borodin.
Re: Do we actually need an ItemId array on nbtree pages containingfixed-width tuples?
From
Peter Geoghegan
Date:
Hi Andrey, On Sun, Dec 3, 2017 at 10:11 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > I like the idea of more compact B-tree. > Chances are that I didn't understood all your ideas. > > But ItemId's let you insert a tuple among two existing tuples without data movement. New tuple is places wherever freespace starts. You just shift bytes in ItemId array. > And you always have to insert tuple in specific position, since B-tree relies on tuple order. It's certainly true that the need to memmove() more data is a new cost to be paid under my proposal: a simple random insertion must move almost 50% the entire page on average, rather than almost 10% on average, as is the case today. I think that this would probably be acceptable. We really only pay this extra cost during random writes into the index (sequential writes will still just append within a page). Random inserts are already much more expensive than sequential int4/int8 index inserts, because of the extra FPIs emitted for full_page_writes, the fact that more pages are dirtied, etc. It's not that I think that random insertions are rare. I think that they're a lot rarer with simple SERIAL/BIGSERIAL primary keys on fact tables, which are what this optimization is really about. The approach might actually be faster for a workload that consists only of random insertions into a table with a SERIAL/BIGSERIAL primary key -- the "worst case". I'm less confident about that, but it seems very possible. -- Peter Geoghegan