Thread: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
"Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Peter Geoghegan
Date:
On Wed, Jul 4, 2018 at 9:43 AM, Peter Geoghegan <pg@bowt.ie> wrote: > I'm starting this new thread to discuss the benefits of B-Tree suffix > truncation, and to demonstrate how effective it can be at increasing > fan-out and space utilization with a real-world example. I haven't > explained why suffix truncation matters very well before now. Now that > I have a patch that works, I can just show the benefit directly. (By > the way, there are similar benefits to covering INCLUDE indexes, where > suffix truncation occurs all the time, and not just > opportunistically.) > Explanation > ----------- > > Pivot tuples describe how the key space will be split up, which will > *hopefully* be balanced for future insertions. So, apart from the > obvious issue about the size of pivot tuples, there is a more > important issue: why unnecessarily commit to certain exact boundaries > early on, before it's clear how well that will balance values that get > inserted later? Actually, this particular example does not really show why general suffix truncation is important. My example did manage to make an index 15% smaller in a practical, workable way, but traditional suffix truncation deserves far less credit for that than I initially claimed. My explanation was about 99% wrong, but my example is still valid. The true explanation for why my patch (the pending v3 of my unique-key-heap-TID patch) avoided so much bloat is very interesting, because it is related to a broader problem. I'll show a simple example where the bloat that my patch can avoid is far greater still, involving a simple single-attribute secondary index. Before I get to that example, I'll give the real explanation. The real reason for the marked decrease in the level of bloating is that my patch makes insertions into secondary indexes (non-unique indexes) jump immediately to the leaf page that the tuple unambiguously belongs on -- since it now has a unique key, there must be one exact page that the value is supposed to go on. We avoid trying to find a place among pages full of logical duplicates, and so we avoid the risk of "getting tired" [1] and giving up on finding free space that is actually available to us. "Getting tired" tends to produce really inefficient space utilization among leaf pages full of duplicates, at least past a certain tipping point. The whole "getting tired" thing is the root of the problem here, which is why the pending v3 of my patch will remove that code completely (_bt_findinsertloc() is streamlined). The "tipping point" nature of getting tired is of particular concern to me, since it sounds like something that could have been a factor in Uber's much-publicized blog post. :-( I attach the test case I mentioned, which I show the output from below, with my commentary in parenthesis (note that there are "\echo" psql statements whose output you'll also see): $ psql -f testcase.sql autovcuum should probably be disabled: autovacuum ──────────── off (1 row) psql:testcase.sql:3: NOTICE: table "sec_idx_bloat" does not exist, skipping DROP TABLE CREATE TABLE (Created empty table named "sec_idx_bloat", with a single int4 attribute.) CREATE INDEX (Created empty index named "bloated" on that attribute.) Initial insert of 1e7 sequential values: INSERT 0 10000000 "bloated" size on master: 214 MB "bloated" size with patch: 214 MB Initial insert of the value 42 1e7 times: INSERT 0 10000000 "bloated" size on master: 486 MB "bloated" size with patch: 430 MB Repeated insertion of the value 42 1e7 times: INSERT 0 10000000 "bloated" size on master: 757 MB "bloated" size with patch: 645 MB Delete + VACUUM away most of the dups: DELETE 18001072 VACUUM "bloated" size on master: 757 MB "bloated" size with patch: 645 MB (That is, VACUUM hasn't made either case have a smaller index yet, though it probably gave more back many more whole index pages to the FSM in the case of the patch.) Third insertion of the value 42 1e7 times: INSERT 0 10000000 (This is where it gets really interesting, because things truly fall apart for master, whereas the patch case manages to reuse existing free space in all cases.) "bloated" size on master: 1029 MB "bloated" size with patch: 645 MB Fourth insertion of the value 42 1e7 times: INSERT 0 10000000 "bloated" size on master: 1301 MB "bloated" size with patch: 688 MB (Patch can still almost reuse all the space left behind by VACUUM, though since it's a bit bigger than the original high watermark size of 645 MB it's not perfectly efficient. Master flounders even further behind, though.) To summarize: recognizing that bloat in indexes is correlated with bloat in tables allows us to recycle space in both structures cooperatively, which can be very important. While this example focusses on bloat, there are a lot of other things to be unhappy about around buffer lock contention, and around the number of pages that will be dirtied. Random choices about which page to dirty leads to increased random I/O when checkpointing. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/nbtinsert.c;h=907cce072412adf88d41ce9317a795fb25820be2;hb=refs/heads/REL_11_STABLE#l694 -- Peter Geoghegan
Attachment
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Robert Haas
Date:
On Sun, Jul 8, 2018 at 7:59 PM, Peter Geoghegan <pg@bowt.ie> wrote: > The whole "getting tired" thing is the root of the problem here, which > is why the pending v3 of my patch will remove that code completely > (_bt_findinsertloc() is streamlined). Peter, This seems like really interesting and important work. I wouldn't have foreseen that the "getting tired" code would have led to this kind of bloat (even if I had known about it at all). I wonder, though, whether it's possible that the reverse could happen in some other scenario. It seems to me that with the existing code, if you reinsert a value many copies of which have been deleted, you'll probably find partially-empty pages whose free space can be reused, but if there's one specific place where each tuple needs to go, you might end up having to split pages if the new TIDs are all larger or smaller than the old TIDs. I'm really glad that you are working on this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Jul 8, 2018 at 7:59 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> The whole "getting tired" thing is the root of the problem here, which >> is why the pending v3 of my patch will remove that code completely >> (_bt_findinsertloc() is streamlined). > This seems like really interesting and important work. I wouldn't > have foreseen that the "getting tired" code would have led to this > kind of bloat (even if I had known about it at all). I wonder, > though, whether it's possible that the reverse could happen in some > other scenario. It seems to me that with the existing code, if you > reinsert a value many copies of which have been deleted, you'll > probably find partially-empty pages whose free space can be reused, > but if there's one specific place where each tuple needs to go, you > might end up having to split pages if the new TIDs are all larger or > smaller than the old TIDs. Yeah ... if memory serves, there were specific usage patterns where that hack made things way better than they'd been before. (I do not recall if the hack itself was mine, but I think I can be blamed for the "getting tired" comment ...) I'd suggest git blaming your way to the commit that put that in, and then checking the hackers archives around that date for more info. regards, tom lane
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Peter Geoghegan
Date:
On Tue, Jul 17, 2018 at 1:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: > This seems like really interesting and important work. I wouldn't > have foreseen that the "getting tired" code would have led to this > kind of bloat (even if I had known about it at all). Thanks! I'm glad that I can come up with concrete, motivating examples like this, because it's really hard to talk about the big picture here. With something like a pgbench workload, there are undoubtedly many different factors in play, since temporal locality influences many different things all at once. I don't think that I understand it all just yet. Taking a holistic view of the problem seems very helpful, but it's also very frustrating at times. > I wonder, > though, whether it's possible that the reverse could happen in some > other scenario. It seems to me that with the existing code, if you > reinsert a value many copies of which have been deleted, you'll > probably find partially-empty pages whose free space can be reused, > but if there's one specific place where each tuple needs to go, you > might end up having to split pages if the new TIDs are all larger or > smaller than the old TIDs. That's a legitimate concern. After all, what I've done boils down to adding a restriction on space utilization that wasn't there before. This clearly helps because it makes it practical to rip out the "getting tired" thing, but that's not everything. There are good reasons for that hack, but if latency magically didn't matter then we could just tear the hack out without doing anything else. That would make groveling through pages full of duplicates at least as discerning about space utilization as my patch manages to be, without any of the complexity. There is actually a flipside to that downside, though (i.e. the downside is also an upside): While not filling up leaf pages that have free space on them is bad, it's only bad when it doesn't leave the pages completely empty. Leaving the pages completely empty is actually good, because then VACUUM is in a position to delete entire pages, removing their downlinks from parent pages. That's a variety of bloat that we can reverse completely. I suspect that you'll see far more of that favorable case in the real world with my patch. It's pretty much impossible to do page deletions with pages full of duplicates today, because the roughly-uniform distribution of still-live tuples among leaf pages fails to exhibit any temporal locality. So, maybe my patch would still come out ahead of simply ripping out "getting tired" in this parallel universe where latency doesn't matter, and space utilization is everything. I made one small mistake with my test case: It actually *is* perfectly efficient at recycling space even at the end, since I don't delete all the duplicates (just 90% of them). Getting tired might have been a contributing factor there, too. -- Peter Geoghegan
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Peter Geoghegan
Date:
On Tue, Jul 17, 2018 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah ... if memory serves, there were specific usage patterns where > that hack made things way better than they'd been before. (I do not > recall if the hack itself was mine, but I think I can be blamed for > the "getting tired" comment ...) I'd suggest git blaming your way > to the commit that put that in, and then checking the hackers archives > around that date for more info. I've done plenty of research into the history of this hack. It was your work, but it does actually make sense in the context of today's nbtree code. It is essential with scankey-wise duplicates, since groveling through hundreds or even thousands of pages full of duplicates to find free space (and avoid a page split) is going to have a very serious downside for latency. Vadim wanted to fix the problem by more or less doing what I propose [1], though he never got into figuring out how to make that practical (i.e. how to make it not bloat up internal pages, which must represent heap TID as just another part of the key space). Having unique keys isn't just assumed by Lehman & Yao; I think that it's assumed by most or all real-world B-Tree implementations. [1] https://www.postgresql.org/message-id/18788.963953289@sss.pgh.pa.us -- Peter Geoghegan
Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > I've done plenty of research into the history of this hack. It was > your work, but it does actually make sense in the context of today's > nbtree code. It is essential with scankey-wise duplicates, since > groveling through hundreds or even thousands of pages full of > duplicates to find free space (and avoid a page split) is going to > have a very serious downside for latency. Well, the actual problem was O(N^2) behavior: https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6 I certainly have no objection to improving matters, but let's be sure we don't re-introduce any two-decade-old problems. regards, tom lane
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Peter Geoghegan
Date:
On Tue, Jul 17, 2018 at 3:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, the actual problem was O(N^2) behavior: > > https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us > > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6 Oh, yeah. We still put the new tuple to the left of all existing duplicates on the leaf that we decide to insert on, because "we'll insert it before any existing equal keys because of the way _bt_binsrch() works". I actually plan on mostly fixing that aspect of _bt_binsrch() while I'm at it, which is why I didn't think to frame it that way. It is claimed that we need to do this _bt_binsrch() go-left-on-equal thing for internal pages because we allow duplicates, contrary to L&Y. I find it much easier to think of it as being necessary due to index scans where only some columns are specified in the initial insertion scan key that finds the initial leaf page (i.e. the _bt_first() stuff). I'm not going to touch the _bt_binsrch() go-left-on-equal thing, because it's actually necessary for any real world implementation that has multiple columns in tuples. Fortunately, we can usually avoid the go-left-on-equal thing for internal pages by avoiding equality -- a sentinel heap scan TID value may be used for this in many cases. The Lehman and Yao paper is badly written. They should have talked about the _bt_binsrch() go-left-on-equal thing, rather than leaving it to the reader to figure it out. It's not why L&Y require unique keys, though. Sorry if all this detail isn't useful right now. The important point is that I actually think that this code gets most things right already. > I certainly have no objection to improving matters, but let's be sure > we don't re-introduce any two-decade-old problems. A mountain of work remains to validate the performance of the patch. -- Peter Geoghegan
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Robert Haas
Date:
On Tue, Jul 17, 2018 at 5:16 PM, Peter Geoghegan <pg@bowt.ie> wrote: > There is actually a flipside to that downside, though (i.e. the > downside is also an upside): While not filling up leaf pages that have > free space on them is bad, it's only bad when it doesn't leave the > pages completely empty. Leaving the pages completely empty is actually > good, because then VACUUM is in a position to delete entire pages, > removing their downlinks from parent pages. That's a variety of bloat > that we can reverse completely. I suspect that you'll see far more of > that favorable case in the real world with my patch. It's pretty much > impossible to do page deletions with pages full of duplicates today, > because the roughly-uniform distribution of still-live tuples among > leaf pages fails to exhibit any temporal locality. So, maybe my patch > would still come out ahead of simply ripping out "getting tired" in > this parallel universe where latency doesn't matter, and space > utilization is everything. Yes, that's a good point. Also, and I think pretty importantly, this seems essential if we want to allow retail index tuple deletion, which has its own set of advantages. I don't think you're going to be able to rip out the getting-tired code, though, because presumably we'll have to continue support existing btree indexes that don't include TIDs for some probably-not-small number of future releases, even if we decide that all future btree indexes (and any that are reindexed) should have TIDs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Peter Geoghegan
Date:
On Tue, Jul 17, 2018 at 5:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Yes, that's a good point. Also, and I think pretty importantly, this > seems essential if we want to allow retail index tuple deletion, which > has its own set of advantages. Retail index tuple deletion is clearly something we should have in some form, and is the top reason to pursue the project. I could probably come up with dozens of other reasons if I had to, though. Here are a few of the less obvious ones: * This improves usage_count of heap page buffers automatically, by recognizing correlated references when there are many duplicates. [1] * How else is VACUUM/deletion supposed to work with indirect indexes? I think that it's impossible without either retail deletion/retail updates, unless an AEL is okay. * Global indexes (indexes that simultaneously cover multiple partitions) are just a generalization of this idea, with a partition number attribute before the heap TID attribute. * The sort order of duplicates actually matters quite a lot to the executor, and to the planner, which this lets us reason about in many cases. [2][3] * There is nothing to stop us from exploiting this within tidbitmap.c/bitmap index scans. We can build disjunct lists of sorted TIDs, one per value. Merging lists like this together is typically very cheap, even when there are relatively many distinct lists. We could probably do that on-the-fly. * When building a bitmap for a low cardinality value from a secondary index, why bother even visiting the leaf pages? We can make a lossy bitmap from a subset of the internal pages. * Techniques like bitmap semi-join, which are important for star-schema queries, are dependent on the merging and intersection of fact table bitmaps being cheap (basically, you build several bitmaps through nestloop joins with some dimension table on the outer side, and some fact table index on the inner side -- that's relatively easy to parallelize well). Tom's "Improve OR conditions on joined columns" patch sorts a list of TIDs [4], and I suspect that there is some high-level connection there. * Heap prefetching for nbtree index scans is a lot more effective than it might otherwise be. I really could go on, but you get the idea. The merits of each of these ideas are debatable, but the fact that there are so many ideas that are at least plausible-sounding seems significant to me. > I don't think you're going to be able to rip out the getting-tired > code, though, because presumably we'll have to continue support > existing btree indexes that don't include TIDs for some > probably-not-small number of future releases, even if we decide that > all future btree indexes (and any that are reindexed) should have > TIDs. That's definitely true. There is no way that it's okay to ignore this issue with pg_upgrade. [1] https://postgr.es/m/CAH2-WzkrKKW88Yq4-BLN5N3DrFv93P8r+ti-KfbTeR08P8ckBQ@mail.gmail.com [2] https://postgr.es/m/20170707234119.GN17566@telsasoft.com [3] https://postgr.es/m/520D6610.8040907%40emulex.com#520D6610.8040907@emulex.com [4] https://postgr.es/m/22002.1487099934%40sss.pgh.pa.us -- Peter Geoghegan
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Simon Riggs
Date:
On 17 July 2018 at 23:10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@bowt.ie> writes: >> I've done plenty of research into the history of this hack. It was >> your work, but it does actually make sense in the context of today's >> nbtree code. It is essential with scankey-wise duplicates, since >> groveling through hundreds or even thousands of pages full of >> duplicates to find free space (and avoid a page split) is going to >> have a very serious downside for latency. > > Well, the actual problem was O(N^2) behavior: > > https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us > > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6 > > I certainly have no objection to improving matters, but let's be sure > we don't re-introduce any two-decade-old problems. Looking at this in terms of CPU cost for a single insert, insertion at the end of the list of duplicates was O(N), which was much improved by the O(k) behavior. Keeping the index entries in order is O(logN) So there is a higher cost to keeping the entries in order, though that becomes a win because of the reduced bloat in cases where we have a high numbers of deletes. That is a win in insertion time as well, since by packing the index more tightly we cause fewer costly splits and less I/O. If we knew that we were never going to do deletes/non-HOT updates from the table we could continue to use the existing mechanism, otherwise we will be better off to use sorted index entries. However, it does appear as if keeping entries in sorted order would be a win on concurrency from reduced block contention on the first few blocks of the index key, so it may also be a win in cases where there are heavy concurrent inserts but no deletes. Bulk loading will improve because adding a new data block via COPY will cause all of the resulting index entries to be added with more adjacency, so a block full of duplicate entries could be sorted and then merged efficiently into the index. I hope we can see a patch that just adds the sorting-by-TID property so we can evaluate that aspect before we try to add other more advanced index ideas. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Peter Geoghegan
Date:
On Tue, Jul 17, 2018 at 10:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > If we knew that we were never going to do deletes/non-HOT updates from > the table we could continue to use the existing mechanism, otherwise > we will be better off to use sorted index entries. However, it does > appear as if keeping entries in sorted order would be a win on > concurrency from reduced block contention on the first few blocks of > the index key, so it may also be a win in cases where there are heavy > concurrent inserts but no deletes. I think so too. I also expect a big reduction in the number of FPIs in the event of many duplicates. > I hope we can see a patch that just adds the sorting-by-TID property > so we can evaluate that aspect before we try to add other more > advanced index ideas. I can certainly see why that's desirable. Unfortunately, it isn't that simple. If I want to sort on heap TID as a tie-breaker, I cannot cut any corners. That is, it has to be just another column, at least as far as the implementation is concerned (heap TID will have a different representation in internal pages and leaf high keys, but nonetheless it's essentially just another column in the keyspace). This means that if I don't have suffix truncation, I'll regress performance in many important cases that have no compensating benefit (e.g. pgbench). There is no point in trying to assess that. It is true that I could opt to only "logically truncate" the heap TID attribute during a leaf page split (i.e. there'd only be "logical truncation", which is to say there'd only be the avoidance of adding a heap TID to the new high key, and never true physical truncation of user attributes). But doing only that much saves very little code, since the logic for assessing whether or not it's safe to avoid adding a new heap attribute (whether or not we logically truncate) still has to involve an insertion scankey. It seems much more natural to do everything at once. Again, the heap TID attribute is more or less just another attribute. Also, the code for doing physical suffix truncation already exists from the covering/INCLUDE index commit. I'm currently improving the logic for picking a page split in light of suffix truncation, which I've been working on for weeks now. I had something that did quite well with the initial index sizes for TPC-C and TPC-H, but then realized I'd totally regressed the motivating example with many duplicates that I started this thread with. I now have something that does both things well, which I'm trying to simplify. Another thing to bear in mind is that page split logic for suffix truncation also helps space utilization on the leaf level. I can get the main TPC-C order_line pkey about 7% smaller with true suffix truncation, even though the internal page index tuples can never be any smaller due to alignment, and even though there are no duplicates that would otherwise make the implementation "get tired". Can I really fix space utilization in a piecemeal fashion? I strongly doubt it. -- Peter Geoghegan
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncationmatters)
From
Peter Eisentraut
Date:
Do you plan to submit this patch to the upcoming commit fest perhaps? I have done some testing on it and it seems worth pursuing further. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Peter Geoghegan
Date:
On Tue, Aug 28, 2018 at 11:32 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Do you plan to submit this patch to the upcoming commit fest perhaps? I > have done some testing on it and it seems worth pursuing further. I should make sure that this makes it into the September 'fest. Thanks for reminding me. I've been distracted by other responsibilities, but I think that this project is well worthwhile. I intend to spend lots of time on it, as I think that it has real strategic importance. I would be most grateful if you signed up to review the patch. I've been testing it with quite a variety of real-world data, which seems like the way to go until the code really matures. As I indicated to Simon on August 2nd, it seems like I should further refine my current working draft of the next version to have less magic. I have a cumbersome test suite that proves that I have something that improves fan-out on TPC-C and TPC-H indexes by quite a bit (e.g. the main TPC-C order_line pkey is about 7% smaller after initial preload, despite not even having smaller pivot tuples due to alignment). It also proves that the benefits of not "getting tired" in the event of many duplicates are preserved. We need to have both at once. The code to pick a split point during a page split is rather tricky. That's what's holding the next version up. I can post what I have right now on the other thread, to give you a better idea of the direction of the patch. You'll have to hold your nose when you look at the code that picks a split point, though. Let me know if you think that that makes sense. I wouldn't want you to spend too much time on old-ish code. -- Peter Geoghegan
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Simon Riggs
Date:
On 2 August 2018 at 21:32, Peter Geoghegan <pg@bowt.ie> wrote:
--
If I want to sort on heap TID as a tie-breaker, I cannot cut any
corners. That is, it has to be just another column, at least as far as
the implementation is concerned (heap TID will have a different
representation in internal pages and leaf high keys, but nonetheless
it's essentially just another column in the keyspace). This means that
if I don't have suffix truncation, I'll regress performance in many
important cases that have no compensating benefit (e.g. pgbench).
There is no point in trying to assess that.
If you include heap TID as a column the suffix will be unique and cannot benefit from suffix truncation.
It would be better to avoid including the heap TID as a column, since it is already there. Or the other way around, don't include it as payload if it is has to be a column.
Alternatively, include only the heap block number. That would make it non-unique, but never more than 240 duplicates. So it would allow suffix truncation, and yet still avoid the multi-page split effect.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
From
Peter Geoghegan
Date:
On Wed, Aug 29, 2018 at 11:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > If you include heap TID as a column the suffix will be unique and cannot > benefit from suffix truncation. Right. During a page split, we must generate a new high key that's less than or equal to all items on the left side (where the new high key goes), and strictly greater than all items on the right side (where the new high key values will be used as a downlink to). We'll only include heap TID explicitly in the new high key when there is no other way to get a value that is strictly less than all tuples on the right side, and only at the leaf level (we don't truncate during internal page splits, and the rules automatically work at non-leaf levels because values come from the leaf level). The original L&Y invariant that I propose to restore requires that downlinks be strictly less than all items on the child page the point to. Truncated attributes logically retain the value minus infinity, which is often what makes the true L&Y invariant hold. > It would be better to avoid including the heap TID as a column, since it is > already there. Or the other way around, don't include it as payload if it is > has to be a column. The high level principle to bear in mind is that pivot tuples (high keys at the leaf level, and all internal page tuples) only need to correctly guide searches. There is no obligation for pivot tuples to be able to retrieve the original values, for something like an index-only scan. We could even make their representation into a flat binary string someday -- this is called "key normalization" [1]. Likewise, pivot tuples are good candidates for automatic prefix compression because there is scarcely any need to decompress them (not working on that for v12, though). Nothing changes about the direct representation of 99%+ of tuples (those at the leaf level that point to the heap -- non-pivot tuples). We need an alternative representation for heap TID in pivot tuples for historical reasons (actually, I think what we have here would even work best in a green field situation). Not including a special heap TID representation in pivots, but including everything else is what I've called "logical truncation" on this thread. If you compare a serial primary key case with my patch and without, you typically see no difference in the size of the index. It is possible to end up with a larger index overall, but that should be extremely rare, and only affect very small indexes. > Alternatively, include only the heap block number. That would make it > non-unique, but never more than 240 duplicates. So it would allow suffix > truncation, and yet still avoid the multi-page split effect. I'm not sure what you mean by the multi-page split effect -- all internal pages are split recursively, during a leaf page split. In general, free space management needs to remain as simple as possible, and splitting the same page twice in one go is a big no-no. I conservatively enforce that the 1/3 of a page restriction has a bit of extra slop for an extra heap TID. There is one choke point that everything takes place -- during leaf page splits. Nothing else changes, apart from index scan logic, which may have to search with an extra heap TID column, plus one or two other small things along those lines. [1] https://wiki.postgresql.org/wiki/Key_normalization -- Peter Geoghegan