Re: PG 12 draft release notes - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: PG 12 draft release notes |
Date | |
Msg-id | 20190613002203.eubqihvbguajr3yy@momjian.us Whole thread Raw |
In response to | Re: PG 12 draft release notes (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: PG 12 draft release notes
|
List | pgsql-hackers |
On Wed, Jun 12, 2019 at 03:06:34PM -0700, Peter Geoghegan wrote: > On Wed, Jun 12, 2019 at 1:16 PM Bruce Momjian <bruce@momjian.us> wrote: > > First, my apologies in getting to this so late. Peter Geoghegan > > supplied me with slides and a video to study, and I now understand how > > complex the btree improvements are. Here is a video of Peter's > > presentation at PGCon: > > > > https://www.youtube.com/watch?v=p5RaATILoiE > > Thank you for going to the trouble of researching the B-Tree stuff in > detail! I wouldn't ask that of anybody in the position of writing > release notes, so it's appreciated. It is awkward to take the work > that I've done and make it into multiple bullet points; I have a hard > time with that myself. I had become so confused by this item that I needed a few weeks to settle on what was actually going on. > > The over-arching improvement to btree in PG 12 is the ordering of index > > entries by tid so all entries are unique. > > Right. Everything good happens as a direct or indirect result of the > TID-as-column thing. That is really the kernel of the whole thing, > because it means that the implementation now *fully* follows the > Lehman and Yao design. Right. > > 1. Since all index tuples are ordered, you can move from one leaf page > > to the next without keeping a lock on the internal page that references > > it, increasing concurrency. > > I'm not sure what you mean here. We've never had to keep a lock on an > internal page while holding a lock on a leaf page -- such "lock > coupling" was necessary in earlier B-Tree designs, but Lehman and > Yao's algorithm avoids that. Of course, that isn't new. > > I think that you're talking about the way that we now check the high > key during index scans, and find that we don't have to move to the > next leaf page, per this commit: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=29b64d1de7c77ffb5cb10696693e6ed8a6fc481c I was wrong. I was thinking of this commit: commit d2086b08b0 Author: Alexander Korotkov <akorotkov@postgresql.org> Date: Sat Jul 28 00:31:40 2018 +0300 Reduce path length for locking leaf B-tree pages during insertion In our B-tree implementation appropriate leaf page for new tuple insertion is acquired using _bt_search() function. This function always returns leaf page locked in shared mode. In order to obtain exclusive lock, caller have to relock the page. This commit makes _bt_search() function lock leaf page immediately in exclusive mode when needed. That removes unnecessary relock and, in turn reduces lock contention for B-tree leaf pages. Our experiments on multi-core systems showed acceleration up to 4.5 times in corner case. but got it confused by an optimization you mentioned in the video where you were talking about the need to perhaps recheck the internal page when moving right. We certainly don't keep the internal page locked. > All of the suffix truncation stuff is good because it makes separator > keys in internal pages smaller, but it's also good because the > separator keys are simultaneously more "discriminating". We tend to > get a nice "clean break" between leaf pages, so checking the high key > before moving right to find additional matches (once we've already > returned some tuples to the executor) is surprisingly effective. It > would have been possible to add this optimization even without the > suffix truncation stuff, but truncation makes it effective. > > If you had to cut one thing from this list, then I would suggest that > it be this item. It's nice, but it's also very obvious, which makes it > hard to explain. I mean, why should there be any ambiguity at all? > Unless we have to return *hundreds* of items to the index scan, then a > simple "select * from foo where bar = ?" style query should only have > to visit one leaf page, even when the constant happens to be on the > boundary of a leaf page (maybe a concurrent page split could make this > happen, but that should be rare). Right. The commit mentioned a 4.5x speedup in a rare benchmark, so I added it lower on the list. > > 2. When inserting a duplicate value in the index, we used to try a few > > pages to see if there was space, then "get tired" and just split a page > > containing duplicates. Now that there are no duplicates (because > > duplicate key fields are sorted by tid) the system knows exactly what > > page the index tuple belongs on, and inserts or splits the page as > > necessary. > > Right -- inserts will descend straight to the correct leaf page, and > the "getting tired" dance isn't used anymore. This makes insertions > faster, but more importantly is a better high level strategy for > storing lots of duplicates. We'll dirty far fewer pages, because > insertions automatically end up inserting around the same place as we > inserted to a moment ago. Insertions of duplicates behave like > serial/auto-incrementing insertions, which was already > fast/well-optimized in various ways. Yes, locality. > It's easy to measure this by looking at index bloat when inserting > lots of duplicates -- I came up with the 16% figure in the talk based > on a simple insert-only test. > > > 3. Pivot tuples are used on internal pages and as min/max indicators on > > leaf pages. These pivot tuples are now trimmed if their trailing key > > fields are not significant. For example, if an index is > > field1/field2/field3, and the page contains values for which field1==5 > > and none that field1==6, there is no need to include field2 and field3 > > in the pivot tuple --- it can just list the pivot as field1==5, > > field2=infinity. This is called suffix truncation. > > Right -- that's exactly how it works. Users may find that indexes with > lots of extra columns at the end won't get so bloated in Postgres 12. > Indexing many columns is typically seen when index-only scans are > important. Of course, you could have made those indexes INCLUDE > indexes on v11, which is actually a closely related idea, but that > makes it impossible to use the trailing/suffix columns in an ORDER BY. > And, you have to know about INCLUDE indexes, and remember to use them. > > (This must be why Oracle can get away with not having INCLUDE indexes.) > > > Page splits used to just split the page in half, which minimizes the > > number of page splits, but sometimes causes lots of wasted space. The > > new code tries to split to reduce the length of pivot tuples, which ends > > up being more efficient in space savings because the pivot tuples are > > shorter, and the leaf pages end up being more tightly packed. This is > > particularly true for ever-increasing keys because we often end up > > creating a new empty page, rather than splitting an existing page. > > Right. We need to be somewhat cleverer about precisely where we split > leaf pages in order to make suffix truncation work well. But, it turns > out that there is another case where being *very* clever about leaf > page split points matters a lot, which is targeted by the "split after > new tuple" optimization. The "split after new tuple" optimization was > just a bonus for me. > > > 4. Vacuum's removal of index tuples in indexes with many duplicates is > > faster since it can more quickly find desired tuples. > > This may be true, but the interesting way that the TID-as-column thing > helps VACUUM is related to locality (spatial and temporal locality). > VACUUM can delete/remove whole pages at a time, though only when > they're completely empty (just one remaining tuple blocks page > deletion. because VACUUM cannot merge non-empty pages together). This > is much more likely to occur when we group like with like. Heap TID is > often correlated with primary key value, or with a timestamp, so we > can easily imagine VACUUM deleting more pages filled with duplicates > because they're grouped together in a way that's logical (instead of > more-or-less random, which is what "getting tired" left us with). > > Even without page deletion occurring, we can reuse ranges in the index > when there are lots of duplicates. We delete some rows in a table, and > VACUUM ultimately removes the rows from both table and indexes -- > including some interesting index that stores lots of duplicates, like > an index on a enum field. VACUUM makes it possible to recycle the heap > TIDs/space *everywhere*, not just in the table, as before. In the > fairly likely event of a future insert that recycles a heap TID also > having the same value for the index with duplicates (say its an enum > value), the dup index tuple can go in exactly the same place as the > earlier, now-deleted dup tuple. The "recycling" works at the index > level, too. In practice this kind of locality is rather common. It's > especially likely with non-HOT updates where the duplicate value isn't > changed, though simple inserts and deletes can see the same benefit. Yes, I can see index space reuse as more closely matching the heap. > Obviously you'll need to boil all that down -- good luck! Attached is an updated patch. I might have missed something, but I think it might be close. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Attachment
pgsql-hackers by date: