Re: PG 12 draft release notes - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: PG 12 draft release notes |
Date | |
Msg-id | CAH2-Wz=Fj5q_wSb=SGXBfdsgFkDWAToa9m+9XBtaY+1N3-vjig@mail.gmail.com Whole thread Raw |
In response to | Re: PG 12 draft release notes (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: PG 12 draft release notes
|
List | pgsql-hackers |
On Wed, Jun 12, 2019 at 1:16 PM Bruce Momjian <bruce@momjian.us> wrote: > First, my apologies in getting to this so late. Peter Geoghegan > supplied me with slides and a video to study, and I now understand how > complex the btree improvements are. Here is a video of Peter's > presentation at PGCon: > > https://www.youtube.com/watch?v=p5RaATILoiE Thank you for going to the trouble of researching the B-Tree stuff in detail! I wouldn't ask that of anybody in the position of writing release notes, so it's appreciated. It is awkward to take the work that I've done and make it into multiple bullet points; I have a hard time with that myself. > The over-arching improvement to btree in PG 12 is the ordering of index > entries by tid so all entries are unique. Right. Everything good happens as a direct or indirect result of the TID-as-column thing. That is really the kernel of the whole thing, because it means that the implementation now *fully* follows the Lehman and Yao design. > 1. Since all index tuples are ordered, you can move from one leaf page > to the next without keeping a lock on the internal page that references > it, increasing concurrency. I'm not sure what you mean here. We've never had to keep a lock on an internal page while holding a lock on a leaf page -- such "lock coupling" was necessary in earlier B-Tree designs, but Lehman and Yao's algorithm avoids that. Of course, that isn't new. I think that you're talking about the way that we now check the high key during index scans, and find that we don't have to move to the next leaf page, per this commit: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=29b64d1de7c77ffb5cb10696693e6ed8a6fc481c All of the suffix truncation stuff is good because it makes separator keys in internal pages smaller, but it's also good because the separator keys are simultaneously more "discriminating". We tend to get a nice "clean break" between leaf pages, so checking the high key before moving right to find additional matches (once we've already returned some tuples to the executor) is surprisingly effective. It would have been possible to add this optimization even without the suffix truncation stuff, but truncation makes it effective. If you had to cut one thing from this list, then I would suggest that it be this item. It's nice, but it's also very obvious, which makes it hard to explain. I mean, why should there be any ambiguity at all? Unless we have to return *hundreds* of items to the index scan, then a simple "select * from foo where bar = ?" style query should only have to visit one leaf page, even when the constant happens to be on the boundary of a leaf page (maybe a concurrent page split could make this happen, but that should be rare). > 2. When inserting a duplicate value in the index, we used to try a few > pages to see if there was space, then "get tired" and just split a page > containing duplicates. Now that there are no duplicates (because > duplicate key fields are sorted by tid) the system knows exactly what > page the index tuple belongs on, and inserts or splits the page as > necessary. Right -- inserts will descend straight to the correct leaf page, and the "getting tired" dance isn't used anymore. This makes insertions faster, but more importantly is a better high level strategy for storing lots of duplicates. We'll dirty far fewer pages, because insertions automatically end up inserting around the same place as we inserted to a moment ago. Insertions of duplicates behave like serial/auto-incrementing insertions, which was already fast/well-optimized in various ways. It's easy to measure this by looking at index bloat when inserting lots of duplicates -- I came up with the 16% figure in the talk based on a simple insert-only test. > 3. Pivot tuples are used on internal pages and as min/max indicators on > leaf pages. These pivot tuples are now trimmed if their trailing key > fields are not significant. For example, if an index is > field1/field2/field3, and the page contains values for which field1==5 > and none that field1==6, there is no need to include field2 and field3 > in the pivot tuple --- it can just list the pivot as field1==5, > field2=infinity. This is called suffix truncation. Right -- that's exactly how it works. Users may find that indexes with lots of extra columns at the end won't get so bloated in Postgres 12. Indexing many columns is typically seen when index-only scans are important. Of course, you could have made those indexes INCLUDE indexes on v11, which is actually a closely related idea, but that makes it impossible to use the trailing/suffix columns in an ORDER BY. And, you have to know about INCLUDE indexes, and remember to use them. (This must be why Oracle can get away with not having INCLUDE indexes.) > Page splits used to just split the page in half, which minimizes the > number of page splits, but sometimes causes lots of wasted space. The > new code tries to split to reduce the length of pivot tuples, which ends > up being more efficient in space savings because the pivot tuples are > shorter, and the leaf pages end up being more tightly packed. This is > particularly true for ever-increasing keys because we often end up > creating a new empty page, rather than splitting an existing page. Right. We need to be somewhat cleverer about precisely where we split leaf pages in order to make suffix truncation work well. But, it turns out that there is another case where being *very* clever about leaf page split points matters a lot, which is targeted by the "split after new tuple" optimization. The "split after new tuple" optimization was just a bonus for me. > 4. Vacuum's removal of index tuples in indexes with many duplicates is > faster since it can more quickly find desired tuples. This may be true, but the interesting way that the TID-as-column thing helps VACUUM is related to locality (spatial and temporal locality). VACUUM can delete/remove whole pages at a time, though only when they're completely empty (just one remaining tuple blocks page deletion. because VACUUM cannot merge non-empty pages together). This is much more likely to occur when we group like with like. Heap TID is often correlated with primary key value, or with a timestamp, so we can easily imagine VACUUM deleting more pages filled with duplicates because they're grouped together in a way that's logical (instead of more-or-less random, which is what "getting tired" left us with). Even without page deletion occurring, we can reuse ranges in the index when there are lots of duplicates. We delete some rows in a table, and VACUUM ultimately removes the rows from both table and indexes -- including some interesting index that stores lots of duplicates, like an index on a enum field. VACUUM makes it possible to recycle the heap TIDs/space *everywhere*, not just in the table, as before. In the fairly likely event of a future insert that recycles a heap TID also having the same value for the index with duplicates (say its an enum value), the dup index tuple can go in exactly the same place as the earlier, now-deleted dup tuple. The "recycling" works at the index level, too. In practice this kind of locality is rather common. It's especially likely with non-HOT updates where the duplicate value isn't changed, though simple inserts and deletes can see the same benefit. Obviously you'll need to boil all that down -- good luck! Thanks -- Peter Geoghegan
pgsql-hackers by date: