Re: PG 12 draft release notes - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: PG 12 draft release notes
Date
Msg-id CAH2-Wz=Fj5q_wSb=SGXBfdsgFkDWAToa9m+9XBtaY+1N3-vjig@mail.gmail.com
Whole thread Raw
In response to Re: PG 12 draft release notes  (Bruce Momjian <bruce@momjian.us>)
Responses Re: PG 12 draft release notes
List pgsql-hackers
On Wed, Jun 12, 2019 at 1:16 PM Bruce Momjian <bruce@momjian.us> wrote:
> First, my apologies in getting to this so late.  Peter Geoghegan
> supplied me with slides and a video to study, and I now understand how
> complex the btree improvements are.  Here is a video of Peter's
> presentation at PGCon:
>
>         https://www.youtube.com/watch?v=p5RaATILoiE

Thank you for going to the trouble of researching the B-Tree stuff in
detail! I wouldn't ask that of anybody in the position of writing
release notes, so it's appreciated. It is awkward to take the work
that I've done and make it into multiple bullet points; I have a hard
time with that myself.

> The over-arching improvement to btree in PG 12 is the ordering of index
> entries by tid so all entries are unique.

Right. Everything good happens as a direct or indirect result of the
TID-as-column thing. That is really the kernel of the whole thing,
because it means that the implementation now *fully* follows the
Lehman and Yao design.

> 1.  Since all index tuples are ordered, you can move from one leaf page
> to the next without keeping a lock on the internal page that references
> it, increasing concurrency.

I'm not sure what you mean here. We've never had to keep a lock on an
internal page while holding a lock on a leaf page -- such "lock
coupling" was necessary in earlier B-Tree designs, but Lehman and
Yao's algorithm avoids that. Of course, that isn't new.

I think that you're talking about the way that we now check the high
key during index scans, and find that we don't have to move to the
next leaf page, per this commit:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=29b64d1de7c77ffb5cb10696693e6ed8a6fc481c

All of the suffix truncation stuff is good because it makes separator
keys in internal pages smaller, but it's also good because the
separator keys are simultaneously more "discriminating". We tend to
get a nice "clean break" between leaf pages, so checking the high key
before moving right to find additional matches (once we've already
returned some tuples to the executor) is surprisingly effective. It
would have been possible to add this optimization even without the
suffix truncation stuff, but truncation makes it effective.

If you had to cut one thing from this list, then I would suggest that
it be this item. It's nice, but it's also very obvious, which makes it
hard to explain. I mean, why should there be any ambiguity at all?
Unless we have to return *hundreds* of items to the index scan, then a
simple "select * from foo where bar = ?" style query should only have
to visit one leaf page, even when the constant happens to be on the
boundary of a leaf page (maybe a concurrent page split could make this
happen, but that should be rare).

> 2.  When inserting a duplicate value in the index, we used to try a few
> pages to see if there was space, then "get tired" and just split a page
> containing duplicates.  Now that there are no duplicates (because
> duplicate key fields are sorted by tid) the system knows exactly what
> page the index tuple belongs on, and inserts or splits the page as
> necessary.

Right -- inserts will descend straight to the correct leaf page, and
the "getting tired" dance isn't used anymore. This makes insertions
faster, but more importantly is a better high level strategy for
storing lots of duplicates. We'll dirty far fewer pages, because
insertions automatically end up inserting around the same place as we
inserted to a moment ago. Insertions of duplicates behave like
serial/auto-incrementing insertions, which was already
fast/well-optimized in various ways.

It's easy to measure this by looking at index bloat when inserting
lots of duplicates -- I came up with the 16% figure in the talk based
on a simple insert-only test.

> 3.  Pivot tuples are used on internal pages and as min/max indicators on
> leaf pages.  These pivot tuples are now trimmed if their trailing key
> fields are not significant.  For example, if an index is
> field1/field2/field3, and the page contains values for which field1==5
> and none that field1==6, there is no need to include field2 and field3
> in the pivot tuple --- it can just list the pivot as field1==5,
> field2=infinity.  This is called suffix truncation.

Right -- that's exactly how it works. Users may find that indexes with
lots of extra columns at the end won't get so bloated in Postgres 12.
Indexing many columns is typically seen when index-only scans are
important. Of course, you could have made those indexes INCLUDE
indexes on v11, which is actually a closely related idea, but that
makes it impossible to use the trailing/suffix columns in an ORDER BY.
And, you have to know about INCLUDE indexes, and remember to use them.

(This must be why Oracle can get away with not having INCLUDE indexes.)

> Page splits used to just split the page in half, which minimizes the
> number of page splits, but sometimes causes lots of wasted space.  The
> new code tries to split to reduce the length of pivot tuples, which ends
> up being more efficient in space savings because the pivot tuples are
> shorter, and the leaf pages end up being more tightly packed.  This is
> particularly true for ever-increasing keys because we often end up
> creating a new empty page, rather than splitting an existing page.

Right. We need to be somewhat cleverer about precisely where we split
leaf pages in order to make suffix truncation work well. But, it turns
out that there is another case where being *very* clever about leaf
page split points matters a lot, which is targeted by the "split after
new tuple" optimization. The "split after new tuple" optimization was
just a bonus for me.

> 4.  Vacuum's removal of index tuples in indexes with many duplicates is
> faster since it can more quickly find desired tuples.

This may be true, but the interesting way that the TID-as-column thing
helps VACUUM is related to locality (spatial and temporal locality).
VACUUM can delete/remove whole pages at a time, though only when
they're completely empty (just one remaining tuple blocks page
deletion. because VACUUM cannot merge non-empty pages together). This
is much more likely to occur when we group like with like. Heap TID is
often correlated with primary key value, or with a timestamp, so we
can easily imagine VACUUM deleting more pages filled with duplicates
because they're grouped together in a way that's logical (instead of
more-or-less random, which is what "getting tired" left us with).

Even without page deletion occurring, we can reuse ranges in the index
when there are lots of duplicates. We delete some rows in a table, and
VACUUM ultimately removes the rows from both table and indexes --
including some interesting index that stores lots of duplicates, like
an index on a enum field. VACUUM makes it possible to recycle the heap
TIDs/space *everywhere*, not just in the table, as before. In the
fairly likely event of a future insert that recycles a heap TID also
having the same value for the index with duplicates (say its an enum
value), the dup index tuple can go in exactly the same place as the
earlier, now-deleted dup tuple. The "recycling" works at the index
level, too. In practice this kind of locality is rather common. It's
especially likely with non-HOT updates where the duplicate value isn't
changed, though simple inserts and deletes can see the same benefit.

Obviously you'll need to boil all that down -- good luck!

Thanks
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: PG 12 draft release notes
Next
From: Tom Lane
Date:
Subject: Re: Question about some changes in 11.3