Re: Why does L&Y Blink Tree need lock coupling? - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Why does L&Y Blink Tree need lock coupling?
Date
Msg-id CAH2-Wz=VVPLZLNjGJ2oFtagQe7Y5eSd9+7OBX9MeSPc+jg4Gbw@mail.gmail.com
Whole thread Raw
In response to Why does L&Y Blink Tree need lock coupling?  (Oliver Yang <olilent2ctw@gmail.com>)
Responses Re: Why does L&Y Blink Tree need lock coupling?  (Oliver Yang <olilent2ctw@gmail.com>)
List pgsql-hackers
On Sun, Dec 11, 2022 at 5:38 PM Oliver Yang <olilent2ctw@gmail.com> wrote:
> The README in nbtree mentions that L&Y algorithm must couple
> locks when moving right during ascent for insertion.  However,
> it's hard to see why that's necessary.

You're not the first person to ask about this exact point in the last
few years. The last time somebody asked me about it via private email.
I'm surprised that there is even that level of interest.

It's not necessary to couple locks on internal levels of the tree in
nbtree, of course. Which is why we don't couple locks in
_bt_getstackbuf().

Note that we have two "move right" functions here --
_bt_getstackbuf(), and _bt_moveright(). Whereas the L&Y pseudocode has
one function for both things. Perhaps just because it's only abstract
pseudocode -- it needs to be understood in context.

You have to be realistic about how faithfully a real world system can
be expected to implement something like L&Y. Some of the assumptions
made by the paper just aren't reasonable, especially the assumption
about atomic page reads (I always thought that that assumption was
odd). Plus nbtree does plenty of things that L&Y don't consider,
including things that are obviously related to the stuff they do talk
about. For example, nbtree has left links, while L&Y does not (nor
does Lanin & Shasha, though they do have something weirdly similar
that they call "out links").

> Since L&Y mostly
> discussed concurrent insertions and searches, what can go wrong
> if inserters only acquire one lock at a time?

You have to consider that we don't match on the separator key during
the ascent of the B-tree structure following a split. That's another a
big difference between nbtree and the paper -- we store a block number
in our descent stack instead.

Imagine if PostgreSQL's nbtree did match on a separator key, like
Lehman and Yao. Think about this scenario:

* We descend the tree and follow a downlink, while remembering the
associated separator key on our descent stack. It is a simple 3 level
B-Tree.

* We split a leaf page, and have to relocate the separator 1 level up
(in level 1).

* The high key of the internal page on level 1 exactly matches our
separator key -- so it must be that the separator key is to the right.

* We release our lock on the original parent, and then lock and read
its right sibling. But where do we insert our new separator key?

We cannot match the original separator key because it doesn't exist in
this other internal page to the right -- there is no first separator
key in any internal page, including when it isn't the root of the
tree. Actually, you could say that there is a separator key associated
with the downlink, but it's only a negative infinity sentinel key.
Negative infinity keys represent "absolute negative infinity" when
they're from the root of the entire B-Tree, but in any other internal
page it represents "relative negative infinity" -- it's only lower
than everything in that particular subtree only.

At the very least it seems risky to assume that it's safe to match on
the separator key without lock coupling so that we see that the
downlink really does come after the matches-descent-stack high key
separator in the original parent page. You could probably make it work
if you had to, but it's annoying to explain, and not actually that
valuable -- moving right within _bt_getstackbuf() is a rare case in
general (and trying to relocate the downlink that happens to become
the first downlink following a concurrent internal page split is even
rarer).

In PostgreSQL it's not annoying to understand why it's okay, because
it's obviously okay to just match on the downlink/block number
directly, which is how it has always worked. It only becomes a problem
when you try to understand what Lehman and Yao meant. It's unfortunate
that they say "at most 3 locks", and therefore draw attention to this
not-very-important issue. Lehman and Yao probably found it easier to
say "let's try to keep our paper simple by making the move right
routine couple locks in very rare cases where it is actually necessary
to move right".

> The Lanin&ShaSha paper cited in README also agrees that B-link
> structure allows inserts and searches to lock only one node at a
> time although it's not apparent in L&Y itself.

But the Lanin & Shasha paper has a far more optimistic approach. They
make rather bold claims about how many locks they can get away with
holding at any one time. That makes it significantly different to L&Y
as well as nbtree (nbtree is far closer to L&Y than it is to Lanin &
Shasha).

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Speedup generation of command completion tags
Next
From: Thomas Munro
Date:
Subject: Re: Tree-walker callbacks vs -Wdeprecated-non-prototype