Re: Why does L&Y Blink Tree need lock coupling? - Mailing list pgsql-hackers

From Oliver Yang
Subject Re: Why does L&Y Blink Tree need lock coupling?
Date
Msg-id CALJbhHO3dsu8tg5s8VyKiAvCx-fju9SEG7b=NLA8brNZoTpaqQ@mail.gmail.com
Whole thread Raw
In response to Re: Why does L&Y Blink Tree need lock coupling?  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Why does L&Y Blink Tree need lock coupling?  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Sun, Dec 11, 2022 at 6:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sun, Dec 11, 2022 at 5:38 PM Oliver Yang <olilent2ctw@gmail.com> wrote:
> > The README in nbtree mentions that L&Y algorithm must couple
> > locks when moving right during ascent for insertion.  However,
> > it's hard to see why that's necessary.
>
> You're not the first person to ask about this exact point in the last
> few years. The last time somebody asked me about it via private email.
> I'm surprised that there is even that level of interest.
>
> It's not necessary to couple locks on internal levels of the tree in
> nbtree, of course. Which is why we don't couple locks in
> _bt_getstackbuf().
>
> Note that we have two "move right" functions here --
> _bt_getstackbuf(), and _bt_moveright(). Whereas the L&Y pseudocode has
> one function for both things. Perhaps just because it's only abstract
> pseudocode -- it needs to be understood in context.
>
> You have to be realistic about how faithfully a real world system can
> be expected to implement something like L&Y. Some of the assumptions
> made by the paper just aren't reasonable, especially the assumption
> about atomic page reads (I always thought that that assumption was
> odd). Plus nbtree does plenty of things that L&Y don't consider,
> including things that are obviously related to the stuff they do talk
> about. For example, nbtree has left links, while L&Y does not (nor
> does Lanin & Shasha, though they do have something weirdly similar
> that they call "out links").
>
> > Since L&Y mostly
> > discussed concurrent insertions and searches, what can go wrong
> > if inserters only acquire one lock at a time?
>
> You have to consider that we don't match on the separator key during
> the ascent of the B-tree structure following a split. That's another a
> big difference between nbtree and the paper -- we store a block number
> in our descent stack instead.
>
> Imagine if PostgreSQL's nbtree did match on a separator key, like
> Lehman and Yao. Think about this scenario:
>
> * We descend the tree and follow a downlink, while remembering the
> associated separator key on our descent stack. It is a simple 3 level
> B-Tree.
>
> * We split a leaf page, and have to relocate the separator 1 level up
> (in level 1).
>
> * The high key of the internal page on level 1 exactly matches our
> separator key -- so it must be that the separator key is to the right.
>
> * We release our lock on the original parent, and then lock and read
> its right sibling. But where do we insert our new separator key?
>
> We cannot match the original separator key because it doesn't exist in
> this other internal page to the right -- there is no first separator
> key in any internal page, including when it isn't the root of the
> tree. Actually, you could say that there is a separator key associated
> with the downlink, but it's only a negative infinity sentinel key.
> Negative infinity keys represent "absolute negative infinity" when
> they're from the root of the entire B-Tree, but in any other internal
> page it represents "relative negative infinity" -- it's only lower
> than everything in that particular subtree only.
>
> At the very least it seems risky to assume that it's safe to match on
> the separator key without lock coupling so that we see that the
> downlink really does come after the matches-descent-stack high key
> separator in the original parent page. You could probably make it work
> if you had to, but it's annoying to explain, and not actually that
> valuable -- moving right within _bt_getstackbuf() is a rare case in
> general (and trying to relocate the downlink that happens to become
> the first downlink following a concurrent internal page split is even
> rarer).
>
> In PostgreSQL it's not annoying to understand why it's okay, because
> it's obviously okay to just match on the downlink/block number
> directly, which is how it has always worked. It only becomes a problem
> when you try to understand what Lehman and Yao meant. It's unfortunate
> that they say "at most 3 locks", and therefore draw attention to this
> not-very-important issue. Lehman and Yao probably found it easier to
> say "let's try to keep our paper simple by making the move right
> routine couple locks in very rare cases where it is actually necessary
> to move right".

As you suggested, the coupling lock during moveright could be
avoided by tracking downlink instead of the separator key during
descent.  In a sense, this isn't a fundamental issue and L&Y
paper could be easily tweaked to track downlink so that it
doesn't require coupling lock in moveright.

However, it's hard to see why coupling lock is needed during
ascent from child level to parent level in the L&Y setting.  What can
go wrong if L&Y's algorithm releases lock on child page before
acquiring lock on its parent?  The correctness proof in L&Y
doesn't use the assumption of coupling lock anywhere.  It appears
that a lock at a time is sufficient in principle.

> > The Lanin&ShaSha paper cited in README also agrees that B-link
> > structure allows inserts and searches to lock only one node at a
> > time although it's not apparent in L&Y itself.
>
> But the Lanin & Shasha paper has a far more optimistic approach. They
> make rather bold claims about how many locks they can get away with
> holding at any one time. That makes it significantly different to L&Y
> as well as nbtree (nbtree is far closer to L&Y than it is to Lanin &
> Shasha).

The direct quote from section 1.2 of Lanin & Shasha
paper: "Although it is not apparent in itself, the B-link
structure allows inserts and searches to lock only one node at a
time."  It seems to be an assertion on the property of the L&Y
algorithm.  It doesn't seem to be related the optimistic approach
employed in Lanin & Shasha own algorithm.


Best,

Hong



pgsql-hackers by date:

Previous
From: Ilya Gladyshev
Date:
Subject: Re: Progress report of CREATE INDEX for nested partitioned tables
Next
From: Alvaro Herrera
Date:
Subject: Re: PGDOCS - Logical replication GUCs - added some xrefs