Thread: Should the nbtree page split REDO routine's locking work more like the locking on the primary?

While reviewing an amcheck patch of Andrey Borodin's, I noticed that
it had a problem that I tied back to btree_xlog_split()'s loose
approach to locking buffers compared to the primary [1] (i.e. compared
to _bt_split()). This created a problem the proposed new check that is
not unlike the problem that backwards scans running on standbys had
with "concurrent" page deletions -- that was a legitimate bug that was
fixed in commit 9a9db08a.

I'm starting to think that we should bite the bullet and not release
all same-level locks within btree_xlog_split() until the very end,
along with the existing right sibling page whose left link we need to
update. In other words, "couple" the locks in the manner of
_bt_split(), though only for same-level pages (just like
btree_xlog_unlink_page() after commit 9a9db08a). That would make it
okay to commit Andrey's patch, but it also seems like a good idea on
general principle. (Note that I'm not proposing cross-level lock
coupling on replicas, which seems unnecessary. Actually it's not
really possible to do that because cross-level locks span multiple
atomic actions/WAL records on the primary.)

Presumably the lock coupling on standbys will have some overhead, but
that seems essentially the same as the overhead on the primary. The
obvious case to test (to evaluate the overhead of being more
conservative in btree_xlog_split()) is a workload where we continually
split the rightmost page. That's not actually relevant, though, since
there is no right sibling to update when we split the rightmost page.

My sense is that the current approach to locking taken within
btree_xlog_split() is kind of an accident, not something that was
pursued as a special optimization for the REDO routine at some point.
Commit 3bbf668d certainly creates that impression. But I might have
missed something.

[1] postgr.es/m/CAH2-Wzm3=SLwu5=z8qG6UBpCemZW3dUNXWbX-cpXCgb=y3OhZw@mail.gmail.com
-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> I'm starting to think that we should bite the bullet and not release
> all same-level locks within btree_xlog_split() until the very end,
> along with the existing right sibling page whose left link we need to
> update.

+1 for making this more like what happens in original execution ("on the
primary", to use your wording).  Perhaps what you suggest here is still
not enough like the original execution, but it sounds closer.

> My sense is that the current approach to locking taken within
> btree_xlog_split() is kind of an accident, not something that was
> pursued as a special optimization for the REDO routine at some point.
> Commit 3bbf668d certainly creates that impression. But I might have
> missed something.

As the commit message for 3bbf668d explains, the initial situation for
all the replay code was that it executed by itself in crash recovery and
didn't need to bother with locks at all.  I think that it did take some
locks even then, but that was because of code sharing with the primary
execution path rather than being something we wanted.  Once we invented
Hot Standby that situation had to be improved.  It seems to me that the
goal now needs to be to replicate the primary-execution buffer locking
behavior in any case where we can't prove that something simpler is safe.
3bbf668d did not claim to achieve that everywhere, and it didn't; it
doesn't surprise me that there's work left to be done.

            regards, tom lane



On Thu, Aug 6, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> +1 for making this more like what happens in original execution ("on the
> primary", to use your wording).  Perhaps what you suggest here is still
> not enough like the original execution, but it sounds closer.

It won't be the same as the original execution, exactly -- I am only
thinking of holding on to same-level page locks (the original page,
its new right sibling, and the original right sibling). I suppose that
it's possible to go further than this in one rarer case (when clearing
incomplete split flag one level down), but for the most part it isn't
even possible to follow original execution's approach to locking in
every detail. Clearly it's not okay for the startup process to hold
buffer locks across replay of the first and second phase of a split,
but that's what it would take to follow original execution 100%
faithfully -- there are two WAL records involved.

I am quite confident that there won't be any remaining problems
provided we follow the original execution's approach to locking within
each level of the tree -- that's enough. Anything that runs during
recovery won't care about cross-level differences, aside from the
obvious (scans may have to move right to recover from concurrent
splits).

> As the commit message for 3bbf668d explains, the initial situation for
> all the replay code was that it executed by itself in crash recovery and
> didn't need to bother with locks at all.  I think that it did take some
> locks even then, but that was because of code sharing with the primary
> execution path rather than being something we wanted.

Makes sense.

-- 
Peter Geoghegan



On Thu, Aug 6, 2020 at 7:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Aug 6, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > +1 for making this more like what happens in original execution ("on the
> > primary", to use your wording).  Perhaps what you suggest here is still
> > not enough like the original execution, but it sounds closer.
>
> It won't be the same as the original execution, exactly -- I am only
> thinking of holding on to same-level page locks (the original page,
> its new right sibling, and the original right sibling).

I pushed a commit that reorders the lock acquisitions within
btree_xlog_unlink_page() -- they're now consistent with _bt_split()
(at least among sibling pages involved in the page split).

Thanks
-- 
Peter Geoghegan




> 8 авг. 2020 г., в 03:28, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> On Thu, Aug 6, 2020 at 7:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> On Thu, Aug 6, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> +1 for making this more like what happens in original execution ("on the
>>> primary", to use your wording).  Perhaps what you suggest here is still
>>> not enough like the original execution, but it sounds closer.
>>
>> It won't be the same as the original execution, exactly -- I am only
>> thinking of holding on to same-level page locks (the original page,
>> its new right sibling, and the original right sibling).
>
> I pushed a commit that reorders the lock acquisitions within
> btree_xlog_unlink_page() -- they're now consistent with _bt_split()
> (at least among sibling pages involved in the page split).

Sounds great, thanks!

Best regards, Andrey Borodin.