While reviewing an amcheck patch of Andrey Borodin's, I noticed that
it had a problem that I tied back to btree_xlog_split()'s loose
approach to locking buffers compared to the primary [1] (i.e. compared
to _bt_split()). This created a problem the proposed new check that is
not unlike the problem that backwards scans running on standbys had
with "concurrent" page deletions -- that was a legitimate bug that was
fixed in commit 9a9db08a.
I'm starting to think that we should bite the bullet and not release
all same-level locks within btree_xlog_split() until the very end,
along with the existing right sibling page whose left link we need to
update. In other words, "couple" the locks in the manner of
_bt_split(), though only for same-level pages (just like
btree_xlog_unlink_page() after commit 9a9db08a). That would make it
okay to commit Andrey's patch, but it also seems like a good idea on
general principle. (Note that I'm not proposing cross-level lock
coupling on replicas, which seems unnecessary. Actually it's not
really possible to do that because cross-level locks span multiple
atomic actions/WAL records on the primary.)
Presumably the lock coupling on standbys will have some overhead, but
that seems essentially the same as the overhead on the primary. The
obvious case to test (to evaluate the overhead of being more
conservative in btree_xlog_split()) is a workload where we continually
split the rightmost page. That's not actually relevant, though, since
there is no right sibling to update when we split the rightmost page.
My sense is that the current approach to locking taken within
btree_xlog_split() is kind of an accident, not something that was
pursued as a special optimization for the REDO routine at some point.
Commit 3bbf668d certainly creates that impression. But I might have
missed something.
[1] postgr.es/m/CAH2-Wzm3=SLwu5=z8qG6UBpCemZW3dUNXWbX-cpXCgb=y3OhZw@mail.gmail.com
--
Peter Geoghegan