Thread: Should the nbtree page split REDO routine's locking work more like the locking on the primary?
Should the nbtree page split REDO routine's locking work more like the locking on the primary?
From
Peter Geoghegan
Date:
While reviewing an amcheck patch of Andrey Borodin's, I noticed that it had a problem that I tied back to btree_xlog_split()'s loose approach to locking buffers compared to the primary [1] (i.e. compared to _bt_split()). This created a problem the proposed new check that is not unlike the problem that backwards scans running on standbys had with "concurrent" page deletions -- that was a legitimate bug that was fixed in commit 9a9db08a. I'm starting to think that we should bite the bullet and not release all same-level locks within btree_xlog_split() until the very end, along with the existing right sibling page whose left link we need to update. In other words, "couple" the locks in the manner of _bt_split(), though only for same-level pages (just like btree_xlog_unlink_page() after commit 9a9db08a). That would make it okay to commit Andrey's patch, but it also seems like a good idea on general principle. (Note that I'm not proposing cross-level lock coupling on replicas, which seems unnecessary. Actually it's not really possible to do that because cross-level locks span multiple atomic actions/WAL records on the primary.) Presumably the lock coupling on standbys will have some overhead, but that seems essentially the same as the overhead on the primary. The obvious case to test (to evaluate the overhead of being more conservative in btree_xlog_split()) is a workload where we continually split the rightmost page. That's not actually relevant, though, since there is no right sibling to update when we split the rightmost page. My sense is that the current approach to locking taken within btree_xlog_split() is kind of an accident, not something that was pursued as a special optimization for the REDO routine at some point. Commit 3bbf668d certainly creates that impression. But I might have missed something. [1] postgr.es/m/CAH2-Wzm3=SLwu5=z8qG6UBpCemZW3dUNXWbX-cpXCgb=y3OhZw@mail.gmail.com -- Peter Geoghegan
Re: Should the nbtree page split REDO routine's locking work more like the locking on the primary?
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > I'm starting to think that we should bite the bullet and not release > all same-level locks within btree_xlog_split() until the very end, > along with the existing right sibling page whose left link we need to > update. +1 for making this more like what happens in original execution ("on the primary", to use your wording). Perhaps what you suggest here is still not enough like the original execution, but it sounds closer. > My sense is that the current approach to locking taken within > btree_xlog_split() is kind of an accident, not something that was > pursued as a special optimization for the REDO routine at some point. > Commit 3bbf668d certainly creates that impression. But I might have > missed something. As the commit message for 3bbf668d explains, the initial situation for all the replay code was that it executed by itself in crash recovery and didn't need to bother with locks at all. I think that it did take some locks even then, but that was because of code sharing with the primary execution path rather than being something we wanted. Once we invented Hot Standby that situation had to be improved. It seems to me that the goal now needs to be to replicate the primary-execution buffer locking behavior in any case where we can't prove that something simpler is safe. 3bbf668d did not claim to achieve that everywhere, and it didn't; it doesn't surprise me that there's work left to be done. regards, tom lane
Re: Should the nbtree page split REDO routine's locking work more like the locking on the primary?
From
Peter Geoghegan
Date:
On Thu, Aug 6, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > +1 for making this more like what happens in original execution ("on the > primary", to use your wording). Perhaps what you suggest here is still > not enough like the original execution, but it sounds closer. It won't be the same as the original execution, exactly -- I am only thinking of holding on to same-level page locks (the original page, its new right sibling, and the original right sibling). I suppose that it's possible to go further than this in one rarer case (when clearing incomplete split flag one level down), but for the most part it isn't even possible to follow original execution's approach to locking in every detail. Clearly it's not okay for the startup process to hold buffer locks across replay of the first and second phase of a split, but that's what it would take to follow original execution 100% faithfully -- there are two WAL records involved. I am quite confident that there won't be any remaining problems provided we follow the original execution's approach to locking within each level of the tree -- that's enough. Anything that runs during recovery won't care about cross-level differences, aside from the obvious (scans may have to move right to recover from concurrent splits). > As the commit message for 3bbf668d explains, the initial situation for > all the replay code was that it executed by itself in crash recovery and > didn't need to bother with locks at all. I think that it did take some > locks even then, but that was because of code sharing with the primary > execution path rather than being something we wanted. Makes sense. -- Peter Geoghegan
Re: Should the nbtree page split REDO routine's locking work more like the locking on the primary?
From
Peter Geoghegan
Date:
On Thu, Aug 6, 2020 at 7:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Aug 6, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > +1 for making this more like what happens in original execution ("on the > > primary", to use your wording). Perhaps what you suggest here is still > > not enough like the original execution, but it sounds closer. > > It won't be the same as the original execution, exactly -- I am only > thinking of holding on to same-level page locks (the original page, > its new right sibling, and the original right sibling). I pushed a commit that reorders the lock acquisitions within btree_xlog_unlink_page() -- they're now consistent with _bt_split() (at least among sibling pages involved in the page split). Thanks -- Peter Geoghegan
Re: Should the nbtree page split REDO routine's locking work more like the locking on the primary?
From
"Andrey M. Borodin"
Date:
> 8 авг. 2020 г., в 03:28, Peter Geoghegan <pg@bowt.ie> написал(а): > > On Thu, Aug 6, 2020 at 7:00 PM Peter Geoghegan <pg@bowt.ie> wrote: >> On Thu, Aug 6, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> +1 for making this more like what happens in original execution ("on the >>> primary", to use your wording). Perhaps what you suggest here is still >>> not enough like the original execution, but it sounds closer. >> >> It won't be the same as the original execution, exactly -- I am only >> thinking of holding on to same-level page locks (the original page, >> its new right sibling, and the original right sibling). > > I pushed a commit that reorders the lock acquisitions within > btree_xlog_unlink_page() -- they're now consistent with _bt_split() > (at least among sibling pages involved in the page split). Sounds great, thanks! Best regards, Andrey Borodin.