Thread: Avoiding superfluous buffer locking during nbtree backwards scans
Attached patch teaches nbtree backwards scans to avoid needlessly relocking a previously read page/buffer at the point where we need to consider reading the next page (the page to the left). Currently, we do this regardless of whether or not we already decided to end the scan, back when we read the page within _bt_readpage. We'll relock the page we've just read (and just unlocked), just to follow its left link, while making sure to deal with concurrent page splits correctly. But why bother relocking the page, or with thinking about concurrent splits, if we can already plainly see that we cannot possibly need to find matching tuples on the left sibling page? The patch just adds a simple precheck, which works in the obvious way. Seems like this was a missed opportunity for commit 2ed5b87f96. On HEAD, the following query requires 24 buffer hits (I'm getting a standard/forward index scan for this): select abalance from pgbench_accounts where aid in (1, 500, 1000, 1500, 2000, 3000) order by aid asc; However, we don't see that with the backwards scan variant: select abalance from pgbench_accounts where aid in (1, 500, 1000, 1500, 2000, 3000) order by aid desc; We actually see 30 buffer hits for this on HEAD. Whereas with the patch, both variants get only 24 buffer hits -- there's parity, at least in cases like these two. Note that we only "achieve parity" here because we happen to be using a SAOP, requiring multiple primitive index scans, each of which ends with its own superfluous lock acquisition. Backwards scans remain at a disadvantage with regard to buffer locks acquired in other cases -- cases that happen to involve scanning several sibling leaf pages in sequence (no change there). It's probably possible to fix those more complicated cases too. But that would require a significantly more complicated approach. I haven't touched existing comments in _bt_readnextpage that contemplate this possibility. -- Peter Geoghegan
Attachment
Re: Avoiding superfluous buffer locking during nbtree backwards scans
From
Matthias van de Meent
Date:
On Fri, 5 Jul 2024 at 18:48, Peter Geoghegan <pg@bowt.ie> wrote: > > Attached patch teaches nbtree backwards scans to avoid needlessly > relocking a previously read page/buffer at the point where we need to > consider reading the next page (the page to the left). +1, LGTM. This changes the backward scan code in _bt_readpage to have an approximately equivalent handling as the forward scan case for end-of-scan cases, which is an improvement IMO. Kind regards, Matthias van de Meent Neon (https://neon.tech/)
On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > +1, LGTM. > > This changes the backward scan code in _bt_readpage to have an > approximately equivalent handling as the forward scan case for > end-of-scan cases, which is an improvement IMO. Pushed just now. Thanks for the review! -- Peter Geoghegan
On Fri, Jul 5, 2024 at 12:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > On HEAD, the following query requires 24 buffer hits (I'm getting a > standard/forward index scan for this): > > select > abalance > from > pgbench_accounts > where > aid in (1, 500, 1000, 1500, 2000, 3000) order by aid asc; > > However, we don't see that with the backwards scan variant: > > select > abalance > from > pgbench_accounts > where > aid in (1, 500, 1000, 1500, 2000, 3000) order by aid desc; > > We actually see 30 buffer hits for this on HEAD. Whereas with the > patch, both variants get only 24 buffer hits -- there's parity, at > least in cases like these two. I'll now present a similar example that shows the remaining issue that my patch (which became my commit 3f44959f) didn't address, and how the issue if fixed by Matthias' more recent follow-up patch: $ pgbench -i -s 1 Now run: explain (analyze, buffers, costs off, summary off) select abalance from pgbench_accounts order by aid; With a warm cache, this query gets a full index scan requiring 1915 buffer hits. However, we still see worse performance for an equivalent backwards scan: explain (analyze, buffers, costs off, summary off) select abalance from pgbench_accounts order by aid desc; On master, the latter query gets 2188 buffer hits -- so clearly my commit 3f44959f left money on the table. These extra buffer hits just don't make sense. Matthias' patch (I tested v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch) will result in both variants requiring 1915 buffer hits. (Plus, of course, query execution should be faster with Matthias' new approach.) Though we do need special extra steps for dealing with concurrent page splits during backwards scans (steps involving a reread of the original page when its sibling link turns out to be stale), there hasn't been a concurrent split here -- which is why I find the inconsistency to be unnatural. Matthias' patch fixes the problem by passing _bt_walk_left the left link that will now have been stashed in the local scan state back when the page was read by _bt_readpage, so that it (_bt_walk_left) can optimistically *start* there (and *not* start at the page that has already been read, as on the master branch). Of course, the left link might be stale, but it was always inherently necessary for _bt_walk_left to be able to deal with that. Technically we're being more optimistic about it now, but that optimism is extremely well justified (concurrent page splits are rare). Arguably, this latest patch from Matthias actually makes the code simpler. It makes the backwards scan case more similar to the forwards scan case. This is another missed opportunity for commit 2ed5b87f96, I'd say -- that 2015 commit left things in a slightly odd in-between state, which this patch fully corrects. Note: my example was deliberately constructed to use an index scan backwards, rather than an index-only scan backwards. While both types of backwards scan will benefit in the same way (less buffer lock traffic), you'll only see a reduction in buffers hit for an EXPLAIN(ANALYZE, BUFFERS) involving a plain index scan backwards. This is due to the fact that the mechanism added by commit 2ed5b87f96 will only drop a leaf page buffer pin early for plain index scans, never for index-only scans. My example also wouldn't make the difference apparent with an unlogged table, for the same reason -- this difference isn't really important, but seems worth noting to avoid any confusion. A nice side benefit of this approach is that it'll make it easier to add better coverage for _bt_walk_left. The case where _bt_walk_left notices a stale left link will still be very rare (we're really not being all that optimistic in doing things this way), but it'll now be easier to hit on purpose. The code in v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch looks reasonable to me. I don't think that there are any impediments to committing it sometime this week. -- Peter Geoghegan
Re: Avoiding superfluous buffer locking during nbtree backwards scans
From
Matthias van de Meent
Date:
On Wed, 16 Oct 2024 at 20:03, Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Oct 11, 2024 at 7:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached is v5 > > Now I'm attaching a v6, which further polishes things. Current plan is to > commit something very close to this in the next day or two. > > v6 is mostly just further comment polishing. But it also merges together > the two existing _bt_readnextpage loops (for forward and backward scan > directions) into one single loop that handles everything. This is > definitely a win for brevity, and might also be a win for clarity -- > but I'm not 100% sure which way I prefer it just yet. (patch v6) > - BlockNumber btps_scanPage; /* latest or next page to be scanned */ > + BlockNumber btps_nextScanPage; /* next page to be scanned */ > + BlockNumber btps_lastCurrPage; /* page whose sibling link was copied into > + * btps_scanPage */ This reference to btps_scanPage in the comment needs to be updated to its new name. (from your v5 mail) > * Now nbtree has only one PredicateLockPage call, inside _bt_readpage. > This saves us an extra BufferGetBlockNumber call. (v4 pushed > PredicateLockPage calls down one layer, v5 pushes them down another > layer still.) (and seen in patch v6) > + /* allow next page to be processed by parallel worker */ > + if (scan->parallel_scan) > + { > + if (ScanDirectionIsForward(dir)) > + _bt_parallel_release(scan, so->currPos.nextPage, > + so->currPos.currPage); > + else > + _bt_parallel_release(scan, so->currPos.prevPage, > + so->currPos.currPage); > + } > + > + PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot); I'm a little bit concerned about this change: I'm not familiar with predicate locks, PLP, or anything related to SSI, but previously we called PLP in parallel scans while we still had hold of the scan, while we now call PLP only after letting other backends do things, allowing PLPs for this scan to happen concurrently with other backends of this scan. In principle, that looks like an improvement in parallelism by reducing the work done in the critical path. However, before, we would always get predicate locks in ascending (or descending for backward scans) value order, but now that strict keyspace order access has been released an unfortunately timed descheduling of the process between _bt_p_release and PLP's guts means mean lock acquisition would be only approximately in index leaf page order. If the lock acquisition order matters in the predicate lock system, then this will likely be a new cause of lock conflicts/deadlocks. If that's a real issue (by which I mean predicate locking is indeed sensitive to acquisition order) then this call to PredicateLockPage must happen before _bt_p_release, so that users don't get conflicts caused only by bad timing issues in single-directional index scans. Apart from these two comments, LGTM. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Thu, Oct 17, 2024 at 5:13 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I'm a little bit concerned about this change: I'm not familiar with > predicate locks, PLP, or anything related to SSI, but previously we > called PLP in parallel scans while we still had hold of the scan, > while we now call PLP only after letting other backends do things, > allowing PLPs for this scan to happen concurrently with other backends > of this scan. In principle, that looks like an improvement in > parallelism by reducing the work done in the critical path. It sort of is, but not really. FWIW, there were two thoughts behind this: 1. Try to avoid gratuitous BufferGetBlockNumber() calls (use a stashed BlockNumber instead), since nbtree has a tendency to do too many of those (something that Andres complained about not long ago). 2. Try to do as little as possible while the backend has seized the scan, on general principle. If only to set a good example going forward -- loudly advertising when the scan is seized should also help. I doubt it matters very much if we call PredicateLockPage() while the parallel scan is seized -- I wouldn't expect avoiding it (waiting until the parallel scan has been released to call PredicateLockPage) will measurably help performance. But I can't see how it could possibly be incorrect, either. There is no reason to think that the precise order that we call PredicateLockPage() for each page matters, in general. Just as it doesn't matter if the index is scanned forwards or backwards. > Apart from these two comments, LGTM. Pushed something very close to v6 several hours ago. Thanks -- Peter Geoghegan
Hi, thanks for working on these improvements. I noticed an unexpected behavior where the index scan continues instead of stopping, even when it detects that there are no tuples that match the conditions. (I observed this while reviewing the skip scan patch, though it isn't directly related to this issue.) On 2024-10-12 08:29, Peter Geoghegan wrote: > On Thu, Oct 10, 2024 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > * We now reset currPos state (including even its moreLeft/moreRight > fields) within _bt_parallel_seize, automatically and regardless of any > other details. IIUC, the above change is the root cause. The commit 1bd4bc8 adds a reset of the currPos state in _bt_parallel_seize(). However, this change can overwrite currPos.moreRight which should be preserved before calling _bt_readnextpage(). * Test case -- Prepare DROP TABLE IF EXISTS test; CREATE TABLE test (smallint smallint, bool bool); INSERT INTO test (SELECT -20000+i%40000, random()>0.5 FROM generate_series(1, 1_000_000) s(i)); CREATE INDEX test_smallint ON test (smallint); VACUUM ANALYZE test; -- Check the number of pages of the index =# SELECT relpages FROM pg_class WHERE relname = 'test_smallint'; relpages ---------- 937 (1 row) -- Test =# SET max_parallel_workers = 0; =# EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT COUNT(*) FROM test WHERE smallint < -10000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=5170.23..5170.24 rows=1 width=8) (actual time=71.352..71.402 rows=1 loops=1) Output: count(*) Buffers: shared hit=934 -> Gather (cost=5170.01..5170.22 rows=2 width=8) (actual time=71.344..71.395 rows=1 loops=1) Output: (PARTIAL count(*)) Workers Planned: 2 Workers Launched: 0 Buffers: shared hit=934 -> Partial Aggregate (cost=4170.01..4170.02 rows=1 width=8) (actual time=71.199..71.199 rows=1 loops=1) Output: PARTIAL count(*) Buffers: shared hit=934 -> Parallel Index Only Scan using test_smallint on public.test (cost=0.42..3906.27 rows=105495 width=0) (actual time=0.062..49.137 rows=250000 loops=1) Output: "smallint" Index Cond: (test."smallint" < '-10000'::integer) Heap Fetches: 0 Buffers: shared hit=934 -- This is not the result I expected. Almost all relpages are being read to retrieve only 25% of the tuples. -- Without commit 1bd4bc8, the number was '236' in my environment. Planning Time: 0.105 ms Execution Time: 71.454 ms (18 rows) Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On 2024-11-08 07:42, Peter Geoghegan wrote: > Attached revision v2 pushes the fix further in this direction. It > explicitly documents that parallel _bt_readnextpage callers are > allowed to use their so->currPos state (including > blkno/nextPage/prevPage) to help _bt_readnextpage to decide whether it > can end the scan right away. However, parallel index scan callers > still won't be allowed to use this same so->currPos state to decide > what page to go to next -- _bt_readnextpage really will need to seize > the scan to get an authoritative blkno to make that part safe > (actually, one parallel scan _bt_readnextpage caller still seizes the > scan for itself, which is the one caller where _bt_readnextpage fully > trusts caller's blkno + lastcurrblkno). Thank you! I've confirmed that the v2 patch fixed the bug. As you mentioned, I also feel that the v2 patch is now easier to understand. Since I couldn't understand the reason, I have a question: is the following deletion related to this change? @@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan) /* * Should not mark parallel scan done when there's still a pending - * primitive index scan (defensive) + * primitive index scan */ Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On Fri, Nov 8, 2024 at 8:25 AM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote: > Thank you! I've confirmed that the v2 patch fixed the bug. As you > mentioned, I also feel that the v2 patch is now easier to understand. Great. I pushed something quite similar to v2 just now. Thanks! > Since I couldn't understand the reason, I have a question: is the > following deletion related to this change? > > @@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan) > > /* > * Should not mark parallel scan done when there's still a > pending > - * primitive index scan (defensive) > + * primitive index scan > */ That's a good question. Prior to this bugfix, the check of so->needPrimScan from within _bt_parallel_done() was defensive; it wasn't strictly necessary. It could have been "Assert(!so->needPrimScan)" instead (I guess that I didn't make it into an assert like this because _bt_parallel_done worked a little like this, prior to Postgres 17, when we had a primscan counter instead of the current so->needPrimScan flag). But that's no longer the case with the bugfix in place; the so->needPrimScan check really is strictly necessary now. It's hard to see why this is. Notice that _bt_parallel_seize() will just return false when another primitive index scan is required. Prior to this bugfix, we'd seize the parallel scan within _bt_steppage, which could only succeed when "!so->needPrimScan" (which was actually verified by an assertion that just got removed). With this bug fix, nothing stops the so->needPrimScan flag from still being set from back when we called _bt_readpage for the so->currPos we're using. And so, and as I said, the check of so->needPrimScan from within _bt_parallel_done() became strictly necessary (not just defensive) -- since so->needPrimScan definitely can be set when we call _bt_parallel_done, and we definitely don't want to *really* end the whole top-level scan when it is set (we must never confuse the end of one primitive index scan with the end of the whole top-level parallel index scan). -- Peter Geoghegan
On 2024-11-09 03:40, Peter Geoghegan wrote: > On Fri, Nov 8, 2024 at 8:25 AM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: >> Since I couldn't understand the reason, I have a question: is the >> following deletion related to this change? >> >> @@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan) >> >> /* >> * Should not mark parallel scan done when there's still a >> pending >> - * primitive index scan (defensive) >> + * primitive index scan >> */ > > That's a good question. > > Prior to this bugfix, the check of so->needPrimScan from within > _bt_parallel_done() was defensive; it wasn't strictly necessary. It > could have been "Assert(!so->needPrimScan)" instead (I guess that I > didn't make it into an assert like this because _bt_parallel_done > worked a little like this, prior to Postgres 17, when we had a > primscan counter instead of the current so->needPrimScan flag). But > that's no longer the case with the bugfix in place; the > so->needPrimScan check really is strictly necessary now. > > It's hard to see why this is. Notice that _bt_parallel_seize() will > just return false when another primitive index scan is required. Prior > to this bugfix, we'd seize the parallel scan within _bt_steppage, > which could only succeed when "!so->needPrimScan" (which was actually > verified by an assertion that just got removed). With this bug fix, > nothing stops the so->needPrimScan flag from still being set from back > when we called _bt_readpage for the so->currPos we're using. And so, > and as I said, the check of so->needPrimScan from within > _bt_parallel_done() became strictly necessary (not just defensive) -- > since so->needPrimScan definitely can be set when we call > _bt_parallel_done, and we definitely don't want to *really* end the > whole top-level scan when it is set (we must never confuse the end of > one primitive index scan with the end of the whole top-level parallel > index scan). I understand, thanks to your explanation. Now, there is a case where _bt_readnextpage() calls _bt_parallel_seize(), _bt_readpage() sets so->needPrimScan=true, and _bt_parallel_done() is called with so->needPrimScan=true. Prior to this bugfix, _bt_parallel_seize() was called after _bt_readpage() sets so->needPrimScan=true, and it just returned false without calling _bt_parallel_done(). Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On Sun, Nov 10, 2024 at 9:53 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote: > I understand, thanks to your explanation. Cool. > Now, there is a case where _bt_readnextpage() calls > _bt_parallel_seize(), > _bt_readpage() sets so->needPrimScan=true, and _bt_parallel_done() is > called > with so->needPrimScan=true. Prior to this bugfix, _bt_parallel_seize() > was > called after _bt_readpage() sets so->needPrimScan=true, and it just > returned > false without calling _bt_parallel_done(). You influenced me to add something about this to my follow-up commit caca6d8d: --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -2230,8 +2230,9 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, !so->currPos.moreRight : !so->currPos.moreLeft)) { /* most recent _bt_readpage call (for lastcurrblkno) ended scan */ + Assert(so->currPos.currPage == lastcurrblkno && !seized); BTScanPosInvalidate(so->currPos); - _bt_parallel_done(scan); + _bt_parallel_done(scan); /* iff !so->needPrimScan */ return false; } I added "iff !so->needPrimScan" to draw attention to the fact that we don't necessarily really end the parallel scan when _bt_parallel_done is called. -- Peter Geoghegan
On 2024-11-11 12:19, Peter Geoghegan wrote: > On Sun, Nov 10, 2024 at 9:53 PM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: >> I understand, thanks to your explanation. > > Cool. > >> Now, there is a case where _bt_readnextpage() calls >> _bt_parallel_seize(), >> _bt_readpage() sets so->needPrimScan=true, and _bt_parallel_done() is >> called >> with so->needPrimScan=true. Prior to this bugfix, _bt_parallel_seize() >> was >> called after _bt_readpage() sets so->needPrimScan=true, and it just >> returned >> false without calling _bt_parallel_done(). > > You influenced me to add something about this to my follow-up commit > caca6d8d: > > --- a/src/backend/access/nbtree/nbtsearch.c > +++ b/src/backend/access/nbtree/nbtsearch.c > @@ -2230,8 +2230,9 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber > blkno, > !so->currPos.moreRight : !so->currPos.moreLeft)) > { > /* most recent _bt_readpage call (for lastcurrblkno) ended > scan */ > + Assert(so->currPos.currPage == lastcurrblkno && !seized); > BTScanPosInvalidate(so->currPos); > - _bt_parallel_done(scan); > + _bt_parallel_done(scan); /* iff !so->needPrimScan */ > return false; > } > > I added "iff !so->needPrimScan" to draw attention to the fact that we > don't necessarily really end the parallel scan when _bt_parallel_done > is called. Thanks! The change made it easier for me to understand. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On 2024-11-13 00:55, Peter Geoghegan wrote: > On Sun, Nov 10, 2024 at 11:36 PM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: >> Thanks! The change made it easier for me to understand. > > As follow-up to all of the recent work in this area, I'd like to add > this wrapper function to return the next item from so->currPos. > > The wrapper function has extra assertions, compared to what we do > already. It's slightly more defensive, and IMV slightly clearer. Thanks, I agree with adding the function for refactoring and including assertions for moreLeft or moreRight. One thing I was concerned about is that "if (scan->xs_want_itup)" was changed to "if (so->currTuples)". However, this isn’t an issue because so->currTuples is only initialized if scan->xs_want_itup is set to true in btrescan(), and it improves consistency with other functions for index-only scans. I also confirmed that make check-world passes with the patch. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On Tue, Nov 12, 2024 at 10:14 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote: > On 2024-11-13 00:55, Peter Geoghegan wrote: > > The wrapper function has extra assertions, compared to what we do > > already. It's slightly more defensive, and IMV slightly clearer. > > Thanks, I agree with adding the function for refactoring and including > assertions for moreLeft or moreRight. Pushed. Thanks again -- Peter Geoghegan