Returning nbtree posting list TIDs in DESC order during backwards scans - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Returning nbtree posting list TIDs in DESC order during backwards scans |
Date | |
Msg-id | CAH2-Wz=Wut2pKvbW-u3hJ_LXwsYeiXHiW8oN1GfbKPavcGo8Ow@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
Currently, nbtree always returns individual posting list TIDs to the scan in ASC order -- even during a backwards scan (more on why the deduplication patch deliberately did things that way later). But allowing backwards scans to return TIDs in "somewhat descending order" like this results in needless unpinning and re-pinning of the same heap page. This seems worth fixing. Patch goals =========== nbtree scans ought to guarantee that they'll return all TIDs in whatever order they appear on the page, when read either front-to-back, or back-to-front (whichever order matches the scan's current direction). Individual posting list TIDs should also be returned in either front-to-back or back-to-front order, also according to the scan's direction. In general, the order that TIDs are returned in ought to never be affected by the use of deduplication -- including during backwards scans. Attached patch teaches nbtree's _bt_readpage function to return TIDs from a posting list in DESC order during backwards scans, bringing nbtree in line with the ideal described in the previous paragraph. This approach seems more principled to me, and is likely to have a small performance benefit. I'll submit this to the first commitfest for Postgres 19. Test case ========= Here's an example of the problem on HEAD, that the patch fixes: -- Setup table with low cardinality index: create table duplicate_test(dup int4); create index on duplicate_test (dup); insert into duplicate_test select val from generate_series(1, 18) val, generate_series(1,1000) dups_per_val; -- Forwards scan, gets 23 buffer hits: explain analyze select ctid, * from duplicate_test where dup < 5 order by dup; -- Equivalent backwards scan, gets 42 buffer hits on HEAD: explain analyze select ctid, * from duplicate_test where dup < 5 order by dup desc; With the patch applied, *both* queries get the expected 23 buffer hits. I am generally suspicious of cases where backwards scans are at a gratuitous disadvantage relative to equivalent forwards scans. See previous work in commit c9c0589f and commit 1bd4bc85 for earlier examples of fixing problems that have that same quality to them. This might be the last such patch that I'll ever have to write. Sorting so->killedItems[] in _bt_killitems ========================================== Making this change in _bt_readpage creates a problem in _bt_killitems: it currently expects posting list TIDs in ascending order (the loop invariant relies on that), which is why the deduplication patch didn't just do things like this in _bt_readpage to begin with. The patch deals with that new problem by qsort'ing the so->killedItems[] array (at the top of _bt_killitems) such that the items always appear exactly once, in the expected ASC order. Another benefit of the qsort approach in _bt_killitems is that it'll gracefully handle scrollable cursor scans that happen to scroll back and forth on the same leaf page. This might result in the same dead TID being added to the so->killedItems[] array multiple times, in neither ASC or DESC order. That also confuses the loop inside _bt_killitems, in a way that needlessly prevents setting LP_DEAD bits (this is a preexisting problem, not a problem created by the changes that the patch makes to _bt_readpage). Having duplicate TIDs in so->killedItems[] like this isn't particularly likely, but it seems worth having proper handling for all the same, on general principle. I tend to doubt that the overhead of the qsort will ever really matter, but if it does then we can always limit it to backwards scans (meaning limit it to cases where so->currPos.dir indicates that _bt_readpage processed the page in the backwards scan direction), and drop the goal of dealing with duplicate so->killedItems[] TIDs gracefully. Note that the qsort is completely useless during standard forwards scans. It's hard to tell those apart from scrollable cursor scans that briefly changed directions "within a page", though, so I'm inclined to just always do the qsort, rather than trying to cleverly avoid it. That's what happens in this v1 of the patch (we qsort unconditionally in _bt_killitems). -- Peter Geoghegan
Attachment
pgsql-hackers by date: