Returning nbtree posting list TIDs in DESC order during backwards scans - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Returning nbtree posting list TIDs in DESC order during backwards scans
Date
Msg-id CAH2-Wz=Wut2pKvbW-u3hJ_LXwsYeiXHiW8oN1GfbKPavcGo8Ow@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Currently, nbtree always returns individual posting list TIDs to the
scan in ASC order -- even during a backwards scan (more on why the
deduplication patch deliberately did things that way later). But
allowing backwards scans to return TIDs in "somewhat descending order"
like this results in needless unpinning and re-pinning of the same
heap page. This seems worth fixing.

Patch goals
===========

nbtree scans ought to guarantee that they'll return all TIDs in
whatever order they appear on the page, when read either
front-to-back, or back-to-front (whichever order matches the scan's
current direction). Individual posting list TIDs should also be
returned in either front-to-back or back-to-front order, also
according to the scan's direction. In general, the order that TIDs are
returned in ought to never be affected by the use of deduplication --
including during backwards scans.

Attached patch teaches nbtree's _bt_readpage function to return TIDs
from a posting list in DESC order during backwards scans, bringing
nbtree in line with the ideal described in the previous paragraph.
This approach seems more principled to me, and is likely to have a
small performance benefit.

I'll submit this to the first commitfest for Postgres 19.

Test case
=========

Here's an example of the problem on HEAD, that the patch fixes:

-- Setup table with low cardinality index:
create table duplicate_test(dup int4);
create index on duplicate_test (dup);
insert into duplicate_test
select val from generate_series(1, 18) val,
                generate_series(1,1000) dups_per_val;

-- Forwards scan, gets 23 buffer hits:
explain analyze
select ctid, * from
duplicate_test
where dup < 5
order by dup;

-- Equivalent backwards scan, gets 42 buffer hits on HEAD:
explain analyze
select ctid, * from
duplicate_test
where dup < 5
order by dup desc;

With the patch applied, *both* queries get the expected 23 buffer hits.

I am generally suspicious of cases where backwards scans are at a
gratuitous disadvantage relative to equivalent forwards scans. See
previous work in commit c9c0589f and commit 1bd4bc85 for earlier
examples of fixing problems that have that same quality to them. This
might be the last such patch that I'll ever have to write.

Sorting so->killedItems[] in _bt_killitems
==========================================

Making this change in _bt_readpage creates a problem in _bt_killitems:
it currently expects posting list TIDs in ascending order (the loop
invariant relies on that), which is why the deduplication patch didn't
just do things like this in _bt_readpage to begin with. The patch
deals with that new problem by qsort'ing the so->killedItems[] array
(at the top of _bt_killitems) such that the items always appear
exactly once, in the expected ASC order.

Another benefit of the qsort approach in _bt_killitems is that it'll
gracefully handle scrollable cursor scans that happen to scroll back
and forth on the same leaf page. This might result in the same dead
TID being added to the so->killedItems[] array multiple times, in
neither ASC or DESC order. That also confuses the loop inside
_bt_killitems, in a way that needlessly prevents setting LP_DEAD bits
(this is a preexisting problem, not a problem created by the changes
that the patch makes to _bt_readpage). Having duplicate TIDs in
so->killedItems[] like this isn't particularly likely, but it seems
worth having proper handling for all the same, on general principle.

I tend to doubt that the overhead of the qsort will ever really
matter, but if it does then we can always limit it to backwards scans
(meaning limit it to cases where so->currPos.dir indicates that
_bt_readpage processed the page in the backwards scan direction), and
drop the goal of dealing with duplicate so->killedItems[] TIDs
gracefully.

Note that the qsort is completely useless during standard forwards
scans. It's hard to tell those apart from scrollable cursor scans that
briefly changed directions "within a page", though, so I'm inclined to
just always do the qsort, rather than trying to cleverly avoid it.
That's what happens in this v1 of the patch (we qsort unconditionally
in _bt_killitems).

-- 
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Sergey Sargsyan
Date:
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Next
From: Andrey Borodin
Date:
Subject: Re: amcheck support for BRIN indexes