Re: Returning nbtree posting list TIDs in DESC order during backwards scans - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Returning nbtree posting list TIDs in DESC order during backwards scans
Date
Msg-id CAH2-Wznq0zWu5bMEmaFZ+s3mkNpHZm-nAD=7yx_eS4t3=jZEUA@mail.gmail.com
Whole thread Raw
In response to Re: Returning nbtree posting list TIDs in DESC order during backwards scans  (Victor Yegorov <vyegorov@gmail.com>)
Responses Re: Returning nbtree posting list TIDs in DESC order during backwards scans
List pgsql-hackers
On Wed, Dec 3, 2025 at 10:18 AM Victor Yegorov <vyegorov@gmail.com> wrote:
> Patch looks fine, applies and compiles cleanly, passes tests.

This patch was trickier than initially expected. I paid no attention
to the possible downside of changing the posting list iteration code
in _bt_readpage (i.e. from scan posting lists in descending order for
backwards scans), which was an oversight.

It seems that we're very sensitive to how the compiler allocates
registers within _bt_readpage, which led to v4 of the patch (and
presumably all earlier versions) not storing itemIndex in a register.
With v4, the compiler spilled the itemIndex comparison to the stack
(at least on my machine, which uses GCC 15.2 from Debian unstable),
and so had to read itemIndex from memory on each loop iteration. This
regressed pgbench's SELECT workload by about 1% of total TPS (on my
Zen 3 CPU).

Attached v5 avoids the regression by tweaking _bt_readpage. I will
commit this version soon (I really mean it this time!).

I'm not sure why these changes have the intended effect -- I mostly
figured all this out through trial and error. Though I can say that my
testing showed that _not_ changing the posting list iteration code in
the _bt_readpage forwards scan loop makes the crucial difference. The
other tweaks probably aren't strictly necessary, but they seem to make
the compiler generate ever so slightly faster code (at least with
pgbench SELECT), so I kept them in.

I also gave up on the idea of using a bitmapset for v5 -- the issue
with regressing _bt_readpage discouraged me from adding code that
allocates memory (more than once per scan) within btgettuple, which is
a performance-critical path. We can simply sort and unique-ify the
killedItems array from within _bt_killitems instead -- which is
largely the same way that my original v1 did it. That way it won't
matter what order we append to killItems in, relative to the scan
direction. (The downside of sticking with an array for killedItems is
that we can still run out of array space in extreme cases involving
scrollable cursors that return many killable tuples, and repeatedly
append the same TID to killedItems[], but that doesn't seem like much
of a loss to me -- that almost never happens in practice.)

> I'd like to point out a missing space after the dot in the 2nd para of the commit message,
> falls out of style.

Fixed that too.

Thanks

--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Jim Mlodgenski
Date:
Subject: INOUT params with expanded objects
Next
From: Tom Lane
Date:
Subject: Re: Solaris versus our NLS files