Thread: Re: strange perf regression with data checksums

Re: strange perf regression with data checksums

From

Aleksander Alekseev

Date:

09 May, 15:53:52

Hi Tomas,

> While running some benchmarks comparing 17 and 18, I ran into a simple
> workload where 18 throughput drops by ~80%. After pulling my hair for a
> couple hours I realized the change that triggered this is 04bec894a04c,
> which set checksums on by default. Which is very bizarre, because the
> workload is read-only and fits into shared buffers.
>
> [...]
>
> But why would it depend on checksums at all? This read-only test should
> be entirely in-memory, so how come it's affected?

These are interesting results.

Just wanted to clarify: did you make sure that all the hint bits were
set before executing the benchmark?

I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.

-- 
Best regards,
Aleksander Alekseev

Re: strange perf regression with data checksums

From

Tomas Vondra

Date:

09 May, 16:06:30

On 5/9/25 14:53, Aleksander Alekseev wrote:
> Hi Tomas,
> 
>> While running some benchmarks comparing 17 and 18, I ran into a simple
>> workload where 18 throughput drops by ~80%. After pulling my hair for a
>> couple hours I realized the change that triggered this is 04bec894a04c,
>> which set checksums on by default. Which is very bizarre, because the
>> workload is read-only and fits into shared buffers.
>>
>> [...]
>>
>> But why would it depend on checksums at all? This read-only test should
>> be entirely in-memory, so how come it's affected?
> 
> These are interesting results.
> 
> Just wanted to clarify: did you make sure that all the hint bits were
> set before executing the benchmark?
> 
> I'm not claiming that hint bits are necessarily the reason for the
> observed behavior but when something is off with presumably read-only
> queries this is the first reason that comes to mind. At least we
> should make sure hint bits are excluded from the equation. If memory
> serves, VACUUM FULL and CHECKPOINT after filling the table and
> creating the index should do the trick.
> 

Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.

Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.


regards

-- 
Tomas Vondra

Re: strange perf regression with data checksums

From

Aleksander Alekseev

Date:

09 May, 16:19:35

Hi,

> > I'm not claiming that hint bits are necessarily the reason for the
> > observed behavior but when something is off with presumably read-only
> > queries this is the first reason that comes to mind. At least we
> > should make sure hint bits are excluded from the equation. If memory
> > serves, VACUUM FULL and CHECKPOINT after filling the table and
> > creating the index should do the trick.
>
> Good question. I haven't checked that explicitly, but it's a tiny data
> set (15MB) and I observed this even on long benchmarks with tens of
> millions of queries. So the hint bits should have been set.
>
> Also, I should have mentioned the query does an index-only scan, and the
> pin/unpin calls are on index pages, not on the heap.

There is one more thing I would check. As I recall perf shows only
on-CPU time while actually the backends may be sleeping on the locks
most of the time. If this is the case perf will not show you the
accurate picture.

In order to check this personally I create gdb.script with a single GDB command:

```
bt
```

And execute:

```
gdb --batch --command=gdb.script -p (backend_pid_here)
```

... 10+ times or so. If what you are observing is actually a lock
contention and the backend sleeps on a lock most of the time, 8/10 or
so stacktraces will show you this.

I assume of course that the benchmark is done on release builds with
disabled Asserts, etc.

BTW do you believe this is a problem related exclusively to NUMA CPUs
with 90+ cores or I can reproduce it on SMT as well?

-- 
Best regards,
Aleksander Alekseev

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

19 May, 19:42:24

On Fri, May 9, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote:
> Good question. I haven't checked that explicitly, but it's a tiny data
> set (15MB) and I observed this even on long benchmarks with tens of
> millions of queries. So the hint bits should have been set.
>
> Also, I should have mentioned the query does an index-only scan, and the
> pin/unpin calls are on index pages, not on the heap.

We don't actually need to call BufferGetLSNAtomic() from _bt_readpage
during index-only scans (nor during bitmap index scans). We can just
not call BufferGetLSNAtomic() at all (except during plain index
scans), with no possible downside.

In general, we call BufferGetLSNAtomic() to stash the page LSN within
so->currPos.lsn, for later. so->currPos.lsn provides us with a way to
detect whether the page was modified during the period in which we
dropped our pin on the leaf page -- plain index scans cannot set
LP_DEAD bits on dead index tuples within _bt_killitems() if the page
has changed. But, index-only scans never drop the pin on the leaf page
to begin with, and so don't even use so->currPos.lsn (bitmap index
scans *never* call _bt_killitems(), and so obviously have no possible
use for so->currPos.lsn, either).

--
Peter Geoghegan

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

19 May, 21:19:30

On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote:
> For index-only scans, yes.

Great.

> The regular index scan however still have this issue, although it's not
> as visible as for IOS.

We can do somewhat better with plain index scans than my initial v1
prototype, without any major difficulties. There's more low-hanging
fruit.

We could also move the call to BufferGetLSNAtomic (that currently
takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin.
That way we'd only need to call BufferGetLSNAtomic for those leaf
pages that will actually need to have some index tuples returned to
the scan (through the btgettuple interface). In other words, we only
need to call BufferGetLSNAtomic for pages that _bt_readpage returns
"true" for when called. There are plenty of leaf pages that
_bt_readpage will return "false" for, especially during large range
scans, and skip scans.

It's easy to see why this extension to my v1 POC is correct: the whole
point of dropping the leaf page pin is that we don't block VACUUM when
btgettuple returns -- but btgettuple isn't going to return until the
next call to _bt_readpage that returns "true" actually takes place (or
until the whole scan ends).

--
Peter Geoghegan

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

19 May, 22:45:01

On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:
> I think we can do better - something like
>
> #ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
>     lsn = PageGetLSN(page);
> #else
>     buf_state = LockBufHdr(bufHdr);
>     lsn = PageGetLSN(page);
>     UnlockBufHdr(bufHdr, buf_state);
> #endif
>
> All perf relevant systems support reading 8 bytes without a chance of
> tearing...

Even assuming that this scheme works perfectly, I'd still like to
pursue the idea I had about fixing this in nbtree.

The relevant nbtree code seems more elegant if we avoid calling
BufferGetLSNAtomic() unless and until its return value might actually
be useful. It's quite a lot easier to understand the true purpose of
so->currPos.lsn with this new structure.

--
Peter Geoghegan

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

19 May, 23:29:09

On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote:
> Same effect as v1 for IOS, with regular index scans I see this:
>
> 64 clients: 0.7M tps
> 96 clients: 1.5M tps
>
> So very similar improvement as for IOS.

Just to be clear, you're saying my v2 appears to fix the problem
fully, for both plain index scans and index-only scans? (You haven't
mentioned bitmap index scans at all, I think, even though they are
relevant to v2.)

I'd be surprised if my v2 *fully* fixed the issue with plain index
scans. It's only going to avoid calls to BufferGetLSNAtomic()
following _bt_readpage calls that return false, but I doubt that your
particular pgbench-variant test queries really look like that.

*Looks again* Oh, wait. This is another one of those benchmarks with
an index scan that returns no rows whatsoever (just like on the other
recent thread involving regressions tied to the introduction of a new
skip support support function to nbtree). Fully makes sense that my v2
would fix that particular problem, even with plain index scans.
But...what about slightly different queries, that actually do return
some rows? Those will look different.

--
Peter Geoghegan

Re: strange perf regression with data checksums

From

Tomas Vondra

Date:

20 May, 03:21:46

On 5/19/25 22:29, Peter Geoghegan wrote:
> On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote:
>> Same effect as v1 for IOS, with regular index scans I see this:
>>
>> 64 clients: 0.7M tps
>> 96 clients: 1.5M tps
>>
>> So very similar improvement as for IOS.
> 
> Just to be clear, you're saying my v2 appears to fix the problem
> fully, for both plain index scans and index-only scans? (You haven't
> mentioned bitmap index scans at all, I think, even though they are
> relevant to v2.)

With v2 the regression disappears, both for index-only scans and regular
index scans. I haven't tried anything with bitmap scans - I hit the
regression mostly by accident, on a workload that does IOS only. I may
try constructing something with bitmap scans, but I didn't have time for
that right now.

> 
> I'd be surprised if my v2 *fully* fixed the issue with plain index
> scans. It's only going to avoid calls to BufferGetLSNAtomic()
> following _bt_readpage calls that return false, but I doubt that your
> particular pgbench-variant test queries really look like that.
> 

Not sure. My workload is pretty simple, and similar to what I used in
the other nbtree regression (with malloc overhead), except that it's not
using partitioning.

pgbench -i -s 1 test
psql test <<EOF
update pgbench_accounts set bid = aid;
create index on pgbench_accounts(bid);
vacuum full;
analyze;
EOF

# select.sql
set enable_indexonlyscan = off;
select count(*) from pgbench_accounts where bid = 1

I don't know what "fully fix" means in this context. I see a massive
improvement with v2, I have no idea if that's the best we could do.

> *Looks again* Oh, wait. This is another one of those benchmarks with
> an index scan that returns no rows whatsoever (just like on the other
> recent thread involving regressions tied to the introduction of a new
> skip support support function to nbtree). Fully makes sense that my v2
> would fix that particular problem, even with plain index scans.
> But...what about slightly different queries, that actually do return
> some rows? Those will look different.
> 

Actually, this particular query does return rows (one, to be precise).

But you're right - it seems sensitive to how many rows are returned, and
at some point the contention goes away and there's no regression.

I need to do proper automated testing, to get reliable results. I've
been doing manual testing, but it's easy to make mistakes that way.

Do you have any suggestions what cases you'd like me to test?

regards

-- 
Tomas Vondra

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

20 May, 15:43:16

On Mon, May 19, 2025 at 8:21 PM Tomas Vondra <tomas@vondra.me> wrote:
> With v2 the regression disappears, both for index-only scans and regular
> index scans. I haven't tried anything with bitmap scans - I hit the
> regression mostly by accident, on a workload that does IOS only. I may
> try constructing something with bitmap scans, but I didn't have time for
> that right now.

I imagine bitmap index scans will be similar to plain index scans.

> I don't know what "fully fix" means in this context. I see a massive
> improvement with v2, I have no idea if that's the best we could do.

You expected there to be *zero* performance impact from enabling
checksums for this workload, since it is a pure read-only workload.
That's what I meant by "fully fix".

> But you're right - it seems sensitive to how many rows are returned, and
> at some point the contention goes away and there's no regression.
>
> I need to do proper automated testing, to get reliable results. I've
> been doing manual testing, but it's easy to make mistakes that way.
>
> Do you have any suggestions what cases you'd like me to test?

Nothing comes to mind. Again, just be aware that we can only
completely avoid calling BufferGetLSNAtomic is only possible when:

* Scan is an index-only scan

OR

* Scan is a bitmap index scan

OR

* Scan is a plain index scan, reading a page that _bt_readpage()
returned "false" for when called.

In other words, plain index scans that read a lot of tuples might
receive no benefit whatsoever. It's possible that it already matters
less there anyway. It's also possible that there is some unforeseen
benefit from merely *delaying* the call to BufferGetLSNAtomic. But in
all likelihood these "unfixed" plain index scans are just as fast with
v2 as they are when run on master/baseline.

--
Peter Geoghegan

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

20 May, 15:46:25

On Tue, May 20, 2025 at 8:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I imagine bitmap index scans will be similar to plain index scans.

To be clear, I meant that the magnitude of the problem will be
similar. I do not mean that they aren't fixed by my v2. They should be
fully fixed by v2.

--
Peter Geoghegan

Re: strange perf regression with data checksums

From

Andres Freund

Date:

21 May, 19:29:11

Hi,

On 2025-05-19 15:45:01 -0400, Peter Geoghegan wrote:
> On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we can do better - something like
> >
> > #ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
> >     lsn = PageGetLSN(page);
> > #else
> >     buf_state = LockBufHdr(bufHdr);
> >     lsn = PageGetLSN(page);
> >     UnlockBufHdr(bufHdr, buf_state);
> > #endif
> >
> > All perf relevant systems support reading 8 bytes without a chance of
> > tearing...
> 
> Even assuming that this scheme works perfectly, I'd still like to
> pursue the idea I had about fixing this in nbtree.
> 
> The relevant nbtree code seems more elegant if we avoid calling
> BufferGetLSNAtomic() unless and until its return value might actually
> be useful. It's quite a lot easier to understand the true purpose of
> so->currPos.lsn with this new structure.

I'm not against that - ISTM we should do both.

Greetings,

Andres Freund

Re: strange perf regression with data checksums

From

Tomas Vondra

Date:

04 June, 14:33:06

Hi,

We had a RMT meeting yesterday, and we briefly discussed whether the v2
patch should go into PG18. And we concluded we probably should try to do
that.

On the one hand, it's not a PG18-specific bug, in the sense that older
releases are affected the same way. But it requires data checksums to be
enabled, and a lot of systems runs with checksums=off, simply because
that's the default. But 18 will (likely) change that.

We didn't get any complaints about this (at least I'm not aware of any),
but I suppose that's simply because people didn't have a comparison, and
treated it as "normal". But with the default changes, it'll be easier to
spot once they upgrade to PG18.

So better to get this in now, otherwise we may have to wait until PG19,
because of ABI (the patch adds a field into BTScanPosData, but maybe
it'd be possible to add it into padding, not sure).


regards

-- 
Tomas Vondra

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

04 June, 17:21:22

On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote:
> So better to get this in now, otherwise we may have to wait until PG19,
> because of ABI (the patch adds a field into BTScanPosData, but maybe
> it'd be possible to add it into padding, not sure).

I agree. I can get this into shape for commit today.

Does anybody object to my proceeding with committing the patch on the
master branch/putting it in Postgres 18?  (FWIW I could probably fit
the new field into some BTScanPosData alignment padding, but I don't
favor back patching.)

I consider my patch to be low risk. There's a kind of symmetry to how
things work with the patch in place, which IMV makes things simpler.

--
Peter Geoghegan

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

04 June, 20:39:40

On Wed, Jun 4, 2025 at 10:21 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote:
> > So better to get this in now, otherwise we may have to wait until PG19,
> > because of ABI (the patch adds a field into BTScanPosData, but maybe
> > it'd be possible to add it into padding, not sure).
>
> I agree. I can get this into shape for commit today.

Attached is v3, which is functionally the same as v2. It has improved
comments, and a couple of extra assertions. Plus there's a real commit
message now.

I also relocated the code that sets so.drop_pin from
_bt_preprocess_keys to btrescan. That seems like the more logical
place for it.

My current plan is to commit this in the next couple of days.

--
Peter Geoghegan

Attachment

v3-0001-Avoid-BufferGetLSNAtomic-calls-during-nbtree-scan.patch

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

06 June, 19:33:44

On Wed, Jun 4, 2025 at 1:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> My current plan is to commit this in the next couple of days.

Pushed.

It would be great if we could also teach BufferGetLSNAtomic() to just
call PageGetLSN() (at least on PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
platforms), but my sense is that that's not going to happen in
Postgres 18.

--
Peter Geoghegan

Re: strange perf regression with data checksums

From

Alexander Lakhin

Date:

09 June, 09:00:00

Hello Peter,

06.06.2025 19:33, Peter Geoghegan wrote:

On Wed, Jun 4, 2025 at 1:39 PM Peter Geoghegan <pg@bowt.ie> wrote:

My current plan is to commit this in the next couple of days.

Pushed.

Please look at the following script, which falsifies an assertion
introduced with e6eed40e4:
create user u;
grant all on schema public to u;
set session authorization u;
create table tbl1 (a int);
create function f1() returns int language plpgsql as $$ invalid $$;

select sum(1) over (partition by 1 order by objid)
from pg_shdepend
left join pg_stat_sys_indexes on refclassid = relid
left join tbl1 on true
limit 1;
(I've simplified an assert-triggering query generated by SQLsmith.)

TRAP: failed Assert("!BTScanPosIsPinned(so->currPos)"), File: "nbtutils.c", Line: 3379, PID: 1621028
ExceptionalCondition at assert.c:52:13
_bt_killitems at nbtutils.c:3380:3
_bt_steppage at nbtsearch.c:2134:5
_bt_next at nbtsearch.c:1560:7
btgettuple at nbtree.c:276:6
index_getnext_tid at indexam.c:640:25
index_getnext_slot at indexam.c:729:10
IndexNext at nodeIndexscan.c:131:9
ExecScanFetch at execScan.h:126:10
...

Best regards,
Alexander

Re: strange perf regression with data checksums

From

Mihail Nikalayeu

Date:

09 June, 15:47:18

Hello, Peter!

>
> I also relocated the code that sets so.drop_pin from
> _bt_preprocess_keys to btrescan. That seems like the more logical
> place for it.
>

I was rebasing [0] and noticed dropPin is not initialized in
btbeginscan, which seems to be suspicious for me.
Is it intended?

[0]: https://commitfest.postgresql.org/patch/5151/

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

09 June, 17:47:35

Hi Alexander,

On Mon, Jun 9, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Please look at the following script, which falsifies an assertion

> TRAP: failed Assert("!BTScanPosIsPinned(so->currPos)"), File: "nbtutils.c", Line: 3379, PID: 1621028

I can reproduce this. I think that it's an old bug.

The problem is related to mark/restore (used by merge joins), where
nbtree can sometimes independently hold a second pin on leaf pages
(albeit very briefly, inside _bt_steppage). The cause of the failure
of my recently-added assertion is that it assumes that so->dropPin
always implies !BTScanPosIsPinned(so->currPos) at the start of
_bt_killitems. I don't think that that assumption is unreasonable,
even in light of this assertion problem.

I think that the real problem is the way that _bt_killitems releases
resources. _bt_killitems doesn't try to leave so->currPos as it was at
the start of its call -- this is nothing new. The overall effect is
that during so->dropPin _bt_steppage calls, we *generally* don't call
IncrBufferRefCount in the "bump pin on current buffer for assignment
to mark buffer" path, but *will* call IncrBufferRefCount when we
happened to need to call _bt_killitems. Calling _bt_killitems
shouldn't have these side-effects.

I can make the assertion failure go away by teaching _bt_killitems to
leave so->currPos in the same state that it was in when _bt_killitems
was called. If there was no pin on the page when _bt_killitems was
called, then there should be no pin held when it returns. See the
attached patch.

As I said, I think that this is actually an old bug. After all,
_bt_killitems is required to test so->currPos.lsn against the same
page's current LSN if the page pin was dropped since _bt_readpage was
first called -- regardless of any other details. I don't think that
it'll consistently do that in this hard-to-test mark/restore path on
any Postgres version, so there might be a risk of failing to detect an
unsafe concurrent TID recycling hazard.

I've always thought that the whole idiom of testing
BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if
it makes sense to totally replace those calls/tests with similar tests
of the new so->dropPin field.

--
Peter Geoghegan

Attachment

v1-0001-Fix-issue-with-mark-restore-nbtree-pins.patch

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

09 June, 17:49:27

On Mon, Jun 9, 2025 at 8:48 AM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:
> I was rebasing [0] and noticed dropPin is not initialized in
> btbeginscan, which seems to be suspicious for me.
> Is it intended?

That's pretty normal. We don't have access to the scan descriptor
within btbeginscan.

This is also why we do things like allocate so->currTuples within
btrescan. We don't yet know if the scan will be an index-only scan
when btbeginscan is called.

--
Peter Geoghegan

Re: strange perf regression with data checksums

From

Peter Geoghegan

Date:

09 June, 22:04:43

On Mon, Jun 9, 2025 at 10:47 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I've always thought that the whole idiom of testing
> BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if
> it makes sense to totally replace those calls/tests with similar tests
> of the new so->dropPin field.

It's definitely possible to avoid testing BTScanPosIsPinned like this
entirely. Attached v2 patch shows how this is possible.

v2 fully removes BTScanPosUnpinIfPinned, and limits the use of
BTScanPosIsPinned to assertions. I think that this approach makes our
buffer pin resource management strategy much clearer, particularly in
cases involving mark/restore. This might help with Tomas Vondra's
index prefetching patch -- I know that Tomas found the mark/restore
aspects of his latest approach challenging.

In v2, I added an assertion at the top of _bt_steppage that
independently verify the same conditions to the ones that are already
tested whenever _bt_steppage calls _bt_killitems (these include the
assertion that Alexander showed could fail):

static bool
_bt_steppage(IndexScanDesc scan, ScanDirection dir)
{
    BTScanOpaque so = (BTScanOpaque) scan->opaque;
    BlockNumber blkno,
                lastcurrblkno;

    Assert(BTScanPosIsValid(so->currPos));
    if (!so->dropPin)
        Assert(BTScanPosIsPinned(so->currPos));
    else
        Assert(!BTScanPosIsPinned(so->currPos));
...

That way if there are remaining issues like the ones that Alexander
reported, we don't rely on reaching _bt_killitems to get an assertion
failure.

(Note that adding these assertions necessitated making _bt_firstpage
call _bt_readnextpage instead of its _bt_steppage wrapper.
_bt_steppage is now "Only called when _bt_drop_lock_and_maybe_pin was
called for so->currPos", which doesn't apply to the case where
_bt_firstpage previously called _bt_steppage. This brings
_bt_firstpage in line with _bt_readnextpage itself -- now we only call
_bt_steppage for a page that we returned to the scan via btgettuple/a
page that might have had its pin released by
_bt_drop_lock_and_maybe_pin.)

v2 also completely removes all IncrBufferRefCount() calls from nbtree.
These were never really necessary, since we don't actually need to
hold more than one pin at a time on the same page at any point in
cases involving mark/restore (nor any other case). We can always get
by with no more than one buffer pin on the same page.

There were 2 IncrBufferRefCount() calls total, both of which v2 removes:

The first IncrBufferRefCount call removed was in _bt_steppage. We copy
so->currPos into so->markPos when leaving the page in _bt_steppage.
We're invalidating so->currPos in passing here, so we don't *need* to
play games with IncrBufferRefCount -- we can just not release the
original pin, based on the understanding that so->markPos has the pin
"transferred" to (so->currPos can therefore be invalidated without
having its pin unpinned).

The second IncrBufferRefCount call removed was in btrestrpos. When we
restore a mark, we need to keep around the original mark, in case it
needs to be restored a little later -- the so->markPos mark must not
go away. But that in itself doesn't necessitate holding multiple
buffer pins on the same page at once. We can keep the old mark around,
without keeping a separate buffer pin for so->markPos. We can just set
so->markItemIndex instead -- that's an alternative (and more
efficient) way of representing the mark (the so->markPos
representation isn't truly needed). This is okay, since our new
so->currPos is for the same page as the mark -- that's how
so->markItemIndex  is supposed to be used.

--
Peter Geoghegan

Attachment

v2-0001-Fix-issue-with-mark-restore-nbtree-pins.patch