Thread: Re: strange perf regression with data checksums

Re: strange perf regression with data checksums

From
Aleksander Alekseev
Date:
Hi Tomas,

> While running some benchmarks comparing 17 and 18, I ran into a simple
> workload where 18 throughput drops by ~80%. After pulling my hair for a
> couple hours I realized the change that triggered this is 04bec894a04c,
> which set checksums on by default. Which is very bizarre, because the
> workload is read-only and fits into shared buffers.
>
> [...]
>
> But why would it depend on checksums at all? This read-only test should
> be entirely in-memory, so how come it's affected?

These are interesting results.

Just wanted to clarify: did you make sure that all the hint bits were
set before executing the benchmark?

I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.

-- 
Best regards,
Aleksander Alekseev



Re: strange perf regression with data checksums

From
Tomas Vondra
Date:
On 5/9/25 14:53, Aleksander Alekseev wrote:
> Hi Tomas,
> 
>> While running some benchmarks comparing 17 and 18, I ran into a simple
>> workload where 18 throughput drops by ~80%. After pulling my hair for a
>> couple hours I realized the change that triggered this is 04bec894a04c,
>> which set checksums on by default. Which is very bizarre, because the
>> workload is read-only and fits into shared buffers.
>>
>> [...]
>>
>> But why would it depend on checksums at all? This read-only test should
>> be entirely in-memory, so how come it's affected?
> 
> These are interesting results.
> 
> Just wanted to clarify: did you make sure that all the hint bits were
> set before executing the benchmark?
> 
> I'm not claiming that hint bits are necessarily the reason for the
> observed behavior but when something is off with presumably read-only
> queries this is the first reason that comes to mind. At least we
> should make sure hint bits are excluded from the equation. If memory
> serves, VACUUM FULL and CHECKPOINT after filling the table and
> creating the index should do the trick.
> 

Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.

Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.


regards

-- 
Tomas Vondra




Re: strange perf regression with data checksums

From
Aleksander Alekseev
Date:
Hi,

> > I'm not claiming that hint bits are necessarily the reason for the
> > observed behavior but when something is off with presumably read-only
> > queries this is the first reason that comes to mind. At least we
> > should make sure hint bits are excluded from the equation. If memory
> > serves, VACUUM FULL and CHECKPOINT after filling the table and
> > creating the index should do the trick.
>
> Good question. I haven't checked that explicitly, but it's a tiny data
> set (15MB) and I observed this even on long benchmarks with tens of
> millions of queries. So the hint bits should have been set.
>
> Also, I should have mentioned the query does an index-only scan, and the
> pin/unpin calls are on index pages, not on the heap.

There is one more thing I would check. As I recall perf shows only
on-CPU time while actually the backends may be sleeping on the locks
most of the time. If this is the case perf will not show you the
accurate picture.

In order to check this personally I create gdb.script with a single GDB command:

```
bt
```

And execute:

```
gdb --batch --command=gdb.script -p (backend_pid_here)
```

... 10+ times or so. If what you are observing is actually a lock
contention and the backend sleeps on a lock most of the time, 8/10 or
so stacktraces will show you this.

I assume of course that the benchmark is done on release builds with
disabled Asserts, etc.

BTW do you believe this is a problem related exclusively to NUMA CPUs
with 90+ cores or I can reproduce it on SMT as well?

-- 
Best regards,
Aleksander Alekseev



Re: strange perf regression with data checksums

From
Peter Geoghegan
Date:
On Fri, May 9, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote:
> Good question. I haven't checked that explicitly, but it's a tiny data
> set (15MB) and I observed this even on long benchmarks with tens of
> millions of queries. So the hint bits should have been set.
>
> Also, I should have mentioned the query does an index-only scan, and the
> pin/unpin calls are on index pages, not on the heap.

We don't actually need to call BufferGetLSNAtomic() from _bt_readpage
during index-only scans (nor during bitmap index scans). We can just
not call BufferGetLSNAtomic() at all (except during plain index
scans), with no possible downside.

In general, we call BufferGetLSNAtomic() to stash the page LSN within
so->currPos.lsn, for later. so->currPos.lsn provides us with a way to
detect whether the page was modified during the period in which we
dropped our pin on the leaf page -- plain index scans cannot set
LP_DEAD bits on dead index tuples within _bt_killitems() if the page
has changed. But, index-only scans never drop the pin on the leaf page
to begin with, and so don't even use so->currPos.lsn (bitmap index
scans *never* call _bt_killitems(), and so obviously have no possible
use for so->currPos.lsn, either).

--
Peter Geoghegan



Re: strange perf regression with data checksums

From
Peter Geoghegan
Date:
On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote:
> For index-only scans, yes.

Great.

> The regular index scan however still have this issue, although it's not
> as visible as for IOS.

We can do somewhat better with plain index scans than my initial v1
prototype, without any major difficulties. There's more low-hanging
fruit.

We could also move the call to BufferGetLSNAtomic (that currently
takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin.
That way we'd only need to call BufferGetLSNAtomic for those leaf
pages that will actually need to have some index tuples returned to
the scan (through the btgettuple interface). In other words, we only
need to call BufferGetLSNAtomic for pages that _bt_readpage returns
"true" for when called. There are plenty of leaf pages that
_bt_readpage will return "false" for, especially during large range
scans, and skip scans.

It's easy to see why this extension to my v1 POC is correct: the whole
point of dropping the leaf page pin is that we don't block VACUUM when
btgettuple returns -- but btgettuple isn't going to return until the
next call to _bt_readpage that returns "true" actually takes place (or
until the whole scan ends).

--
Peter Geoghegan



Re: strange perf regression with data checksums

From
Peter Geoghegan
Date:
On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:
> I think we can do better - something like
>
> #ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
>     lsn = PageGetLSN(page);
> #else
>     buf_state = LockBufHdr(bufHdr);
>     lsn = PageGetLSN(page);
>     UnlockBufHdr(bufHdr, buf_state);
> #endif
>
> All perf relevant systems support reading 8 bytes without a chance of
> tearing...

Even assuming that this scheme works perfectly, I'd still like to
pursue the idea I had about fixing this in nbtree.

The relevant nbtree code seems more elegant if we avoid calling
BufferGetLSNAtomic() unless and until its return value might actually
be useful. It's quite a lot easier to understand the true purpose of
so->currPos.lsn with this new structure.

--
Peter Geoghegan



Re: strange perf regression with data checksums

From
Peter Geoghegan
Date:
On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote:
> Same effect as v1 for IOS, with regular index scans I see this:
>
> 64 clients: 0.7M tps
> 96 clients: 1.5M tps
>
> So very similar improvement as for IOS.

Just to be clear, you're saying my v2 appears to fix the problem
fully, for both plain index scans and index-only scans? (You haven't
mentioned bitmap index scans at all, I think, even though they are
relevant to v2.)

I'd be surprised if my v2 *fully* fixed the issue with plain index
scans. It's only going to avoid calls to BufferGetLSNAtomic()
following _bt_readpage calls that return false, but I doubt that your
particular pgbench-variant test queries really look like that.

*Looks again* Oh, wait. This is another one of those benchmarks with
an index scan that returns no rows whatsoever (just like on the other
recent thread involving regressions tied to the introduction of a new
skip support support function to nbtree). Fully makes sense that my v2
would fix that particular problem, even with plain index scans.
But...what about slightly different queries, that actually do return
some rows? Those will look different.

--
Peter Geoghegan



Re: strange perf regression with data checksums

From
Tomas Vondra
Date:
On 5/19/25 22:29, Peter Geoghegan wrote:
> On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote:
>> Same effect as v1 for IOS, with regular index scans I see this:
>>
>> 64 clients: 0.7M tps
>> 96 clients: 1.5M tps
>>
>> So very similar improvement as for IOS.
> 
> Just to be clear, you're saying my v2 appears to fix the problem
> fully, for both plain index scans and index-only scans? (You haven't
> mentioned bitmap index scans at all, I think, even though they are
> relevant to v2.)

With v2 the regression disappears, both for index-only scans and regular
index scans. I haven't tried anything with bitmap scans - I hit the
regression mostly by accident, on a workload that does IOS only. I may
try constructing something with bitmap scans, but I didn't have time for
that right now.

> 
> I'd be surprised if my v2 *fully* fixed the issue with plain index
> scans. It's only going to avoid calls to BufferGetLSNAtomic()
> following _bt_readpage calls that return false, but I doubt that your
> particular pgbench-variant test queries really look like that.
> 


Not sure. My workload is pretty simple, and similar to what I used in
the other nbtree regression (with malloc overhead), except that it's not
using partitioning.

pgbench -i -s 1 test
psql test <<EOF
update pgbench_accounts set bid = aid;
create index on pgbench_accounts(bid);
vacuum full;
analyze;
EOF

# select.sql
set enable_indexonlyscan = off;
select count(*) from pgbench_accounts where bid = 1

I don't know what "fully fix" means in this context. I see a massive
improvement with v2, I have no idea if that's the best we could do.

> *Looks again* Oh, wait. This is another one of those benchmarks with
> an index scan that returns no rows whatsoever (just like on the other
> recent thread involving regressions tied to the introduction of a new
> skip support support function to nbtree). Fully makes sense that my v2
> would fix that particular problem, even with plain index scans.
> But...what about slightly different queries, that actually do return
> some rows? Those will look different.
> 

Actually, this particular query does return rows (one, to be precise).

But you're right - it seems sensitive to how many rows are returned, and
at some point the contention goes away and there's no regression.

I need to do proper automated testing, to get reliable results. I've
been doing manual testing, but it's easy to make mistakes that way.

Do you have any suggestions what cases you'd like me to test?


regards

-- 
Tomas Vondra




Re: strange perf regression with data checksums

From
Peter Geoghegan
Date:
On Mon, May 19, 2025 at 8:21 PM Tomas Vondra <tomas@vondra.me> wrote:
> With v2 the regression disappears, both for index-only scans and regular
> index scans. I haven't tried anything with bitmap scans - I hit the
> regression mostly by accident, on a workload that does IOS only. I may
> try constructing something with bitmap scans, but I didn't have time for
> that right now.

I imagine bitmap index scans will be similar to plain index scans.

> I don't know what "fully fix" means in this context. I see a massive
> improvement with v2, I have no idea if that's the best we could do.

You expected there to be *zero* performance impact from enabling
checksums for this workload, since it is a pure read-only workload.
That's what I meant by "fully fix".

> But you're right - it seems sensitive to how many rows are returned, and
> at some point the contention goes away and there's no regression.
>
> I need to do proper automated testing, to get reliable results. I've
> been doing manual testing, but it's easy to make mistakes that way.
>
> Do you have any suggestions what cases you'd like me to test?

Nothing comes to mind. Again, just be aware that we can only
completely avoid calling BufferGetLSNAtomic is only possible when:

* Scan is an index-only scan

OR

* Scan is a bitmap index scan

OR

* Scan is a plain index scan, reading a page that _bt_readpage()
returned "false" for when called.

In other words, plain index scans that read a lot of tuples might
receive no benefit whatsoever. It's possible that it already matters
less there anyway. It's also possible that there is some unforeseen
benefit from merely *delaying* the call to BufferGetLSNAtomic. But in
all likelihood these "unfixed" plain index scans are just as fast with
v2 as they are when run on master/baseline.


--
Peter Geoghegan



Re: strange perf regression with data checksums

From
Peter Geoghegan
Date:
On Tue, May 20, 2025 at 8:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I imagine bitmap index scans will be similar to plain index scans.

To be clear, I meant that the magnitude of the problem will be
similar. I do not mean that they aren't fixed by my v2. They should be
fully fixed by v2.

--
Peter Geoghegan



Re: strange perf regression with data checksums

From
Andres Freund
Date:
Hi,

On 2025-05-19 15:45:01 -0400, Peter Geoghegan wrote:
> On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we can do better - something like
> >
> > #ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
> >     lsn = PageGetLSN(page);
> > #else
> >     buf_state = LockBufHdr(bufHdr);
> >     lsn = PageGetLSN(page);
> >     UnlockBufHdr(bufHdr, buf_state);
> > #endif
> >
> > All perf relevant systems support reading 8 bytes without a chance of
> > tearing...
> 
> Even assuming that this scheme works perfectly, I'd still like to
> pursue the idea I had about fixing this in nbtree.
> 
> The relevant nbtree code seems more elegant if we avoid calling
> BufferGetLSNAtomic() unless and until its return value might actually
> be useful. It's quite a lot easier to understand the true purpose of
> so->currPos.lsn with this new structure.

I'm not against that - ISTM we should do both.

Greetings,

Andres Freund



Re: strange perf regression with data checksums

From
Tomas Vondra
Date:
Hi,

We had a RMT meeting yesterday, and we briefly discussed whether the v2
patch should go into PG18. And we concluded we probably should try to do
that.

On the one hand, it's not a PG18-specific bug, in the sense that older
releases are affected the same way. But it requires data checksums to be
enabled, and a lot of systems runs with checksums=off, simply because
that's the default. But 18 will (likely) change that.

We didn't get any complaints about this (at least I'm not aware of any),
but I suppose that's simply because people didn't have a comparison, and
treated it as "normal". But with the default changes, it'll be easier to
spot once they upgrade to PG18.

So better to get this in now, otherwise we may have to wait until PG19,
because of ABI (the patch adds a field into BTScanPosData, but maybe
it'd be possible to add it into padding, not sure).


regards

-- 
Tomas Vondra




Re: strange perf regression with data checksums

From
Peter Geoghegan
Date:
On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote:
> So better to get this in now, otherwise we may have to wait until PG19,
> because of ABI (the patch adds a field into BTScanPosData, but maybe
> it'd be possible to add it into padding, not sure).

I agree. I can get this into shape for commit today.

Does anybody object to my proceeding with committing the patch on the
master branch/putting it in Postgres 18?  (FWIW I could probably fit
the new field into some BTScanPosData alignment padding, but I don't
favor back patching.)

I consider my patch to be low risk. There's a kind of symmetry to how
things work with the patch in place, which IMV makes things simpler.

--
Peter Geoghegan



Re: strange perf regression with data checksums

From
Peter Geoghegan
Date:
On Wed, Jun 4, 2025 at 10:21 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote:
> > So better to get this in now, otherwise we may have to wait until PG19,
> > because of ABI (the patch adds a field into BTScanPosData, but maybe
> > it'd be possible to add it into padding, not sure).
>
> I agree. I can get this into shape for commit today.

Attached is v3, which is functionally the same as v2. It has improved
comments, and a couple of extra assertions. Plus there's a real commit
message now.

I also relocated the code that sets so.drop_pin from
_bt_preprocess_keys to btrescan. That seems like the more logical
place for it.

My current plan is to commit this in the next couple of days.

--
Peter Geoghegan

Attachment

Re: strange perf regression with data checksums

From
Peter Geoghegan
Date:
On Wed, Jun 4, 2025 at 1:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> My current plan is to commit this in the next couple of days.

Pushed.

It would be great if we could also teach BufferGetLSNAtomic() to just
call PageGetLSN() (at least on PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
platforms), but my sense is that that's not going to happen in
Postgres 18.

--
Peter Geoghegan