Home > mailing lists

Re: Buffer locking is special (hints, checksums, AIO writes) - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Buffer locking is special (hints, checksums, AIO writes)
Date	November 20, 2025 23:51:33
Msg-id	z5gkpjfxa4rg3zylnnrpga77ezts5rrgi7xr34yydcmqeathop@hbeyhwlm2i5f Whole thread Raw
In response to	Re: Buffer locking is special (hints, checksums, AIO writes) (Greg Burd <greg@burd.me>)
Responses	Re: Buffer locking is special (hints, checksums, AIO writes)
List	pgsql-hackers

Tree view

Hi,

On 2025-11-20 14:08:57 -0500, Greg Burd wrote:
> After talking to you about these ideas at PGConf in NYC I've been
> anxiously awaiting this patch set. Thanks for dedicating the energy and
> time to get it to this stage.

Thanks!

> High level feedback after reading the patches/email/commit messages is
> that it looks to get you to where you wanted to be, unblocking AIO
> writes. I think they'll end up being faster than what's in place now,
> even before you get to the AIO piece.  Certainly removing the copy of
> each page to do a checksum will help.  Opening the door to a future
> where we can have super-pinned/locked pages is also a net win.

I actually had planned to write about the performance effects, in particular
of 0012, a bit more:

It's worth pointing out that the new way of setting hint bits is inherently
more expensive than what we did before - upgrading a lock to a different lock
level isn't free, compared to doing, well, nothing.

For paths that set the hint bits of a whole page, like a seqscan, that cost is
more than amortized by the batched approach introduced in 0011. Those get
faster with the patch, both when already hinted and when not.

However, there are paths that aren't easily amenable to that approach, like
e.g. an ordered index scan referencing unhinted tuples. There we only ever
access a single tuple and release the upgraded lock after every tuple. If the
index scan is perfectly correlated with the table and every tuple is unhinted,
that's a decent amount of additional work.

I've spent a lot of time micro-optimizing that workload, to avoid any
significiant regressions. An extreme stress-test started out being about 20%
slower than today, as of my current local version, it's a bit faster (~1%) on
one of my machines and a bit slower (~2%) on another. Partially that was
achieved by optimizing the hint-bit-lock-upgrade code more (e.g. having a fast
path for updating a single hint bit, avoiding redundant reads of the lock
state by having MarkSharedBufferDirtyHint(), ...), partially by optimizing the
locking code.  The latter is a bit of a cheat though - things would be even
faster if we went with the old way of setting hint bits, but with the
independent optimizations applied.

I think that's ok though:

1) the old way of setting hint bits is a pretty dirty hack that causes issues
   in quite a few places.

2) by definition, having to set hint bits is an ephemeral state, once the hint
   bits are set, the difference vanishes

3) no normal workload shows the difference - my stress test does
   SELECT * FROM manyrows_idx ORDER BY i OFFSET 10000000;
   on a perfectly correlated table with very narrow rows, i.e. an index scan
   of the whole table, where none of the scan results are ever used. Once one
   actually uses the resulting rows, the performance difference completely
   vanishes.

4) as part of the index prefetching work, we might get the infrastructure to
   actually batch the hint-bit setting in this case too.

I see some mild performance gains in workloads like pgbench [-S], but nothing
to write home about. Which I think is about what I would expect - there's a
minor efficiency gain due to the private refcount changes and not having state
tracking for two error recovery mechanisms (lwlocks' and private refcount).

Non-all-visible seqscans do see some performance gain due to 0011, whether the
table is hinted or not. But it's again something that's mostly noticeable in
microbenchmarks, as e.g. tuple deforming or qual evaluation has a much bigger
impact.

> Everything before/after 0008 was rather easy to digest and understand
> and I found nothing really to call out at this stage
>
> 0008 is understandable too, it's just sizable. While it is large, I find
> it well laid out and more readable than before.  I gave the locking code
> a good look, it seems correct AFAICT.

I hope so :), the locking logic it's largely the same as lwlock.c, with some
exceptions due to the added lock level and differences in error handling
state.

I don't really see a good way to split 0008 unfortunately...  I previously had
split 0012 into four patches (core changes, heapam changes, the rest of the
adaptions to setting hint bits in various places, adding assertions relying on
the different lock levels), but I found it pretty unwieldly from a "comment
management" perspective, because comments that need to be rewritten are
temporarily wrong, or would need to be modified in yet another path.  But I'm
open to going back to that approach.

I guess I could pull out the addition of UnlockBuffer() and the "redirection"
to it from LockBuffer() into a separate patch.

> Keep going, I'll be happy to dedicate time to testing and digging into
> the commits as you get this into a final state.  I look forward to
> extending/enhancing this code once integrated.

Cool!

I think 0001, 0002, 0003, 0005 and 0009 should be mergeable pretty soon.  I've
some further polishing to do for 0006 and 0007, but I think they could go in
well ahead of the rest.

For 0008, in addition to what's noted in the commit message, I think there
needs to be an additional section in src/backend/storage/buffer/README (or
such), explaining that buffer content locks used to be lwlocks but aren't
anymore for xyz reasons.  I suspect it'd also be good to have a few references
from lwlock.c to bufmgr.c to make sure the code is co-evolved.

For 0012, I think it might make sense to pull out some of the changes to
fsm_vacuum_page() out into a separate commit. Basically changing the code to
acquire a lock on the page - I still can't quite believe that somebody thought
it's sane to update the page without even bothering with a share lock.

0014 needs a lot more polishing, there's references to the hint bit / locking
interactions all over. Pretty hard to find all the references :(.

Greetings,

Andres Freund

pgsql-hackers by date:

From: Greg Burd
Date: 20 November 2025, 23:45:22
Subject: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers

From: David Rowley
Date: 20 November 2025, 23:58:38
Subject: Re: another autovacuum scheduling thread

Re: Buffer locking is special (hints, checksums, AIO writes) - Mailing list pgsql-hackers

Previous

Next