Re: AIO writes vs hint bits vs checksums - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO writes vs hint bits vs checksums
Date
Msg-id xbxhku5bpfyzq4b3kng32dtwhjtq6e4cmfjxxiabo434p6wadi@to4dfk4i4mok
Whole thread Raw
In response to Re: AIO writes vs hint bits vs checksums  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2024-10-30 09:58:30 -0400, Andres Freund wrote:
> On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:
> > Could we put the overhead on the FlushBuffer()
> > instead?
> >
> > How about something like this:
> >
> > To set hint bits:
> >
> > 1. Lock page in SHARED mode.
> > 2. Read BM_IO_IN_PROGRESS
> > 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
> > 4. Unlock page
> >
> > To flush a buffer:
> >
> > 1. Lock page in SHARED mode
> > 2. Set BM_IO_IN_PROGRESS
> > 3. Read the lock count on the buffer lock, to see if we're the only locker.
> > 4. If anyone else is holding the lock, upgrade it to exclusive mode, and
> > immediately downgrade back to share mode.
> > 5. calculate CRC, flush the buffer
> > 6. Clear BM_IO_IN_PROGRESS and unlock page.
> >
> > This goes back to the idea of adding LWLock support for this, but the amount
> > of changes could be pretty small. The missing operation we don't have today
> > is reading the share-lock count on the lock in step 3. That seems simple to
> > add.
>
> I've played around with a bunch of ideas like this. There are two main
> reasons I didn't like them that much in the end:
>
> 1) The worst case latency impacts seemed to make them not that
>    interesting. A buffer that is heavily contended with share locks might not
>    get down to zero share lockers for quite a while. That's not a problem for
>    individual FlushBuffer() calls, but it could very well add up to a decent
>    sized delay for something like a checkpoint that has to flush a lot of
>    buffers.
>
>    Waiting for all pre-existing share lockers is easier said than done. We
>    don't record the acquisition order anywhere and a share-lock release won't
>    wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
>    exactly be free and the cost would be born by all lwlock users.
>
> 2) They are based on holding an lwlock. But it's actually quite conceivable
>    that we'd want to set something hint-bit-like without any lock, as we
>    e.g. currently do for freespace/. That would work with something like the
>    approach I chose here, but not if we rely on lwlocks.
>
>    E.g. with a bit of work we could actually do sequential scans without
>    blocking concurrent buffer modifications by carefully ordering when
>    pd_lower is modified and making sure that tuple header fields are written
>    in a sensible order.

I still don't like this idea a whole lot - but perhaps we could get reduce the
overhead of my proposal some, to get closer to yours. When setting hint bits
for many tuples on a page the overhead of my approach is neglegible, but when doing it
for individual tuples it's a bit less neglegible.

We can reduce the efficiency difference substantially by adding a bufmgr.c API
that set hints on a page. That function can set the hint bit while holding the
buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and
thus also doesn't need to do resowner.c accounting.

To see the worst case overhead, I
a) disabling the "batch" optimization
b) disabled checksums, as that otherwise would hide small efficiency
   differences
c) used an unlogged table

and measured the performance difference for a previously-unhinted sequential
scan of a narrow table that immediately discards all tuples due to OFFSET -
afaict the worst case for the proposed new behaviour.

Previously this was 30.8% slower than master. Now it's only 1.9% slower.

With the batch optimization enabled, the patchset is 7.5% faster.


I also looked at the performance impact on scans that cannot use the batched
approach. The worst case I could think of was a large ordered indexscan of a
previously unhinted table.

For an IOS, the performance difference is a slowdown of 0.65%.

But the difference being so small is partially just due to IOS being
considerably slower than a plain index scan when all tuples need to be fetched
from the table (we need to address that...). Forcing a non-IOS IOS scan using
enable_indexonlyscan, I get a slowdown of 5.0%.


Given that this is an intentionally pathological workload - there's no IO /
the workload is CPU bound, yet the data is only ever accessed once, with a
query that doesn't ever actually look at the data - I'm quite happy with that.

As soon as I make it a query that actually uses the data, the difference
vanishes in the noise.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "Andrey M. Borodin"
Date:
Subject: Re: Using read stream in autoprewarm
Next
From: Jeff Davis
Date:
Subject: Re: Collation & ctype method table, and extension hooks