Re: AIO writes vs hint bits vs checksums - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO writes vs hint bits vs checksums
Date
Msg-id j526lwnbudqijis6e4qtiwsmr3ldy4c6jghti3cqkeo34xrh63@hmidtdwn6vxp
Whole thread Raw
In response to Re: AIO writes vs hint bits vs checksums  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: AIO writes vs hint bits vs checksums
List pgsql-hackers
Hi,

Thanks for the quick review!


On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:
> On 24/09/2024 18:55, Andres Freund wrote:
> > What I'd instead like to propose is to implement the right to set hint bits as
> > a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
> > named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
> > BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
> > BM_SETTING_HINTS to be unset to start IO.
> >
> > Naively implementing this, by acquiring and releasing the permission to set
> > hint bits in SetHintBits() unfortunately leads to a significant performance
> > regression. While the performance is unaffected for OLTPish workloads like
> > pgbench (both read and write), sequential scans of unhinted tables regress
> > significantly, due to the per-tuple lock acquisition this would imply.
>
> It'd be nice to implement this without having to set and clear
> BM_SETTING_HINTS every time.

True - but it's worth noting that we only need to set BM_SETTING_HINTS if
there are any hint bits that need to be set. So in the most common paths we
won't need to deal with it at all.



> Could we put the overhead on the FlushBuffer()
> instead?
>
> How about something like this:
>
> To set hint bits:
>
> 1. Lock page in SHARED mode.
> 2. Read BM_IO_IN_PROGRESS
> 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
> 4. Unlock page
>
> To flush a buffer:
>
> 1. Lock page in SHARED mode
> 2. Set BM_IO_IN_PROGRESS
> 3. Read the lock count on the buffer lock, to see if we're the only locker.
> 4. If anyone else is holding the lock, upgrade it to exclusive mode, and
> immediately downgrade back to share mode.
> 5. calculate CRC, flush the buffer
> 6. Clear BM_IO_IN_PROGRESS and unlock page.
>
> This goes back to the idea of adding LWLock support for this, but the amount
> of changes could be pretty small. The missing operation we don't have today
> is reading the share-lock count on the lock in step 3. That seems simple to
> add.

I've played around with a bunch of ideas like this. There are two main
reasons I didn't like them that much in the end:

1) The worst case latency impacts seemed to make them not that
   interesting. A buffer that is heavily contended with share locks might not
   get down to zero share lockers for quite a while. That's not a problem for
   individual FlushBuffer() calls, but it could very well add up to a decent
   sized delay for something like a checkpoint that has to flush a lot of
   buffers.

   Waiting for all pre-existing share lockers is easier said than done. We
   don't record the acquisition order anywhere and a share-lock release won't
   wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
   exactly be free and the cost would be born by all lwlock users.

2) They are based on holding an lwlock. But it's actually quite conceivable
   that we'd want to set something hint-bit-like without any lock, as we
   e.g. currently do for freespace/. That would work with something like the
   approach I chose here, but not if we rely on lwlocks.

   E.g. with a bit of work we could actually do sequential scans without
   blocking concurrent buffer modifications by carefully ordering when
   pd_lower is modified and making sure that tuple header fields are written
   in a sensible order.



> Acquiring the exclusive lock in step 4 is just a way to wait for all the
> existing share-lockers to release the lock. You wouldn't need to block new
> share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but
> it currently ignores share-lockers. So doing this "properly" would require
> more changes to LWLocks. Briefly acquiring an exclusive lock seems
> acceptable though.

I don't think LW_WAIT_UNTIL_FREE is that easily extended to cover something
like this, because share lockers can acquire/release the lock without waking
anybody. And changing that would be rather expensive for everyone...


> > But: We can address this and improve performance over the status quo! Today we
> > determine tuple visiblity determination one-by-one, even when checking the
> > visibility of an entire page worth of tuples.
>
> Yes, batching the visibility checks makes sense in any case.

Cool. The wins by doing that are noticeable...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: general purpose array_sort
Next
From: Jelte Fennema-Nio
Date:
Subject: Re: protocol-level wait-for-LSN