Re: AIO writes vs hint bits vs checksums - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO writes vs hint bits vs checksums |
Date | |
Msg-id | j526lwnbudqijis6e4qtiwsmr3ldy4c6jghti3cqkeo34xrh63@hmidtdwn6vxp Whole thread Raw |
In response to | Re: AIO writes vs hint bits vs checksums (Heikki Linnakangas <hlinnaka@iki.fi>) |
List | pgsql-hackers |
Hi, Thanks for the quick review! On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote: > On 24/09/2024 18:55, Andres Freund wrote: > > What I'd instead like to propose is to implement the right to set hint bits as > > a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I > > named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when > > BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for > > BM_SETTING_HINTS to be unset to start IO. > > > > Naively implementing this, by acquiring and releasing the permission to set > > hint bits in SetHintBits() unfortunately leads to a significant performance > > regression. While the performance is unaffected for OLTPish workloads like > > pgbench (both read and write), sequential scans of unhinted tables regress > > significantly, due to the per-tuple lock acquisition this would imply. > > It'd be nice to implement this without having to set and clear > BM_SETTING_HINTS every time. True - but it's worth noting that we only need to set BM_SETTING_HINTS if there are any hint bits that need to be set. So in the most common paths we won't need to deal with it at all. > Could we put the overhead on the FlushBuffer() > instead? > > How about something like this: > > To set hint bits: > > 1. Lock page in SHARED mode. > 2. Read BM_IO_IN_PROGRESS > 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't > 4. Unlock page > > To flush a buffer: > > 1. Lock page in SHARED mode > 2. Set BM_IO_IN_PROGRESS > 3. Read the lock count on the buffer lock, to see if we're the only locker. > 4. If anyone else is holding the lock, upgrade it to exclusive mode, and > immediately downgrade back to share mode. > 5. calculate CRC, flush the buffer > 6. Clear BM_IO_IN_PROGRESS and unlock page. > > This goes back to the idea of adding LWLock support for this, but the amount > of changes could be pretty small. The missing operation we don't have today > is reading the share-lock count on the lock in step 3. That seems simple to > add. I've played around with a bunch of ideas like this. There are two main reasons I didn't like them that much in the end: 1) The worst case latency impacts seemed to make them not that interesting. A buffer that is heavily contended with share locks might not get down to zero share lockers for quite a while. That's not a problem for individual FlushBuffer() calls, but it could very well add up to a decent sized delay for something like a checkpoint that has to flush a lot of buffers. Waiting for all pre-existing share lockers is easier said than done. We don't record the acquisition order anywhere and a share-lock release won't wake anybody if the lockcount doesn't reach zero. Changing that wouldn't exactly be free and the cost would be born by all lwlock users. 2) They are based on holding an lwlock. But it's actually quite conceivable that we'd want to set something hint-bit-like without any lock, as we e.g. currently do for freespace/. That would work with something like the approach I chose here, but not if we rely on lwlocks. E.g. with a bit of work we could actually do sequential scans without blocking concurrent buffer modifications by carefully ordering when pd_lower is modified and making sure that tuple header fields are written in a sensible order. > Acquiring the exclusive lock in step 4 is just a way to wait for all the > existing share-lockers to release the lock. You wouldn't need to block new > share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but > it currently ignores share-lockers. So doing this "properly" would require > more changes to LWLocks. Briefly acquiring an exclusive lock seems > acceptable though. I don't think LW_WAIT_UNTIL_FREE is that easily extended to cover something like this, because share lockers can acquire/release the lock without waking anybody. And changing that would be rather expensive for everyone... > > But: We can address this and improve performance over the status quo! Today we > > determine tuple visiblity determination one-by-one, even when checking the > > visibility of an entire page worth of tuples. > > Yes, batching the visibility checks makes sense in any case. Cool. The wins by doing that are noticeable... Greetings, Andres Freund
pgsql-hackers by date: