Re: AIO writes vs hint bits vs checksums - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO writes vs hint bits vs checksums |
Date | |
Msg-id | xbxhku5bpfyzq4b3kng32dtwhjtq6e4cmfjxxiabo434p6wadi@to4dfk4i4mok Whole thread Raw |
In response to | Re: AIO writes vs hint bits vs checksums (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, On 2024-10-30 09:58:30 -0400, Andres Freund wrote: > On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote: > > Could we put the overhead on the FlushBuffer() > > instead? > > > > How about something like this: > > > > To set hint bits: > > > > 1. Lock page in SHARED mode. > > 2. Read BM_IO_IN_PROGRESS > > 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't > > 4. Unlock page > > > > To flush a buffer: > > > > 1. Lock page in SHARED mode > > 2. Set BM_IO_IN_PROGRESS > > 3. Read the lock count on the buffer lock, to see if we're the only locker. > > 4. If anyone else is holding the lock, upgrade it to exclusive mode, and > > immediately downgrade back to share mode. > > 5. calculate CRC, flush the buffer > > 6. Clear BM_IO_IN_PROGRESS and unlock page. > > > > This goes back to the idea of adding LWLock support for this, but the amount > > of changes could be pretty small. The missing operation we don't have today > > is reading the share-lock count on the lock in step 3. That seems simple to > > add. > > I've played around with a bunch of ideas like this. There are two main > reasons I didn't like them that much in the end: > > 1) The worst case latency impacts seemed to make them not that > interesting. A buffer that is heavily contended with share locks might not > get down to zero share lockers for quite a while. That's not a problem for > individual FlushBuffer() calls, but it could very well add up to a decent > sized delay for something like a checkpoint that has to flush a lot of > buffers. > > Waiting for all pre-existing share lockers is easier said than done. We > don't record the acquisition order anywhere and a share-lock release won't > wake anybody if the lockcount doesn't reach zero. Changing that wouldn't > exactly be free and the cost would be born by all lwlock users. > > 2) They are based on holding an lwlock. But it's actually quite conceivable > that we'd want to set something hint-bit-like without any lock, as we > e.g. currently do for freespace/. That would work with something like the > approach I chose here, but not if we rely on lwlocks. > > E.g. with a bit of work we could actually do sequential scans without > blocking concurrent buffer modifications by carefully ordering when > pd_lower is modified and making sure that tuple header fields are written > in a sensible order. I still don't like this idea a whole lot - but perhaps we could get reduce the overhead of my proposal some, to get closer to yours. When setting hint bits for many tuples on a page the overhead of my approach is neglegible, but when doing it for individual tuples it's a bit less neglegible. We can reduce the efficiency difference substantially by adding a bufmgr.c API that set hints on a page. That function can set the hint bit while holding the buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and thus also doesn't need to do resowner.c accounting. To see the worst case overhead, I a) disabling the "batch" optimization b) disabled checksums, as that otherwise would hide small efficiency differences c) used an unlogged table and measured the performance difference for a previously-unhinted sequential scan of a narrow table that immediately discards all tuples due to OFFSET - afaict the worst case for the proposed new behaviour. Previously this was 30.8% slower than master. Now it's only 1.9% slower. With the batch optimization enabled, the patchset is 7.5% faster. I also looked at the performance impact on scans that cannot use the batched approach. The worst case I could think of was a large ordered indexscan of a previously unhinted table. For an IOS, the performance difference is a slowdown of 0.65%. But the difference being so small is partially just due to IOS being considerably slower than a plain index scan when all tuples need to be fetched from the table (we need to address that...). Forcing a non-IOS IOS scan using enable_indexonlyscan, I get a slowdown of 5.0%. Given that this is an intentionally pathological workload - there's no IO / the workload is CPU bound, yet the data is only ever accessed once, with a query that doesn't ever actually look at the data - I'm quite happy with that. As soon as I make it a query that actually uses the data, the difference vanishes in the noise. Greetings, Andres Freund
pgsql-hackers by date: