Thread: AIO writes vs hint bits vs checksums
Hi, Currently we modify pages while just holding a share lock, for hint bit writes. Writing a buffer out only requires a share lock. Because of that we can't compute checksums and write out pages in-place, as a concurent hint bit write can easily corrupt the checksum. That's not great, but not awful for our current synchronous buffered IO, as we only ever have a single page being written out at a time. However, it becomes a problem even if we just want to write out in chunks larger than a single page - we'd need to reserve not just one BLCKSZ sized buffer for this, but make it PG_IOV_MAX * BLCKSZ sized. Perhaps still tolerable. With AIO this becomes a considerably bigger issue: a) We can't just have one write in progress at a time, but many b) To be able to implement AIO using workers the "source" or "target" memory of the IO needs to be in shared memory Besides that, the need to copy the buffers makes checkpoints with AIO noticeably slower when checksums are enabled - it's not the checksum but the copy that's the biggest source of the slowdown. So far the AIO patchset has solved this by introducing a set of "bounce buffers", which can be acquired and used as the source/target of IO when doing it in-place into shared buffers isn't viable. I am worried about that solution however, as either acquisition of bounce buffers becomes a performance issue (that's how I did it at first, it was hard to avoid regressions) or we reserve bounce buffers for each backend, in which case the memory overhead for instances with relatively small amount of shared_buffers and/or many connections can be significant. Which lead me down the path of trying to avoid the need for the copy in the first place: What if we don't modify pages while it's undergoing IO? The naive approach would be to not set hint bits with just a shared lock - but that doesn't seem viable at all. For performance we rely on hint bits being set and in many cases we'll only encounter the page in shared mode. We could implement a conditional lock upgrade to an exclusive lock and do so while setting hint bits, but that'd obviously be concerning from a concurrency point of view. What I suspect we might want instead is something inbetween a share and an exclusive lock, which is taken while setting a hint bit and which conflicts with having an IO in progress. On first blush it might sound attractive to introduce this on the level of lwlocks. However, I doubt that is a good idea - it'd make lwlock.c more complicated which would imply overhead for other users, while the precise semantics would be fairly specific to buffer locking. A variant of this would be to generalize lwlock.c to allow implementing different kinds of locks more easily. But that's a significant project on its own and doesn't really seem necessary for this specific project. What I'd instead like to propose is to implement the right to set hint bits as a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for BM_SETTING_HINTS to be unset to start IO. Naively implementing this, by acquiring and releasing the permission to set hint bits in SetHintBits() unfortunately leads to a significant performance regression. While the performance is unaffected for OLTPish workloads like pgbench (both read and write), sequential scans of unhinted tables regress significantly, due to the per-tuple lock acquisition this would imply. But: We can address this and improve performance over the status quo! Today we determine tuple visiblity determination one-by-one, even when checking the visibility of an entire page worth of tuples. That's not exactly free. I've prototyped checking visibility of an entire page of tuples at once and it indeed speeds up visibility checks substantially (in some cases seqscans are over 20% faster!). Once we have page-level visibility checks we can get the right to set hint bits once for an entire page instead of doing it for every tuple - with that in place the "new approach" of setting hint bits only with BM_SETTING_HINTS wins. Having a page level approach to setting hint bits has other advantages: E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set on the page and we don't mark a page dirty on hot standby. Which often will result in hint bits notpersistently set on replicas until the page is frozen. Another issue is that we'll often WAL log hint bits for a page (due to hint bits being set), just to then immediately log another WAL record for the same page (e.g. for pruning), which is obviously wasteful. With a different interface we could combine the WAL records for both. I've not prototyped either, but I'm fairly confident they'd be helpful. Does this sound like a reasonable idea? Counterpoints? If it does sound reasonable, I'll clean up my pile of hacks into something semi-understandable... Greetings, Andres Freund
On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote: > So far the AIO patchset has solved this by introducing a set of "bounce > buffers", which can be acquired and used as the source/target of IO when doing > it in-place into shared buffers isn't viable. > > I am worried about that solution however, as either acquisition of bounce > buffers becomes a performance issue (that's how I did it at first, it was hard > to avoid regressions) or we reserve bounce buffers for each backend, in which > case the memory overhead for instances with relatively small amount of > shared_buffers and/or many connections can be significant. > But: We can address this and improve performance over the status quo! Today we > determine tuple visiblity determination one-by-one, even when checking the > visibility of an entire page worth of tuples. That's not exactly free. I've > prototyped checking visibility of an entire page of tuples at once and it > indeed speeds up visibility checks substantially (in some cases seqscans are > over 20% faster!). Nice! It sounds like you refactored the relationship between heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint bit setting upward or the iterate-over-tuples downward. Is that about right? > Once we have page-level visibility checks we can get the right to set hint > bits once for an entire page instead of doing it for every tuple - with that > in place the "new approach" of setting hint bits only with BM_SETTING_HINTS > wins. How did page-level+BM_SETTING_HINTS performance compare to performance of the page-level change w/o the BM_SETTING_HINTS change? > Having a page level approach to setting hint bits has other advantages: > > E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set > on the page and we don't mark a page dirty on hot standby. Which often will > result in hint bits notpersistently set on replicas until the page is frozen. Nice way to improve that. > Does this sound like a reasonable idea? Counterpoints? I guess the main part left to discuss is index scans or other scan types where we'd either not do page-level visibility or we'd do page-level visibility including tuples we wouldn't otherwise use. BM_SETTING_HINTS likely won't show up so readily in index scan profiles, but the cost is still there. How should we think about comparing the distributed cost of the buffer header manipulations during index scans vs. the costs of bounce buffers? Thanks, nm
Hi, On 2024-09-24 12:43:40 -0700, Noah Misch wrote: > On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote: > > So far the AIO patchset has solved this by introducing a set of "bounce > > buffers", which can be acquired and used as the source/target of IO when doing > > it in-place into shared buffers isn't viable. > > > > I am worried about that solution however, as either acquisition of bounce > > buffers becomes a performance issue (that's how I did it at first, it was hard > > to avoid regressions) or we reserve bounce buffers for each backend, in which > > case the memory overhead for instances with relatively small amount of > > shared_buffers and/or many connections can be significant. > > > But: We can address this and improve performance over the status quo! Today we > > determine tuple visiblity determination one-by-one, even when checking the > > visibility of an entire page worth of tuples. That's not exactly free. I've > > prototyped checking visibility of an entire page of tuples at once and it > > indeed speeds up visibility checks substantially (in some cases seqscans are > > over 20% faster!). > > Nice! It sounds like you refactored the relationship between > heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint > bit setting upward or the iterate-over-tuples downward. Is that about right? I've tried about five variations, so I don't have one answer to this yet :). One problem is that having repeated loops doing PageGetItemId(), PageGetItem(), ItemIdGetLength() isn't exactly free. To some degree it can be hidden by allowing for better superscalar execution, but not entirely. I've been somewhat confused by the compiler generated code around ItemId handling for a while, it looks way more expensive than it should - it regularly is a bottleneck due to the sheer number of instructions being executed leading to being frontend bound. But never quite looked into it deeply enough to figure out what's causing it / how to fix it. > > Once we have page-level visibility checks we can get the right to set hint > > bits once for an entire page instead of doing it for every tuple - with that > > in place the "new approach" of setting hint bits only with BM_SETTING_HINTS > > wins. > > How did page-level+BM_SETTING_HINTS performance compare to performance of the > page-level change w/o the BM_SETTING_HINTS change? Just ran that. There probably is a performance difference, but it's small (<0.5%) making it somewhat hard to be certain. It looks like the main reason for that is ConditionVariableBroadcast() on the iocv shows up even though nobody is waiting. I've been fighting that with AIO as well, so maybe it's time to figure out the memory ordering rules that'd allow to check that without a full spinlock acquisition. If we figure it out, we perhaps should use the chance to get rid of BM_PIN_COUNT_WAITER... > > Does this sound like a reasonable idea? Counterpoints? > > I guess the main part left to discuss is index scans or other scan types where > we'd either not do page-level visibility or we'd do page-level visibility > including tuples we wouldn't otherwise use. BM_SETTING_HINTS likely won't > show up so readily in index scan profiles, but the cost is still there. I could indeed not make it show up in some simple index lookup heavy workloads. I need to try some more extreme cases though (e.g. fetching all tuples in a table via an index or having very long HOT chains). If it's not visible cost-wise compared to all the other costs of index scans - does it matter? If it's not visible it's either because it proportionally is very small or because it's completely hidden by superscalar execution. Another thing I forgot to mention that probably fits into the "tradeoffs" bucket: Because BM_SETTING_HINTS would be just a single bit, one backend setting hint bits would block out another backend setting hint bits. In most situations that'll be fine, or even faster than not doing so due to reducing cache line ping-pong, but in cases of multiple backends doing index lookups to different unhinted tuples on the same page it could be a bit worse. But I suspect that's fine because it's hard to believe that you could have enough index lookups to unhinted tuples for that to be a bottleneck - something has to produce all those unhinted tuples after all, and that's rather more expensive. And for single-tuple visibility checks the window in which hint bits are set is very small. > How should we think about comparing the distributed cost of the buffer > header manipulations during index scans vs. the costs of bounce buffers? Well, the cost of bounce buffers would be born as long as postgres is up, whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't really show up. If there are cases where the cost of the more expensive hint bit logic does show up, it'll get a lot harder to weigh. So far my prototype uses the path that avoids hint bits being set while IO is going on all the time, not just when checksums are enabled. We could change that, I guess. However, our habit of modifying buffers while IO is going on is causing issues with filesystem level checksums as well, as evidenced by the fact that debug_io_direct = data on btrfs causes filesystem corruption. So I tend to think it'd be better to just stop doing that alltogether (we also do that for WAL, when writing out a partial page, but a potential fix there would be different, I think). A thing I forgot to mention: Bounce buffers are kind of an architectural dead end, in that we wouln't need them in that form if we get to a threaded postgres. Zooming out (a lot) more: I like the idea of having a way to get the permission to perform some kinds of modifications on a page without an exlusive lock. While obviously a lot more work, I do think there's some potential to have some fast-paths that perform work on a page level without blocking out readers. E.g. some simple cases of insert could correctly be done without blocking out readers (by ordering the update of the max page offset Greetings, Andres Freund
On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres@anarazel.de> wrote: > Just ran that. There probably is a performance difference, but it's small > (<0.5%) making it somewhat hard to be certain. It looks like the main reason > for that is ConditionVariableBroadcast() on the iocv shows up even though > nobody is waiting. . o O { Gotta fix that. Memory barriers might be enough to check for empty wait list?, and even in the slow path, atomic wait lists or something better than spinlocks... } > However, our habit of modifying buffers while IO is going on is > causing issues with filesystem level checksums as well, as evidenced by the > fact that debug_io_direct = data on btrfs causes filesystem corruption. So I > tend to think it'd be better to just stop doing that alltogether (we also do > that for WAL, when writing out a partial page, but a potential fix there would > be different, I think). +many. Interesting point re the WAL variant. For the record, here's some discussion and a repro for that problem, which Andrew currently works around in a build farm animal with mount options: https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com
On Tue, Sep 24, 2024 at 04:30:25PM -0400, Andres Freund wrote: > On 2024-09-24 12:43:40 -0700, Noah Misch wrote: > > On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote: > > > Besides that, the need to copy the buffers makes checkpoints with AIO > > > noticeably slower when checksums are enabled - it's not the checksum but the > > > copy that's the biggest source of the slowdown. How big is that copy's contribution to the slowdown there? A measurable CPU overhead on writes likely does outweigh the unmeasurable overhead on index scans, but ... > > > Does this sound like a reasonable idea? Counterpoints? > > How should we think about comparing the distributed cost of the buffer > > header manipulations during index scans vs. the costs of bounce buffers? > > Well, the cost of bounce buffers would be born as long as postgres is up, > whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't > really show up. ... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad decision. From what I've heard so far of the performance effects, if it were me, I would keep the bounce buffers. I'd pursue BM_SETTING_HINTS and bounce buffer removal as a distinct project after the main AIO capability. Bounce buffers have an implementation. They aren't harming other design decisions. The AIO project is big, so I'd want to err on the side of not designating other projects as its prerequisites. > Zooming out (a lot) more: I like the idea of having a way to get the > permission to perform some kinds of modifications on a page without an > exlusive lock. While obviously a lot more work, I do think there's some > potential to have some fast-paths that perform work on a page level without > blocking out readers. E.g. some simple cases of insert could correctly be done > without blocking out readers (by ordering the update of the max page offset True.
Andres Freund <andres@anarazel.de> wrote: > What I'd instead like to propose is to implement the right to set hint bits as > a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I > named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when > BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for > BM_SETTING_HINTS to be unset to start IO. > > Naively implementing this, by acquiring and releasing the permission to set > hint bits in SetHintBits() unfortunately leads to a significant performance > regression. While the performance is unaffected for OLTPish workloads like > pgbench (both read and write), sequential scans of unhinted tables regress > significantly, due to the per-tuple lock acquisition this would imply. An alternative approach: introduce a flag that tells that the checksum is being computed, and disallow setting hint bits when that flag is set. As long as the checksum computation takes take much less time than the IO, fewer hint bit updates should be rejected. Of course, SetHintBits() would have to update the checksum too. But if it could determine where the hint bit is located in the buffer and if "some intermediate state" of the computation was maintained for each page in shared buffers, then the checksum update might be cheaper than the initial computation. But I'm not sure I understand the algorithm enough. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at> wrote: > Andres Freund <andres@anarazel.de> wrote: > > > What I'd instead like to propose is to implement the right to set hint bits as > > a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I > > named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when > > BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for > > BM_SETTING_HINTS to be unset to start IO. > > > > Naively implementing this, by acquiring and releasing the permission to set > > hint bits in SetHintBits() unfortunately leads to a significant performance > > regression. While the performance is unaffected for OLTPish workloads like > > pgbench (both read and write), sequential scans of unhinted tables regress > > significantly, due to the per-tuple lock acquisition this would imply. > > An alternative approach: introduce a flag that tells that the checksum is > being computed, and disallow setting hint bits when that flag is set. As long > as the checksum computation takes take much less time than the IO, fewer hint > bit updates should be rejected. Well, the checksum actually should not be computed during the IO, so the IO would still disallow hint bit updates :-( -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Wed, Sep 25, 2024 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres@anarazel.de> wrote: > > However, our habit of modifying buffers while IO is going on is > > causing issues with filesystem level checksums as well, as evidenced by the > > fact that debug_io_direct = data on btrfs causes filesystem corruption. So I > > tend to think it'd be better to just stop doing that alltogether (we also do > > that for WAL, when writing out a partial page, but a potential fix there would > > be different, I think). > > +many. Interesting point re the WAL variant. For the record, here's > some discussion and a repro for that problem, which Andrew currently > works around in a build farm animal with mount options: > > https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com Here's an interesting new development in that area, this time from OpenZFS, which committed its long awaited O_DIRECT support a couple of weeks ago[1] and seems to have taken a different direction since that last discussion. Clearly it has the same checksum stability problem as BTRFS and PostgreSQL itself, so an O_DIRECT mode with the goal of avoiding copying and caching must confront that and break *something*, or accept something like bounce buffers and give up the zero-copy goal. Curiously, they seem to have landed on two different solutions with three different possible behaviours: (1) On FreeBSD, temporarily make the memory non-writeable, (2) On Linux, they couldn't do that so they have an extra checksum verification on write. I haven't fully grokked all this yet, or even tried it, and it's not released or anything, but it looks a bit like all three behaviours are bad for our current hint bit design: on FreeBSD, setting a hint bit might crash (?) if a write is in progress in another process, and on Linux, depending on zfs_vdev_direct_write_verify, either the concurrent write might fail (= checkpointer failing on EIO because someone concurrently set a hint bit) or a later read might fail (= file is permanently corrupted and you don't find out until later, like btrfs). I plan to look more closely soon and see if I understood that right... [1] https://github.com/openzfs/zfs/pull/10018/commits/d7b861e7cfaea867ae28ab46ab11fba89a5a1fda
On 30/10/2024 04:21, Andres Freund wrote: > Attached is a, unfortunately long, series of patches implementing what I > described upthread. Review of the preparatory patches: > 0001 Add very basic test for kill_prior_tuples > > We currently don't exercise this patch for gist and hash, which seems > somewhat criminal. Interesting to use the isolationtester for this. There's just one session, so you're just using it to define reusable steps with handy names. I'm fine with that, but please add a comment to explain it. I wonder if it'd be more straightforward to make it a regular pg_regress test though. There would be some repetition, but would it be so bad? You forgot to add the new test to 'isolation_schedule'. typos: "inex" -> "index" "does something approximately reasonble" -> "do something approximately reasonable" > 0002 heapam: Move logic to handle HEAP_MOVED into a helper function > > A prep patch, which consolidates HEAP_MOVED handling into a helper. This > isn't strictly necessary, but I got bothered by it while writing this patch > series. +1 > +/* > + * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and > + * adjust hint bits. See the comment for SetHintBits() for more background. > + * > + * This helper returns false if the row ought to be invisible, true otherwise. > + */ > +static inline bool > +HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer) > +{ > + TransactionId xvac; > + > + /* only used by pre-9.0 binary upgrades */ > + if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN)))) > + return true; > + > + ... > +} This is so unlikely these days that this function probably should not be inlined anymore. > 0003 bufmgr: Add BufferLockHeldByMe() > > Useful for writing assertions in later patches. > + if (mode == BUFFER_LOCK_EXCLUSIVE) > + lwmode = LW_EXCLUSIVE; > + else if (mode == BUFFER_LOCK_SHARE) > + lwmode = LW_SHARED; > + else > + { > + Assert(false); > + pg_unreachable(); > + lwmode = LW_EXCLUSIVE; /* assuage compiler */ > + } I just learned a new word :-). This is fine, but to avoid the assuaging, you could also do this: if (mode == BUFFER_LOCK_EXCLUSIVE) lwmode = LW_EXCLUSIVE; else { Assert(mode == BUFFER_LOCK_SHARE); lwmode = LW_SHARED; } > 0004 heapam: Use exclusive lock on old page in CLUSTER > > When I started looking into this I wanted to make sure that it's actually > safe to skip setting hint bits in all places. One place that's *not* safe > is cluster - we assume that all hint bits are set in rewrite_heap_tuple(). > > This might not strictly be required, but it seemed very fragile as > is. While using an exclusive lock does seem like it makes it a bit safer, > it's not a very good fix, as we still dirty the old buffer, which seems > like a shame when one uses VACUUM FULL. > + /* > + * To be able to guarantee that we can set the hint bit, acquire an > + * exclusive lock on the old buffer. We need the hint bits to be set > + * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> > + * heap_freeze_tuple() will get confused. > + * Hmm, I don't see any comments in the reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple() path about requiring hint bits to be set. How does it get confused? > 0005 heapam: Only set tuple's block once per page in pagemode > > Minor, but measurable, optimization. +1, this makes sense independently of all the other patches. > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c > index 1748eafa100..d0cb4b1e29b 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -981,6 +981,9 @@ heapgettup_pagemode(HeapScanDesc scan, > linesleft = scan->rs_ntuples; > lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1; > > + /* set block once per page, instead of doing so for every tuple */ > + BlockIdSet(&tuple->t_self.ip_blkid, scan->rs_cblock); > + > /* lineindex now references the next or previous visible tid */ > continue_page: > > @@ -995,7 +998,7 @@ continue_page: > > tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp); > tuple->t_len = ItemIdGetLength(lpp); > - ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff); > + tuple->t_self.ip_posid = lineoff; > > /* skip any tuples that don't match the scan key */ > if (key != NULL && Should probably use ItemPointerSetBlockNumber and ItemPointerSetOffsetNumber. > 0006 heapam: Add batch mode mvcc check and use it in page mode > > This is a good bit faster on its own, but is required to avoid a > performance regression later, when setting hint bits only only when no IO > going on at the same. The BATCHMVCC_FEWER_ARGS stuff is pretty weird. Hard to imagine that it'd make any difference how you pass those arguments. > + if (likely(valid)) > + { > + vistuples_dense[nvis] = tup->t_self.ip_posid; > + nvis++; > + } Is the likely(valid) hint important for performance here? It would be a bad hint for a table with lots of dead tuples. I feel we'd better rely on the branch predictor for this one. > 0007 bufmgr: Make it easier to change number of buffer state bits > > This just makes it easier to change the division of bits in > BufferDesc->state. Looks good to me. -- Heikki Linnakangas Neon (https://neon.tech)
On 24/09/2024 18:55, Andres Freund wrote: > What I'd instead like to propose is to implement the right to set hint bits as > a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I > named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when > BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for > BM_SETTING_HINTS to be unset to start IO. > > Naively implementing this, by acquiring and releasing the permission to set > hint bits in SetHintBits() unfortunately leads to a significant performance > regression. While the performance is unaffected for OLTPish workloads like > pgbench (both read and write), sequential scans of unhinted tables regress > significantly, due to the per-tuple lock acquisition this would imply. It'd be nice to implement this without having to set and clear BM_SETTING_HINTS every time. Could we put the overhead on the FlushBuffer() instead? How about something like this: To set hint bits: 1. Lock page in SHARED mode. 2. Read BM_IO_IN_PROGRESS 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't 4. Unlock page To flush a buffer: 1. Lock page in SHARED mode 2. Set BM_IO_IN_PROGRESS 3. Read the lock count on the buffer lock, to see if we're the only locker. 4. If anyone else is holding the lock, upgrade it to exclusive mode, and immediately downgrade back to share mode. 5. calculate CRC, flush the buffer 6. Clear BM_IO_IN_PROGRESS and unlock page. This goes back to the idea of adding LWLock support for this, but the amount of changes could be pretty small. The missing operation we don't have today is reading the share-lock count on the lock in step 3. That seems simple to add. Acquiring the exclusive lock in step 4 is just a way to wait for all the existing share-lockers to release the lock. You wouldn't need to block new share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but it currently ignores share-lockers. So doing this "properly" would require more changes to LWLocks. Briefly acquiring an exclusive lock seems acceptable though. > But: We can address this and improve performance over the status quo! Today we > determine tuple visiblity determination one-by-one, even when checking the > visibility of an entire page worth of tuples. Yes, batching the visibility checks makes sense in any case. -- Heikki Linnakangas Neon (https://neon.tech)
Hi, Thanks for the quick review! On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote: > On 24/09/2024 18:55, Andres Freund wrote: > > What I'd instead like to propose is to implement the right to set hint bits as > > a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I > > named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when > > BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for > > BM_SETTING_HINTS to be unset to start IO. > > > > Naively implementing this, by acquiring and releasing the permission to set > > hint bits in SetHintBits() unfortunately leads to a significant performance > > regression. While the performance is unaffected for OLTPish workloads like > > pgbench (both read and write), sequential scans of unhinted tables regress > > significantly, due to the per-tuple lock acquisition this would imply. > > It'd be nice to implement this without having to set and clear > BM_SETTING_HINTS every time. True - but it's worth noting that we only need to set BM_SETTING_HINTS if there are any hint bits that need to be set. So in the most common paths we won't need to deal with it at all. > Could we put the overhead on the FlushBuffer() > instead? > > How about something like this: > > To set hint bits: > > 1. Lock page in SHARED mode. > 2. Read BM_IO_IN_PROGRESS > 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't > 4. Unlock page > > To flush a buffer: > > 1. Lock page in SHARED mode > 2. Set BM_IO_IN_PROGRESS > 3. Read the lock count on the buffer lock, to see if we're the only locker. > 4. If anyone else is holding the lock, upgrade it to exclusive mode, and > immediately downgrade back to share mode. > 5. calculate CRC, flush the buffer > 6. Clear BM_IO_IN_PROGRESS and unlock page. > > This goes back to the idea of adding LWLock support for this, but the amount > of changes could be pretty small. The missing operation we don't have today > is reading the share-lock count on the lock in step 3. That seems simple to > add. I've played around with a bunch of ideas like this. There are two main reasons I didn't like them that much in the end: 1) The worst case latency impacts seemed to make them not that interesting. A buffer that is heavily contended with share locks might not get down to zero share lockers for quite a while. That's not a problem for individual FlushBuffer() calls, but it could very well add up to a decent sized delay for something like a checkpoint that has to flush a lot of buffers. Waiting for all pre-existing share lockers is easier said than done. We don't record the acquisition order anywhere and a share-lock release won't wake anybody if the lockcount doesn't reach zero. Changing that wouldn't exactly be free and the cost would be born by all lwlock users. 2) They are based on holding an lwlock. But it's actually quite conceivable that we'd want to set something hint-bit-like without any lock, as we e.g. currently do for freespace/. That would work with something like the approach I chose here, but not if we rely on lwlocks. E.g. with a bit of work we could actually do sequential scans without blocking concurrent buffer modifications by carefully ordering when pd_lower is modified and making sure that tuple header fields are written in a sensible order. > Acquiring the exclusive lock in step 4 is just a way to wait for all the > existing share-lockers to release the lock. You wouldn't need to block new > share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but > it currently ignores share-lockers. So doing this "properly" would require > more changes to LWLocks. Briefly acquiring an exclusive lock seems > acceptable though. I don't think LW_WAIT_UNTIL_FREE is that easily extended to cover something like this, because share lockers can acquire/release the lock without waking anybody. And changing that would be rather expensive for everyone... > > But: We can address this and improve performance over the status quo! Today we > > determine tuple visiblity determination one-by-one, even when checking the > > visibility of an entire page worth of tuples. > > Yes, batching the visibility checks makes sense in any case. Cool. The wins by doing that are noticeable... Greetings, Andres Freund
Hi, On 2024-10-30 13:29:01 +0200, Heikki Linnakangas wrote: > On 30/10/2024 04:21, Andres Freund wrote: > > Attached is a, unfortunately long, series of patches implementing what I > > described upthread. > > Review of the preparatory patches: > > > 0001 Add very basic test for kill_prior_tuples > > > > We currently don't exercise this patch for gist and hash, which seems > > somewhat criminal. > > Interesting to use the isolationtester for this. There's just one session, > so you're just using it to define reusable steps with handy names. Yea. I had started out writing it as a pg_regress style test and it quickly got very hard to understand. > I'm fine with that, but please add a comment to explain it. Makes sense. > I wonder if it'd be more straightforward to make it a regular pg_regress > test though. There would be some repetition, but would it be so bad? I found it to be quite bad. If testing just one AM it's ok-ish, but once you test 2-3 it gets very long and repetitive. I guess we could use functions or such to make it a bit less painful - but that point, is it actually simpler? > You forgot to add the new test to 'isolation_schedule'. Oops. > typos: > "inex" -> "index" > "does something approximately reasonble" -> "do something approximately > reasonable" Oops^2. > > +/* > > + * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and > > + * adjust hint bits. See the comment for SetHintBits() for more background. > > + * > > + * This helper returns false if the row ought to be invisible, true otherwise. > > + */ > > +static inline bool > > +HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer) > > +{ > > + TransactionId xvac; > > + > > + /* only used by pre-9.0 binary upgrades */ > > + if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN)))) > > + return true; > > + > > + ... > > +} > > This is so unlikely these days that this function probably should not be > inlined anymore. Because I put the check for moved in the function it is important that it's inlined. The compiler can understand that the rest of the function is unlikely to be executed and put it together with other less likely to be executed code. > > 0003 bufmgr: Add BufferLockHeldByMe() > > > > Useful for writing assertions in later patches. > > > + if (mode == BUFFER_LOCK_EXCLUSIVE) > > + lwmode = LW_EXCLUSIVE; > > + else if (mode == BUFFER_LOCK_SHARE) > > + lwmode = LW_SHARED; > > + else > > + { > > + Assert(false); > > + pg_unreachable(); > > + lwmode = LW_EXCLUSIVE; /* assuage compiler */ > > + } > > I just learned a new word :-). Heh. > This is fine, but to avoid the assuaging, you could also do this: > > if (mode == BUFFER_LOCK_EXCLUSIVE) > lwmode = LW_EXCLUSIVE; > else > { > Assert(mode == BUFFER_LOCK_SHARE); > lwmode = LW_SHARED; > } I have no opinion about which version is better... > > 0004 heapam: Use exclusive lock on old page in CLUSTER > > > > When I started looking into this I wanted to make sure that it's actually > > safe to skip setting hint bits in all places. One place that's *not* safe > > is cluster - we assume that all hint bits are set in rewrite_heap_tuple(). > > > > This might not strictly be required, but it seemed very fragile as > > is. While using an exclusive lock does seem like it makes it a bit safer, > > it's not a very good fix, as we still dirty the old buffer, which seems > > like a shame when one uses VACUUM FULL. > > > + /* > > + * To be able to guarantee that we can set the hint bit, acquire an > > + * exclusive lock on the old buffer. We need the hint bits to be set > > + * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> > > + * heap_freeze_tuple() will get confused. > > + * > > Hmm, I don't see any comments in the reform_and_rewrite_tuple() -> > rewrite_heap_tuple() -> heap_freeze_tuple() path about requiring hint bits > to be set. There's this comment is in heap_prepare_freeze_tuple(): * It is assumed that the caller has checked the tuple with * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD * (else we should be removing the tuple, not freezing it). I don't remember the precise details of what all goes wrong, I think there was more than one avenue. One example is stuff like this: /* * If the tuple has been updated, check the old-to-new mapping hash table. */ if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) || HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) && !HeapTupleHeaderIndicatesMovedPartitions(old_tuple->t_data) && !(ItemPointerEquals(&(old_tuple->t_self), &(old_tuple->t_data->t_ctid)))) if the original tuple had an xmax but we didn't unset it in HeapTupleSatisfiesVacuum(), due to not being allowed to write a hint bit, we'll behave differently here. But at the same time heapam_relation_copy_for_cluster() will treat the tuple as dead and call rewrite_heap_dead_tuple(). > How does it get confused? Corrupted visibility information after a CLUSTER. > > 0006 heapam: Add batch mode mvcc check and use it in page mode > > > > This is a good bit faster on its own, but is required to avoid a > > performance regression later, when setting hint bits only only when no IO > > going on at the same. > > The BATCHMVCC_FEWER_ARGS stuff is pretty weird. Hard to imagine that it'd > make any difference how you pass those arguments. I unfortunately found it to make a difference :(, even after trying to account for binary layout difference by randomizing section order via linker flags. > > + if (likely(valid)) > > + { > > + vistuples_dense[nvis] = tup->t_self.ip_posid; > > + nvis++; > > + } > > Is the likely(valid) hint important for performance here? It would be a bad > hint for a table with lots of dead tuples. I feel we'd better rely on the > branch predictor for this one. It did :( Greetings, Andres Freund
On Tue, 2024-09-24 at 11:55 -0400, Andres Freund wrote: > What I suspect we might want instead is something inbetween a share > and an > exclusive lock, which is taken while setting a hint bit and which > conflicts > with having an IO in progress. I am starting to wonder if a shared content locks are really the right concept at all. It makes sense for simple mutexes, but we are already more complex than that, and you are suggesting adding on to that complexity. Which I agree is a good idea, I'm just wondering if we could go even further. The README states that a shared lock is necessary for visibility checking, but can we just be more careful with the ordering and atomicity of visibility changes in the page? * carefully order reads and writes of xmin/xmax/hints (would that create too many read barriers in the tqual.c code?) * write line pointer after tuple is written We would still have pins and cleanup locks to prevent data removal. We'd have the logic you suggest that would prevent modification during IO. And there would still need to be an exclusive content locks so that two inserts don't try to allocate the same line pointer, or lock the same tuple. If PD_ALL_VISIBLE is set it's even simpler. Am I missing some major hazards? Regards, Jeff Davis
Hi, On 2024-10-30 10:47:35 -0700, Jeff Davis wrote: > On Tue, 2024-09-24 at 11:55 -0400, Andres Freund wrote: > > What I suspect we might want instead is something inbetween a share > > and an > > exclusive lock, which is taken while setting a hint bit and which > > conflicts > > with having an IO in progress. > > I am starting to wonder if a shared content locks are really the right > concept at all. It makes sense for simple mutexes, but we are already > more complex than that, and you are suggesting adding on to that > complexity. What I am proposing isn't making the content lock more complicated, it's orthogonal to the content lock. > Which I agree is a good idea, I'm just wondering if we could go even > further. > > The README states that a shared lock is necessary for visibility > checking, but can we just be more careful with the ordering and > atomicity of visibility changes in the page? > > * carefully order reads and writes of xmin/xmax/hints (would > that create too many read barriers in the tqual.c code?) > * write line pointer after tuple is written It's possible, but it'd be a lot of work. And you wouldn't need to just do this for heap, but all the indexes too, to make progress on the don't-set-hint-bits-during-io front. So I don't think it makes sense to tie these things together. I do think that it's an argument for not importing all the complexity into lwlock.c though. > We would still have pins and cleanup locks to prevent data removal. As-is cleanup locks only work in coordination with content locks. While cleanup is ongoing we need to prevent anybody from starting to look at the page - without acquiring something like a shared lock that's not easy. > We'd have the logic you suggest that would prevent modification during > IO. And there would still need to be an exclusive content locks so that > two inserts don't try to allocate the same line pointer, or lock the > same tuple. > > If PD_ALL_VISIBLE is set it's even simpler. > > Am I missing some major hazards? I don't think anything fundamental, but it's a decidedly nontrivial amount of work. Greetings, Andres Freund
Hi, On 2024-10-30 09:58:30 -0400, Andres Freund wrote: > On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote: > > Could we put the overhead on the FlushBuffer() > > instead? > > > > How about something like this: > > > > To set hint bits: > > > > 1. Lock page in SHARED mode. > > 2. Read BM_IO_IN_PROGRESS > > 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't > > 4. Unlock page > > > > To flush a buffer: > > > > 1. Lock page in SHARED mode > > 2. Set BM_IO_IN_PROGRESS > > 3. Read the lock count on the buffer lock, to see if we're the only locker. > > 4. If anyone else is holding the lock, upgrade it to exclusive mode, and > > immediately downgrade back to share mode. > > 5. calculate CRC, flush the buffer > > 6. Clear BM_IO_IN_PROGRESS and unlock page. > > > > This goes back to the idea of adding LWLock support for this, but the amount > > of changes could be pretty small. The missing operation we don't have today > > is reading the share-lock count on the lock in step 3. That seems simple to > > add. > > I've played around with a bunch of ideas like this. There are two main > reasons I didn't like them that much in the end: > > 1) The worst case latency impacts seemed to make them not that > interesting. A buffer that is heavily contended with share locks might not > get down to zero share lockers for quite a while. That's not a problem for > individual FlushBuffer() calls, but it could very well add up to a decent > sized delay for something like a checkpoint that has to flush a lot of > buffers. > > Waiting for all pre-existing share lockers is easier said than done. We > don't record the acquisition order anywhere and a share-lock release won't > wake anybody if the lockcount doesn't reach zero. Changing that wouldn't > exactly be free and the cost would be born by all lwlock users. > > 2) They are based on holding an lwlock. But it's actually quite conceivable > that we'd want to set something hint-bit-like without any lock, as we > e.g. currently do for freespace/. That would work with something like the > approach I chose here, but not if we rely on lwlocks. > > E.g. with a bit of work we could actually do sequential scans without > blocking concurrent buffer modifications by carefully ordering when > pd_lower is modified and making sure that tuple header fields are written > in a sensible order. I still don't like this idea a whole lot - but perhaps we could get reduce the overhead of my proposal some, to get closer to yours. When setting hint bits for many tuples on a page the overhead of my approach is neglegible, but when doing it for individual tuples it's a bit less neglegible. We can reduce the efficiency difference substantially by adding a bufmgr.c API that set hints on a page. That function can set the hint bit while holding the buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and thus also doesn't need to do resowner.c accounting. To see the worst case overhead, I a) disabling the "batch" optimization b) disabled checksums, as that otherwise would hide small efficiency differences c) used an unlogged table and measured the performance difference for a previously-unhinted sequential scan of a narrow table that immediately discards all tuples due to OFFSET - afaict the worst case for the proposed new behaviour. Previously this was 30.8% slower than master. Now it's only 1.9% slower. With the batch optimization enabled, the patchset is 7.5% faster. I also looked at the performance impact on scans that cannot use the batched approach. The worst case I could think of was a large ordered indexscan of a previously unhinted table. For an IOS, the performance difference is a slowdown of 0.65%. But the difference being so small is partially just due to IOS being considerably slower than a plain index scan when all tuples need to be fetched from the table (we need to address that...). Forcing a non-IOS IOS scan using enable_indexonlyscan, I get a slowdown of 5.0%. Given that this is an intentionally pathological workload - there's no IO / the workload is CPU bound, yet the data is only ever accessed once, with a query that doesn't ever actually look at the data - I'm quite happy with that. As soon as I make it a query that actually uses the data, the difference vanishes in the noise. Greetings, Andres Freund