Re: AIO writes vs hint bits vs checksums - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO writes vs hint bits vs checksums |
Date | |
Msg-id | jo5p5nthb3hxwfj7qifiu2rxk5y3hexbwxs5d6x2jotsvj3bq5@jhtrriubncjb Whole thread Raw |
In response to | AIO writes vs hint bits vs checksums (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO writes vs hint bits vs checksums
Re: AIO writes vs hint bits vs checksums |
List | pgsql-hackers |
Hi, On 2024-09-24 12:43:40 -0700, Noah Misch wrote: > On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote: > > So far the AIO patchset has solved this by introducing a set of "bounce > > buffers", which can be acquired and used as the source/target of IO when doing > > it in-place into shared buffers isn't viable. > > > > I am worried about that solution however, as either acquisition of bounce > > buffers becomes a performance issue (that's how I did it at first, it was hard > > to avoid regressions) or we reserve bounce buffers for each backend, in which > > case the memory overhead for instances with relatively small amount of > > shared_buffers and/or many connections can be significant. > > > But: We can address this and improve performance over the status quo! Today we > > determine tuple visiblity determination one-by-one, even when checking the > > visibility of an entire page worth of tuples. That's not exactly free. I've > > prototyped checking visibility of an entire page of tuples at once and it > > indeed speeds up visibility checks substantially (in some cases seqscans are > > over 20% faster!). > > Nice! It sounds like you refactored the relationship between > heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint > bit setting upward or the iterate-over-tuples downward. Is that about right? I've tried about five variations, so I don't have one answer to this yet :). One problem is that having repeated loops doing PageGetItemId(), PageGetItem(), ItemIdGetLength() isn't exactly free. To some degree it can be hidden by allowing for better superscalar execution, but not entirely. I've been somewhat confused by the compiler generated code around ItemId handling for a while, it looks way more expensive than it should - it regularly is a bottleneck due to the sheer number of instructions being executed leading to being frontend bound. But never quite looked into it deeply enough to figure out what's causing it / how to fix it. > > Once we have page-level visibility checks we can get the right to set hint > > bits once for an entire page instead of doing it for every tuple - with that > > in place the "new approach" of setting hint bits only with BM_SETTING_HINTS > > wins. > > How did page-level+BM_SETTING_HINTS performance compare to performance of the > page-level change w/o the BM_SETTING_HINTS change? Just ran that. There probably is a performance difference, but it's small (<0.5%) making it somewhat hard to be certain. It looks like the main reason for that is ConditionVariableBroadcast() on the iocv shows up even though nobody is waiting. I've been fighting that with AIO as well, so maybe it's time to figure out the memory ordering rules that'd allow to check that without a full spinlock acquisition. If we figure it out, we perhaps should use the chance to get rid of BM_PIN_COUNT_WAITER... > > Does this sound like a reasonable idea? Counterpoints? > > I guess the main part left to discuss is index scans or other scan types where > we'd either not do page-level visibility or we'd do page-level visibility > including tuples we wouldn't otherwise use. BM_SETTING_HINTS likely won't > show up so readily in index scan profiles, but the cost is still there. I could indeed not make it show up in some simple index lookup heavy workloads. I need to try some more extreme cases though (e.g. fetching all tuples in a table via an index or having very long HOT chains). If it's not visible cost-wise compared to all the other costs of index scans - does it matter? If it's not visible it's either because it proportionally is very small or because it's completely hidden by superscalar execution. Another thing I forgot to mention that probably fits into the "tradeoffs" bucket: Because BM_SETTING_HINTS would be just a single bit, one backend setting hint bits would block out another backend setting hint bits. In most situations that'll be fine, or even faster than not doing so due to reducing cache line ping-pong, but in cases of multiple backends doing index lookups to different unhinted tuples on the same page it could be a bit worse. But I suspect that's fine because it's hard to believe that you could have enough index lookups to unhinted tuples for that to be a bottleneck - something has to produce all those unhinted tuples after all, and that's rather more expensive. And for single-tuple visibility checks the window in which hint bits are set is very small. > How should we think about comparing the distributed cost of the buffer > header manipulations during index scans vs. the costs of bounce buffers? Well, the cost of bounce buffers would be born as long as postgres is up, whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't really show up. If there are cases where the cost of the more expensive hint bit logic does show up, it'll get a lot harder to weigh. So far my prototype uses the path that avoids hint bits being set while IO is going on all the time, not just when checksums are enabled. We could change that, I guess. However, our habit of modifying buffers while IO is going on is causing issues with filesystem level checksums as well, as evidenced by the fact that debug_io_direct = data on btrfs causes filesystem corruption. So I tend to think it'd be better to just stop doing that alltogether (we also do that for WAL, when writing out a partial page, but a potential fix there would be different, I think). A thing I forgot to mention: Bounce buffers are kind of an architectural dead end, in that we wouln't need them in that form if we get to a threaded postgres. Zooming out (a lot) more: I like the idea of having a way to get the permission to perform some kinds of modifications on a page without an exlusive lock. While obviously a lot more work, I do think there's some potential to have some fast-paths that perform work on a page level without blocking out readers. E.g. some simple cases of insert could correctly be done without blocking out readers (by ordering the update of the max page offset Greetings, Andres Freund
pgsql-hackers by date: