AIO writes vs hint bits vs checksums - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | AIO writes vs hint bits vs checksums |
Date | |
Msg-id | stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m Whole thread Raw |
Responses |
Re: AIO writes vs hint bits vs checksums
Re: AIO writes vs hint bits vs checksums Re: AIO writes vs hint bits vs checksums Re: AIO writes vs hint bits vs checksums |
List | pgsql-hackers |
Hi, Currently we modify pages while just holding a share lock, for hint bit writes. Writing a buffer out only requires a share lock. Because of that we can't compute checksums and write out pages in-place, as a concurent hint bit write can easily corrupt the checksum. That's not great, but not awful for our current synchronous buffered IO, as we only ever have a single page being written out at a time. However, it becomes a problem even if we just want to write out in chunks larger than a single page - we'd need to reserve not just one BLCKSZ sized buffer for this, but make it PG_IOV_MAX * BLCKSZ sized. Perhaps still tolerable. With AIO this becomes a considerably bigger issue: a) We can't just have one write in progress at a time, but many b) To be able to implement AIO using workers the "source" or "target" memory of the IO needs to be in shared memory Besides that, the need to copy the buffers makes checkpoints with AIO noticeably slower when checksums are enabled - it's not the checksum but the copy that's the biggest source of the slowdown. So far the AIO patchset has solved this by introducing a set of "bounce buffers", which can be acquired and used as the source/target of IO when doing it in-place into shared buffers isn't viable. I am worried about that solution however, as either acquisition of bounce buffers becomes a performance issue (that's how I did it at first, it was hard to avoid regressions) or we reserve bounce buffers for each backend, in which case the memory overhead for instances with relatively small amount of shared_buffers and/or many connections can be significant. Which lead me down the path of trying to avoid the need for the copy in the first place: What if we don't modify pages while it's undergoing IO? The naive approach would be to not set hint bits with just a shared lock - but that doesn't seem viable at all. For performance we rely on hint bits being set and in many cases we'll only encounter the page in shared mode. We could implement a conditional lock upgrade to an exclusive lock and do so while setting hint bits, but that'd obviously be concerning from a concurrency point of view. What I suspect we might want instead is something inbetween a share and an exclusive lock, which is taken while setting a hint bit and which conflicts with having an IO in progress. On first blush it might sound attractive to introduce this on the level of lwlocks. However, I doubt that is a good idea - it'd make lwlock.c more complicated which would imply overhead for other users, while the precise semantics would be fairly specific to buffer locking. A variant of this would be to generalize lwlock.c to allow implementing different kinds of locks more easily. But that's a significant project on its own and doesn't really seem necessary for this specific project. What I'd instead like to propose is to implement the right to set hint bits as a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for BM_SETTING_HINTS to be unset to start IO. Naively implementing this, by acquiring and releasing the permission to set hint bits in SetHintBits() unfortunately leads to a significant performance regression. While the performance is unaffected for OLTPish workloads like pgbench (both read and write), sequential scans of unhinted tables regress significantly, due to the per-tuple lock acquisition this would imply. But: We can address this and improve performance over the status quo! Today we determine tuple visiblity determination one-by-one, even when checking the visibility of an entire page worth of tuples. That's not exactly free. I've prototyped checking visibility of an entire page of tuples at once and it indeed speeds up visibility checks substantially (in some cases seqscans are over 20% faster!). Once we have page-level visibility checks we can get the right to set hint bits once for an entire page instead of doing it for every tuple - with that in place the "new approach" of setting hint bits only with BM_SETTING_HINTS wins. Having a page level approach to setting hint bits has other advantages: E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set on the page and we don't mark a page dirty on hot standby. Which often will result in hint bits notpersistently set on replicas until the page is frozen. Another issue is that we'll often WAL log hint bits for a page (due to hint bits being set), just to then immediately log another WAL record for the same page (e.g. for pruning), which is obviously wasteful. With a different interface we could combine the WAL records for both. I've not prototyped either, but I'm fairly confident they'd be helpful. Does this sound like a reasonable idea? Counterpoints? If it does sound reasonable, I'll clean up my pile of hacks into something semi-understandable... Greetings, Andres Freund
pgsql-hackers by date: