Re: AIO v2.5 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | ym3dqpa4xcvoeknewcw63x77vnqdosbqcetjinb2zfoh65k55m@m4ozmwhr6lk6 Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
Hi, On 2025-03-28 08:54:42 -0400, Andres Freund wrote: > On 2025-03-27 20:22:23 -0700, Noah Misch wrote: > > On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote: > > > - We need to register a local callback for shared buffer reads, which don't > > > need them today . That's a small bit of added overhead. It's a shame to do > > > so for counters that approximately never get incremented. > > > > Fair concern. An idea is to let the complete_shared callback change the > > callback list associated with the IO, so it could change > > PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The > > latter would differ from the former only in having the extra local callback. > > Could that help? I think the only overhead is using more PGAIO_HCB numbers. > > I think changing the callback could work - I'll do some measurements in a > coffee or two, but I suspect the overhead is not worth being too worried about > for now. There's a different aspect that worries me slightly more, see > further down. > ... > Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a > local hashtable. That's more expensive than an indirect function call > (i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if > not we can apply a mini-cache for the current database, which is surely the > only thing that ever matters for performance. I tried it and at ~30GB/s of read IO, with checksums disabled, I can't see a difference of either having the unnecessary complete_local callback or having the lookup in pgstat_prepare_report_checksum_failure(). In a profile there are a few hits inside pgstat_get_entry_ref(), but not enough to matter. Hence I think this isn't worth worrying about, at least for now. I think we have far bigger fish to fry at this point than such a small performance difference. I've adjusted the comment above TRACE_POSTGRESQL_BUFFER_READ_DONE() to not mention the overhead. I'm still inclined to think that it's better to call it in the shared completion callback. I also fixed support and added tests for ignore_checksum_failure, that also needs to be determined at the start of the IO, not in the completion. Once more there were no tests, of course. I spent the last 6 hours on the stupid error/warning messages around this, somewhat ridiculous. The number of combinations is annoyingly large. It's e.g. plausible to use ignore_checksum_failure=on and zero_damaged_pages=on at the same time for recovery. The same buffer could both be ignored *and* zeroed. Or somebody could use ignore_checksum_failure=on but then still encounter a page that is invalid. But I finally got to a point where the code ends up readable, without undue duplication. It would, leaving some nasty hack aside, require a errhint_internal() - but I can't imagine a reason against introducing that, given we have it for the errmsg and errhint. Here's the relevant code: /* * Treat a read that had both zeroed buffers *and* ignored checksums as a * special case, it's too irregular to be emitted the same way as the other * cases. */ if (zeroed_any && ignored_any) { Assert(zeroed_any && ignored_any); Assert(nblocks > 1); /* same block can't be both zeroed and ignored */ Assert(result.status != PGAIO_RS_ERROR); affected_count = zeroed_or_error_count; ereport(elevel, errcode(ERRCODE_DATA_CORRUPTED), errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s", affected_count, checkfail_count, first, last, rpath.str), affected_count > 1 ? errdetail("Block %u held first zeroed page.", first + first_off) : 0, errhint("See server log for details about the other %u invalid blocks.", affected_count + checkfail_count - 1)); return; } /* * The other messages are highly repetitive. To avoid duplicating a long * and complicated ereport(), gather the translated format strings * separately and then do one common ereport. */ if (result.status == PGAIO_RS_ERROR) { Assert(!zeroed_any); /* can't have invalid pages when zeroing them */ affected_count = zeroed_or_error_count; msg_one = _("invalid page in block %u of relation %s"); msg_mult = _("%u invalid pages among blocks %u..%u of relation %s"); det_mult = _("Block %u held first invalid page."); hint_mult = _("See server log for the other %u invalid blocks."); } else if (zeroed_any && !ignored_any) { affected_count = zeroed_or_error_count; msg_one = _("invalid page in block %u of relation %s; zeroing out page"); msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s"); det_mult = _("Block %u held first zeroed page."); hint_mult = _("See server log for the other %u zeroed blocks."); } else if (!zeroed_any && ignored_any) { affected_count = checkfail_count; msg_one = _("ignoring checksum failure in block %u of relation %s"); msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s"); det_mult = _("Block %u held first ignored page."); hint_mult = _("See server log for the other %u ignored blocks."); } else pg_unreachable(); ereport(elevel, errcode(ERRCODE_DATA_CORRUPTED), affected_count == 1 ? errmsg_internal(msg_one, first + first_off, rpath.str) : errmsg_internal(msg_mult, affected_count, first, last, rpath.str), affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0, affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0); Does that approach make sense? What do you think about using "zeroing invalid page in block %u of relation %s" instead of "invalid page in block %u of relation %s; zeroing out page" I thought about instead translating "ignoring", "ignored", "zeroing", "zeroed", etc separately, but I have doubts about how well that would actually translate. Greetings, Andres Freund
pgsql-hackers by date: