Re: AIO v2.5 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | 5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga@ptwkinxqo3az Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
Hi, On 2025-03-26 21:07:40 -0400, Andres Freund wrote: > TODO > ... > - Add an explicit test for the checksum verification in the completion callback > > There is an existing test for testing an invalid page due to page header > verification in test_aio, but not for checksum failures. > > I think it's indirectly covered (e.g. in amcheck), but seems better to test > it explicitly. Ah, for crying out loud. As it turns out, no, we do not have *ANY* tests for this for the server-side. Not a single one. I'm somewhat apoplectic, data_checksums is a really complicated feature, which we just started *turning on by default*, without a single test of the failure behaviour, when detecting failures is the one thing the feature is supposed to do. I now wrote some tests. And I both regret doing so (because it found problems, which would have been apparent long ago, if the feature had come with *any* tests, if I had gone the same way I could have just pushed stuff) and am glad I did (because I dislike pushing broken stuff). I have to admit, I was tempted to just ignore this issue and just not say anything about tests for checksum failures anymore. Problems: 1) PageIsVerifiedExtended() emits a WARNING, just like with ZERO_ON_ERROR, we don't want to emit it in a) io workers b) another backend if it completes the error. This isn't hard to address, we can add PIV_LOG_LOG (or something like that) to emit it at a different log level and an out-parameter to trigger sending a warning / adjust the warning/error message we already emit once the issuer completes the IO. 2) With IO workers (and "foreign completors", in rare cases), the checksum failures would be attributed wrongly, as it reports all stats to MyDatabaseId As it turns out, this is already borked on master for shared relations, since pg_stat_database.checksum_failures has existed, see [1]. This isn't too hard to fix, if we adjust the signature to PageIsVerifiedExtended() to pass in the database oid. But see also 3) 3) We can't pgstat_report_checksum_failure() during the completion callback, as it *sometimes* allocates memory Aside from the allocation-in-critical-section asserts, I think this is *extremely* unlikely to actually cause a problem in practice. But we don't want to rely on that, obviously. Addressing the first two is pretty simple and/or needs to be done anyway, since it's a currently existing bug, as discussed in [1]. Addressing 3) is not at all trivial. Here's what I've thought of so far: Approach I) My first thoughts were around trying to make the relevant pgstat infrastructure either not need to allocate memory, or handle memory allocation failures gracefully. Unfortunately that seems not really viable: The most successful approach I tried was to report stats directly to the dshash table, and only report stats if there's already an entry (which there just about always will, except for a very short period after stats have been reset). Unfortunately that fails because to access the shared memory with the stats data we need to do dsa_get_address(), which can fail if the relevant dsm segment wasn't already mapped in the current process (it allocates memory in the process of mapping in the segment). There's no API to do that without erroring out. That aspect rules out a number of other approaches that sounded like they could work - we e.g. could increase the refcount of the relevant pgstat entry before issuing IO, ensuring that it's around by the time we need to report. But that wouldn't get around the issue of needing to map in the dsm segment. Approach II) Don't report the error in the completion callback. The obvious place would be to do it where we we'll raise the warning/error in the issuing process. The big disadvantage is that that that could lead to under-counting checksum errors: a) A read stream does 2+ concurrent reads for the same relation, and more than one encounters checksum errors. When processing the results for the first failed read, we raise an error and thus won't process the results of the second+ reads with errors. b) A read is started asynchronously, but before the backend gets around to processing the result of the IO, it errors out during that other work (possibly due to a cancellation). Because the backend never looked at the results of the IOs, the checksum errors don't get accounted for. b) doesn't overly bother me, but a) seems problematic. Approach III) Accumulate checksum errors in two backend local variables (one for database specific errors, one for errors on shared relations), which will be flushed by the backend that issued IO during the next pgstat_report_start(). Two disadvantages: - Accumulation of errors will be delayed until the next pgstat_report_start(). That seems acceptable, after all we do so far a lot of other stats. - We need to register a local callback for shared buffer reads, which don't need them today . That's a small bit of added overhead. It's a shame to do so for counters that approximately never get incremented. Approach IV): Embracy piercing abstractions / generic infrastructure and put two atomic variables (one for shared one for the backend's database) in some backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and update that in the completion callback. Flush that variable to the shared stats in pgstat_report_start() or such. This would avoid the need for the local completion callback, and would also allow to introduce a function see the number "unflushed" checksum errors. It also doesn't require transporting the number of errors between the shared callback and the local callback - but we might want to do have that for the error message anyway. I wish the new-to-18 pgstat_backend() were designed in a way to make this possible in a nice way. But unfortunately it puts the backend-specific data in the dshash table / dynamic shared memory, rather than in a MaxBackends + NUM_AUX sized array array in plain shared memory. As explained in I), we can't rely on having the entire array mapped. Leaving the issue from this email aside, that also adds a fair bit of overhead to other cases. Does anybody have better ideas? I think II), III) and IV) are all relatively simple to implement. The most complicated bit is that a bit of bit-squeezing is necessary to fit the number of checksum errors (in addition to the number of otherwise invalid pages) into the available space for error data. It's doable. We could also just increase the size of PgAioResult. I've implemented II), but I'm not sure the disadvantages are acceptable. Greetings, Andres Freund [1] https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg%40e4gt7npsohuz
pgsql-hackers by date: