Re: AIO v2.5 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | 3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6 Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
Hi, On 2025-04-01 17:47:51 -0400, Andres Freund wrote: > There are three different types of failures in the test_aio test so far: And a fourth, visible after I enabled liburing support for skink. https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-04-03%2007%3A06%3A19&stg=pg_upgrade-check (ignore the pg_upgrade and oauth failures, they're independent, I've raised them separately) 4a) 2025-04-03 10:58:32.978 UTC [2486740][client backend][3/6:0] LOG: short read injection point called, is enabled: 0 ==2486740== VALGRINDERROR-BEGIN ==2486740== Invalid read of size 2 ==2486740== at 0x59C8AC: PageIsNew (bufpage.h:237) ==2486740== by 0x59C8AC: PageIsVerified (bufpage.c:108) ==2486740== by 0x567870: buffer_readv_complete_one (bufmgr.c:6873) ==2486740== by 0x567870: buffer_readv_complete (bufmgr.c:6996) ==2486740== by 0x567870: shared_buffer_readv_complete (bufmgr.c:7153) ==2486740== by 0x55DDB2: pgaio_io_call_complete_shared (aio_callback.c:256) ==2486740== by 0x55D6F1: pgaio_io_process_completion (aio.c:512) ==2486740== by 0x55F53A: pgaio_uring_drain_locked (method_io_uring.c:370) ==2486740== by 0x55F7B8: pgaio_uring_wait_one (method_io_uring.c:449) ==2486740== by 0x55C702: pgaio_io_wait (aio.c:587) ==2486740== by 0x55C8B0: pgaio_wref_wait (aio.c:900) ==2486740== by 0x8639240: read_rel_block_ll (test_aio.c:440) ==2486740== by 0x3B915C: ExecInterpExpr (execExprInterp.c:953) ==2486740== by 0x3B4E4E: ExecInterpExprStillValid (execExprInterp.c:2299) ==2486740== by 0x3F7E97: ExecEvalExprNoReturn (executor.h:445) ==2486740== Address 0x8fa400e is in a rw- anonymous segment ==2486740== ==2486740== VALGRINDERROR-END The reason for this is that this test unpins the buffer (from the backend's view), before waiting for the IO. While the AIO subsystem holds a pin, UnpinBufferNoOwner() marked the buffer as inaccessible: /* * Mark buffer non-accessible to Valgrind. * * Note that the buffer may have already been marked non-accessible * within access method code that enforces that buffers are only * accessed while a buffer lock is held. */ VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ); I think to fix this we need to mark buffers as accessible around the PageIsVerified() call in buffer_readv_complete_one(), IFF they're not pinned by the backend. Unfortunately, this is complicated by the fact that local buffers do not have valgrind integration :(, so we should only do that for local buffers, as otherwise the local buffer stays inaccessible the next time it is pinned. 4b) That's not all though, after getting past this failure, I see uninitialized memory errors for reads into temporary buffers: ==3334031== VALGRINDERROR-BEGIN ==3334031== Conditional jump or move depends on uninitialised value(s) ==3334031== at 0xD7C859: PageIsVerified (bufpage.c:108) ==3334031== by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876) ==3334031== by 0xD385D1: buffer_readv_complete (bufmgr.c:7002) ==3334031== by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210) ==3334031== by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306) ==3334031== by 0xD24720: pgaio_io_reclaim (aio.c:644) ==3334031== by 0xD24400: pgaio_io_process_completion (aio.c:521) ==3334031== by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382) ==3334031== by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461) ==3334031== by 0xD245E0: pgaio_io_wait (aio.c:587) ==3334031== by 0xD24FFE: pgaio_wref_wait (aio.c:900) ==3334031== by 0xD2F471: WaitReadBuffers (bufmgr.c:1695) ==3334031== by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898) ==3334031== by 0x8B4861: heap_fetch_next_buffer (heapam.c:654) ==3334031== by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016) ==3334031== by 0x8B594F: heap_getnextslot (heapam.c:1375) ==3334031== by 0xB28AA4: table_scan_getnextslot (tableam.h:1031) ==3334031== by 0xB29177: SeqNext (nodeSeqscan.c:81) ==3334031== by 0xB28F75: ExecScanFetch (execScan.h:126) ==3334031== by 0xB28FDD: ExecScanExtended (execScan.h:170) The reason for this one is, I think, that valgrind doesn't understand io_uring sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue of commands is somewhat hard to intercept by tools like valgrind and rr. The best fix for that one would, I think, be to have method_io_uring() iterate over the IOV and mark the relevant regions as defined? That does fix the issue at least and does seem to make sense? Not quite sure if we should mark the entire IOV is efined or just the portion that was actually read - the latter is additional fiddly code, and it's not clear it's likely to be helpful? 4c) Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after PageIsVerified() causes the *next* read into the same buffer in an IO worker to fail: ==3339904== Syscall param pread64(buf) points to unaddressable byte(s) ==3339904== at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64) ==3339904== by 0x5B3B6AC: __syscall_cancel (cancellation.c:75) ==3339904== by 0x5B93C83: pread (pread64.c:25) ==3339904== by 0xD274F4: pg_preadv (pg_iovec.h:56) ==3339904== by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137) ==3339904== by 0xD2A6D7: IoWorkerMain (method_worker.c:538) ==3339904== by 0xC91E26: postmaster_child_launch (launch_backend.c:290) ==3339904== by 0xC99594: StartChildProcess (postmaster.c:3972) ==3339904== by 0xC99EE3: maybe_adjust_io_workers (postmaster.c:4403) ==3339904== by 0xC958A8: PostmasterMain (postmaster.c:1381) ==3339904== by 0xB69622: main (main.c:227) ==3339904== Address 0x7f936d386000 is in a rw- anonymous segment Because, from the view of the IO worker, that memory is still marked NOACCESS, even if it since has been marked accessible in the backend. We could adress this by conditioning the VALGRIND_MAKE_MEM_NOACCESS() on not being in an IO worker, but it seems better to instead explicitly mark the region accessible in the worker, before executing the IO. In a first hack, I did that in pgaio_io_perform_synchronously(), but that is likely too broad. I don't think the same scenario exists when IOs are executed synchronously in the foreground. Questions: 1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't need to have special-case logic for that. But it also makes the change less localized and more "impactful", who knows what kind of skullduggery we have been getting away with unnoticed. I haven't written the code up yet, but I don't think it'd be all that much code to add valgrind support to localbuf. 2) Any better ideas to handle the above issues than what I outlined? Greetings, Andres Freund
pgsql-hackers by date: