Re: AIO v2.5 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.5
Date
Msg-id 20250403194023.c4.nmisch@google.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote:
> 4b)
> 
> That's not all though, after getting past this failure, I see uninitialized
> memory errors for reads into temporary buffers:
> 
> ==3334031== VALGRINDERROR-BEGIN
> ==3334031== Conditional jump or move depends on uninitialised value(s)
> ==3334031==    at 0xD7C859: PageIsVerified (bufpage.c:108)
> ==3334031==    by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876)
> ==3334031==    by 0xD385D1: buffer_readv_complete (bufmgr.c:7002)
> ==3334031==    by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210)
> ==3334031==    by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306)
> ==3334031==    by 0xD24720: pgaio_io_reclaim (aio.c:644)
> ==3334031==    by 0xD24400: pgaio_io_process_completion (aio.c:521)
> ==3334031==    by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382)
> ==3334031==    by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461)
> ==3334031==    by 0xD245E0: pgaio_io_wait (aio.c:587)
> ==3334031==    by 0xD24FFE: pgaio_wref_wait (aio.c:900)
> ==3334031==    by 0xD2F471: WaitReadBuffers (bufmgr.c:1695)
> ==3334031==    by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898)
> ==3334031==    by 0x8B4861: heap_fetch_next_buffer (heapam.c:654)
> ==3334031==    by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016)
> ==3334031==    by 0x8B594F: heap_getnextslot (heapam.c:1375)
> ==3334031==    by 0xB28AA4: table_scan_getnextslot (tableam.h:1031)
> ==3334031==    by 0xB29177: SeqNext (nodeSeqscan.c:81)
> ==3334031==    by 0xB28F75: ExecScanFetch (execScan.h:126)
> ==3334031==    by 0xB28FDD: ExecScanExtended (execScan.h:170)
> 
> 
> The reason for this one is, I think, that valgrind doesn't understand io_uring
> sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue
> of commands is somewhat hard to intercept by tools like valgrind and rr.
> 
> The best fix for that one would, I think, be to have method_io_uring() iterate
> over the IOV and mark the relevant regions as defined?  That does fix the
> issue at least and does seem to make sense?

Makes sense.  Valgrind knows that read() makes its target bytes "defined".  It
probably doesn't have an io_uring equivalent for that.

I expect we only need this for local buffers, and it's unclear to me how the
fix for (4a) didn't fix this.  Before bufmgr Valgrind integration (1e0dfd1 of
2020-07) there was no explicit handling of shared_buffers.  I suspect that
worked because the initial mmap() of shared memory was considered "defined"
(zeros), and steps like PageAddItem() copy only defined bytes into buffers.
Hence, shared_buffers remained defined without explicit Valgrind client
requests.  This example uses local buffers.  Storage for those comes from
MemoryContextAlloc() in GetLocalBufferStorage().  That memory starts
undefined, but it becomes defined at PageInit() or read().  Hence, I expected
the fix for (4a) to make the buffer defined after io_uring read.  What makes
the outcome different?

In the general case, we could want client requests as follows:

- If completor==definer and has not dropped pin:
  - Make defined before verifying page.  That's all.  It might be cleaner to
    do this when first retrieving a return value from io_uring, since this
    just makes up for what Valgrind already does for readv().

- If completor!=definer or has dropped pin:
  - Make NOACCESS in definer when definer cedes its own pin.
  - For io_method=worker, make UNDEFINED before starting readv().  It might be
    cleanest to do this when the worker first acts as the owner of the AIO
    subsystem pin, if that's a clear moment earlier than readv().
  - Make DEFINED in completor before verifying page.  It might be cleaner to
    do this when the completor first retrieves a return value from io_uring,
    since this just makes up for what Valgrind already does for readv().
  - Make NOACCESS in completor after verifying page.  Similarly, it might be
    cleaner to do this when the completor releases the AIO subsystem pin.

> Not quite sure if we should mark
> the entire IOV is efined or just the portion that was actually read - the
> latter is additional fiddly code, and it's not clear it's likely to be helpful?

Seems fine to do the simpler way if that saves fiddly code.

> 4c)
> 
> Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after
> PageIsVerified() causes the *next* read into the same buffer in an IO worker
> to fail:
> 
> ==3339904== Syscall param pread64(buf) points to unaddressable byte(s)
> ==3339904==    at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64)
> ==3339904==    by 0x5B3B6AC: __syscall_cancel (cancellation.c:75)
> ==3339904==    by 0x5B93C83: pread (pread64.c:25)
> ==3339904==    by 0xD274F4: pg_preadv (pg_iovec.h:56)
> ==3339904==    by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137)
> ==3339904==    by 0xD2A6D7: IoWorkerMain (method_worker.c:538)
> ==3339904==    by 0xC91E26: postmaster_child_launch (launch_backend.c:290)
> ==3339904==    by 0xC99594: StartChildProcess (postmaster.c:3972)
> ==3339904==    by 0xC99EE3: maybe_adjust_io_workers (postmaster.c:4403)
> ==3339904==    by 0xC958A8: PostmasterMain (postmaster.c:1381)
> ==3339904==    by 0xB69622: main (main.c:227)
> ==3339904==  Address 0x7f936d386000 is in a rw- anonymous segment
> 
> Because, from the view of the IO worker, that memory is still marked NOACCESS,
> even if it since has been marked accessible in the backend.
> 
> 
> We could adress this by conditioning the VALGRIND_MAKE_MEM_NOACCESS() on not
> being in an IO worker, but it seems better to instead explicitly mark the
> region accessible in the worker, before executing the IO.

Sounds good.  Since the definer gave the AIO subsystem a pin on the worker's
behalf, it's like the worker is doing an implicit pin and explicit unpin.

> In a first hack, I did that in pgaio_io_perform_synchronously(), but that is
> likely too broad.  I don't think the same scenario exists when IOs are
> executed synchronously in the foreground.
> 
> 
> Questions:
> 
> 1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't
>    need to have special-case logic for that. But it also makes the change less
>    localized and more "impactful", who knows what kind of skullduggery we have
>    been getting away with unnoticed.
> 
>    I haven't written the code up yet, but I don't think it'd be all that much
>    code to add valgrind support to localbuf.

It would be the right thing long-term, and it's not a big deal if it causes
some false positives initially.  So if you're leaning that way, that's good.

> 2) Any better ideas to handle the above issues than what I outlined?

Not here, unless the discussion under (4b) differs usefully from what you
planned.



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Using read stream in autoprewarm
Next
From: Andrew Dunstan
Date:
Subject: Re: Non-text mode for pg_dumpall