Re: AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.5
Date
Msg-id havtpn6xuwpybrveymzvvym4cqi37jxhcsfququk4lioyasvsq@eitpmihs2akh
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
Hi,

On 2025-03-24 11:43:47 -0400, Andres Freund wrote:
> > I didn't write a test to prove it, but I'm suspecting we'll reach the above
> > ERROR with this sequence:
> >
> >   CREATE TEMP TABLE foo ...;
> >   [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
> >   DROP TABLE foo;
> 
> That seems plausible.  I'll try to write a test after this email.

FWIW, a test did indeed confirm that.  Luckily:

> > DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true).  I
> > think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
> > the particular rel) before InvalidateLocalBuffer().  Or use something like the
> > logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
> > corresponding bufmgr code.
> 
> Just waiting for the IO in InvalidateBuffer() does seem like the best bet to
> me.

This did indeed resolve the issue.

I've extended the testsuite to test for that and a bunch more things. Working
on sending out a new version...


> > We call read_stream_get_block() while in batchmode, so the stream callback
> > needs to be ready for that.  A complicated case is
> > collect_corrupt_items_read_stream_next_block(), which may do its own buffer
> > I/O to read in a vmbuffer for VM_ALL_FROZEN().  That's feeling to me like a
> > recipe for corner cases reaching ERROR "starting batch while batch already in
> > progress".  Are there mitigating factors?
> 
> Ugh, yes, you're right.  heap_vac_scan_next_block() is also affected.
> 
> I don't think "starting batch while batch already in progress" is the real
> issue though - it seems easy enough to avoid starting another batch inside,
> partially because current cases seem unlikely to need to do batchable IO
> inside. What worries me more is that code might block while there's
> unsubmitted IO - which seems entirely plausible.
> 
> 
> I can see a few approaches:
> 
> 1) Declare that all read stream callbacks have to be careful and cope with
>    batch mode
> 
>    I'm not sure how viable that is, not starting batches seems ok, but
>    ensuring that the code doesn't block is a different story.
> 
> 
> 2) Have read stream users opt-in to batching
> 
>    Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough
>    to implement and to add to the callsites where that's fine.
> 
> 
> 3) Teach read stream to "look ahead" far enough to determine all the blocks
>    that could be issued in a batch outside of batchmode
> 
>    I think that's probably not a great idea, it'd lead us to looking further
>    ahead than we really need to, which could increase "unfairness" in
>    e.g. parallel sequential scan.
> 
> 
> 4) Just defer using batch mode for now
> 
>    It's a nice win with io_uring for random IO, e.g. from bitmap heap scans ,
>    but there's no need to immediately solve this.
> 
> 
> I think regardless of what we go for, it's worth splitting
>   "aio: Basic read_stream adjustments for real AIO"
> into the actually basic parts (i.e. introducing sync_mode) from the not
> actually so basic parts (i.e. batching).
> 
> 
> I suspect that 2) would be the best approach. Only the read stream user knows
> what it needs to do in the callback.

I still think 2) would be the best option.

Writing a patch for that.

If a callback may sometimes need to block, it can still opt into
READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking.

The hardest part is to explain the flag. Here's my current attempt:

/* ---
 * Opt-in to using AIO batchmode.
 *
 * Submitting IO in larger batches can be more efficient than doing so
 * one-by-one, particularly for many small reads. It does, however, require
 * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
 * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
 * a) block without first calling pgaio_submit_staged(), unless a
 *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
 *    never acquired in a nested fashion
 * b) directly or indirectly start another batch pgaio_enter_batchmode()
 *
 * As this requires care and is nontrivial in some cases, batching is only
 * used with explicit opt-in.
 * ---
 */
#define READ_STREAM_USE_BATCHING 0x08


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Next
From: Jim Nasby
Date:
Subject: Re: Vacuum statistics