Re: AIO v2.5 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | havtpn6xuwpybrveymzvvym4cqi37jxhcsfququk4lioyasvsq@eitpmihs2akh Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
Hi, On 2025-03-24 11:43:47 -0400, Andres Freund wrote: > > I didn't write a test to prove it, but I'm suspecting we'll reach the above > > ERROR with this sequence: > > > > CREATE TEMP TABLE foo ...; > > [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing] > > DROP TABLE foo; > > That seems plausible. I'll try to write a test after this email. FWIW, a test did indeed confirm that. Luckily: > > DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I > > think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for > > the particular rel) before InvalidateLocalBuffer(). Or use something like the > > logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in > > corresponding bufmgr code. > > Just waiting for the IO in InvalidateBuffer() does seem like the best bet to > me. This did indeed resolve the issue. I've extended the testsuite to test for that and a bunch more things. Working on sending out a new version... > > We call read_stream_get_block() while in batchmode, so the stream callback > > needs to be ready for that. A complicated case is > > collect_corrupt_items_read_stream_next_block(), which may do its own buffer > > I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a > > recipe for corner cases reaching ERROR "starting batch while batch already in > > progress". Are there mitigating factors? > > Ugh, yes, you're right. heap_vac_scan_next_block() is also affected. > > I don't think "starting batch while batch already in progress" is the real > issue though - it seems easy enough to avoid starting another batch inside, > partially because current cases seem unlikely to need to do batchable IO > inside. What worries me more is that code might block while there's > unsubmitted IO - which seems entirely plausible. > > > I can see a few approaches: > > 1) Declare that all read stream callbacks have to be careful and cope with > batch mode > > I'm not sure how viable that is, not starting batches seems ok, but > ensuring that the code doesn't block is a different story. > > > 2) Have read stream users opt-in to batching > > Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough > to implement and to add to the callsites where that's fine. > > > 3) Teach read stream to "look ahead" far enough to determine all the blocks > that could be issued in a batch outside of batchmode > > I think that's probably not a great idea, it'd lead us to looking further > ahead than we really need to, which could increase "unfairness" in > e.g. parallel sequential scan. > > > 4) Just defer using batch mode for now > > It's a nice win with io_uring for random IO, e.g. from bitmap heap scans , > but there's no need to immediately solve this. > > > I think regardless of what we go for, it's worth splitting > "aio: Basic read_stream adjustments for real AIO" > into the actually basic parts (i.e. introducing sync_mode) from the not > actually so basic parts (i.e. batching). > > > I suspect that 2) would be the best approach. Only the read stream user knows > what it needs to do in the callback. I still think 2) would be the best option. Writing a patch for that. If a callback may sometimes need to block, it can still opt into READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking. The hardest part is to explain the flag. Here's my current attempt: /* --- * Opt-in to using AIO batchmode. * * Submitting IO in larger batches can be more efficient than doing so * one-by-one, particularly for many small reads. It does, however, require * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: * a) block without first calling pgaio_submit_staged(), unless a * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is * never acquired in a nested fashion * b) directly or indirectly start another batch pgaio_enter_batchmode() * * As this requires care and is nontrivial in some cases, batching is only * used with explicit opt-in. * --- */ #define READ_STREAM_USE_BATCHING 0x08 Greetings, Andres Freund
pgsql-hackers by date: