Re: AIO v2.5 - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | CAEze2Wh-6VqUEgqjhmYK_v+ca+98DPGpt1nj08QceF4Ka0mf6g@mail.gmail.com Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
On Wed, 9 Jul 2025 at 16:59, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-07-09 13:26:09 +0200, Matthias van de Meent wrote: > > I've been going through the new AIO code as an effort to rebase and > > adapt Neon to PG18. In doing so, I found the following > > items/curiosities: > > > > 1. In aio/README.md, the following code snippet is found: > > > > [...] > > pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1); > > [...] > > > > I believe it would be clearer if it took a reference to the buffer: > > > > pgaio_io_set_handle_data_32(ioh, (uint32 *) &buffer, 1); > > > > The main reason here is that common practice is to have a `Buffer > > buffer;` whereas a Buffer * is more commonly plural. > > It's also just simply wrong as-is :/. Interpreting the buffer id as a pointer > obviously makes no sense... Given that the snippet didn't contain type indications for buffer upto that point, technically the buffer variable could've been defined as `Buffer* buffer;` which would've been type-correct. That would be very confusing however, hence the suggested change. After your mail, I also noticed the later snippet which should be updated, too: ``` -smgrstartreadv(ioh, operation->smgr, forknum, blkno, - BufferGetBlock(buffer), 1); +void *page = BufferGetBlock(buffer); +smgrstartreadv(ioh, operation->smgr, forknum, blkno, + &page, 1); ``` > > 3. I noticed that there is AIO code for writev-related operations > > (specifically, pgaio_io_start_writev is exposed, as is > > PGAIO_OP_WRITEV), but no practical way to excercise that code: it's > > not called from anywhere in the project, and there is no way for > > extensions to register the relevant callbacks required to make writev > > work well on buffered contents. Is that intentional? > > Yes. We obviously do want to support writes eventually, and it didn't seem > useful to not have the most basic code for writes in the AIO infrastructure. > > You could still use it to e.g. write out temporary file data or such. Yes, though IIUC that would require an implementation of at least PgAioTargetInfo for such a use case (it's definitely not a SMGR target), which currently isn't available and can't be registered dynamically by an extension. Or maybe did I miss something? > > and there is no way for extensions to register the relevant callbacks > > required to make writev work well on buffered contents. Is that intentional? > > FWIW, the problem with writev for buffered IO is not so much with the AIO > infrastructure, but with buffer locking and sync.c... Maybe "buffered contents" wasn't the right phrasing, but my main point remains - I can't seem to find a way to make use of the pgaio_start_writev API in the 18 branch that doesn't require patching core code or heavily relying on the behaviour of various pgaio internals: The only valid pgaio target is PGAIO_TID_SMGR, which effectively forces you to use buffered relations as IO target (either shared, or backend-local/temp). Then, there are no callbacks defined for writev, so the user will effectively have to do setup and cleanup on their own, making worker-based IO practically impossible, and requiring the user to handle all lock acquisition/releases in their own process' time instead of in the AIO paths - requiring significantly more involved error handling. (PS. I'm not quite 100% sure that it is impossible to use, just that there are rather few handles available for using this part of the new tool, and it seems completely untested in the PG18 branch) ----- Something else I've just noticed is the use of int32 in PgAIOHandle->result. In sync and worker mode, pg_preadv and pg_pwritev return ssize_t, which most modern systems can't fit in int32 (the output was int before, then size_t, then ssize_t: [0]). While not directly an issue in default PG18 due to the use of 1GB relation segments capping the max IO size for SMGR-managed IOs (and various other code-level constraints), this may have more issues when an extension starts bulk-reading data on a system compiled with RELSEG_SIZE >= 2GB; I can't find any protective checks against overflows in downcasting the IO result. I did notice Linux seems to guarantee it won't read more than 0x7FFF_F000 in any one operation, but I can't find any similar guarantees for e.g. MacOS or Windows (well, our win32 port limits pg_pread/writev to 1GB, which seems sufficient for this exact case). If we're keeping int32, then it would be great if the considerations for why it's OK to use 32 bits to store the ssize_t vlaue would be included in the docs/comments. Kind regards, Matthias van de Meent Databricks [0] https://postgr.es/m/flat/1672202.1703441340%40sss.pgh.pa.us
pgsql-hackers by date: