Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
|
List | pgsql-hackers |
On Fri, Jan 12, 2024 at 12:32 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Ok. It feels surprising to have three steps. I understand that you need > > two steps, one to start the I/O and another to wait for them to finish, > > but why do you need separate Prepare and Start steps? What can you do in > > between them? (You explained that. I'm just saying that that's my > > initial reaction when seeing that API. It is surprising.) [...] > OK, I'm going to try the two-step version (again) with interfaces > along the lines you sketched out... more soon. Here's the 2 step version. The streaming_read.c API is unchanged, but the bugmfr.c API now has only the following extra functions: bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op) WaitReadBuffers(ReadBuffersOperation *op) That is, the PrepareReadBuffer() step is gone. StartReadBuffers() updates *nblocks to the number actually processed, which is always at least one. If it returns true, then you must call WaitReadBuffers(). When it finds a 'hit' (doesn't need I/O), then that one final (or only) buffer is processed, but no more. StartReadBuffers() always conceptually starts 0 or 1 I/Os. Example: if you ask for 16 blocks, and it finds two misses followed by a hit, it'll set *nblocks = 3, smgrprefetch(2 blocks), and smgrreadv(2 blocks) them in WaitReadBuffer(). The caller can't really tell that the third block was a hit. The only case it can distinguish is if the first one was a hit, and then it returns false and sets *nblocks = 1. This arrangement, where the results include the 'boundary' block that ends the readable range, avoids the double-lookup problem we discussed upthread. I think it should probably also be able to handle multiple consecutive 'hits' at the start of a sequence, but in this version I kept it simpler. It couldn't ever handle more than one after an I/O range though, because it can't guess if the one after will be a hit or a miss. If it turned out to be a miss, we don't want to start a second I/O, so unless we decide that we're happy unpinning and re-looking-up next time, it's better to give up then. Hence the idea of including the hit as a bonus block on the end. It took me a long time but I eventually worked my way around to preferring this way over the 3 step version. streaming_read.c now has to do a bit more work including sometimes 'ungetting' a block (ie deferring one that the callback has requested until next time), to resolve some circularities that come up with flow control. But I suspect you'd probably finish up having to deal with 'short' writes anyway, because in the asynchronous future, in a three-step version, the StartReadBuffers() (as 2nd step) might also be short when it fails to get enough BM_IO_IN_PROGRESS flags, so you have to deal with some version of these problems anyway. Thoughts? I am still thinking about how to improve the coding in streaming_read.c, ie to simplify and beautify the main control loop and improve the flow control logic. And looking for interesting test cases to hit various conditions in it and try to break it. And trying to figure out how this read-coalescing and parallel seq scan's block allocator might interfere with each other to produce non-idea patterns of system calls. Here are some example strace results generated by a couple of simple queries. See CF #4426 for pg_buffercache_invalidate(). === Sequential scan example === create table big as select generate_series(1, 10000000); select count(*) from big; pread64(81, ...) = 8192 <-- starts small pread64(81, ...) = 16384 pread64(81, ...) = 32768 pread64(81, ...) = 65536 pread64(81, ...) = 131072 <-- fully ramped up size reached preadv(81, ...) = 131072 <-- more Vs seen as buffers fill up/fragments preadv(81, ...) = 131072 ...repeating... preadv(81, ...) = 131072 preadv(81, ...) = 131072 pread64(81, ...) = 8192 <-- end fragment create table small as select generate_series(1, 100000); select bool_and(pg_buffercache_invalidate(bufferid)) from pg_buffercache where relfilenode = pg_relation_filenode('small') and relblocknumber % 3 != 0; -- <-- kick out every 3rd block select count(*) from small; preadv(88, ...) = 16384 <-- just the 2-block fragments we need to load preadv(88, ...) = 16384 preadv(88, ...) = 16384 === Bitmap heapscan example === create table heap (i int primary key); insert into heap select generate_series(1, 1000000); select bool_and(pg_buffercache_invalidate(bufferid)) from pg_buffercache where relfilenode = pg_relation_filenode('heap'); select count(i) from heap where i in (10, 1000, 10000, 100000) or i in (20, 200, 2000, 200000); pread64(75, ..., 8192, 0) = 8192 fadvise64(75, 32768, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(75, 65536, 8192, POSIX_FADV_WILLNEED) = 0 pread64(75, ..., 8192, 32768) = 8192 pread64(75, ..., 8192, 65536) = 8192 fadvise64(75, 360448, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(75, 3620864, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(75, 7241728, 8192, POSIX_FADV_WILLNEED) = 0 pread64(75, ..., 8192, 360448) = 8192 pread64(75, ..., 8192, 3620864) = 8192 pread64(75, ..., 8192, 7241728) = 8192
Attachment
- v5-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v5-0002-Provide-API-for-streaming-reads-of-relations.patch
- v5-0003-Use-streaming-reads-in-pg_prewarm.patch
- v5-0004-WIP-Use-streaming-reads-in-heapam-scans.patch
- v5-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patch
- v5-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patch
pgsql-hackers by date: