Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKGJabs5ug9H8KGeoMtOyx-erHuVBSdHwpm3fdwVV4qjzoQ@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
Re: Streaming I/O, vectored I/O (WIP) |
List | pgsql-hackers |
On Tue, Mar 12, 2024 at 7:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I am planning to review this patch set, so started going through 0001, > I have a question related to how we are issuing smgrprefetch in > StartReadBuffers() Thanks! > + /* > + * In theory we should only do this if PrepareReadBuffers() had to > + * allocate new buffers above. That way, if two calls to > + * StartReadBuffers() were made for the same blocks before > + * WaitReadBuffers(), only the first would issue the advice. > + * That'd be a better simulation of true asynchronous I/O, which > + * would only start the I/O once, but isn't done here for > + * simplicity. Note also that the following call might actually > + * issue two advice calls if we cross a segment boundary; in a > + * true asynchronous version we might choose to process only one > + * real I/O at a time in that case. > + */ > + smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len); > } > > This is always issuing smgrprefetch starting with the input blockNum, > shouldn't we pass the first blockNum which we did not find in the > Buffer pool? So basically in the loop above this call where we are > doing PrepareReadBuffer() we should track the first blockNum for which > the found is not true and pass that blockNum into the smgrprefetch() > as a first block right? I think you'd be right if StartReadBuffers() were capable of processing a sequence consisting of a hit followed by misses, but currently it always gives up after the first hit. That is, it always processes some number of misses (0-16) and then at most one hit. So for now the variable would always turn out to be the same as blockNum. The reason is that I wanted to allows "full sized" read system calls to form. If you said "hey please read these 16 blocks" (I'm calling that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits, then it could only form a read of 14 blocks, but there might be more blocks that could be read after those. We would have some arbitrary shorter read system calls, when we wanted to make them all as big as possible. So in the current patch you say "hey please read these 16 blocks" and it returns saying "only read 1", you call again with 15 and it says "only read 1", and you call again and says "read 16!" (assuming 2 more were readable after the original range we started with). Then physical reads are maximised. Maybe there is some nice way to solve that, but I thought this way was the simplest (and if there is some instruction-cache-locality/tight-loop/perf reason why we should work harder to find ranges of hits, it could be for later). Does that make sense?
pgsql-hackers by date: