Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKG+ay_ht+Ge4BUKG_01+SuQF7Z93JMmy1an+0syGM6o5SQ@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
|
List | pgsql-hackers |
On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Ok. It feels surprising to have three steps. I understand that you need > two steps, one to start the I/O and another to wait for them to finish, > but why do you need separate Prepare and Start steps? What can you do in > between them? (You explained that. I'm just saying that that's my > initial reaction when seeing that API. It is surprising.) Actually I don't think I explained that very well. First, some more detail about how a two-step version would work: * we only want to start one I/O in an StartReadBuffers() call, because otherwise it is hard/impossible for the caller to cap concurrency I/O * therefore StartReadBuffers() can handle sequences matching /^H*I*H*/ ("H" = hit, "I" = miss, I/O) in one call * in the asynchronous version, "I" in that pattern means we got BM_IO_IN_PROGRESS * in the synchronous version, "I" means that it's not valid, not BM_IN_IN_PROGRESS, but we won't actually try to get BM_IO_IN_PROGRESS until the later Complete/Wait call (and then the answer might chagne, but we'll just deal with that by looping in the synchronous version) * streaming_read.c has to deal with buffering up work that StartReadBuffers() didn't accept * that's actually quite easy, you just use the rest to create a new range in the next slot Previously I thought the requirement to deal with buffering future stuff that StartReadBuffers() couldn't accept yet was a pain, and life became so much simpler once I deleted all that and exposed PrepareReadBuffer() to the calling code. Perhaps I just hadn't done a good enough job of that. The other disadvantage you reminded me of was the duplicate buffer lookup in certain unlucky patterns, which I had forgotten about in my previous email. But I guess it's not fatal to the idea and there is a potential partial mitigation. (See below). A third thing was the requirement for communication between StartReadBuffers() and CompleteReadBuffers() which I originally had an "opaque" object that the caller has to keep around that held private state. It seemed nice to go back to talking just about buffer numbers, but that's not really an argument for anything... OK, I'm going to try the two-step version (again) with interfaces along the lines you sketched out... more soon. > I see. When you're about to zero the page, there's not much point in > splitting the operation into Prepare/Start/Complete stages anyway. > You're not actually doing any I/O. Perhaps it's best to have a separate > "Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)" > function that does the same as > ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today. That makes sense, but... hmm, sometimes just allocating a page generates I/O if it has to evict a dirty buffer. Nothing in this code does anything fancy about that, but imagine some hypothetical future thing that manages to do that asynchronously -- then we might want to take advantage of the ability to stream even a zeroed page, ie doing something ahead of time? Just a thought for another day, and perhaps that is just an argument for including it in the streaming read API, but it doesn't mean that the bufmgr.c API can't be as you say. > One weakness is that if StartReadBufferRange() finds that the range is > "chopped up", it needs to return and throw away the work it had to do to > look up the next buffer. So in the extreme case that every other block > in the range is in the buffer cache, each call would look up two buffers > in the buffer cache, startBlk and startBlk + 1, but only return one > buffer to the caller. Yeah, right. This was one of the observations that influenced my PrepareReadBuffer() three-step thing that I'd forgotten. To spell that out with an example, suppose the buffer pool contains every odd numbered block. Successive StartReadBuffers() calls would process "HMHm", "MHm", "MHm"... where "m" represents a miss that we can't do anything with for a block we'll look up in the buffer pool again in the next call. With the PrepareReadBuffer() design, that miss just starts a new range and we don't have to look it up again. Hmm, I suppose that could be mitigated somewhat with ReadRecentBuffer() if we can find somewhere decent to store it. BTW it was while thinking about and testing cases like that that I found Palak Chaturvedi's https://commitfest.postgresql.org/46/4426/ extremely useful. It can kick out every second page or any other range-chopping scenario you can express in a WHERE clause. I would quite like to get that tool into the tree...
pgsql-hackers by date: