Re: Trying out read streams in pgvector (an extension) - Mailing list pgsql-hackers
| From | Thomas Munro |
|---|---|
| Subject | Re: Trying out read streams in pgvector (an extension) |
| Date | |
| Msg-id | CA+hUKGJLT2JvWLEiBXMbkSSc5so_Y7=N+S2ce7npjLw8QL3d5w@mail.gmail.com Whole thread Raw |
| In response to | Re: Trying out read streams in pgvector (an extension) (Melanie Plageman <melanieplageman@gmail.com>) |
| Responses |
Re: Trying out read streams in pgvector (an extension)
Re: Trying out read streams in pgvector (an extension) |
| List | pgsql-hackers |
On Fri, Nov 21, 2025 at 4:28 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I'm not totally opposed to this. My rationale for making it an error > is that the developer could have test cases where all the buffers are > consumed but the code is written such that that won't always happen. > Then if a real production query doesn't consume all the buffers, it > could return wrong results (I think). That will mean the user can't > complete their query until the extension author releases a new version > of their code. But I'm not sure what the right answer is here. Focusing on making sure v19 has a good interface for this, and abandoning thoughts of back-patching a bandaid, and the constraints that leads to, for now... I think it'd be better if that were the consumer's choice. I don't want the consumer to be required to drain the stream before resuming, as that'd be an unprincipled stall. For example, if new WAL arrives over the network then I think it should be possible for recovery's WAL-powered stream of heap pages to resume looking ahead even if recovery hasn't drained the existing stream completely. Peter G (CC'd) and I discussed some problems he had in the index prefetching work, and I tried to extend this a bit to give the semantics he wanted, in point 2 below. It's simple itself, but might lead to some tricky questions higher up. Posted for experimentation. It'll be interesting to see if this goes somewhere. 1. read_stream_resume() as before, but with a new explicit read_stream_pause(): if a block number callback would like to report a temporary lack of information, it should return read_stream_pause(stream), not InvalidBlockNumber. Then after read_stream_resume(stream) is called, the next read_stream_next_buffer() enters the lookahead loop again. While paused, if the consumer drains all the existing buffers in the stream and then one more, it will receive InvalidBuffer, but if the _resume() call is made sooner, the consumer won't ever know about the temporary lack of buffers in the stream. 2. read_stream_yield(): while streaming heap pages that come from TIDs on index pages, Peter didn't like that the executor lost control of how much work was done by the lookahead loop underneath read_stream_next_buffer(). The consumer might have a heap page with some tuples that could be emitted right now, but the block number callback might be evaluating arbitrarily expensive filter qual expressions far ahead, and they might prefer to emit more tuples now before doing an unbounded amount of work finding more. This interface allows some limited coroutine-like multitasking, where the block number callback can return read_stream_yield(stream) to return control back to the consumer periodically if it knows the consumer could already do something else. It works by pausing the stream and resuming it in the next read_stream_next_buffer() call, but that's an internal detail. Some half-baked thoughts about the resulting flow control: Yielding control periodically just when it happens to be possible within the constraints of the volcano executor is an interesting thing to think about. You can only yield if you already have a tuple to emit. There is no saying when control will return to you, and the node you yield to might immediately block on I/O and yet you could have been doing useful CPU work. You probably need an event-driven node-hopping executor to fix that in general, but on the flip side, I can think of one bet that I'd take: if you already have a tuple to emit AND if index scans themselves (not only referenced heap pages) were also streamed AND if a hypothetical read_stream_next_buffer_no_wait(btree_stream) said the next index page you need is not ready yet, then you should yield. You're gambling that other plan nodes will have better luck running without an I/O stall, but you have ~0% chance. Yielding just because you've scanned N index pages/tuples/whatever is harder to think about. The stream shouldn't get far ahead unless it's recently been useful for I/O concurrency (though optimal distance heuristics are an open problem), but in this case a single invocation of the block number callback can call ReadBuffer() an arbitrary number of times, filtering out all the index tuples as it rampages through the whole index IIUC. I see why you might want to yield periodically if you can, but I also wonder how much that can really help if you still have to pick up where you left off next time. I guess it depends on the distribution of matches. It's also clear that any cold-cache testing done with direct I/O enabled will stall abominably as long as that level calls ReadBuffer(), possibly confusing matters.
Attachment
pgsql-hackers by date: