Re: Trying out read streams in pgvector (an extension) - Mailing list pgsql-hackers
| From | Melanie Plageman |
|---|---|
| Subject | Re: Trying out read streams in pgvector (an extension) |
| Date | |
| Msg-id | CAAKRu_ZGhnWZXOyEyZ2r47g-F7U8asMRA6U8YZw3h=2rR=m_hQ@mail.gmail.com Whole thread Raw |
| In response to | Re: Trying out read streams in pgvector (an extension) (Thomas Munro <thomas.munro@gmail.com>) |
| Responses |
Re: Trying out read streams in pgvector (an extension)
|
| List | pgsql-hackers |
On Wed, Nov 12, 2025 at 5:47 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > I suppose an alternative design would be that _next_buffer() returns > InvalidBuffer only once (= the block number callback returns > InvalidBlock once) and then automatically resumes (= it restores the > distance) and then you can call read_stream_next_buffer() again (= the > block number callback will be called again to fill the stream up with > new buffers before waiting for the first one to be ready to give to > you if it isn't already). That would have the advantage of not > requiring a new function at all and make the patch even shorter, but I > don't know, I guess I thought that would be a bit more fragile in some > way, less explicit. Hmm, would it actually be better if it worked > like that? We discussed off-list and decided that changing existing functionality in an unexpected way is undesirable. So, it is better we stick with adding read_stream_resume. However, in talking about read_stream_resume() further, Thomas and I also thought of potential issues with it: If read_stream_resume() is called before the read stream user callback has ever returned InvalidBlockNumber, 1) The value of resume_distance will be the original value of distance from read_stream_begin_relation(). You don't want to reset the distance to that value. 2) There may be inflight or completed buffers that have yet to be yielded which will be returned the next time read_stream_next_buffer() is invoked. If the user resets the state the callback is using to return blocks and expects the next invocation of read_stream_next_buffer() to return buffers with those blocks, they will be disappointed. If we try to address this by requiring that stream->distance is 0 when read_stream_resume() is called, that won't work because while it is set to 0 when the callback returns InvalidBlockNumber, there may still be unreturned buffers in the stream. If the user wants to use read_stream_reset() to exhaust the stream before calling read_stream_resume(), read_stream_reset() sets stream->distance to 1 at the end, so read_stream_resume() couldn't detect if reset() was correctly called first or if the distance is > 0 because the stream is still in progress. To make sure 1) distance isn't reset to a resume_distance from read_stream_begin_relation() and 2) unexpected buffers aren't returned from the read stream, we could error out in read_stream_resume() if pinned_buffers > 0. And in read_stream_reset(), we would save distance in resume_distance before clearing distance. That would allow calling read_stream_resume() either if you called read_stream_reset() or if you exhausted the stream yourself. See rough attached patch for a sketch of this. It would be nicer if we could error out if read_stream_next_buffer() didn't return InvalidBuffer, but we can't do that if we want to allow calling read_stream_reset() followed by read_stream_resume() because distance won't be 0. I tried this with a modified pgvector with an hnsw read stream user and it seemed to work correctly. - Melanie
Attachment
pgsql-hackers by date: