Re: Trying out read streams in pgvector (an extension) - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Trying out read streams in pgvector (an extension) |
| Date | |
| Msg-id | 74f57c98-ed0c-4a7d-8474-b038e7ee3baf@vondra.me Whole thread Raw |
| In response to | Re: Trying out read streams in pgvector (an extension) (Peter Geoghegan <pg@bowt.ie>) |
| List | pgsql-hackers |
On 12/9/25 23:38, Peter Geoghegan wrote: > On Mon, Dec 8, 2025 at 10:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Yielding just because you've scanned N index pages/tuples/whatever is >> harder to think about. The stream shouldn't get far ahead unless it's >> recently been useful for I/O concurrency (though optimal distance >> heuristics are an open problem), but in this case a single invocation >> of the block number callback can call ReadBuffer() an arbitrary number >> of times, filtering out all the index tuples as it rampages through >> the whole index IIUC. I see why you might want to yield periodically >> if you can, but I also wonder how much that can really help if you >> still have to pick up where you left off next time. > > I think of it as a necessary precaution against pathological behavior > where the amount of memory used to cache matching tuples/TIDs gets out > of hand. There's no specific reason to expect that to happen (or no > good reason). But I'm pretty sure that it'll prove necessary to pay > non-zero attention to how much work has been done since the last time > we returned a tuple (when there's a tuple available to return). > I think it's not all that difficult to hit such runaway cases, loading arbitrary number of batches ahead. That's exactly why I had to come up with the read_stream_reset approach in the first place, actually. The simplest example is an index scan on a correlated index. We skip duplicate blocks, so to find the next block to pass to the stream, we may have to load multiple leaf pages. And we may need multiple such blocks, to satisfy the prefetch distance (= number of IOs). An even more extreme example is IOS, with just a couple not-allvisible pages (that need prefetching). You hit the first one, then the distance ramps up, and at some point there are no more not-allvisible pages. But we try to maintain the distance, and we end up reading all remaining leaf pages. I'm sure we need to protect against these issues, which is why we have INDEX_SCAN_MAX_BATCHES. IIRC you also suggested we will need some internal limit to keep the number of buffer pins under control, right? I agree there may be other important reasons to "pause", e.g. based on how much work was done since the last time the index scan returned a tuple. But I'm not sure what exactly to look at, because most of the "work" is happening outside the index scan, no? >> I guess it >> depends on the distribution of matches. > > To be clear, I haven't done any kind of modelling of the problems in > this area. Once I do that (in 2026), I'll be able to say more about > the requirements. Maybe Tomas could take a look sooner? > I don't have any explicit formal model of the problems, but it should not be difficult to come up with "adversary cases" for the prefetching patches, for example. That is, ways to construct datasets / indexes that end up looking very far (many leafs) ahead. Is that what you meant by modelling, or did you mean some sort of a formal model of how far to prefetch for a given dataset? AFAICS the "ideal" prefetch distance would be such that a page gets loaded into shared buffers right before the scan actually needs it (after processing through the earlier index entries). Let's say we know * T1 - cost of "processing" one index tuple (average time between calls to table_index_fetch_tuple?) * T2 - cost of I/O, i.e. how long does it take to read a block We want keep the look-ahead distance at least K, such that K * T1 > T2 but not much more than that. But I'm not sure where to get T1 and T2, as T1 depends on what happens outside the index scan, and T2 is hardware dependent (and not actually linear / additive). Or am I confused and you asked about something else? regards -- Tomas Vondra
pgsql-hackers by date: