Home > mailing lists

Re: Trying out read streams in pgvector (an extension) - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Trying out read streams in pgvector (an extension)
Date	December 15 02:02:53
Msg-id	74f57c98-ed0c-4a7d-8474-b038e7ee3baf@vondra.me Whole thread Raw
In response to	Re: Trying out read streams in pgvector (an extension) (Peter Geoghegan <pg@bowt.ie>)
List	pgsql-hackers

Tree view

On 12/9/25 23:38, Peter Geoghegan wrote:
> On Mon, Dec 8, 2025 at 10:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Yielding just because you've scanned N index pages/tuples/whatever is
>> harder to think about.  The stream shouldn't get far ahead unless it's
>> recently been useful for I/O concurrency (though optimal distance
>> heuristics are an open problem), but in this case a single invocation
>> of the block number callback can call ReadBuffer() an arbitrary number
>> of times, filtering out all the index tuples as it rampages through
>> the whole index IIUC.  I see why you might want to yield periodically
>> if you can, but I also wonder how much that can really help if you
>> still have to pick up where you left off next time.
> 
> I think of it as a necessary precaution against pathological behavior
> where the amount of memory used to cache matching tuples/TIDs gets out
> of hand. There's no specific reason to expect that to happen (or no
> good reason). But I'm pretty sure that it'll prove necessary to pay
> non-zero attention to how much work has been done since the last time
> we returned a tuple (when there's a tuple available to return).
> 

I think it's not all that difficult to hit such runaway cases, loading
arbitrary number of batches ahead. That's exactly why I had to come up
with the read_stream_reset approach in the first place, actually.

The simplest example is an index scan on a correlated index. We skip
duplicate blocks, so to find the next block to pass to the stream, we
may have to load multiple leaf pages. And we may need multiple such
blocks, to satisfy the prefetch distance (= number of IOs).

An even more extreme example is IOS, with just a couple not-allvisible
pages (that need prefetching). You hit the first one, then the distance
ramps up, and at some point there are no more not-allvisible pages. But
we try to maintain the distance, and we end up reading all remaining
leaf pages.

I'm sure we need to protect against these issues, which is why we have
INDEX_SCAN_MAX_BATCHES.

IIRC you also suggested we will need some internal limit to keep the
number of buffer pins under control, right?

I agree there may be other important reasons to "pause", e.g. based on
how much work was done since the last time the index scan returned a
tuple. But I'm not sure what exactly to look at, because most of the
"work" is happening outside the index scan, no?

>> I guess it
>> depends on the distribution of matches.
> 
> To be clear, I haven't done any kind of modelling of the problems in
> this area. Once I do that (in 2026), I'll be able to say more about
> the requirements. Maybe Tomas could take a look sooner?
> 

I don't have any explicit formal model of the problems, but it should
not be difficult to come up with "adversary cases" for the prefetching
patches, for example. That is, ways to construct datasets / indexes that
end up looking very far (many leafs) ahead.

Is that what you meant by modelling, or did you mean some sort of a
formal model of how far to prefetch for a given dataset?

AFAICS the "ideal" prefetch distance would be such that a page gets
loaded into shared buffers right before the scan actually needs it
(after processing through the earlier index entries).

Let's say we know

* T1 - cost of "processing" one index tuple (average time between calls
to table_index_fetch_tuple?)

* T2 - cost of I/O, i.e. how long does it take to read a block

We want keep the look-ahead distance at least K, such that

   K * T1 > T2

but not much more than that. But I'm not sure where to get T1 and T2, as
T1 depends on what happens outside the index scan, and T2 is hardware
dependent (and not actually linear / additive).

Or am I confused and you asked about something else?

regards

-- 
Tomas Vondra

pgsql-hackers by date:

From: Tom Lane
Date: 15 December, 01:55:36
Subject: Re: POC: make mxidoff 64 bits

From: Chao Li
Date: 15 December, 02:25:56
Subject: Fixes a typo in tablecmd

Re: Trying out read streams in pgvector (an extension) - Mailing list pgsql-hackers

Previous

Next