On Fri, Apr 11, 2025 at 5:50 AM James Hunter <james.hunter.pg@gmail.com> wrote:
> I am looking at the pre-streaming code, in PG 17, as I am not familiar
> with the PG 18 "streaming" code. Back in PG 17, nodeBitmapHeapscan.c
> maintained two shared TBM iterators, for PQ. One of the iterators was
> the actual, "fetch" iterator; the other was the "prefetch" iterator,
> which kept some distance ahead of the "fetch" iterator (to hide read
> latency).
We're talking at cross-purposes.
The new streaming BHS isn't just issuing probabilistic hints about
future access obtained from a second iterator. It has just one shared
iterator connected up to the workers' ReadStreams. Each worker pulls
a disjoint set of blocks out of its stream, possibly running a bunch
of IOs in the background as required. The stream replaces the old
ReadBuffer() call, and the old PrefetchBuffer() call and a bunch of
dubious iterator synchronisation logic are deleted. These are now
real IOs running in the background and for the *exact* blocks you will
consume; posix_fadvise() was just a stepping towards AIO that
tolerated sloppy synchronisation including being entirely wrong. If
you additionally teach the iterator to work in batches, as my 0001
patch (which I didn't propose for v18) showed, then one worker might
end up processing (say) 10 blocks at end-of-scan while all the other
workers have finished the node, and maybe the whole query. That'd be
unfair. "Ramp-down" ... 8, 4, 2, 1 has been used in one or two other
places in parallel-aware nodes with internal batching as a kind of
fudge to help them finish CPU work around the same time if you're
lucky, and my 0002 patch shows that NOT working here. I suspect the
concept itself is defunct: it no longer narrows the CPU work
completion time range across workers at all well due to the elastic
streams sitting in between. Any naive solution that requires
cooperation/waiting for another worker to hand over final scraps of
work originally allocated to it (and I don't mean the IO completion
part, that all just works just fine as you say, a lot of engineering
went into the buffer manager to make that true, for AIO but also in
the preceding decades... what I mean here is: how do you even know
which block to read?) is probably a deadlock risk. Essays have been
written on the topic if you are interested.
All the rest of our conversation makes no sense without that context :-)
> > I admit this all sounds kinda complicated and maybe there is a much
> > simpler way to achieve the twin goals of maximising I/O combining AND
> > parallel query fairness.
>
> I tend to think that the two goals are so much in conflict, that it's
> not worth trying to apply cleverness to get them to agree on things...
I don't give up so easily :-)