Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id CA+hUKGKi8WG1HEZAQBC8PJrmfaf+mLug3PN3ytqxKYm5ghEwCA@mail.gmail.com
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (James Hunter <james.hunter.pg@gmail.com>)
Responses Re: BitmapHeapScan streaming read user and prelim refactoring
Re: BitmapHeapScan streaming read user and prelim refactoring
List pgsql-hackers
On Fri, Apr 11, 2025 at 5:50 AM James Hunter <james.hunter.pg@gmail.com> wrote:
> I am looking at the pre-streaming code, in PG 17, as I am not familiar
> with the PG 18 "streaming" code. Back in PG 17, nodeBitmapHeapscan.c
> maintained two shared TBM iterators, for PQ. One of the iterators was
> the actual, "fetch" iterator; the other was the "prefetch" iterator,
> which kept some distance ahead of the "fetch" iterator (to hide read
> latency).

We're talking at cross-purposes.

The new streaming BHS isn't just issuing probabilistic hints about
future access obtained from a second iterator.  It has just one shared
iterator connected up to the workers' ReadStreams.  Each worker pulls
a disjoint set of blocks out of its stream, possibly running a bunch
of IOs in the background as required.  The stream replaces the old
ReadBuffer() call, and the old PrefetchBuffer() call and a bunch of
dubious iterator synchronisation logic are deleted.  These are now
real IOs running in the background and for the *exact* blocks you will
consume; posix_fadvise() was just a stepping towards AIO that
tolerated sloppy synchronisation including being entirely wrong.  If
you additionally teach the iterator to work in batches, as my 0001
patch (which I didn't propose for v18) showed, then one worker might
end up processing (say) 10 blocks at end-of-scan while all the other
workers have finished the node, and maybe the whole query.  That'd be
unfair.  "Ramp-down" ... 8, 4, 2, 1 has been used in one or two other
places in parallel-aware nodes with internal batching as a kind of
fudge to help them finish CPU work around the same time if you're
lucky, and my 0002 patch shows that NOT working here.  I suspect the
concept itself is defunct: it no longer narrows the CPU work
completion time range across workers at all well due to the elastic
streams sitting in between.  Any naive solution that requires
cooperation/waiting for another worker to hand over final scraps of
work originally allocated to it (and I don't mean the IO completion
part, that all just works just fine as you say, a lot of engineering
went into the buffer manager to make that true, for AIO but also in
the preceding decades... what I mean here is: how do you even know
which block to read?) is probably a deadlock risk.  Essays have been
written on the topic if you are interested.

All the rest of our conversation makes no sense without that context :-)

> > I admit this all sounds kinda complicated and maybe there is a much
> > simpler way to achieve the twin goals of maximising I/O combining AND
> > parallel query fairness.
>
> I tend to think that the two goals are so much in conflict, that it's
> not worth trying to apply cleverness to get them to agree on things...

I don't give up so easily :-)



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Next
From: Peter Smith
Date:
Subject: Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding