On Thu, Apr 17, 2025 at 1:58 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I have no answers but I have speculated for years about a very
> specific case (without any idea where to begin due to lack of ... I
> guess all this sort of stuff): in ExecParallelHashJoinNewBatch(),
> workers split up and try to work on different batches on their own to
> minimise contention, and when that's not possible (more workers than
> batches, or finishing their existing work at different times and going
> to help others), they just proceed in round-robin order.  A beginner
> thought is: if you're going to help someone working on a hash table,
> it would surely be best to have the CPUs and all the data on the same
> NUMA node.  During loading, cache line ping pong would be cheaper, and
> during probing, it *might* be easier to tune explicit memory prefetch
> timing that way as it would look more like a single node system with a
> fixed latency, IDK (I've shared patches for prefetching before that
> showed pretty decent speedups, and the lack of that feature is
> probably a bigger problem than any of this stuff, who knows...).
> Another beginner thought is that the DSA allocator is a source of
> contention during loading: the dumbest problem is that the chunks are
> just too small, but it might also be interesting to look into per-node
> pools.  Or something.   IDK, just some thoughts...
And BTW there are papers about that (but they mostly just remind me
that I have to reboot the prefetching patch long before that...), for
example:
https://15721.courses.cs.cmu.edu/spring2023/papers/11-hashjoins/lang-imdm2013.pdf