Re: pgsql: Add parallel-aware hash joins. - Mailing list pgsql-committers

From Tom Lane
Subject Re: pgsql: Add parallel-aware hash joins.
Date
Msg-id 4001.1514678419@sss.pgh.pa.us
Whole thread Raw
In response to Re: pgsql: Add parallel-aware hash joins.  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: pgsql: Add parallel-aware hash joins.
List pgsql-committers
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Sun, Dec 31, 2017 at 11:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... This isn't quite 100% reproducible on gaur/pademelon,
>> but it fails more often than not seems like, so I can poke into it
>> if you can say what info would be helpful.

> Right.  That's apparently unrelated and is the last build-farm issue
> on my list (so far).  I had noticed that certain BF animals are prone
> to that particular failure, and they mostly have architectures that I
> don't have so a few things are probably just differently sized.  At
> first I thought I'd tweak the tests so that the parameters were always
> stable, and I got as far as installing Debian on qemu-system-ppc (it
> took a looong time to compile PostgreSQL), but that seems a bit cheap
> and flimsy... better to fix the size estimation error.

"Size estimation error"?  Why do you think it's that?  We have exactly
the same plan in both cases.

My guess is that what's happening is that one worker or the other ends
up processing the whole scan, or the vast majority of it, so that that
worker's hash table has to hold substantially more than half of the
tuples and thereby is forced to up the number of batches.  I don't see
how you can expect to estimate that situation exactly; or if you do,
you'll be pessimizing the plan for cases where the split is more nearly
equal.

By this theory, the reason why certain BF members are more prone to the
failure is that they're single-processor machines, and perhaps have
kernels with relatively long scheduling quanta, so that it's more likely
that the worker that gets scheduled first is able to read the whole input
to the hash step.

> I assume that what happens here is the planner's size estimation code
> sometimes disagrees with Parallel Hash's chunk-based memory
> accounting, even though in this case we had perfect tuple count and
> tuple size information.  In an earlier version of the patch set I
> refactored the planner to be chunk-aware (even for parallel-oblivious
> hash join), but later in the process I tried to simplify and shrink
> the patch set and avoid making unnecessary changes to non-Parallel
> Hash code paths.  I think I'll need to make the planner aware of the
> maximum amount of fragmentation possible when parallel-aware
> (something like: up to one tuple's worth at the end of each chunk, and
> up to one whole wasted chunk per participating backend).  More soon.

I'm really dubious that trying to model the executor's space consumption
exactly is a good idea, even if it did fix this specific problem.
That would expend extra planner cycles and pose a continuing maintenance
gotcha.

            regards, tom lane


pgsql-committers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: pgsql: Add parallel-aware hash joins.
Next
From: Tom Lane
Date:
Subject: pgsql: Merge coding of return/exit/continue cases in plpgsql's loopsta