Re: [HACKERS] Parallel Hash take II - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Parallel Hash take II
Date
Msg-id CA+TgmoaK=YuQvKfu5qz3oWTnGshBjBVpEnFX99V1bvFgfWKBJg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel Hash take II  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Thu, Nov 16, 2017 at 6:42 PM, Andres Freund <andres@anarazel.de> wrote:
> What I'm basically worried about is that the *only* reason for some
> plans to choose to use parallelism is that essentially the effective
> amount of work_mem between the plans is that the parallel one uses
> (max_parallel_workers_per_gather + 1) * work_mem. Which might push
> queries to use parallelism even if it's not actually beneficial in
> reducing runtime.

I don't get it.  If we switch to using parallelism and the runtime
doesn't go down, that just means the costing is wrong.  The costing is
supposed to reflect the runtime.

It's true that adding parallel hash may tend to increase the amount of
memory that gets used in practice.  But it seems to me that any plan
that uses memory and is sometimes chosen over a non-memory-using plan
does that.

> Thomas' earlier comparison of this behaviour with e.g. parallel
> oblivious hash nodes does *NOT* seem apt to me. There's currently
> effectively no cost pressure for/against parallelism for those (even if
> there potentially should). Which means they do not favor parallel
> queries solely because they're allowed to use more memory, and thus it's
> far less likely that every of those nodes uses the maximum alloted
> work_mem.

I agree that there is no cost pressure for or against parallelism.
The design which I have been pursuing with parallel query up to this
point is that cost represents execution time, so minimizing cost means
minimizing execution time, and that's the goal.  If we want, we can
put a tax on parallel query plans, so that they're only chosen when
they are *substantially* cheaper than non-parallel plans, or even that
a plan with a few workers is to be preferred over a plan with more
workers unless the gains are sufficiently substantial.  But I don't
think the right way to do that is to prevent a parallel hash from
using as much memory as a non-parallel hash in the same spot would
use.

Rather, what we could do is, for example, have a knob that multiplies
the cost of a partial path by a floating-point value when we insert a
Gather/Gather Merge node.  If you want to always pick the cheapest
path regardless of whether it uses parallelism, set the GUC to 1.0.
If you only want to pick a parallel query path if it's twice as cheap
as the best non-parallel path, set the GUC to 2.0.

> I think it's wrong to just multiply the amount of work_mem that way, and
> it'll bite use. Introducing a separate guc, perhaps inheriting from
> work_mem if set to -1, that limits the amount of memory inside a
> parallel node seems saner. That value then would not be multiplied with
> the chosen worker number.

I don't mind having an option to override the amount of memory that
parallel hash is allowed to used, but I'm also not yet convinced that
we have a real problem that needs solving.

> As far as I understand it we currently cost a gather's startup cost
> once, not per worker.

Yes - because cost is supposed to measure execution time, and the
workers start all at once, not sequentially.  The startup time for the
workers as a whole doesn't get much longer as the number of workers
rises.  It might be that the formula should be 1000 + 50/worker or
something rather than a constant, but I doubt it matters very much.

> That startup cost effectively include redundant
> work like materializing a table, building parallel oblivious hashtables,
> resorting the same table for a mergejoin etc...

Right, because that work all contributes to total execution time.
Actually, when the same worker is going to be redone in multiple
workers, we should really inflate the cost, because if two workers
each sort the same table, it takes each of them longer than it would
take a single worker to sort the table for itself.  That's one of the
biggest problems with parallel costing right now, from what I've seen:
we're too willing to do the same work over in each of 4-6
participants, not realizing that they're going to content for buffer
locks and I/O bandwidth.

> That's fine if your goal is solely to return a single query as fast as
> possible, even if doubling the resources will only give you a minimal
> cost advantage.  If instead your goal is to optimize both for individual
> query performance and overall system throughput that's obviously not
> good.

I agree.

> I think we should cost the startup cost of paralell nodes more like
> startup_cost * (max_parallel_workers_per_gather * parallel_resource_efficiency + 1)
>
> which'd allow to tune query performance for using parallelism even for
> tiny benefits (parallel_resource_efficiency = 0), and conversely tune it
> so it only gets used if the overall loss of efficiency is small
> (parallel_resource_efficiency = 0.9 or such).

No, I think that's wrong.  I think you need to apply the adjustment at
the end of the (parallel) planning process, not incrementally to each
node.  Otherwise, the costs assigned to the individual nodes becomes
some unholy amalgam of execution time and total resource expenditure.
I think that will cause strange and stupid plan choices.  See also
notes above.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Paul Ramsey
Date:
Subject: Re: Inlining functions with "expensive" parameters
Next
From: Merlin Moncure
Date:
Subject: feature request: consume asynchronous notification via a function