Re: [HACKERS] Parallel Hash take II - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Parallel Hash take II
Date
Msg-id 20171116234216.oljmhk2fgs2axtl7@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Parallel Hash take II  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Parallel Hash take II  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2017-11-15 14:09:13 -0500, Robert Haas wrote:
> On Wed, Nov 15, 2017 at 1:35 PM, Andres Freund <andres@anarazel.de> wrote:
> > But this does bug me, and I think it's what made me pause here to make a
> > bad joke.  The way that parallelism treats work_mem makes it even more
> > useless of a config knob than it was before.  Parallelism, especially
> > after this patch, shouldn't compete / be benchmarked against a
> > single-process run with the same work_mem. To make it "fair" you need to
> > compare parallelism against a single threaded run with work_mem *
> > max_parallelism.
> 
> I don't really know how to do a fair comparison between a parallel
> plan and a non-parallel plan.  Even if the parallel plan contains zero
> nodes that use work_mem, it might still use more memory than the
> non-parallel plan, because a new backend uses a bunch of memory.  If
> you really want a comparison that is fair on the basis of memory
> usage, you have to take that into account somehow.

That's not quite what I'm concerned about.  Consider something
(completely artifical) like:

tpch_5[18786][1]=# SET work_mem = '50MB';
tpch_5[18786][1]=# EXPLAIN SELECT c_name, count(*) FROM orders JOIN customer ON (o_custkey = c_custkey) WHERE
o_orderdateBETWEEN '1995-01-01' AND '1995-01-05' GROUP BY 1 ORDER BY count(*) DESC LIMIT 10;
 

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                        QUERY PLAN
   │
 

├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=77344.16..77344.19 rows=10 width=27)
   │
 
│   ->  Sort  (cost=77344.16..77379.68 rows=14206 width=27)
   │
 
│         Sort Key: (count(*)) DESC
   │
 
│         ->  HashAggregate  (cost=76895.12..77037.18 rows=14206 width=27)
   │
 
│               Group Key: customer.c_name
   │
 
│               ->  Hash Join  (cost=35347.04..76824.09 rows=14206 width=19)
   │
 
│                     Hash Cond: (orders.o_custkey = customer.c_custkey)
   │
 
│                     ->  Bitmap Heap Scan on orders  (cost=302.04..41599.74 rows=14206 width=4)
   │
 
│                           Recheck Cond: ((o_orderdate >= '1995-01-01'::date) AND (o_orderdate <= '1995-01-05'::date))
   │
 
│                           ->  Bitmap Index Scan on i_o_orderdate  (cost=0.00..298.49 rows=14206 width=0)
   │
 
│                                 Index Cond: ((o_orderdate >= '1995-01-01'::date) AND (o_orderdate <=
'1995-01-05'::date))│
 
│                     ->  Hash  (cost=25670.00..25670.00 rows=750000 width=23)
   │
 
│                           ->  Seq Scan on customer  (cost=0.00..25670.00 rows=750000 width=23)
   │
 

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(13 rows)

tpch_5[18786][1]=# SET work_mem = '10MB';
tpch_5[18786][1]=# EXPLAIN SELECT c_name, count(*) FROM orders JOIN customer ON (o_custkey = c_custkey) WHERE
o_orderdateBETWEEN '1995-01-01' AND '1995-01-05' GROUP BY 1 ORDER BY count(*) DESC LIMIT 10;
 

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                           QUERY PLAN
         │
 

├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=82847.92..82847.94 rows=10 width=27)
         │
 
│   ->  Sort  (cost=82847.92..82883.43 rows=14206 width=27)
         │
 
│         Sort Key: (count(*)) DESC
         │
 
│         ->  HashAggregate  (cost=82398.87..82540.93 rows=14206 width=27)
         │
 
│               Group Key: customer.c_name
         │
 
│               ->  Merge Join  (cost=42580.44..82327.84 rows=14206 width=19)
         │
 
│                     Merge Cond: (customer.c_custkey = orders.o_custkey)
         │
 
│                     ->  Index Scan using i_c_custkey on customer  (cost=0.42..37663.43 rows=750000 width=23)
         │
 
│                     ->  Sort  (cost=42579.54..42615.05 rows=14206 width=4)
         │
 
│                           Sort Key: orders.o_custkey
         │
 
│                           ->  Bitmap Heap Scan on orders  (cost=302.04..41599.74 rows=14206 width=4)
         │
 
│                                 Recheck Cond: ((o_orderdate >= '1995-01-01'::date) AND (o_orderdate <=
'1995-01-05'::date))    │
 
│                                 ->  Bitmap Index Scan on i_o_orderdate  (cost=0.00..298.49 rows=14206 width=0)
         │
 
│                                       Index Cond: ((o_orderdate >= '1995-01-01'::date) AND (o_orderdate <=
'1995-01-05'::date))│
 

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

Note how the plan switched from a hashjoin to a mergejoin solely on the
basis of different work_mem settings, and that there's obviously
different costs associated between the two plans.

What I'm basically worried about is that the *only* reason for some
plans to choose to use parallelism is that essentially the effective
amount of work_mem between the plans is that the parallel one uses
(max_parallel_workers_per_gather + 1) * work_mem. Which might push
queries to use parallelism even if it's not actually beneficial in
reducing runtime.


Thomas' earlier comparison of this behaviour with e.g. parallel
oblivious hash nodes does *NOT* seem apt to me. There's currently
effectively no cost pressure for/against parallelism for those (even if
there potentially should). Which means they do not favor parallel
queries solely because they're allowed to use more memory, and thus it's
far less likely that every of those nodes uses the maximum alloted
work_mem.


I think it's wrong to just multiply the amount of work_mem that way, and
it'll bite use. Introducing a separate guc, perhaps inheriting from
work_mem if set to -1, that limits the amount of memory inside a
parallel node seems saner. That value then would not be multiplied with
the chosen worker number.


> But even then, the parallel plan is also almost certainly consuming
> more CPU cycles to produce the same results.  Parallelism is all about
> trading away efficiency for execution time.  Not just because of
> current planner and executor limitations, but intrinsically, parallel
> plans are less efficient.  The globally optimal solution on a system
> that is short on either memory or CPU cycles is to turn parallelism
> off.

I do think that our parallelism isn't properly tunable on that front.  I
think we really need something like a 'parallel_resource_efficiency'
[0..1] GUC.

As far as I understand it we currently cost a gather's startup cost
once, not per worker. That startup cost effectively include redundant
work like materializing a table, building parallel oblivious hashtables,
resorting the same table for a mergejoin etc...

That's fine if your goal is solely to return a single query as fast as
possible, even if doubling the resources will only give you a minimal
cost advantage.  If instead your goal is to optimize both for individual
query performance and overall system throughput that's obviously not
good.

I think we should cost the startup cost of paralell nodes more like
startup_cost * (max_parallel_workers_per_gather * parallel_resource_efficiency + 1)

which'd allow to tune query performance for using parallelism even for
tiny benefits (parallel_resource_efficiency = 0), and conversely tune it
so it only gets used if the overall loss of efficiency is small
(parallel_resource_efficiency = 0.9 or such).

(skipping over a lot of  details for such a proposal)

Now, currently that'd have the weakness that we sometimes would end up
not using parallelism, because at the determined amount of parallelism
it's not beneficial to use it, even though it'd be still worthwhile to
use a lower level of parallelism. But given we currently don't plan for
multiple degrees of parallelism that seems the right thing to do if you
care about efficiency / overall throughput.


Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: [HACKERS] Transaction control in procedures
Next
From: Merlin Moncure
Date:
Subject: Re: [HACKERS] Transaction control in procedures