parallel joins, and better parallel explain - Mailing list pgsql-hackers

From Robert Haas
Subject parallel joins, and better parallel explain
Date
Msg-id CA+TgmoagReuGcfduoZ1iUMrCU3aR+MoRn7X+55DiEkyF32m_8g@mail.gmail.com
Whole thread Raw
Responses Re: parallel joins, and better parallel explain  (Simon Riggs <simon@2ndQuadrant.com>)
Re: parallel joins, and better parallel explain  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Attached find a patch that does (mostly) two things.  First, it allows
the optimizer to generate plans where a Nested Loop or Hash Join
appears below a Gather node.  This is a big improvement on what we
have today, where only a sequential scan can be parallelized; with
this patch, entire join problems can be parallelized, as long as they
don't need a Merge Join (see below for more on this).  Second, it
improves the output of EXPLAIN when parallel workers are used.  With
this patch, EXPLAIN (ANALYZE, VERBOSE) displays not only totals for
all workers, as it does currently in master, but also per-worker
statistics.  Putting these two things together, you can get spiffy
stuff like this - thanks to Thom Brown for the query and sample data:

rhaas=# explain (analyze, verbose, costs off) SELECT count(*)  FROM
contacts NATURAL JOIN countries WHERE continent = 'Africa';
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=602.527..602.527 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=0.243..531.129 rows=1185951 loops=1)
         Number of Workers: 2
         ->  Hash Join (actual time=0.206..396.106 rows=395317 loops=3)
               Hash Cond: (contacts.country = countries.country)
               Worker 0: actual time=0.260..485.785 rows=486461 loops=1
               Worker 1: actual time=0.260..483.459 rows=480065 loops=1
               ->  Parallel Seq Scan on public.contacts (actual
time=0.034..143.824 rows=1666667 loops=3)
                     Output: contacts.id, contacts.first_name,
contacts.last_name, contacts.age, contacts.country
                     Worker 0: actual time=0.035..176.784 rows=2051492 loops=1
                     Worker 1: actual time=0.038..174.175 rows=2021506 loops=1
               ->  Hash (actual time=0.064..0.064 rows=59 loops=3)
                     Output: countries.country
                     Buckets: 1024  Batches: 1  Memory Usage: 11kB
                     Worker 0: actual time=0.070..0.070 rows=59 loops=1
                     Worker 1: actual time=0.069..0.069 rows=59 loops=1
                     ->  Seq Scan on public.countries (actual
time=0.019..0.051 rows=59 loops=3)
                           Output: countries.country
                           Filter: (countries.continent = 'Africa'::text)
                           Rows Removed by Filter: 190
                           Worker 0: actual time=0.025..0.056 rows=59 loops=1
                           Worker 1: actual time=0.025..0.058 rows=59 loops=1
 Planning time: 0.247 ms
 Execution time: 603.285 ms
(25 rows)

The general theory of operation of this patch is that a Parallel Seq
Scan can be thought of as a "partial" path - that is, it can be run in
multiple workers and will produce part of the results in each worker;
when Gather is performed on those results, we get a complete result
set.  For reasons that should be fairly clear on short reflection, a
join between a partial path for one of the two relations and an
ordinary path for the other produces a partial path for the result;
joining two partial paths would produce wrong answers.  Thus, we
proceed by generating partial paths for each baserel and joinrel,
which can either be gathered to employ parallelism at that level, or
used to build partial paths for higher-level joinrels which can then
be gathered in turn.

As mentioned above, this patch doesn't try to generate Merge Join
paths at present.  That could be changed, but the plans would probably
not be very good in most cases; the problem is of course that only one
of the two paths can be partial.  So, if you had a sort on both sides,
each worker would have to sort part of one relation (which sounds
fine) and all of the other one (which sounds bad).  You might
conceivably win with a Sort on one side and an Index Scan on the other
side, but probably not very often.  The Gather at the top of the plan
tree is order-destroying, so it doesn't even help for the merge
ordering to match the final query ordering.  I'll put some code in to
try partial Merge Join plans if there is a big hue and cry for it, but
personally my feeling is that it would be smarter to forget it for now
and write the code once we have some better options on the executor
side.  See the "parallelism + sorting" thread for some discussion of
what kinds of executor nodes would be useful in being able to generate
better parallel merge joins.  Some of that work might even get done in
time for 9.6, which would be nice.

I thought for a while that I might need some code to prevent the
generation of parallel plans in some cases that involve volatile
functions.  For example, consider this query:

select (length(continent) - 3) % 10, count(*) from contacts natural
join countries where (length(continent) - 3) % 10 =
substr(timeofday()::text, 23, 1)::integer group by 1;

Without parallelism, this gets evaluated (on my machine, anyway) as a
hash join with countries on the inner side; the volatile but
parallel-safe filter condition is applied to the seq scan that feeds
the hash join.  Thus the decision as to whether each row of the
countries table is in or out gets made just once.  If this gets run
with a parallel hash join, each worker will make its own decision
about which rows to include in the hash table, and they probably won't
all make the same decisions.  This means that the parallel hash join
could return a combination of rows that could never be returned by any
serial plan.  That sounds bad, until you realize that if you disable
hash and merge joins and materiailzation, this can also be run as a
nested loop plan, in which case - if you can persuade the optimizer to
put the countries table on the inner side of the join - you can get
the executor to evaluate the filter condition on countries once for
every row in the contacts table, and of course there's nothing at all
that will make it give the same answer for each row each time.

After mulling it over a bit and studying the documentation, I realized
that marking a function "volatile" guarantees that it won't be
executed *less than once per base table row*.  The use of a parallel
plan here doesn't violate that rule.  "volatile" never guarantees that
the function won't be evaluated *more than once per base table row*.
So now I think this is all fine.  If we're in a serial plan, a
volatile filter condition will get evaluated as little as once per
row, but maybe as many times as some nested loop around that scan
executes.  In a parallel plan, it's the same thing, except that the
number of loops might be based on the number of parallel workers
rather than the number of rows in some outer table.  That doesn't seem
like an important distinction.

This patch expands on and subsumes the patch I posted on the Parallel
Append thread.  It therefore also does what was discussed over there:
pulls up Gather on top of any Append nodes generated by inheritance
expansion, which is a good idea for all the reasons already discussed
on that thread.  Also as discussed on that thread, the cost model here
is still not particularly smart and probably needs improvement in a
variety of ways.  Amit Kapila and others at EnterpriseDB will be
looking further into that, and I hope for more feedback from other
interested parties as well, but I don't think this patch needs to fix
everything that's wrong with parallel costing itself.  Even
identifying those problems will probably take a fair amount of time
and research, and not having this functionality will not make finding
those problems any easier - probably the opposite.

The EXPLAIN portion of the patch should perhaps be separated out and
reviewed/committed separately. I developed it together because I found
that the current EXPLAIN output inadequate for understanding how work
was divided up between the leader and workers.  The EXPLAIN changes
made it possible to figure that out.  More improvements are possible,
but I think this is a big step up over what we have now, so I'm
anxious to press forward with it.  Let me know if it's helpful to
split this part out for separate review; I've left it together for now
both because it makes testing the rest of the patch easier and because
it's 9:30pm on the day before Thanksgiving.  (Happy Thanksgiving, BTW,
if you live someplace where that's a thing, as I do!)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: WIP: About CMake v2
Next
From: Euler Taveira
Date:
Subject: Re: What .gitignore files do in the tarball?