Re: parallel joins, and better parallel explain - Mailing list pgsql-hackers

From Paul Ramsey
Subject Re: parallel joins, and better parallel explain
Date
Msg-id CACowWR1u_Wi0i14GTf97JHvz2gcaEgKYf-u_NuZcEmz0F-CAYQ@mail.gmail.com
Whole thread Raw
In response to Re: parallel joins, and better parallel explain  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, Dec 2, 2015 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> Oops.  The new version I've attached should fix this.

I've been trying to see if parallel join has any effect on PostGIS
spatial join queries, which are commonly CPU bound. (My tests [1] on
simple parallel scan were very positive, though quite limited in that
they only parallelized such a small part of the work).

Like Amit, I found the current patches are missing a change to
src/include/nodes/relation.h, but just adding in "Relids
extra_lateral_rels" to JoinPathExtraData allowed a warning-free build.

The assumptions on parallel code in generally are that setting up
parallel workers is very costly compared to the work to be done, so to
get PostGIS code to parallelize I've been reduced to shoving
parallel_setup_cost very low (1) and parallel_tuple_cost as well.
Otherwise I just end up with ordinary plans. I did redefine all the
relevant functions as "parallel safe" and upped their declared costs
significantly.

I set up a 8000 record spatial table, with a spatial index, and did a
self-join on it.

explain analyze
select a.gid, b.gid from vada a join vada b
on st_intersects(a.geom, b.geom)
where a.gid != b.gid;

With no parallelism, I got this:

set max_parallel_degree = 0;
                                                         QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=0.15..227332.48 rows=1822243 width=8) (actual
 
time=0.377..5528.461 rows=52074 loops=1)  ->  Seq Scan on vada a  (cost=0.00..1209.92 rows=8792 width=1189)
(actual time=0.027..5.004 rows=8792 loops=1)  ->  Index Scan using vada_gix on vada b  (cost=0.15..25.71 rows=1
width=1189) (actual time=0.351..0.622 rows=6 loops=8792)        Index Cond: (a.geom && geom)        Filter: ((a.gid <>
gid)AND _st_intersects(a.geom, geom))        Rows Removed by Filter: 3Planning time: 3.976 msExecution time: 5533.573
ms


With parallelism, I got this:
                                                          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=0.15..226930.05 rows=1822243 width=8) (actual
 
time=0.840..5462.029 rows=52074 loops=1)  ->  Gather  (cost=0.00..807.49 rows=8792 width=1189) (actual
time=0.335..39.326 rows=8792 loops=1)        Number of Workers: 1        ->  Parallel Seq Scan on vada a
(cost=0.00..806.61rows=5861
 
width=1189) (actual time=0.015..10.167 rows=4396 loops=2)  ->  Index Scan using vada_gix on vada b  (cost=0.15..25.71
rows=1
width=1189) (actual time=0.353..0.609 rows=6 loops=8792)        Index Cond: (a.geom && geom)        Filter: ((a.gid <>
gid)AND _st_intersects(a.geom, geom))        Rows Removed by Filter: 3Planning time: 4.019 msExecution time: 5467.126
ms

Given that it's a CPU-bound process, I was hoping for something closer
to the results of the sequence tests, about 50% time reduction, based
on the two cores in my test machine.

In general either the parallel planner is way too conservative (it
seems), or we need to radically increase the costs of our PostGIS
functions (right now, most "costly" functions are cost 100, but I had
to push costs up into the 100000 range to get parallelism to kick in
sometimes). Some guidelines on cost setting would be useful, something
that says, "this function run against this kind of data is cost level
1, compare the time your function takes on 'standard' data to the
baseline function to get a cost ratio to use in the function
definition"

ATB,

P.



[1]  https://gist.github.com/pramsey/84e7a39d83cccae692f8



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Another XML build issue
Next
From: Jim Nasby
Date:
Subject: Re: Proposal: custom compression methods