Re: parallel joins, and better parallel explain - Mailing list pgsql-hackers
From | Paul Ramsey |
---|---|
Subject | Re: parallel joins, and better parallel explain |
Date | |
Msg-id | CACowWR1u_Wi0i14GTf97JHvz2gcaEgKYf-u_NuZcEmz0F-CAYQ@mail.gmail.com Whole thread Raw |
In response to | Re: parallel joins, and better parallel explain (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Wed, Dec 2, 2015 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Oops. The new version I've attached should fix this. I've been trying to see if parallel join has any effect on PostGIS spatial join queries, which are commonly CPU bound. (My tests [1] on simple parallel scan were very positive, though quite limited in that they only parallelized such a small part of the work). Like Amit, I found the current patches are missing a change to src/include/nodes/relation.h, but just adding in "Relids extra_lateral_rels" to JoinPathExtraData allowed a warning-free build. The assumptions on parallel code in generally are that setting up parallel workers is very costly compared to the work to be done, so to get PostGIS code to parallelize I've been reduced to shoving parallel_setup_cost very low (1) and parallel_tuple_cost as well. Otherwise I just end up with ordinary plans. I did redefine all the relevant functions as "parallel safe" and upped their declared costs significantly. I set up a 8000 record spatial table, with a spatial index, and did a self-join on it. explain analyze select a.gid, b.gid from vada a join vada b on st_intersects(a.geom, b.geom) where a.gid != b.gid; With no parallelism, I got this: set max_parallel_degree = 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------Nested Loop (cost=0.15..227332.48 rows=1822243 width=8) (actual time=0.377..5528.461 rows=52074 loops=1) -> Seq Scan on vada a (cost=0.00..1209.92 rows=8792 width=1189) (actual time=0.027..5.004 rows=8792 loops=1) -> Index Scan using vada_gix on vada b (cost=0.15..25.71 rows=1 width=1189) (actual time=0.351..0.622 rows=6 loops=8792) Index Cond: (a.geom && geom) Filter: ((a.gid <> gid)AND _st_intersects(a.geom, geom)) Rows Removed by Filter: 3Planning time: 3.976 msExecution time: 5533.573 ms With parallelism, I got this: QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------Nested Loop (cost=0.15..226930.05 rows=1822243 width=8) (actual time=0.840..5462.029 rows=52074 loops=1) -> Gather (cost=0.00..807.49 rows=8792 width=1189) (actual time=0.335..39.326 rows=8792 loops=1) Number of Workers: 1 -> Parallel Seq Scan on vada a (cost=0.00..806.61rows=5861 width=1189) (actual time=0.015..10.167 rows=4396 loops=2) -> Index Scan using vada_gix on vada b (cost=0.15..25.71 rows=1 width=1189) (actual time=0.353..0.609 rows=6 loops=8792) Index Cond: (a.geom && geom) Filter: ((a.gid <> gid)AND _st_intersects(a.geom, geom)) Rows Removed by Filter: 3Planning time: 4.019 msExecution time: 5467.126 ms Given that it's a CPU-bound process, I was hoping for something closer to the results of the sequence tests, about 50% time reduction, based on the two cores in my test machine. In general either the parallel planner is way too conservative (it seems), or we need to radically increase the costs of our PostGIS functions (right now, most "costly" functions are cost 100, but I had to push costs up into the 100000 range to get parallelism to kick in sometimes). Some guidelines on cost setting would be useful, something that says, "this function run against this kind of data is cost level 1, compare the time your function takes on 'standard' data to the baseline function to get a cost ratio to use in the function definition" ATB, P. [1] https://gist.github.com/pramsey/84e7a39d83cccae692f8
pgsql-hackers by date: