parallel joins, and better parallel explain - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | parallel joins, and better parallel explain |
Date | |
Msg-id | CA+TgmoagReuGcfduoZ1iUMrCU3aR+MoRn7X+55DiEkyF32m_8g@mail.gmail.com Whole thread Raw |
Responses |
Re: parallel joins, and better parallel explain
Re: parallel joins, and better parallel explain |
List | pgsql-hackers |
Attached find a patch that does (mostly) two things. First, it allows the optimizer to generate plans where a Nested Loop or Hash Join appears below a Gather node. This is a big improvement on what we have today, where only a sequential scan can be parallelized; with this patch, entire join problems can be parallelized, as long as they don't need a Merge Join (see below for more on this). Second, it improves the output of EXPLAIN when parallel workers are used. With this patch, EXPLAIN (ANALYZE, VERBOSE) displays not only totals for all workers, as it does currently in master, but also per-worker statistics. Putting these two things together, you can get spiffy stuff like this - thanks to Thom Brown for the query and sample data: rhaas=# explain (analyze, verbose, costs off) SELECT count(*) FROM contacts NATURAL JOIN countries WHERE continent = 'Africa'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Aggregate (actual time=602.527..602.527 rows=1 loops=1) Output: count(*) -> Gather (actual time=0.243..531.129 rows=1185951 loops=1) Number of Workers: 2 -> Hash Join (actual time=0.206..396.106 rows=395317 loops=3) Hash Cond: (contacts.country = countries.country) Worker 0: actual time=0.260..485.785 rows=486461 loops=1 Worker 1: actual time=0.260..483.459 rows=480065 loops=1 -> Parallel Seq Scan on public.contacts (actual time=0.034..143.824 rows=1666667 loops=3) Output: contacts.id, contacts.first_name, contacts.last_name, contacts.age, contacts.country Worker 0: actual time=0.035..176.784 rows=2051492 loops=1 Worker 1: actual time=0.038..174.175 rows=2021506 loops=1 -> Hash (actual time=0.064..0.064 rows=59 loops=3) Output: countries.country Buckets: 1024 Batches: 1 Memory Usage: 11kB Worker 0: actual time=0.070..0.070 rows=59 loops=1 Worker 1: actual time=0.069..0.069 rows=59 loops=1 -> Seq Scan on public.countries (actual time=0.019..0.051 rows=59 loops=3) Output: countries.country Filter: (countries.continent = 'Africa'::text) Rows Removed by Filter: 190 Worker 0: actual time=0.025..0.056 rows=59 loops=1 Worker 1: actual time=0.025..0.058 rows=59 loops=1 Planning time: 0.247 ms Execution time: 603.285 ms (25 rows) The general theory of operation of this patch is that a Parallel Seq Scan can be thought of as a "partial" path - that is, it can be run in multiple workers and will produce part of the results in each worker; when Gather is performed on those results, we get a complete result set. For reasons that should be fairly clear on short reflection, a join between a partial path for one of the two relations and an ordinary path for the other produces a partial path for the result; joining two partial paths would produce wrong answers. Thus, we proceed by generating partial paths for each baserel and joinrel, which can either be gathered to employ parallelism at that level, or used to build partial paths for higher-level joinrels which can then be gathered in turn. As mentioned above, this patch doesn't try to generate Merge Join paths at present. That could be changed, but the plans would probably not be very good in most cases; the problem is of course that only one of the two paths can be partial. So, if you had a sort on both sides, each worker would have to sort part of one relation (which sounds fine) and all of the other one (which sounds bad). You might conceivably win with a Sort on one side and an Index Scan on the other side, but probably not very often. The Gather at the top of the plan tree is order-destroying, so it doesn't even help for the merge ordering to match the final query ordering. I'll put some code in to try partial Merge Join plans if there is a big hue and cry for it, but personally my feeling is that it would be smarter to forget it for now and write the code once we have some better options on the executor side. See the "parallelism + sorting" thread for some discussion of what kinds of executor nodes would be useful in being able to generate better parallel merge joins. Some of that work might even get done in time for 9.6, which would be nice. I thought for a while that I might need some code to prevent the generation of parallel plans in some cases that involve volatile functions. For example, consider this query: select (length(continent) - 3) % 10, count(*) from contacts natural join countries where (length(continent) - 3) % 10 = substr(timeofday()::text, 23, 1)::integer group by 1; Without parallelism, this gets evaluated (on my machine, anyway) as a hash join with countries on the inner side; the volatile but parallel-safe filter condition is applied to the seq scan that feeds the hash join. Thus the decision as to whether each row of the countries table is in or out gets made just once. If this gets run with a parallel hash join, each worker will make its own decision about which rows to include in the hash table, and they probably won't all make the same decisions. This means that the parallel hash join could return a combination of rows that could never be returned by any serial plan. That sounds bad, until you realize that if you disable hash and merge joins and materiailzation, this can also be run as a nested loop plan, in which case - if you can persuade the optimizer to put the countries table on the inner side of the join - you can get the executor to evaluate the filter condition on countries once for every row in the contacts table, and of course there's nothing at all that will make it give the same answer for each row each time. After mulling it over a bit and studying the documentation, I realized that marking a function "volatile" guarantees that it won't be executed *less than once per base table row*. The use of a parallel plan here doesn't violate that rule. "volatile" never guarantees that the function won't be evaluated *more than once per base table row*. So now I think this is all fine. If we're in a serial plan, a volatile filter condition will get evaluated as little as once per row, but maybe as many times as some nested loop around that scan executes. In a parallel plan, it's the same thing, except that the number of loops might be based on the number of parallel workers rather than the number of rows in some outer table. That doesn't seem like an important distinction. This patch expands on and subsumes the patch I posted on the Parallel Append thread. It therefore also does what was discussed over there: pulls up Gather on top of any Append nodes generated by inheritance expansion, which is a good idea for all the reasons already discussed on that thread. Also as discussed on that thread, the cost model here is still not particularly smart and probably needs improvement in a variety of ways. Amit Kapila and others at EnterpriseDB will be looking further into that, and I hope for more feedback from other interested parties as well, but I don't think this patch needs to fix everything that's wrong with parallel costing itself. Even identifying those problems will probably take a fair amount of time and research, and not having this functionality will not make finding those problems any easier - probably the opposite. The EXPLAIN portion of the patch should perhaps be separated out and reviewed/committed separately. I developed it together because I found that the current EXPLAIN output inadequate for understanding how work was divided up between the leader and workers. The EXPLAIN changes made it possible to figure that out. More improvements are possible, but I think this is a big step up over what we have now, so I'm anxious to press forward with it. Let me know if it's helpful to split this part out for separate review; I've left it together for now both because it makes testing the rest of the patch easier and because it's 9:30pm on the day before Thanksgiving. (Happy Thanksgiving, BTW, if you live someplace where that's a thing, as I do!) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: