Re: Parallelize correlated subqueries that execute within each worker - Mailing list pgsql-hackers
From | Antonin Houska |
---|---|
Subject | Re: Parallelize correlated subqueries that execute within each worker |
Date | |
Msg-id | 20283.1675701655@antos Whole thread Raw |
In response to | Re: Parallelize correlated subqueries that execute within each worker (James Coleman <jtc331@gmail.com>) |
Responses |
Re: Parallelize correlated subqueries that execute within each worker
|
List | pgsql-hackers |
James Coleman <jtc331@gmail.com> wrote: > On Sat, Jan 21, 2023 at 10:07 PM James Coleman <jtc331@gmail.com> wrote: > > ... > > While working through Tomas's comment about a conditional in the > > max_parallel_hazard_waker being guaranteed true I realized that in the > > current version of the patch the safe_param_ids tracking in > > is_parallel_safe isn't actually needed any longer. That seemed > > suspicious, and so I started digging, and I found out that in the > > current approach all of the tests pass with only the changes in > > clauses.c. I don't believe that the other changes aren't needed; > > rather I believe there isn't yet a test case exercising them, but I > > realize that means I can't prove they're needed. I spent some time > > poking at this, but at least with my current level of imagination I > > haven't stumbled across a query that would exercise these checks. > > I played with this a good bit more yesterday, I'm now a good bit more > confident this is correct. I've cleaned up the patch; see attached for > v7. > > Here's some of my thought process: > The comments in src/include/nodes/pathnodes.h:2953 tell us that > PARAM_EXEC params are used to pass values around from one plan node to > another in the following ways: > 1. Values down into subqueries (for outer references in subqueries) > 2. Up out of subqueries (for the results of a subplan) > 3. From a NestLoop plan node into its inner relation (when the inner > scan is parameterized with values from the outer relation) > > Case (2) is already known to be safe (we currently add these params to > safe_param_ids in max_parallel_hazard_walker when we encounter a > SubPlan node). > > I also believe case (3) is already handled. We don't build partial > paths for joins when joinrel->lateral_relids is non-empty, and join > order calculations already require that parameterization here go the > correct way (i.e., inner depends on outer rather than the other way > around). > > That leaves us with only case (1) to consider in this patch. Another > way of saying this is that this is really the only thing the > safe_param_ids tracking is guarding against. For params passed down > into subqueries we can further distinguish between init plans and > "regular" subplans. We already know that params from init plans are > safe (at the right level). So we're concerned here with a way to know > if the params passed to subplans are safe. We already track required > rels in ParamPathInfo, so it's fairly simple to do this test. > > Which this patch we do in fact now see (as expected) rels with > non-empty lateral_relids showing up in generate_[useful_]gather_paths. > And the partial paths can now have non-empty required outer rels. I'm > not able to come up with a plan that would actually be caught by those > checks; I theorize that because of the few places we actually call > generate_[useful_]gather_paths we are in practice already excluding > those, but for now I've left these as a conditional rather than an > assertion because it seems like the kind of guard we'd want to ensure > those methods are safe. Maybe we can later (in separate patches) relax the restrictions imposed on partial path creation a little bit, so that more parameterized partial paths are created. One particular case that should be rejected by your checks is a partial index path, which can be parameterized, but I couldn't construct a query that makes your checks fire. Maybe the reason is that a parameterized index path is mostly used on the inner side of a NL join, however no partial path can be used there. (The problem is that each worker evaluating the NL join would only see a subset of the inner relation, which whould lead to incorrect results.) So I'd also choose conditions rather than assert statements. Following are my (minor) findings: In generate_gather_paths() you added this test /* * Delay gather path creation until the level in the join tree where all * params used in a worker are generated within that worker. */ if (!bms_is_subset(required_outer, rel->relids)) return; but I'm not sure if required_outer can contain anything of rel->relids. How about using bms_is_empty(required) outer, or even this? if (required_outer) return; Similarly, /* We can't pass params to workers. */ if (!bms_is_subset(PATH_REQ_OUTER(cheapest_partial_path), rel->relids)) might look like if (!bms_is_empty(PATH_REQ_OUTER(cheapest_partial_path))) or if (PATH_REQ_OUTER(cheapest_partial_path)) In particular, build_index_paths() does the following when setting outer_relids (which eventually becomes (path->param_info->ppi_req_outer): /* Enforce convention that outer_relids is exactly NULL if empty */ if (bms_is_empty(outer_relids)) outer_relids = NULL; Another question is whether in this call simple_gather_path = (Path *) create_gather_path(root, rel, cheapest_partial_path, rel->reltarget, required_outer, rowsp); required_outer should be passed to create_gather_path(). Shouldn't it rather be PATH_REQ_OUTER(cheapest_partial_path) that you test just above? Again, build_index_paths() initializes outer_relids this way outer_relids = bms_copy(rel->lateral_relids); but then it may add some more relations: /* OK to include this clause */ index_clauses = lappend(index_clauses, iclause); outer_relids = bms_add_members(outer_relids, rinfo->clause_relids); So I think that PATH_REQ_OUTER(cheapest_partial_path) in generate_gather_paths() can eventually contain more relations than required_outer, and therefore it's safer to check the first. Similar comments might apply to generate_useful_gather_paths(). Here I also suggest to move this test /* We can't pass params to workers. */ if (!bms_is_subset(PATH_REQ_OUTER(subpath), rel->relids)) continue; to the top of the loop because it's relatively cheap. -- Antonin Houska Web: https://www.cybertec-postgresql.com
pgsql-hackers by date: