Thread: [HACKERS] Parameterization of partial path
When looking at try_partial_hashjoin_path and try_partial_nestloop_path functions I'm failing to understand the comment "Parameterized partial paths are not supported.". It's clear to me that GatherPath can't have parameters because repeated execution of parallel plan with adjusted parameters is not supported. But the fact that generic partial path has parameters should not be a problem if those parameters are satisfied below the nearest GatherPath node higher in the plan tree. Do I miss anything of the concept? Actually this check if (!bms_is_subset(inner_paramrels, outer_path->parent->relids)) return; in try_partial_nestloop_path only rejects special case where the inner path references relations outside the join, but still allows the outer path to have parameters outside. As for try_partial_hashjoin_path, the comment "If the inner path is parameterized ..." seems to be copy & pasted from try_partial_nestloop_path, but I think it does not fit hash join. From hash_inner_and_outer I judge that neither side of hash join can be parameterized by the other: /** If either cheapest-total path is parameterized by the other rel, we* can't use a hashjoin. (There's no use looking foralternative* input paths, since these should already be the least-parameterized* available paths.)*/ if (PATH_PARAM_BY_REL(cheapest_total_outer, innerrel) ||PATH_PARAM_BY_REL(cheapest_total_inner, outerrel))return; Shouldn't try_partial_hashjoin_path and try_partial_nestloop_path do just the same checks that try_hashjoin_path and try_nestloop_path do respectively? -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
On Thu, Feb 9, 2017 at 12:36 PM, Antonin Houska <ah@cybertec.at> wrote: > When looking at try_partial_hashjoin_path and try_partial_nestloop_path > functions I'm failing to understand the comment "Parameterized partial paths > are not supported.". > > It's clear to me that GatherPath can't have parameters because repeated > execution of parallel plan with adjusted parameters is not supported. Actually, I think in theory that is fine. You'd start up and shut down workers for every execution, which is likely to stink in terms of performance, but it should work OK. The only problem is that it'll only work if you pass the values of the parameters down to the worker processes, which the code currently doesn't. Bind parameters sent by the user are passed down, but there's no provision to pass down parameters populated at execution time. Amit has written code for that, though, and it's been posted here. It just hasn't gotten into the tree yet for one reason or another. > But the > fact that generic partial path has parameters should not be a problem if those > parameters are satisfied below the nearest GatherPath node higher in the plan > tree. Do I miss anything of the concept? Yes, I think you're missing a key point. A parameterized path would normally be used for a query like small LJ (big1 IJ big2 ON big1.x = big2.x) ON big1.y = small.y. The plan might look like this: Nested Loop Left Join -> Seq Scan on small -> Nested Loop Join -> Index Scan on big1 -> Index Scan on big2 In essence, we're repeating the join between big1 and big2 for every row in small. That seems like a bad strategy until you realize that each join will scan only a tiny fraction of each of those tables. If you couldn't pass a parameter down from the scan of small to the scans on big1 and big2, you'd end up with a plan that scans one or both of those tables in their entirety. Ouch. Now, this plan can be parallelized just fine. The sequential scan on small can be replaced with a Parallel Seq Scan. That works fine, and the planner is already capable of generating such plans. However, at no point in that do you get a parameterized *partial* path. You generate regular old parameterized paths for big1 and big2 and join then to produce a parameterized path for (big1 big2), and then you join that via a nested loop with the non-parameterized partial path for small, and you get a partial but unparameterized path for (small big1 big2). Then you push a Gather node on top and you're done. Suppose we did generate a partial plan for (big1 big2). It would look like this: Nested Loop Join -> Parallel Index Scan on big1 -> Index Scan on big2 Now, how could you join that in a meaningful way to small so as to come up with anything sensible? You really can't. Consider a plan like this: Gather -> Nested Loop Left Join -> Seq Scan on small -> Nested Loop Join -> Partial Index Scan on big1 -> Index Scan on big2 That's clearly nonsense. The partial index scan is supposed to return a subset of the result rows to each worker, but there's no reason the workers have to be basing their work on the same row from 'small'. Which of their values would the parallel index scan supposedly be using? This isn't a matter of some implementation detail that we're currently missing; such a plan is just inherently nonsensical. > Actually this check > > if (!bms_is_subset(inner_paramrels, outer_path->parent->relids)) > return; > > in try_partial_nestloop_path only rejects special case where the inner path > references relations outside the join, but still allows the outer path to have > parameters outside. Right. We build a partial join path by joining a partial path on the outer side with a non-partial path on the inner side. If the join is a nested loop, the inner side can be parameterized, but all of those parameters have to be provided by the outer side; if not, we get the nonsensical situation illustrated above. > As for try_partial_hashjoin_path, the comment "If the inner path is > parameterized ..." seems to be copy & pasted from try_partial_nestloop_path, > but I think it does not fit hash join. From hash_inner_and_outer I judge that > neither side of hash join can be parameterized by the other: > > /* > * If either cheapest-total path is parameterized by the other rel, we > * can't use a hashjoin. (There's no use looking for alternative > * input paths, since these should already be the least-parameterized > * available paths.) > */ > if (PATH_PARAM_BY_REL(cheapest_total_outer, innerrel) || > PATH_PARAM_BY_REL(cheapest_total_inner, outerrel)) > return; > > Shouldn't try_partial_hashjoin_path and try_partial_nestloop_path do just the > same checks that try_hashjoin_path and try_nestloop_path do respectively? No, because there is no point in producing parameterized partial paths for the reasons mentioned above. I think you're right that the comment in try_partial_hashjoin_path is not quite right. What it should really be saying is that a hash join between a partial path and a parameterized path is bound to produce a parameterized partial path, and those aren't useful for anything, so we shouldn't create them. Maybe someday somebody will figure out a context in which a partial parameterized path is actually good for something, but until then we shouldn't generate them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 9, 2017 at 11:36 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 9, 2017 at 12:36 PM, Antonin Houska <ah@cybertec.at> wrote: >> When looking at try_partial_hashjoin_path and try_partial_nestloop_path >> functions I'm failing to understand the comment "Parameterized partial paths >> are not supported.". >> >> It's clear to me that GatherPath can't have parameters because repeated >> execution of parallel plan with adjusted parameters is not supported. > > Actually, I think in theory that is fine. You'd start up and shut > down workers for every execution, which is likely to stink in terms of > performance, but it should work OK. The only problem is that it'll > only work if you pass the values of the parameters down to the worker > processes, which the code currently doesn't. > Another thing is that currently, we use the existing DSM for rescan. If we really want to pass the params during rescan we might need some work so that if we need more memory to pass the params at the time of rescan, then we should be able to do that. Now that we have DSA, I think we can use that to pass the params during rescan if required. However here the important point is that I have not come across any plan (in the benchmarks like TPC-H) where it could be beneficial to pass params during rescan, so not sure it is worth the effort to invent something for that. Now, it is possible that I am missing something, but if someone can present a use case, then I think we can try to support such parallel plans. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 9, 2017 at 12:36 PM, Antonin Houska <ah@cybertec.at> wrote: > > When looking at try_partial_hashjoin_path and try_partial_nestloop_path > > functions I'm failing to understand the comment "Parameterized partial paths > > are not supported.". > > > > It's clear to me that GatherPath can't have parameters because repeated > > execution of parallel plan with adjusted parameters is not supported. > > Actually, I think in theory that is fine. You'd start up and shut > down workers for every execution, which is likely to stink in terms of > performance, but it should work OK. The only problem is that it'll > only work if you pass the values of the parameters down to the worker > processes, which the code currently doesn't. Bind parameters sent by > the user are passed down, but there's no provision to pass down > parameters populated at execution time. Amit has written code for > that, though, and it's been posted here. It just hasn't gotten into > the tree yet for one reason or another. ok, I also thought it's about missing implementation rather than principal issue. > > But the > > fact that generic partial path has parameters should not be a problem if those > > parameters are satisfied below the nearest GatherPath node higher in the plan > > tree. Do I miss anything of the concept? > > Yes, I think you're missing a key point. A parameterized path would > normally be used for a query like small LJ (big1 IJ big2 ON big1.x = > big2.x) ON big1.y = small.y. The plan might look like this: Thanks for detailed explanation. I think the missing point in my thoughts was that the only way to satisfy parameters of a relation (so that Gather does not have to pass any parameters) is to put the relation on the inner side of a join. However the inner side must provide the whole result set, not just part of it, else the result is wrong. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at