Thread: [HACKERS] Parameterization of partial path

[HACKERS] Parameterization of partial path

From
Antonin Houska
Date:
When looking at try_partial_hashjoin_path and try_partial_nestloop_path
functions I'm failing to understand the comment "Parameterized partial paths
are not supported.".

It's clear to me that GatherPath can't have parameters because repeated
execution of parallel plan with adjusted parameters is not supported. But the
fact that generic partial path has parameters should not be a problem if those
parameters are satisfied below the nearest GatherPath node higher in the plan
tree. Do I miss anything of the concept?

Actually this check

if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))  return;

in try_partial_nestloop_path only rejects special case where the inner path
references relations outside the join, but still allows the outer path to have
parameters outside.

As for try_partial_hashjoin_path, the comment "If the inner path is
parameterized ..." seems to be copy & pasted from try_partial_nestloop_path,
but I think it does not fit hash join. From hash_inner_and_outer I judge that
neither side of hash join can be parameterized by the other:

/** If either cheapest-total path is parameterized by the other rel, we* can't use a hashjoin.  (There's no use looking
foralternative* input paths, since these should already be the least-parameterized* available paths.)*/ 
if (PATH_PARAM_BY_REL(cheapest_total_outer, innerrel) ||PATH_PARAM_BY_REL(cheapest_total_inner, outerrel))return;

Shouldn't try_partial_hashjoin_path and try_partial_nestloop_path do just the
same checks that try_hashjoin_path and try_nestloop_path do respectively?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] Parameterization of partial path

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 12:36 PM, Antonin Houska <ah@cybertec.at> wrote:
> When looking at try_partial_hashjoin_path and try_partial_nestloop_path
> functions I'm failing to understand the comment "Parameterized partial paths
> are not supported.".
>
> It's clear to me that GatherPath can't have parameters because repeated
> execution of parallel plan with adjusted parameters is not supported.

Actually, I think in theory that is fine.  You'd start up and shut
down workers for every execution, which is likely to stink in terms of
performance, but it should work OK.  The only problem is that it'll
only work if you pass the values of the parameters down to the worker
processes, which the code currently doesn't. Bind parameters sent by
the user are passed down, but there's no provision to pass down
parameters populated at execution time.  Amit has written code for
that, though, and it's been posted here.  It just hasn't gotten into
the tree yet for one reason or another.

> But the
> fact that generic partial path has parameters should not be a problem if those
> parameters are satisfied below the nearest GatherPath node higher in the plan
> tree. Do I miss anything of the concept?

Yes, I think you're missing a key point.  A parameterized path would
normally be used for a query like small LJ (big1 IJ big2 ON big1.x =
big2.x) ON big1.y = small.y.  The plan might look like this:

Nested Loop Left Join
-> Seq Scan on small
-> Nested Loop Join -> Index Scan on big1 -> Index Scan on big2

In essence, we're repeating the join between big1 and big2 for every
row in small.  That seems like a bad strategy until you realize that
each join will scan only a tiny fraction of each of those tables.  If
you couldn't pass a parameter down from the scan of small to the scans
on big1 and big2, you'd end up with a plan that scans one or both of
those tables in their entirety.  Ouch.

Now, this plan can be parallelized just fine.  The sequential scan on
small can be replaced with a Parallel Seq Scan.  That works fine, and
the planner is already capable of generating such plans.  However, at
no point in that do you get a parameterized *partial* path.  You
generate regular old parameterized paths for big1 and big2 and join
then to produce a parameterized path for (big1 big2), and then you
join that via a nested loop with the non-parameterized partial path
for small, and you get a partial but unparameterized path for (small
big1 big2).  Then you push a Gather node on top and you're done.

Suppose we did generate a partial plan for (big1 big2).  It would look
like this:

Nested Loop Join
-> Parallel Index Scan on big1
-> Index Scan on big2

Now, how could you join that in a meaningful way to small so as to
come up with anything sensible?   You really can't.  Consider a plan
like this:

Gather
-> Nested Loop Left Join -> Seq Scan on small -> Nested Loop Join   -> Partial Index Scan on big1   -> Index Scan on
big2

That's clearly nonsense.  The partial index scan is supposed to return
a subset of the result rows to each worker, but there's no reason the
workers have to be basing their work on the same row from 'small'.
Which of their values would the parallel index scan supposedly be
using?  This isn't a matter of some implementation detail that we're
currently missing; such a plan is just inherently nonsensical.

> Actually this check
>
> if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
>    return;
>
> in try_partial_nestloop_path only rejects special case where the inner path
> references relations outside the join, but still allows the outer path to have
> parameters outside.

Right.  We build a partial join path by joining a partial path on the
outer side with a non-partial path on the inner side.  If the join is
a nested loop, the inner side can be parameterized, but all of those
parameters have to be provided by the outer side; if not, we get the
nonsensical situation illustrated above.

> As for try_partial_hashjoin_path, the comment "If the inner path is
> parameterized ..." seems to be copy & pasted from try_partial_nestloop_path,
> but I think it does not fit hash join. From hash_inner_and_outer I judge that
> neither side of hash join can be parameterized by the other:
>
> /*
>  * If either cheapest-total path is parameterized by the other rel, we
>  * can't use a hashjoin.  (There's no use looking for alternative
>  * input paths, since these should already be the least-parameterized
>  * available paths.)
>  */
> if (PATH_PARAM_BY_REL(cheapest_total_outer, innerrel) ||
>         PATH_PARAM_BY_REL(cheapest_total_inner, outerrel))
>         return;
>
> Shouldn't try_partial_hashjoin_path and try_partial_nestloop_path do just the
> same checks that try_hashjoin_path and try_nestloop_path do respectively?

No, because there is no point in producing parameterized partial paths
for the reasons mentioned above.  I think you're right that the
comment in try_partial_hashjoin_path is not quite right.  What it
should really be saying is that a hash join between a partial path and
a parameterized path is bound to produce a parameterized partial path,
and those aren't useful for anything, so we shouldn't create them.

Maybe someday somebody will figure out a context in which a partial
parameterized path is actually good for something, but until then we
shouldn't generate them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parameterization of partial path

From
Amit Kapila
Date:
On Thu, Feb 9, 2017 at 11:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 12:36 PM, Antonin Houska <ah@cybertec.at> wrote:
>> When looking at try_partial_hashjoin_path and try_partial_nestloop_path
>> functions I'm failing to understand the comment "Parameterized partial paths
>> are not supported.".
>>
>> It's clear to me that GatherPath can't have parameters because repeated
>> execution of parallel plan with adjusted parameters is not supported.
>
> Actually, I think in theory that is fine.  You'd start up and shut
> down workers for every execution, which is likely to stink in terms of
> performance, but it should work OK.  The only problem is that it'll
> only work if you pass the values of the parameters down to the worker
> processes, which the code currently doesn't.
>

Another thing is that currently, we use the existing DSM for rescan.
If we really want to pass the params during rescan we might need some
work so that if we need more memory to pass the params at the time of
rescan, then we should be able to do that.  Now that we have DSA, I
think we can use that to pass the params during rescan if required.
However here the important point is that I have not come across any
plan (in the benchmarks like TPC-H) where it could be beneficial to
pass params during rescan, so not sure it is worth the effort to
invent something for that.  Now, it is possible that I am missing
something, but if someone can present a use case, then I think we can
try to support such parallel plans.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Parameterization of partial path

From
Antonin Houska
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> On Thu, Feb 9, 2017 at 12:36 PM, Antonin Houska <ah@cybertec.at> wrote:
> > When looking at try_partial_hashjoin_path and try_partial_nestloop_path
> > functions I'm failing to understand the comment "Parameterized partial paths
> > are not supported.".
> >
> > It's clear to me that GatherPath can't have parameters because repeated
> > execution of parallel plan with adjusted parameters is not supported.
>
> Actually, I think in theory that is fine.  You'd start up and shut
> down workers for every execution, which is likely to stink in terms of
> performance, but it should work OK.  The only problem is that it'll
> only work if you pass the values of the parameters down to the worker
> processes, which the code currently doesn't. Bind parameters sent by
> the user are passed down, but there's no provision to pass down
> parameters populated at execution time.  Amit has written code for
> that, though, and it's been posted here.  It just hasn't gotten into
> the tree yet for one reason or another.

ok, I also thought it's about missing implementation rather than principal
issue.

> > But the
> > fact that generic partial path has parameters should not be a problem if those
> > parameters are satisfied below the nearest GatherPath node higher in the plan
> > tree. Do I miss anything of the concept?
>
> Yes, I think you're missing a key point.  A parameterized path would
> normally be used for a query like small LJ (big1 IJ big2 ON big1.x =
> big2.x) ON big1.y = small.y.  The plan might look like this:

Thanks for detailed explanation. I think the missing point in my thoughts was
that the only way to satisfy parameters of a relation (so that Gather does not
have to pass any parameters) is to put the relation on the inner side of a
join. However the inner side must provide the whole result set, not just part
of it, else the result is wrong.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at