Thread: Parallel Append can break run-time partition pruning

Parallel Append can break run-time partition pruning

From

David Rowley

Date:

15 April 2020, 07:18:42

I had a report from the wilds that run-time partition pruning was not
working in certain cases.

After some investigation and obtaining the mockup of the actual case,
I discovered that the problem was down to accumulate_append_subpath()
hitting the case where it does not pullup a Parallel Append where the
first parallel node is > 0.

What's actually happening is that the plan is left with a nested
Append, and in this particular case, the top-level Append only has a
single subpath, to which the code for 8edd0e794 (Suppress Append and
MergeAppend plan nodes that have a single child) causes the nested
Append to be pulled up to become the main Append.   This causes
run-time pruning to break since we only attach the pruning information
to the top-level Append.

The most simplified test case I can find to demonstrate this issue is:

create table list (a int, b int) partition by list(a);
create table list_12 partition of list for values in(1,2) partition by list(a);
create table list_12_1 partition of list_12 for values in(1);
create table list_12_2 partition of list_12 for values in(2);

insert into list select 2,0 from generate_Series(1,1000000) x;
vacuum analyze list;

explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;

-- force the 2nd subnode of the Append to be non-parallel.
alter table list_12_1 set (parallel_workers=0);

explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;


The results of this in master are:

postgres=# explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;
                                QUERY PLAN
---------------------------------------------------------------------------
 Gather (actual rows=0 loops=1)
   Workers Planned: 2
   Params Evaluated: $0
   Workers Launched: 2
   InitPlan 1 (returns $0)
     ->  Result (actual rows=1 loops=1)
   ->  Parallel Append (actual rows=0 loops=3)
         ->  Parallel Seq Scan on list_12_2 list_2 (never executed)
               Filter: ((b > 0) AND (a = $0))
         ->  Parallel Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
               Filter: ((b > 0) AND (a = $0))
(11 rows)


postgres=# alter table list_12_1 set (parallel_workers=0);
ALTER TABLE
postgres=# explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;
                                QUERY PLAN
---------------------------------------------------------------------------
 Gather (actual rows=0 loops=1)
   Workers Planned: 2
   Params Evaluated: $0
   Workers Launched: 2
   InitPlan 1 (returns $0)
     ->  Result (actual rows=1 loops=1)
   ->  Parallel Append (actual rows=0 loops=3)
         ->  Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
               Filter: ((b > 0) AND (a = $0))
         ->  Parallel Seq Scan on list_12_2 list_2 (actual rows=0 loops=3)
               Filter: ((b > 0) AND (a = $0))
               Rows Removed by Filter: 333333
(12 rows)

Notice that we don't get "(never executed)" for list_12_2 in the 2nd case.

I'm a bit divided on what the correct fix is.  If I blame Parallel
Append for not trying hard enough to pull up the lower Append in
accumulate_append_subpath(), then clearly the parallel append code is
to blame. However, perhaps run-time pruning should be tagging on
PartitionPruneInfo to more than top-level Appends. Fixing the latter
case, code-wise is about as simple as removing the "rel->reloptkind ==
RELOPT_BASEREL &&" line from create_append_plan(). Certainly, if the
outer Append hadn't been a single subpath Append, then we wouldn't
have pulled up the lower-level Append, so perhaps we should be
run-time pruning lower-level ones too.

What do other people think?

(copying in Robert and Amit K due to their work on Parallel Append,
Tom as I seem to remember him complaining about
accumulate_append_subpath() at some point and Amit L because...
partitioning...)

David

Re: Parallel Append can break run-time partition pruning

From

Amit Langote

Date:

17 April 2020, 07:07:54

On Wed, Apr 15, 2020 at 4:18 PM David Rowley <dgrowleyml@gmail.com> wrote:
> I'm a bit divided on what the correct fix is.  If I blame Parallel
> Append for not trying hard enough to pull up the lower Append in
> accumulate_append_subpath(), then clearly the parallel append code is
> to blame.

I spent some time trying to understand how Append parallelism works
and I am tempted to agree with you that there might be problems with
how accumulate_append_subpath()'s interacts with parallelism. Maybe it
would be better to disregard a non-parallel-aware partial Append if it
requires us to fail on flattening a child Append.  I have as attached
a PoC fix to show that.  While a nested Append is not really a problem
in general, it appears to me that our run-time code is not in position
to work correctly with them, or at least not with how things stand
today...

> However, perhaps run-time pruning should be tagging on
> PartitionPruneInfo to more than top-level Appends. Fixing the latter
> case, code-wise is about as simple as removing the "rel->reloptkind ==
> RELOPT_BASEREL &&" line from create_append_plan(). Certainly, if the
> outer Append hadn't been a single subpath Append, then we wouldn't
> have pulled up the lower-level Append, so perhaps we should be
> run-time pruning lower-level ones too.

While looking at this, I observed that the PartitionPruneInfo of the
top-level Append (the one that later gets thrown out) contains bogus
information:

   {PARTITIONPRUNEINFO
   :prune_infos ((
      {PARTITIONEDRELPRUNEINFO
      :rtindex 1
      :present_parts (b 0)
      :nparts 1
      :subplan_map  0
      :subpart_map  1

One of these should be -1.

      {PARTITIONEDRELPRUNEINFO
      :rtindex 2
      :present_parts (b)
      :nparts 2
      :subplan_map  -1 -1
      :subpart_map  -1 -1

subplan_map values are not correct, because subpaths list that would
have been passed would not include paths of lower-level partitions as
the flattening didn't occur.

   ))
   :other_subplans (b)
   }

I guess the problem is that we let an Append be nested, but don't
account for that in how partitioned_rels list it parent Append is
constructed.  The top-level Append's partitioned_rels should not have
contained sub-partitioned table's RTI if it got its own Append.  Maybe
if we want to make run-time pruning work with nested Appends, we need
to fix how partitioned_rels are gathered.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Attachment

avoid-partial-nested-append.patch

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

19 April 2020, 23:00:04

On Fri, 17 Apr 2020 at 19:08, Amit Langote <amitlangote09@gmail.com> wrote:
> While looking at this, I observed that the PartitionPruneInfo of the
> top-level Append (the one that later gets thrown out) contains bogus
> information:
>
>    {PARTITIONPRUNEINFO
>    :prune_infos ((
>       {PARTITIONEDRELPRUNEINFO
>       :rtindex 1
>       :present_parts (b 0)
>       :nparts 1
>       :subplan_map  0
>       :subpart_map  1
>
> One of these should be -1.
>
>       {PARTITIONEDRELPRUNEINFO
>       :rtindex 2
>       :present_parts (b)
>       :nparts 2
>       :subplan_map  -1 -1
>       :subpart_map  -1 -1
>
> subplan_map values are not correct, because subpaths list that would
> have been passed would not include paths of lower-level partitions as
> the flattening didn't occur.

It's not great that we're generating that, but as far as I can see,
it's not going to cause any misbehaviour.  It'll cause a small
slowdown in run-time pruning due to perhaps having to perform an
additional call to find_matching_subplans_recurse() during execution.
In this case, it'll never find any subnodes that match due to both
maps having all -1 element values.

Since f2343653f5, we're not using partitioned_rels for anything else,
so we should likely fix this so that we don't add the item to
partitioned_rels when we don't pullup the sub-Append.   I think we
should hold off on fixing that until we decide if any adjustments need
to be made to the sub-Append pullup code.

David

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

21 April 2020, 03:03:18

On Fri, 17 Apr 2020 at 19:08, Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Wed, Apr 15, 2020 at 4:18 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > I'm a bit divided on what the correct fix is.  If I blame Parallel
> > Append for not trying hard enough to pull up the lower Append in
> > accumulate_append_subpath(), then clearly the parallel append code is
> > to blame.
>
> I spent some time trying to understand how Append parallelism works
> and I am tempted to agree with you that there might be problems with
> how accumulate_append_subpath()'s interacts with parallelism. Maybe it
> would be better to disregard a non-parallel-aware partial Append if it
> requires us to fail on flattening a child Append.  I have as attached
> a PoC fix to show that.  While a nested Append is not really a problem
> in general, it appears to me that our run-time code is not in position
> to work correctly with them, or at least not with how things stand
> today...

Thanks for taking a look at this. I've now looked at this in more
detail and below is my understanding of what's going on:

It seems, in this case, what's going on is, on the following line:

accumulate_append_subpath(cheapest_partial_path,
  &partial_subpaths, NULL);

we don't manage to pullup the sub-Append due to passing a NULL pointer
for the final special_subpaths argument.  This results in just taking
the child's Append path verbatim. i.e. nested Append

Later, when we do:

else if (nppath == NULL ||
(cheapest_partial_path != NULL &&
  cheapest_partial_path->total_cost < nppath->total_cost))
{
/* Partial path is cheaper or the only option. */
Assert(cheapest_partial_path != NULL);
accumulate_append_subpath(cheapest_partial_path,
  &pa_partial_subpaths,
  &pa_nonpartial_subpaths);

we do pass a non-NULL special_subpaths argument to allow the
sub-Append to be pulled up.

So, now we have 2 paths, one with a nested Append and one with a
flattened Append.  Both paths have the same cost, but due to the fact
that we call add_partial_path() for the nested Append version first,
the logic in add_partial_path() accepts that path. However, the
subsequent call of add_partial_path(), the one for the non-nested
Append, that path is rejected due to the total cost being too similar
to one of the existing partial path. We just end up keeping the nested
Append as the cheapest partial path... That path, since in the example
case only has a single subpath, is pulled up into the main append by
the logic added in 8edd0e794.

I think you've realised this and that's why your PoC patch just
rejected the first path when it's unable to do the pullup. We'll get a
better path later when we allow mixed partial and non-partial paths.

(We'll never fail to do a pullup when calling
accumulate_append_subpath() for "nppath", since that's a non-parallel
path and accumulate_append_subpath() will always pull Append paths up
when they're not parallel aware.)

I wonder if the fix should be more something along the lines of trying
to merge things do we only generate a single partial path.  That way
we wouldn't be at the mercy of the logic in add_partial_path() to
accept or reject the path based on the order the paths are added.

David

Re: Parallel Append can break run-time partition pruning

From

Amit Langote

Date:

21 April 2020, 05:23:44

Hi David,

On Tue, Apr 21, 2020 at 12:03 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 17 Apr 2020 at 19:08, Amit Langote <amitlangote09@gmail.com> wrote:
> > On Wed, Apr 15, 2020 at 4:18 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > > I'm a bit divided on what the correct fix is.  If I blame Parallel
> > > Append for not trying hard enough to pull up the lower Append in
> > > accumulate_append_subpath(), then clearly the parallel append code is
> > > to blame.
> >
> > I spent some time trying to understand how Append parallelism works
> > and I am tempted to agree with you that there might be problems with
> > how accumulate_append_subpath()'s interacts with parallelism. Maybe it
> > would be better to disregard a non-parallel-aware partial Append if it
> > requires us to fail on flattening a child Append.  I have as attached
> > a PoC fix to show that.  While a nested Append is not really a problem
> > in general, it appears to me that our run-time code is not in position
> > to work correctly with them, or at least not with how things stand
> > today...
>
> Thanks for taking a look at this. I've now looked at this in more
> detail and below is my understanding of what's going on:
>
> It seems, in this case, what's going on is, on the following line:
>
> accumulate_append_subpath(cheapest_partial_path,
>   &partial_subpaths, NULL);
>
> we don't manage to pullup the sub-Append due to passing a NULL pointer
> for the final special_subpaths argument.  This results in just taking
> the child's Append path verbatim. i.e. nested Append
>
> Later, when we do:
>
> else if (nppath == NULL ||
> (cheapest_partial_path != NULL &&
>   cheapest_partial_path->total_cost < nppath->total_cost))
> {
> /* Partial path is cheaper or the only option. */
> Assert(cheapest_partial_path != NULL);
> accumulate_append_subpath(cheapest_partial_path,
>   &pa_partial_subpaths,
>   &pa_nonpartial_subpaths);
>
> we do pass a non-NULL special_subpaths argument to allow the
> sub-Append to be pulled up.
>
> So, now we have 2 paths, one with a nested Append and one with a
> flattened Append.  Both paths have the same cost, but due to the fact
> that we call add_partial_path() for the nested Append version first,
> the logic in add_partial_path() accepts that path. However, the
> subsequent call of add_partial_path(), the one for the non-nested
> Append, that path is rejected due to the total cost being too similar
> to one of the existing partial path. We just end up keeping the nested
> Append as the cheapest partial path... That path, since in the example
> case only has a single subpath, is pulled up into the main append by
> the logic added in 8edd0e794.
>
> I think you've realised this and that's why your PoC patch just
> rejected the first path when it's unable to do the pullup.

Right.

> We'll get a
> better path later when we allow mixed partial and non-partial paths.

Yes, but only if parallel-aware Append is allowed (pa_subpaths_valid).
So it's possible that the Append may not participate in any
parallelism whatsoever if we reject partial Append on failing to fold
a child Append, which does somewhat suck.

> (We'll never fail to do a pullup when calling
> accumulate_append_subpath() for "nppath", since that's a non-parallel
> path and accumulate_append_subpath() will always pull Append paths up
> when they're not parallel aware.)
>
> I wonder if the fix should be more something along the lines of trying
> to merge things do we only generate a single partial path.  That way
> we wouldn't be at the mercy of the logic in add_partial_path() to
> accept or reject the path based on the order the paths are added.

So as things stand, parallel-aware partial Append (Parallel Append)
path competes with non-parallel partial Append path on cost grounds.
As far as I can see, it's only the latter that can contain among its
subpaths an (nested) Append which can be problematic.  Given that, out
choice between the two types of partial Append paths becomes based on
something that is not cost, but is that okay?

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

22 April 2020, 03:22:14

On Tue, 21 Apr 2020 at 15:03, David Rowley <dgrowleyml@gmail.com> wrote:
> I wonder if the fix should be more something along the lines of trying
> to merge things do we only generate a single partial path.  That way
> we wouldn't be at the mercy of the logic in add_partial_path() to
> accept or reject the path based on the order the paths are added.

I took a shot at doing things this way.

First, I'll recap on the problem this is trying to solve:

add_paths_to_append_rel() attempts to create two separate partial
Append paths. I'll describe both of these below:

Path1: This path is generated regardless of if Parallel Append is
enabled and contains all the cheapest partial paths from each child
relation.   If parallel append is enabled this will become a Parallel
Append. If it's not then a non-parallel append will be created
containing the list of partial subpaths. Here's an example from
select_parallel.out:

                          QUERY PLAN
--------------------------------------------------------------
 Finalize Aggregate
   ->  Gather
         Workers Planned: 1
         ->  Partial Aggregate
               ->  Append
                     ->  Parallel Seq Scan on a_star a_star_1
                     ->  Parallel Seq Scan on b_star a_star_2
                     ->  Parallel Seq Scan on c_star a_star_3
                     ->  Parallel Seq Scan on d_star a_star_4
                     ->  Parallel Seq Scan on e_star a_star_5
                     ->  Parallel Seq Scan on f_star a_star_6

Path2: We only ever consider this one when enable_parallel_append ==
true and the append rel's consider_parallel == true.  When this path
is generated, it'll always be for a Parallel Append.  This path may
contain a mix of partial paths for child rels and parallel_safe child
paths, of which will only be visited by a single worker.

The problem is that path1 does not pullup child Appends when the child
append path contains a mix of partial and parallel safe paths (i.e a
sub-path2, per above). Since we create path2 in addition to path1, the
costs come out the same even though path1 couldn't pullup the
sub-Append paths. Unfortunately, the costs are the same so path1 is
prefered since it's added first. add_partial_path() just rejects path2
based on it being too similar in cost to the existing path1.

In the attached, I'm trying to solve this by only created 1 partial
Append path in the first place. This path will always try to use the
cheapest partial path, or the cheapest parallel safe path, if parallel
append is allowed and that path is cheaper than the cheapest partial
path.

I believe the attached gives us what we want and additionally, since
it should now always pullup the sub-Appends, then there's no need to
consider adjusting partitioned_rels based on if the pull-up occurred
or not. Those should now be right in all cases. This should also fix
the run-time pruning issue too since in my original test case it'll
pullup the sub-Append which means that the code added in 8edd0e794 is
no longer going to do anything with it as the top-level Append will
never contain just 1 subpath.

I'm reasonably certain that this is correct, but I did find it a bit
mind-bending considering all the possible cases, so it could do with
some more eyes on it.   I've not really done a final polish of the
comments yet. I'll do that if the patch is looking promising.

The output of the original test with the attached is as follows:

postgres=# explain (analyze on, costs off, timing off, summary off)
postgres-# select * from list where a = (select 1) and b > 0;
                                QUERY PLAN
---------------------------------------------------------------------------
 Gather (actual rows=0 loops=1)
   Workers Planned: 2
   Params Evaluated: $0
   Workers Launched: 2
   InitPlan 1 (returns $0)
     ->  Result (actual rows=1 loops=1)
   ->  Parallel Append (actual rows=0 loops=3)
         ->  Parallel Seq Scan on list_12_2 list_2 (never executed)
               Filter: ((b > 0) AND (a = $0))
         ->  Parallel Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
               Filter: ((b > 0) AND (a = $0))
(11 rows)

postgres=# -- force the 2nd subnode of the Append to be non-parallel.
postgres=# alter table list_12_1 set (parallel_workers=0);
ALTER TABLE
postgres=# explain (analyze on, costs off, timing off, summary off)
postgres-# select * from list where a = (select 1) and b > 0;
                             QUERY PLAN
--------------------------------------------------------------------
 Gather (actual rows=0 loops=1)
   Workers Planned: 2
   Params Evaluated: $0
   Workers Launched: 2
   InitPlan 1 (returns $0)
     ->  Result (actual rows=1 loops=1)
   ->  Parallel Append (actual rows=0 loops=3)
         ->  Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
               Filter: ((b > 0) AND (a = $0))
         ->  Parallel Seq Scan on list_12_2 list_2 (never executed)
               Filter: ((b > 0) AND (a = $0))
(11 rows)

Notice we get "(never executed)" in both cases.

David

Attachment

always_pullup_child_appends_to_top_level_append.patch

Re: Parallel Append can break run-time partition pruning

From

Amit Langote

Date:

22 April 2020, 14:37:25

On Wed, Apr 22, 2020 at 12:22 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Tue, 21 Apr 2020 at 15:03, David Rowley <dgrowleyml@gmail.com> wrote:
> > I wonder if the fix should be more something along the lines of trying
> > to merge things do we only generate a single partial path.  That way
> > we wouldn't be at the mercy of the logic in add_partial_path() to
> > accept or reject the path based on the order the paths are added.
>
> In the attached, I'm trying to solve this by only created 1 partial
> Append path in the first place. This path will always try to use the
> cheapest partial path, or the cheapest parallel safe path, if parallel
> append is allowed and that path is cheaper than the cheapest partial
> path.
>
> I believe the attached gives us what we want and additionally, since
> it should now always pullup the sub-Appends, then there's no need to
> consider adjusting partitioned_rels based on if the pull-up occurred
> or not. Those should now be right in all cases. This should also fix
> the run-time pruning issue too since in my original test case it'll
> pullup the sub-Append which means that the code added in 8edd0e794 is
> no longer going to do anything with it as the top-level Append will
> never contain just 1 subpath.
>
> I'm reasonably certain that this is correct, but I did find it a bit
> mind-bending considering all the possible cases, so it could do with
> some more eyes on it.   I've not really done a final polish of the
> comments yet. I'll do that if the patch is looking promising.

Thanks for the patch.

It's good to see that unfolded sub-Appends will not occur with the new
code structure or one hopes.  Also, I am finding it somewhat easier to
understand how partial Appends get built due to smaller code footprint
after patching.

One thing I remain concerned about is that it appears like we are no
longer leaving the choice between parallel and non-parallel Append to
the cost machinery which is currently the case.  AFAICS with patched,
as long as parallel Append is enabled and allowed, it will be chosen
over a non-parallel Append as the partial path.

Regarding the patch, I had been assuming that the "pa" in
pa_subpaths_valid stands for "parallel append", so it using the
variable as is in the new code structure would be misleading.  Maybe,
parallel_subpaths_valid?

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

22 April 2020, 22:58:33

On Thu, 23 Apr 2020 at 02:37, Amit Langote <amitlangote09@gmail.com> wrote:
> One thing I remain concerned about is that it appears like we are no
> longer leaving the choice between parallel and non-parallel Append to
> the cost machinery which is currently the case.  AFAICS with patched,
> as long as parallel Append is enabled and allowed, it will be chosen
> over a non-parallel Append as the partial path.

Given the same set of paths, when would a non-parallel append be
cheaper than a parallel one? I don't see anything in cost_append()
that could cause the costs to come out higher for the parallel
version. However, I might have misunderstood something. Can you give
an example of a case that you think might change?

The cost comparison is still there for the cheapest parallel safe
normal path vs the cheapest partial path, so, when each of those paths
are allowed, then we will still end up with the cheapest paths for
each child.

> Regarding the patch, I had been assuming that the "pa" in
> pa_subpaths_valid stands for "parallel append", so it using the
> variable as is in the new code structure would be misleading.  Maybe,
> parallel_subpaths_valid?

Yeah, I had wondered if it would be better to rename it, I'd just not
thought too hard on what to call it yet.

David

Re: Parallel Append can break run-time partition pruning

From

Tom Lane

Date:

22 April 2020, 23:11:05

David Rowley <dgrowleyml@gmail.com> writes:
> Given the same set of paths, when would a non-parallel append be
> cheaper than a parallel one?

Well, anytime the parallel startup cost is significant, for starters.
But maybe we account for that at some other point, like when building
the Gather?

            regards, tom lane

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

22 April 2020, 23:35:55

On Thu, 23 Apr 2020 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > Given the same set of paths, when would a non-parallel append be
> > cheaper than a parallel one?
>
> Well, anytime the parallel startup cost is significant, for starters.
> But maybe we account for that at some other point, like when building
> the Gather?

Yeah. There's no mention of parallel_setup_cost or parallel_tuple_cost
in any of the Append costing code. Those are only applied when we cost
Gather / GatherMerge At the point Amit and I are talking about, we're
only comparing two Append paths. No Gather/GatherMerge in sight yet,
so any additional costs from those is not applicable.

If there was some reason that a Parallel Append could come out more
expensive, then maybe we could just create a non-parallel Append using
the same subpath list and add_partial_path() it. I just don't quite
see how that would ever win though. I'm willing to be proven wrong
though.

David

Re: Parallel Append can break run-time partition pruning

From

Tom Lane

Date:

22 April 2020, 23:39:09

David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 23 Apr 2020 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, anytime the parallel startup cost is significant, for starters.
>> But maybe we account for that at some other point, like when building
>> the Gather?

> Yeah. There's no mention of parallel_setup_cost or parallel_tuple_cost
> in any of the Append costing code. Those are only applied when we cost
> Gather / GatherMerge At the point Amit and I are talking about, we're
> only comparing two Append paths. No Gather/GatherMerge in sight yet,
> so any additional costs from those is not applicable.

Right, so really the costs of partial and non-partial paths are not
commensurable, and comparing them directly is just misleading.
I trust we're not throwing away non-partial paths on that basis?

            regards, tom lane

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

23 April 2020, 02:01:03

On Thu, 23 Apr 2020 at 11:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > On Thu, 23 Apr 2020 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Well, anytime the parallel startup cost is significant, for starters.
> >> But maybe we account for that at some other point, like when building
> >> the Gather?
>
> > Yeah. There's no mention of parallel_setup_cost or parallel_tuple_cost
> > in any of the Append costing code. Those are only applied when we cost
> > Gather / GatherMerge At the point Amit and I are talking about, we're
> > only comparing two Append paths. No Gather/GatherMerge in sight yet,
> > so any additional costs from those is not applicable.
>
> Right, so really the costs of partial and non-partial paths are not
> commensurable, and comparing them directly is just misleading.
> I trust we're not throwing away non-partial paths on that basis?

There is a case in both master and in the patch where we compare the
cost of the cheapest path in partial_pathlist. However, in this case,
the pathlist path will be used in an Append or Parallel Append with a
Gather below it, so those parallel_(setup|tuple)_costs will be applied
regardless.  The non-parallel Append in this case still requires a
Gather since it still is using multiple workers to execute the
subpaths. e.g the plan I posted in [1].

The code comparing the path costs is:

else if (nppath == NULL ||
(cheapest_partial_path != NULL &&
  cheapest_partial_path->total_cost < nppath->total_cost))

[1] https://www.postgresql.org/message-id/CAApHDvqcOD3ObPgPAeU+3qyFL_wzE5kmczw70qMAh7qJ-3wuzw@mail.gmail.com

Re: Parallel Append can break run-time partition pruning

From

Robert Haas

Date:

23 April 2020, 02:36:57

On Wed, Apr 22, 2020 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
> If there was some reason that a Parallel Append could come out more
> expensive, then maybe we could just create a non-parallel Append using
> the same subpath list and add_partial_path() it. I just don't quite
> see how that would ever win though. I'm willing to be proven wrong
> though.

I think you're talking about the thing that this comment is trying to explain:

    /*
     * Consider a parallel-aware append using a mix of partial and non-partial
     * paths.  (This only makes sense if there's at least one child which has
     * a non-partial path that is substantially cheaper than any partial path;
     * otherwise, we should use the append path added in the previous step.)
     */

Like, suppose there are two relations A and B, and we're appending
them. A has no indexes, so we can only choose between a Seq Scan and
an Index Scan. B has a GIST index that is well-suited to the query,
but GIST indexes don't support parallel scans. So we've got three
choices:

1. Don't use parallelism at all. Then, we can do a normal Append with
a Seq Scan on a and an Index Scan on b. ("If we found unparameterized
paths for all children, build an unordered, unparameterized Append
path for the rel.")

2. Use parallelism for both relations. Then, we can do a Gather over a
Parallel Append (or a regular Append, if Parallel Append is disabled)
with a Parallel Seq Scan on a and a Parallel Seq Scan on b. As
compared with #1, this should win for a, but it might lose heavily for
b, because switching from an index scan to a Seq Scan could be a big
loser. ("Consider an append of unordered, unparameterized partial
paths.  Make it parallel-aware if possible.")

3. Use parallelism for a but not for b. The only way to do this is a
Parallel Append, because there's no other way to mix partial and
non-partial paths at present. This lets us get the benefit of a
Parallel Seq Scan on a while still being able to do a non-parallel
GIST Index Scan on b. This has a good chance of being better than #2,
but it's fundamentally a costing decision, because if the table is
small enough or the query isn't very selective, #2 will actually be
faster, just on the raw power of more workers and less random I/O
("Consider a parallel-aware append using a mix of partial and
non-partial paths.")

It seems to me that all three strategies are viable. The third one is
much less likely to be used now that we have parallel index scans for
btree and parallel bitmap heap scans, but I believe it can be a winner
if you have the right case. You want to think about cases where there
are parallel plans available for everything in the tree, but at least
some of the children have much better non-parallel plans.

Note that for strategy #2 we always prefer Parallel Append to
non-Parallel Append on the optimistic assumption that Parallel Append
will always be better; we only use regular Append if Parallel Append
is disabled. But for strategy #3 there is no such choice to be made: a
regular Append would not be valid. If placed under a Gather, it would
execute the non-partial paths more than once; if not placed under a
Gather, we'd have partial paths without any Gather above them, which
is an invalid plan shape. So you can't "just use a regular Append" in
case #3.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

23 April 2020, 04:36:48

On Thu, 23 Apr 2020 at 14:37, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Apr 22, 2020 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > If there was some reason that a Parallel Append could come out more
> > expensive, then maybe we could just create a non-parallel Append using
> > the same subpath list and add_partial_path() it. I just don't quite
> > see how that would ever win though. I'm willing to be proven wrong
> > though.
>
> I think you're talking about the thing that this comment is trying to explain:
>
>     /*
>      * Consider a parallel-aware append using a mix of partial and non-partial
>      * paths.  (This only makes sense if there's at least one child which has
>      * a non-partial path that is substantially cheaper than any partial path;
>      * otherwise, we should use the append path added in the previous step.)
>      */
>
> Like, suppose there are two relations A and B, and we're appending
> them. A has no indexes, so we can only choose between a Seq Scan and
> an Index Scan. B has a GIST index that is well-suited to the query,
> but GIST indexes don't support parallel scans. So we've got three
> choices:
>
> 1. Don't use parallelism at all. Then, we can do a normal Append with
> a Seq Scan on a and an Index Scan on b. ("If we found unparameterized
> paths for all children, build an unordered, unparameterized Append
> path for the rel.")
>
> 2. Use parallelism for both relations. Then, we can do a Gather over a
> Parallel Append (or a regular Append, if Parallel Append is disabled)
> with a Parallel Seq Scan on a and a Parallel Seq Scan on b. As
> compared with #1, this should win for a, but it might lose heavily for
> b, because switching from an index scan to a Seq Scan could be a big
> loser. ("Consider an append of unordered, unparameterized partial
> paths.  Make it parallel-aware if possible.")
>
> 3. Use parallelism for a but not for b. The only way to do this is a
> Parallel Append, because there's no other way to mix partial and
> non-partial paths at present. This lets us get the benefit of a
> Parallel Seq Scan on a while still being able to do a non-parallel
> GIST Index Scan on b. This has a good chance of being better than #2,
> but it's fundamentally a costing decision, because if the table is
> small enough or the query isn't very selective, #2 will actually be
> faster, just on the raw power of more workers and less random I/O
> ("Consider a parallel-aware append using a mix of partial and
> non-partial paths.")

Thanks for those examples.

I ran this situation through the code but used a hash index instead of
GIST. The 3 settings which give us control over this plan are
enable_parallel_append, enable_indexscan, enable_bitmapscan.
enable_bitmapscan must be included since we can still get a parallel
bitmap scan with a hash index.

For completeness, I just tried with each of the 8 combinations of the
GUCs, but I'd detailed below which of your cases I'm testing as a
comment.  There are 4 cases since #2 works with parallel and
non-parallel Append.  Naturally, the aim is that the patched version
does not change the behaviour.

-- Test case
create table listp (a int, b int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 partition of listp for values in(2);
insert into listp select x,y from generate_Series(1,2) x,
generate_Series(1,1000000) y;
create index on listp2 using hash(b);
vacuum analyze listp;

explain (costs off) select * from listp where b = 1;
SET enable_indexscan = off;
explain (costs off) select * from listp where b = 1;
SET enable_indexscan = on;
SET enable_bitmapscan = off;
explain (costs off) select * from listp where b = 1; -- case #3, Mixed
scan of parallel and non-parallel paths with a Parallel Append
SET enable_indexscan = off;
explain (costs off) select * from listp where b = 1; -- case #2 with
Parallel Append
SET enable_indexscan = on;
SET enable_bitmapscan = on;
SET enable_parallel_append = off;
explain (costs off) select * from listp where b = 1;
SET enable_indexscan = off;
explain (costs off) select * from listp where b = 1; -- case #2 with
non-Parallel Append
SET enable_indexscan = on;
SET enable_bitmapscan = off;
explain (costs off) select * from listp where b = 1; -- case #1, best
serial plan
SET enable_indexscan = off;
explain (costs off) select * from listp where b = 1;

The results, patched/unpatched, are the same.

David

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

23 April 2020, 10:37:00

On Thu, 23 Apr 2020 at 02:37, Amit Langote <amitlangote09@gmail.com> wrote:
> Regarding the patch, I had been assuming that the "pa" in
> pa_subpaths_valid stands for "parallel append", so it using the
> variable as is in the new code structure would be misleading.  Maybe,
> parallel_subpaths_valid?

I started making another pass over this patch again.  I did the
renaming you've mentioned plus the other two List variables for the
subpaths.

I did notice that I had forgotten to properly disable parallel append
when it was disallowed by the rel's consider_parallel flag.  For us to
have done something wrong, we'd have needed a parallel safe path in
the pathlist and also have needed the rel to be consider_parallel ==
false.  It was easy enough to make up a case for that by sticking a
parallel restrict function in the target list.  I've fixed that issue
in the attached and also polished up the comments a little and removed
the unused variable that I had forgotten about.

For now. I'd still like to get a bit more confidence that the only
noticeable change in the outcome here is that we're now pulling up
sub-Appends in all cases. I've read the code a number of times and
just can't quite see any room for anything changing.  My tests per
Robert's case all matched what the previous version did, but I'm still
only about 93% on this. Given that I'm aiming to fix a bug in master,
v11 and v12 here, I need to get that confidence level up to about the
100% mark.

I've attached the v2 patch.

David

Attachment

always_pullup_child_appends_to_top_level_append_v2.patch

Re: Parallel Append can break run-time partition pruning

From

Amit Langote

Date:

24 April 2020, 08:46:40

On Thu, Apr 23, 2020 at 7:37 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Thu, 23 Apr 2020 at 02:37, Amit Langote <amitlangote09@gmail.com> wrote:
> > Regarding the patch, I had been assuming that the "pa" in
> > pa_subpaths_valid stands for "parallel append", so it using the
> > variable as is in the new code structure would be misleading.  Maybe,
> > parallel_subpaths_valid?
>
> I started making another pass over this patch again.  I did the
> renaming you've mentioned plus the other two List variables for the
> subpaths.
>
> I did notice that I had forgotten to properly disable parallel append
> when it was disallowed by the rel's consider_parallel flag.  For us to
> have done something wrong, we'd have needed a parallel safe path in
> the pathlist and also have needed the rel to be consider_parallel ==
> false.  It was easy enough to make up a case for that by sticking a
> parallel restrict function in the target list.  I've fixed that issue
> in the attached and also polished up the comments a little and removed
> the unused variable that I had forgotten about.

Thanks for updating the patch.

> For now. I'd still like to get a bit more confidence that the only
> noticeable change in the outcome here is that we're now pulling up
> sub-Appends in all cases.

Yeah, I would have liked us to have enough confidence to remove the
following text in the header comment of accumulate_append_subpath:

... If the latter is
 * NULL, we don't flatten the path at all (unless it contains only partial
 * paths).

> I've read the code a number of times and
> just can't quite see any room for anything changing.  My tests per
> Robert's case all matched what the previous version did, but I'm still
> only about 93% on this. Given that I'm aiming to fix a bug in master,
> v11 and v12 here, I need to get that confidence level up to about the
> 100% mark.

I think I have managed to convince myself (still < 100% though) that
it's not all that bad that we won't be leaving it up to
add_partial_path() to decide between a Parallel Append whose subpaths
set consist only of partial paths and another whose subpaths set
consists of a mix of partial and non-partial paths.  That's because we
will be building only the latter if that one looks cheaper to begin
with.  If Parallel Append is disabled, there could only ever be one
partial Append path for add_partial_path() to consider, one whose
subpaths set consists only of partial paths.

On the patch itself:

+        * We also build a set of paths for each child by trying to use the
+        * cheapest partial path, or the cheapest parallel safe normal path
+        * either when that is cheaper, or if parallel append is not allowed.
         */
-       if (pa_subpaths_valid)
+       if (parallel_subpaths_valid)
        {

In the comment above ", or parallel append is not allowed" should be "
provided parallel append is allowed".  Or how about writing it as
follows:

        /*
         * Add the child's cheapest partial path, or if parallel append is
         * allowed, its cheapest parallel safe normal path if that one is
         * cheaper, to the partial Append path we are constructing for the
         * parent.
         */

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

25 April 2020, 11:24:48

On Thu, 23 Apr 2020 at 22:37, David Rowley <dgrowleyml@gmail.com> wrote:
> For now. I'd still like to get a bit more confidence that the only
> noticeable change in the outcome here is that we're now pulling up
> sub-Appends in all cases. I've read the code a number of times and
> just can't quite see any room for anything changing.  My tests per
> Robert's case all matched what the previous version did, but I'm still
> only about 93% on this. Given that I'm aiming to fix a bug in master,
> v11 and v12 here, I need to get that confidence level up to about the
> 100% mark.

I looked at this again and I don't think what I've got is right.

The situation that I'm concerned about is that we generate a Parallel
Append path for a sub-partitioned table and that path has a mix of
partial and parallel safe paths.  When we perform
add_paths_to_append_rel() for the top-level partition, if for some
reason we didn't do a Parallel Append, then we could pullup the mixed
set of partial and parallel safe paths from the child-Append. We can't
execute a parallel_safe subpath under a normal Append that's below a
Gather as multiple workers could execute the same parallel safe scan,
which won't lead to anything good, probably wrong results at best.
Now, we'll only ever have a Parallel Append for the sub-partitioned
table if enable_parallel_append is on and the rel's consider_parallel
is switched on too, but if for some reason the top-level partitioned
table had consider_parallel set to off, then we could end up pulling
up the subpaths from a Parallel Append path, which could contain a mix
of partial and parallel safe paths and we'd then throw those into a
non-Parallel Append partial path!  The parallel safe path just can't
go in one of those as multiple workers could then execute that path.
Only Parallel Append knows how to execute those safely as it ensures
only 1 worker touches it.

Now, looking at the code today, it does seem that a child rel will do
parallel only if the parent rel does, but I don't think it's a great
idea to assume that will never change. Additionally, it seems fragile
to also assume the value of the enable_parallel_append, a global
variable would stay the same between calls too.

I wonder if we could fix this by when we call
add_paths_to_append_rel() for the top-level partitioned table, just
recursively get all the child rels and their children too.
accumulate_append_subpath() would then never see Appends or
MergeAppends as we'd just be dealing with scan type Paths.
add_paths_to_append_rel() for sub-partitioned tables wouldn't have to
do very much then.  I'll need to see how that idea would fit in with
partition-wise joins. My guess is, not very well. There's also
apply_scanjoin_target_to_paths() which calls add_paths_to_append_rel()
too.

The only other idea I have so far is just to either not generate the
partial path using the partial_subpaths list in
add_paths_to_append_rel() when pa_subpaths_valid and
pa_nonpartial_subpaths != NIL. i.e only create the first of those
partial paths if we're not going to create the 2nd one. Or even just
swap the order that we add_partial_path() them so that
add_partial_path()  does not reject the path with the flattened
Appends.

David

Re: Parallel Append can break run-time partition pruning

From

Robert Haas

Date:

28 April 2020, 13:31:15

On Sat, Apr 25, 2020 at 7:25 AM David Rowley <dgrowleyml@gmail.com> wrote:
> I looked at this again and I don't think what I've got is right.

Apart from the considerations which you raised, I think there's also a
costing issue here. For instance, suppose we have an Append with three
children. Two of them have partial paths which will complete after
consuming 1 second of CPU time, and using a partial path is
unquestionably the best strategy for those children. The third one has
a partial path which will complete after using 10 seconds of CPU time
and a non-partial path which will complete after using 8 seconds of
CPU time. What strategy is best? If we pick the non-partial path, the
time that the whole thing takes to complete will be limited by the
non-partial path, which, since it cannot use parallelism, will take 8
seconds of wall clock time. If we pick the partial path, we have a
total of 12 seconds of CPU time which we can finish in 6 wall clock
seconds with 2 participants, 4 seconds with 3 participants, or 3
seconds with 4 participants. This is clearly a lot better.

Incidentally, the planner knows about the fact that you can't finish
an Append faster than you can finish its non-partial participants. See
append_nonpartial_cost().

Now, I don't think anything necessarily gets broken here by your patch
as written. Planner decisions are made on estimated cost, which is a
proxy for wall clock time, not CPU time. Right now, we only prefer the
non-partial path if we think it can be executed in less time by ONE
worker than the partial path could be executed by ALL workers:

                                /*
                                 * Either we've got only a non-partial
path, or we think that
                                 * a single backend can execute the
best non-partial path
                                 * faster than all the parallel
backends working together can
                                 * execute the best partial path.

So, in the above example, we wouldn't even consider the non-partial
path. Even say we just have 2 workers. The partial path will have a
cost of 6, which is less than 8, so it gets rejected. But as the
comment goes on to note:

                                 * It might make sense to be more
aggressive here.  Even if
                                 * the best non-partial path is more
expensive than the best
                                 * partial path, it could still be
better to choose the
                                 * non-partial path if there are
several such paths that can
                                 * be given to different workers.  For
now, we don't try to
                                 * figure that out.

This kind of strategy is particularly appealing when the number of
Append children is large compared to the number of workers. For
instance, if you have an Append with 100 children and you are planning
with 4 workers, it's probably a pretty good idea to be very aggressive
about picking the path that uses the least resources, which the
current algorithm would not do. You're unlikely to end up with idle
workers, because you have so many plans to which they can be
allocated. However, it's possible: it could be that there's one child
which is way bigger than all the others and the really important thing
is to get a partial path for that child, so that it can be attacked in
parallel, even if that means that overall resource consumption is
higher. As the number of children decreases relative to the number of
workers, having partial paths becomes increasingly appealing. To take
a degenerate case, suppose you have 4 workers but only 2 children.
Partial paths should look really appealing, because the alternative is
leaving workers unused.

I *think* your patch is based around the idea of merging the
turn-the-append-of-partial-paths-into-a-parallel-append case with the
consider-a-mix-of-partial-and-nonpartial-paths-for-parallel-append
case. That seems possible to do given that the current heuristic is to
compare the raw path costs, but I think that it wouldn't work if we
wanted to be more aggressive about considering the mixed strategy and
letting the costing machinery sort out which way is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Append can break run-time partition pruning

From

Amit Langote

Date:

15 October 2020, 14:01:23

Robert forwarded me a link to an email sent to -general list, where
the reported problem seems to be the same problem that was being
discussed here.

https://www.postgresql.org/message-id/flat/d0f6d811-8946-eb9f-68e2-1a8a7f80ff21%40a-kretschmer.de

Going over the last few emails, it seems that David held off from
committing the patch, because of the lack of confidence in its
robustness.   With the patch, a sub-partitioned child's partial
Append's child paths will *always* be pulled up into the parent's set
of partial child paths thus preventing the nesting of Appends, which
the run-time pruning code can't currently cope with.  The lack of
confidence seems to be due to the fact that always pulling up a
sub-Append's child paths into the parent partial Append's child paths
*may* cause the latter to behave wrongly under parallelism and the new
code structure will prevent add_partial_path() from being the
arbitrator of whether such a path is really the best in terms of cost.

If we can't be confident in that approach, maybe we should consider
making the run-time pruning code cope with nested Appends?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

25 October 2020, 01:06:38

On Fri, 16 Oct 2020 at 03:01, Amit Langote <amitlangote09@gmail.com> wrote:
>
> Going over the last few emails, it seems that David held off from
> committing the patch, because of the lack of confidence in its
> robustness.   With the patch, a sub-partitioned child's partial
> Append's child paths will *always* be pulled up into the parent's set
> of partial child paths thus preventing the nesting of Appends, which
> the run-time pruning code can't currently cope with.  The lack of
> confidence seems to be due to the fact that always pulling up a
> sub-Append's child paths into the parent partial Append's child paths
> *may* cause the latter to behave wrongly under parallelism and the new
> code structure will prevent add_partial_path() from being the
> arbitrator of whether such a path is really the best in terms of cost.
>
> If we can't be confident in that approach, maybe we should consider
> making the run-time pruning code cope with nested Appends?

I've been thinking about this and my thoughts are:

There are other cases where we don't pullup sub-Merge Append nodes
anyway, so I think we should just make run-time pruning work for more
than just the top-level Append/Merge Append.

The case I'm thinking about is the code added in 959d00e9dbe for
ordered Append scans.  It's possible a sub-partitioned table has
partitions which don't participate in the same ordering.  We need to
keep the sub-Merge Append in those cases... well, at least when
there's more than 1 partition remaining after plan-time pruning.

I've attached the patch which, pending a final look, I'm proposing to
commit for master only.  I just don't quite think this is a bug fix,
and certainly, due to the invasiveness of the proposed patch, that
means no backpatch.

I fixed all the stuff about the incorrectly set partitioned_rels list.
What I ended up with there is making it accumulate_append_subpath's
job to also pullup the sub-paths partitioned_rels fields when pulling
up a nested Append/MergeAppend.   If there's no pullup, there then we
don't care about the sub-path's partitioned_rels.  That's for it to
deal with.  With that, I think that run-time pruning will only get the
RT indexes for partitions that we actually have sub-paths for. That's
pretty good, so I added an Assert() to verify that in
make_partitionedrel_pruneinfo().  (I hope I don't regret that later)

This does mean we need to maintain a different partitioned_rels list
for each Append path we consider.  So there's a number (six) of these
now between add_paths_to_append_rel() and
generate_orderedappend_paths().  To try to minimise the impact of
that, I've changed the field so that instead of being a List of
IntLists, it's just a List of Relids.  The top-level list just
contains a single element until you get a UNION ALL that selects from
a partitioned table on each side of the union.  Merging sub-path
partitioned rel RT indexes into the existing element is now pretty
cheap as it just uses bms_add_members() rather the list_concat we'd
have had to have used if it was still a List of IntLists.

After fixing up how partitioned_rels is built, I saw there were no
usages of RelOptInfo.partitioned_child_rels, so I got rid of it.

I did another couple of little cleanups and wrote some regression
tests to test all this.

Overall I'm fairly happy with this, especially getting rid of a
partitioned table related field from RelOptInfo.

David

Attachment

v1-0001-Allow-run-time-pruning-on-nested-Append-MergeAppe.patch

Re: Parallel Append can break run-time partition pruning

From

Amit Langote

Date:

27 October 2020, 06:39:53

On Sun, Oct 25, 2020 at 10:06 AM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 16 Oct 2020 at 03:01, Amit Langote <amitlangote09@gmail.com> wrote:
> >
> > Going over the last few emails, it seems that David held off from
> > committing the patch, because of the lack of confidence in its
> > robustness.   With the patch, a sub-partitioned child's partial
> > Append's child paths will *always* be pulled up into the parent's set
> > of partial child paths thus preventing the nesting of Appends, which
> > the run-time pruning code can't currently cope with.  The lack of
> > confidence seems to be due to the fact that always pulling up a
> > sub-Append's child paths into the parent partial Append's child paths
> > *may* cause the latter to behave wrongly under parallelism and the new
> > code structure will prevent add_partial_path() from being the
> > arbitrator of whether such a path is really the best in terms of cost.
> >
> > If we can't be confident in that approach, maybe we should consider
> > making the run-time pruning code cope with nested Appends?
>
> I've been thinking about this and my thoughts are:

Thanks for working on this.

> There are other cases where we don't pullup sub-Merge Append nodes
> anyway, so I think we should just make run-time pruning work for more
> than just the top-level Append/Merge Append.
>
> The case I'm thinking about is the code added in 959d00e9dbe for
> ordered Append scans.  It's possible a sub-partitioned table has
> partitions which don't participate in the same ordering.  We need to
> keep the sub-Merge Append in those cases... well, at least when
> there's more than 1 partition remaining after plan-time pruning.

Ah, I guess that case would also likewise fail to use runtime pruning properly.

> I've attached the patch which, pending a final look, I'm proposing to
> commit for master only.  I just don't quite think this is a bug fix,
> and certainly, due to the invasiveness of the proposed patch, that
> means no backpatch.
>
> I fixed all the stuff about the incorrectly set partitioned_rels list.
> What I ended up with there is making it accumulate_append_subpath's
> job to also pullup the sub-paths partitioned_rels fields when pulling
> up a nested Append/MergeAppend.   If there's no pullup, there then we
> don't care about the sub-path's partitioned_rels.  That's for it to
> deal with.  With that, I think that run-time pruning will only get the
> RT indexes for partitions that we actually have sub-paths for. That's
> pretty good, so I added an Assert() to verify that in
> make_partitionedrel_pruneinfo().  (I hope I don't regret that later)

So AIUI, the fix here is to make any given Append/MergeAppend's
part_prune_info only contain PartitionedRelPruneInfos of the (sub-)
partitioned tables whose partitions' subplans are directly in
appendplan/mergeplans, such that the partition indexes can be mapped
to the subplan indexes.  That does imply present_parts must be
non-empty, so the Assert seems warranted.

> This does mean we need to maintain a different partitioned_rels list
> for each Append path we consider.  So there's a number (six) of these
> now between add_paths_to_append_rel() and
> generate_orderedappend_paths().  To try to minimise the impact of
> that, I've changed the field so that instead of being a List of
> IntLists, it's just a List of Relids.  The top-level list just
> contains a single element until you get a UNION ALL that selects from
> a partitioned table on each side of the union.  Merging sub-path
> partitioned rel RT indexes into the existing element is now pretty
> cheap as it just uses bms_add_members() rather the list_concat we'd
> have had to have used if it was still a List of IntLists.

The refactoring seemed complicated on a first look, but overall looks good.

> After fixing up how partitioned_rels is built, I saw there were no
> usages of RelOptInfo.partitioned_child_rels, so I got rid of it.

+1

> I did another couple of little cleanups and wrote some regression
> tests to test all this.
>
> Overall I'm fairly happy with this, especially getting rid of a
> partitioned table related field from RelOptInfo.

Some comments:

+    * For partitioned tables, we accumulate a list of the partitioned RT
+    * indexes for the subpaths that are directly under this Append.

Maybe:

For partitioned tables, accumulate a list of the RT indexes of
partitioned tables in the tree whose partitions' subpaths are directly
under this Append.

+    * lists for each Append Path that we create as accumulate_append_subpath
+    * sometimes can't flatten sub-Appends into the top-level Append.

How about expanding that reason a little bit as:

...can't flatten sub-Appends into the top-level Append which prevents
the former's partitioned_rels from being pulled into the latter's.

+ * most one element which is a RelIds of the partitioned relations which there

s/RelIds/Relids

+ * are subpaths for.  In this case we just add the RT indexes for the
+ * partitioned tables for the subpath we're pulling up to the single entry in
+ * 'partitioned_rels'.

How about:

In this case, we just pull the RT indexes contained in
sub-Append/MergeAppend's partitioned_rels into the single entry of
*partitioned_rels, which belongs to the parent Append.

         * relid_subpart_map maps relid of a non-leaf partition to the index
-        * in 'partitioned_rels' of that rel (which will also be the index in
-        * the returned PartitionedRelPruneInfo list of the info for that
+        * in 'partrelids' of that rel (which will also be the index in the
+        * returned PartitionedRelPruneInfo list of the info for that

...the index in 'partrelids', which in the new code is a bitmap set,
sounds a bit odd.  Maybe just mention the index in the list of
PartitionedRelPruneInfos as:

relid_subpart_map maps relid of a given non-leaf partition in
'partrelids' to the index of its PartitionedRelPruneInfo in the
returned list.

+       /*
+        * Ensure there were no stray PartitionedRelPruneInfo generated for
+        * partitioned tables that had no sub-paths for.
+        */
+       Assert(!bms_is_empty(present_parts));

Maybe you meant:

...for partitioned tables for which we had neither leaf subpaths nor
sub-PartitionedRelPruneInfos.

+   List       *partitioned_rels;   /* List of Relids for each non-leaf
+                                    * partitioned table in the partition
+                                    * tree. One for each partition hierarchy.
+                                    */
    List       *subpaths;       /* list of component Paths */

How about describing partitioned_rels as follows:

List of Relids set containing RT indexes of non-leaf tables for each
partition hierarchy whose paths are in 'subpaths'

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Parallel Append can break run-time partition pruning

From

David Rowley

Date:

02 November 2020, 00:50:57

On Tue, 27 Oct 2020 at 19:40, Amit Langote <amitlangote09@gmail.com> wrote:
> Some comments:

Thanks for having a look at this.

I've made some adjustments to those comments and pushed.

David

Re: Parallel Append can break run-time partition pruning

From

Amit Langote

Date:

02 November 2020, 01:58:11

On Mon, Nov 2, 2020 at 9:51 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 27 Oct 2020 at 19:40, Amit Langote <amitlangote09@gmail.com> wrote:
> > Some comments:
>
> Thanks for having a look at this.
>
> I've made some adjustments to those comments and pushed.

Thank you.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Parallel Append can break run-time partition pruning

From

Justin Pryzby

Date:

12 November 2021, 17:31:02

On Mon, Nov 02, 2020 at 01:50:57PM +1300, David Rowley wrote:
> On Tue, 27 Oct 2020 at 19:40, Amit Langote <amitlangote09@gmail.com> wrote:
> > Some comments:
> 
> Thanks for having a look at this.
> 
> I've made some adjustments to those comments and pushed.

commit a929e17e5 doesn't appear in the v14 release notes, but I wanted to
mention that this appears to allow fixing a rowcount mis-estimate for us,
which I had reported before:
https://www.postgresql.org/message-id/20170326193344.GS31628%40telsasoft.com
https://www.postgresql.org/message-id/20170415002322.GA24216@telsasoft.com
https://www.postgresql.org/message-id/20170524211730.GM31097@telsasoft.com

And others have reported before:
https://www.postgresql.org/message-id/flat/7DF51702-0F6A-4571-80BB-188AAEF260DA@gmail.com

https://www.postgresql.org/message-id/SG2PR01MB29673BE6F7AA24424FDBFF60BC670%40SG2PR01MB2967.apcprd01.prod.exchangelabs.com

For years, our reports have included a generated WHERE clause for each table
being queried, to allow each table's partitions to be properly pruned/excluded
(not just one table, as happened if we used a single WHERE clause).

That worked, but then the planner underestimates the rowcount, since it doesn't
realize that the conditions are redundant (since "equality classes" do not
handle the inequality conditions).

In v14, one WHERE clause per table still gives an underestimate; but, now
multiple WHERE clauses aren't required, because a single WHERE clause excludes
partitions from each table, and the rowcount from the elided partitions is
excluded from the Append rowcount at plan time.

Thanks for this feature !

-- 
Justin