Re: Parallel Append can break run-time partition pruning - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Parallel Append can break run-time partition pruning
Date
Msg-id CA+HiwqHjzbSj4MbrBTA34b8Un-rz56f8twhj_Sur1agfduNYMw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Append can break run-time partition pruning  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Parallel Append can break run-time partition pruning
List pgsql-hackers
On Sun, Oct 25, 2020 at 10:06 AM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 16 Oct 2020 at 03:01, Amit Langote <amitlangote09@gmail.com> wrote:
> >
> > Going over the last few emails, it seems that David held off from
> > committing the patch, because of the lack of confidence in its
> > robustness.   With the patch, a sub-partitioned child's partial
> > Append's child paths will *always* be pulled up into the parent's set
> > of partial child paths thus preventing the nesting of Appends, which
> > the run-time pruning code can't currently cope with.  The lack of
> > confidence seems to be due to the fact that always pulling up a
> > sub-Append's child paths into the parent partial Append's child paths
> > *may* cause the latter to behave wrongly under parallelism and the new
> > code structure will prevent add_partial_path() from being the
> > arbitrator of whether such a path is really the best in terms of cost.
> >
> > If we can't be confident in that approach, maybe we should consider
> > making the run-time pruning code cope with nested Appends?
>
> I've been thinking about this and my thoughts are:

Thanks for working on this.

> There are other cases where we don't pullup sub-Merge Append nodes
> anyway, so I think we should just make run-time pruning work for more
> than just the top-level Append/Merge Append.
>
> The case I'm thinking about is the code added in 959d00e9dbe for
> ordered Append scans.  It's possible a sub-partitioned table has
> partitions which don't participate in the same ordering.  We need to
> keep the sub-Merge Append in those cases... well, at least when
> there's more than 1 partition remaining after plan-time pruning.

Ah, I guess that case would also likewise fail to use runtime pruning properly.

> I've attached the patch which, pending a final look, I'm proposing to
> commit for master only.  I just don't quite think this is a bug fix,
> and certainly, due to the invasiveness of the proposed patch, that
> means no backpatch.
>
> I fixed all the stuff about the incorrectly set partitioned_rels list.
> What I ended up with there is making it accumulate_append_subpath's
> job to also pullup the sub-paths partitioned_rels fields when pulling
> up a nested Append/MergeAppend.   If there's no pullup, there then we
> don't care about the sub-path's partitioned_rels.  That's for it to
> deal with.  With that, I think that run-time pruning will only get the
> RT indexes for partitions that we actually have sub-paths for. That's
> pretty good, so I added an Assert() to verify that in
> make_partitionedrel_pruneinfo().  (I hope I don't regret that later)

So AIUI, the fix here is to make any given Append/MergeAppend's
part_prune_info only contain PartitionedRelPruneInfos of the (sub-)
partitioned tables whose partitions' subplans are directly in
appendplan/mergeplans, such that the partition indexes can be mapped
to the subplan indexes.  That does imply present_parts must be
non-empty, so the Assert seems warranted.

> This does mean we need to maintain a different partitioned_rels list
> for each Append path we consider.  So there's a number (six) of these
> now between add_paths_to_append_rel() and
> generate_orderedappend_paths().  To try to minimise the impact of
> that, I've changed the field so that instead of being a List of
> IntLists, it's just a List of Relids.  The top-level list just
> contains a single element until you get a UNION ALL that selects from
> a partitioned table on each side of the union.  Merging sub-path
> partitioned rel RT indexes into the existing element is now pretty
> cheap as it just uses bms_add_members() rather the list_concat we'd
> have had to have used if it was still a List of IntLists.

The refactoring seemed complicated on a first look, but overall looks good.

> After fixing up how partitioned_rels is built, I saw there were no
> usages of RelOptInfo.partitioned_child_rels, so I got rid of it.

+1

> I did another couple of little cleanups and wrote some regression
> tests to test all this.
>
> Overall I'm fairly happy with this, especially getting rid of a
> partitioned table related field from RelOptInfo.

Some comments:

+    * For partitioned tables, we accumulate a list of the partitioned RT
+    * indexes for the subpaths that are directly under this Append.

Maybe:

For partitioned tables, accumulate a list of the RT indexes of
partitioned tables in the tree whose partitions' subpaths are directly
under this Append.

+    * lists for each Append Path that we create as accumulate_append_subpath
+    * sometimes can't flatten sub-Appends into the top-level Append.

How about expanding that reason a little bit as:

...can't flatten sub-Appends into the top-level Append which prevents
the former's partitioned_rels from being pulled into the latter's.

+ * most one element which is a RelIds of the partitioned relations which there

s/RelIds/Relids

+ * are subpaths for.  In this case we just add the RT indexes for the
+ * partitioned tables for the subpath we're pulling up to the single entry in
+ * 'partitioned_rels'.

How about:

In this case, we just pull the RT indexes contained in
sub-Append/MergeAppend's partitioned_rels into the single entry of
*partitioned_rels, which belongs to the parent Append.

         * relid_subpart_map maps relid of a non-leaf partition to the index
-        * in 'partitioned_rels' of that rel (which will also be the index in
-        * the returned PartitionedRelPruneInfo list of the info for that
+        * in 'partrelids' of that rel (which will also be the index in the
+        * returned PartitionedRelPruneInfo list of the info for that

...the index in 'partrelids', which in the new code is a bitmap set,
sounds a bit odd.  Maybe just mention the index in the list of
PartitionedRelPruneInfos as:

relid_subpart_map maps relid of a given non-leaf partition in
'partrelids' to the index of its PartitionedRelPruneInfo in the
returned list.

+       /*
+        * Ensure there were no stray PartitionedRelPruneInfo generated for
+        * partitioned tables that had no sub-paths for.
+        */
+       Assert(!bms_is_empty(present_parts));

Maybe you meant:

...for partitioned tables for which we had neither leaf subpaths nor
sub-PartitionedRelPruneInfos.

+   List       *partitioned_rels;   /* List of Relids for each non-leaf
+                                    * partitioned table in the partition
+                                    * tree. One for each partition hierarchy.
+                                    */
    List       *subpaths;       /* list of component Paths */

How about describing partitioned_rels as follows:

List of Relids set containing RT indexes of non-leaf tables for each
partition hierarchy whose paths are in 'subpaths'

-- 
Amit Langote
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "tsunakawa.takay@fujitsu.com"
Date:
Subject: RE: extension patch of CREATE OR REPLACE TRIGGER
Next
From: Craig Ringer
Date:
Subject: Re: Internal key management system