UNION ALL on partitioned tables won't use indices. - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject UNION ALL on partitioned tables won't use indices.
Date
Msg-id 20131024.193953.233464126.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
Responses Re: UNION ALL on partitioned tables won't use indices.  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hello, Sorry that it's been a while..

1. Observed symptom

As you know, UNION ALL accompanied with ORDER BY uses indexes if
available.

> uniontest=# EXPLAIN SELECT * FROM c11 UNION ALL SELECT * FROM c12 ORDER BY a;
>                                QUERY PLAN                                
> ---------------------------------------------------------------------------
>  Merge Append  (cost=0.59..10214.10 rows=200000 width=16)
>    Sort Key: c11.a
>    ->  Index Scan using ... on c11  (cost=0.29..3857.04 rows=100000 width=16)
>    ->  Index Scan using ... on c12  (cost=0.29..3857.04 rows=100000 width=16)

So do simple queries on partitioned (inheritance) tables.

> uniontest=# EXPLAIN SELECT * FROM p1 ORDER BY a;
>                                 QUERY PLAN                                
> ------------------------------------------------------------------------------
>  Merge Append  (cost=0.73..11392.19 rows=200001 width=16)
>    Sort Key: p1.a
>    ->  Index Scan using ... on p1  (cost=0.12..8.14 rows=1 width=44)
>    ->  Index Scan using ... on c11  (cost=0.29..3857.04 rows=100000 width=16)
>    ->  Index Scan using ... on c12  (cost=0.29..3857.04 rows=100000 width=16)

Nevertheless, UNION ALL on partitioned tables doesn't. This is
quite unfavourable behavior especially having LIMIT.

>uniontest=# EXPLAIN ANALYZE SELECT * FROM p1
>            UNION ALL SELECT * FROM p2 ORDER BY a LIMIT 10;
>                              QUERY PLAN                       
>------------------------------------------------------------------------
> Limit   (actual time=182.732..182.735 rows=10 loops=1)
>   ->  Sort  (actual time=182.729..182.730 rows=10 loops=1)
>     Sort Key: p1.a
>     Sort Method: top-N heapsort  Memory: 25kB
>     ->  Append  (actual time=0.029..108.109 rows=400000 loops=1)
>       ->  Seq Scan on p1  (actual time=0.001..0.001 rows=0 loops=1)
>       ->  Seq Scan on c11  (actual time=0.027..19.074 rows=100000 loops=1)
>       ->  Seq Scan on c12  (actual time=0.014..16.653 rows=100000 loops=1)
>       ->  Seq Scan on p2  (actual time=0.000..0.000 rows=0 loops=1)
>       ->  Seq Scan on c21  (actual time=0.012..16.677 rows=100000 loops=1)
>       ->  Seq Scan on c22  (actual time=0.012..16.794 rows=100000 loops=1)
> Total runtime: 182.857 ms


2. The cause

In grouping_planner, flattern_simple_union_all creates
appendrelinfos for each subqueries then expand_inherited_tables
furthur expands the parent tables in each subqueries. Finally,
this sequence leaves following structure. Where rte[2] and [3]
are subqueries abandoned on the way pulling up rte[4] and [5].

rte[1] Subquery "SELECT*1", inh = 1  +- appinfo[0] - rte[4] Relation "p1", inh = 1  |               +- appinfo[2] -
rte[6] Relation "p1", inh = 0  |               +- appinfo[3] - rte[7]  Relation "c11", inh = 0  |               +-
appinfo[4]- rte[8]  Relation "c12", inh = 0  +- appinfo[1] - rte[5] Relation "p2", inh = 1                  +-
appinfo[5]- rte[9]  Relation "p1", inh = 0                  +- appinfo[6] - rte[10] Relation "c11", inh = 0
    +- appinfo[7] - rte[11] Relation "c12", inh = 0
 

On the other hand, ec member finally has exprs only for varno =
1, 4 and 5, in other words, it lacks the members for the most
descendant RTEs.  This is because add_child_rel_equivalences()
always inhibits add_eq_member from registering the child_rel's
relid on EC. Conequently these grandchild relations does not find
index pathkeys for themselves.


3. Measures

I could thought up three approaches for the problem.

One is to simplly modifying here to give child flag in the
parameters of add_eq_member accordig to whether the child_rel is
inh or not. This seems to work fine although leaves two levels of
MergeAppends (PATCH-1). So the additional patch is attached to
collapse these MergeAppends (PATCH-2). This gives the same plan
as PATCH-3.

> uniontest=# explain analyze select * from p1 union all
>                             select * from p2 order by a limit 10;
>                                QUERY PLAN                       
> ------------------------------------------------------------------------
>  Limit  (actual time=0.208..0.224 rows=10 loops=1)
>    ->  Merge Append  (actual time=0.205..0.218 rows=10 loops=1)
>      Sort Key: p1.a
>      ->  Merge Append  (actual time=0.110..0.120 rows=10 loops=1)
>        Sort Key: p1.a
>        ->  Index Scan .. on p1  (actual time=0.006..0.006 rows=0 loops=1)
>        ->  Index Scan .. on c11  (actual time=0.054..0.060 rows=10 loops=1)
>        ->  Index Scan .. on c12  (actual time=0.046..0.046 rows=1 loops=1)
>      ->  Merge Append  (actual time=0.093..0.093 rows=1 loops=1)
>        Sort Key: p2.a
>        ->  Index Scan .. on p2  (actual time=0.002..0.002 rows=0 loops=1)
>        ->  Index Scan .. on c21  (actual time=0.047..0.047 rows=1 loops=1)
>        ->  Index Scan .. on c22  (actual time=0.043..0.043 rows=1 loops=1)
>  Total runtime: 0.448 ms


The second is to collapse the appendrel structure shown above to
have only single level inheritance. This also seems to work
fine. This makes the plan with single level
MergeAppend. (PATCH-3)

> uniontest=# EXPLAIN ANALYZE SELECT * FROM p1 UNION ALL
>             SELECT * FROM p2 ORDER BY a LIMIT 10;
>                                   QUERY PLAN                 
> -------------------------------------------------------------------
>  Limit  (actual time=0.095..0.107 rows=10 loops=1)
>    ->  Merge Append  (actual time=0.092..0.102 rows=10 loops=1)
>      Sort Key: p1.a
>      ->  Index Scan ... on p1 (actual time=0.005..0.005 rows=0 loops=1)
>      ->  Index Scan ... on c11  (actual time=0.023..0.030 rows=10 loops=1)
>      ->  Index Scan ... on c12  (actual time=0.020..0.020 rows=1 lops=1)
>      ->  Index Scan ... on p2  (actual time=0.001..0.001 rows=0 loops=1)
>      ->  Index Scan ... on c21  (actual time=0.019..0.019 rows=1 loops=1)
>      ->  Index Scan ... on c22  (actual time=0.019..0.019 rows=1 loops=1)
>  Total runtime: 0.241 ms



The last one is most strait-forward in some sense, and conversely
is most ad hoc measure. Modifying expand_inherited_rtentry() to
pull up adding appendrels if needed, using the similar manner to
PATCH-3. This is added as PATCH-4.


what do you think about this?


Four patches following are attached to this message.

1. union_inh_idx_typ1_v1_20131024.patch     .. PATCH-1 described above.
2. union_inh_idx_typ1_add_v1_20131024.patch .. PATCH-2 described above.
3. union_inh_idx_typ2_v1_20131024.patch     .. PATCH-3 described above.
4. union_inh_idx_typ3_v1_20131024.patch     .. PATCH-4 described above.

regards,
-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 711b161..734ed47 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -1940,6 +1940,7 @@ add_child_rel_equivalences(PlannerInfo *root,                Expr       *child_expr;
 Relids        new_relids;                Relids        new_nullable_relids;
 
+                bool        has_children = false;                child_expr = (Expr *)
adjust_appendrel_attrs(root,
@@ -1969,9 +1970,15 @@ add_child_rel_equivalences(PlannerInfo *root,
     child_rel->relids);                }
 
+                /*
+                 * Does this child_rel have children? If and only if so, tell
+                 * add_eq_member to register new_relids to cur_ec.
+                 */
+                has_children =
+                    root->simple_rte_array[child_rel->relid]->inh;                (void) add_eq_member(cur_ec,
child_expr,                                    new_relids, new_nullable_relids,
 
-                                     true, cur_em->em_datatype);
+                                     !has_children, cur_em->em_datatype);            }        }    }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 64b1705..0e3cf4b 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -944,7 +944,8 @@ create_merge_append_path(PlannerInfo *root,    MergeAppendPath *pathnode =
makeNode(MergeAppendPath);   Cost        input_startup_cost;    Cost        input_total_cost;
 
-    ListCell   *l;
+    ListCell   *l, *lm;
+    bool        collapse_subpaths = false;    pathnode->path.pathtype = T_MergeAppend;    pathnode->path.parent =
rel;
@@ -953,6 +954,47 @@ create_merge_append_path(PlannerInfo *root,    pathnode->path.pathkeys = pathkeys;
pathnode->subpaths= subpaths;
 
+
+    /*
+     * If subpaths containes MergeAppendPaths already ordered on the pathkeys
+     * of the creating node, they can be expanded onto this node.
+     */
+    foreach (lm, subpaths)
+    {
+        Path *spath = (Path *) lfirst(lm);
+
+        if (IsA(spath, MergeAppendPath) &&
+            pathkeys_contained_in(pathkeys, spath->pathkeys))
+        {
+            collapse_subpaths = true;
+            break;
+        }
+    }
+
+    if (collapse_subpaths)
+    {
+        subpaths = NIL;
+
+        foreach (lm, pathnode->subpaths)
+        {
+            MergeAppendPath *mpath = (MergeAppendPath *) lfirst(lm);
+            ListCell *lcm;
+            
+            if (IsA(mpath, MergeAppendPath) &&
+                pathkeys_contained_in(pathkeys, mpath->path.pathkeys))
+            {
+                foreach (lcm, mpath->subpaths)
+                {
+                    Path *smpath = (Path*) lfirst (lcm);
+                    subpaths = lappend(subpaths, smpath);
+                }
+            }
+            else
+                subpaths = lappend(subpaths, subpaths);
+        }
+        pathnode->subpaths = subpaths;
+    }
+    /*     * Apply query-wide LIMIT if known and path is for sole base relation.     * (Handling this at this low
levelis a bit klugy.) 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d8aa35d..8167583 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -404,6 +404,15 @@ subquery_planner(PlannerGlobal *glob, Query *parse,    expand_inherited_tables(root);    /*
+     * Collapse multilevel inheritances into fewer levels.
+     *
+     * UNION ALL containing subselects on inherited tables leaves multilevel
+     * inheritance after calling expand_inherited_tables().  Fold them in
+     * order to make MergeAppend plan available for such queries.
+     */
+    collapse_appendrels(root);
+
+    /*     * Set hasHavingQual to remember if HAVING clause is present.  Needed     * because preprocess_expression
willreduce a constant-true condition to     * an empty qual list ... but "HAVING TRUE" is not a semantic no-op.
 
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e249628..c7933b5 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -57,6 +57,11 @@ typedef struct    AppendRelInfo *appinfo;} adjust_appendrel_attrs_context;
+typedef struct {
+    AppendRelInfo    *child_appinfo;
+    Index              target_rti;
+} transvars_merge_context;
+static Plan *recurse_set_operations(Node *setOp, PlannerInfo *root,                       double tuple_fraction,
               List *colTypes, List *colCollations,
 
@@ -98,6 +103,8 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,                      List
*input_plans,                     List *refnames_tlist);static List *generate_setop_grouplist(SetOperationStmt *op,
List*targetlist);
 
+static Node *transvars_merge_mutator(Node *node,
+                                     transvars_merge_context *ctx);static void expand_inherited_rtentry(PlannerInfo
*root,RangeTblEntry *rte,                         Index rti);static void make_inh_translation_list(Relation
oldrelation,
@@ -1211,6 +1218,131 @@ expand_inherited_tables(PlannerInfo *root)}/*
+ * collapse_appendrels
+ *      Pulling up the descendents by recursively combining successive
+ *      translations.
+ *
+ *      Although the same thing could be done in adjust_appendrel_attrs(),
+ *      doing it here all through at once is more efficient than individually
+ *      (and maybe repetitively) done there.
+ */
+void
+collapse_appendrels(PlannerInfo *root)
+{
+    Index               last_parent_relid = 1;
+    AppendRelInfo    *last_parent = NULL;
+    ListCell         *lcchild;
+    ListCell         *lcparent;
+    bool             full_search = false;
+
+    foreach(lcchild, root->append_rel_list)
+    {
+        AppendRelInfo    *ch_appinfo         = (AppendRelInfo *)lfirst(lcchild);
+        Index             ch_parent_relid = ch_appinfo->parent_relid;
+        
+        if (last_parent_relid != ch_parent_relid)
+        {
+            /*
+             * Find the parent for the current child appinfo if parent relid
+             * is different from that of preveous child.
+             */
+            do
+            {
+                /*
+                 * For most cases, the relations are referred to as the parent
+                 * in their apperarence order in rtable from
+                 * append_rel_list. So start searching for the parent appinfos
+                 * from just after the last parent. If the incremental search
+                 * was failed, retry searching the entire list and exit on
+                 * failure.
+                 */
+                if (!last_parent)
+                {
+                    /*
+                     * If this is the first time or the preveous search was
+                     * failed, set up for full search.
+                     */
+                    lcparent = list_head(root->append_rel_list);
+                    last_parent_relid = 1;
+                    full_search = true;
+                }
+
+                last_parent = NULL;
+                for_each_cell(lcparent, lcparent)
+                {
+                    /*
+                     * Children's and parents' apprelinfos are bonded via
+                     * rtable relations. So two apprelinfos are in
+                     * parent-child relationship when the child_relid of the
+                     * parent is equal to the parent_relid of the child.
+                     */
+                    AppendRelInfo *papp = (AppendRelInfo*)lfirst(lcparent);
+                    if (papp->child_relid == ch_parent_relid)
+                    {
+                        last_parent = papp;
+                        last_parent_relid = ch_parent_relid;
+                        
+                        /* Search from the next cell next time. */
+                        lcparent = lnext(lcparent);
+                        full_search = false;
+                        break;        /* Parent found */
+                    }
+                }
+                /* Retry only when incremental search was failed. */
+            } while (!full_search && !last_parent);
+        }
+
+        /*
+         * Replace child translated_vars so as to be a direct children of the
+         * topmost relation.
+         */
+        if (last_parent)
+        {
+            transvars_merge_context ctx;
+
+            ctx.child_appinfo = ch_appinfo;
+            ctx.target_rti  = last_parent->child_relid;
+
+            ch_appinfo->translated_vars =
+                (List*)expression_tree_mutator(
+                    (Node*)last_parent->translated_vars,
+                    transvars_merge_mutator, &ctx);
+            ch_appinfo->parent_relid = last_parent->parent_relid;
+            ch_appinfo->parent_reltype = last_parent->parent_reltype;
+        }
+    }
+}
+
+/*
+ * Replace Var nodes with corresponding child nodes.
+ */
+static Node *
+transvars_merge_mutator(Node *node, transvars_merge_context *ctx)
+{
+    if (node == NULL)
+        return NULL;
+
+    if (IsA(node, Var))
+    {
+        Var *oldv = (Var*)node;
+
+        if (!oldv->varlevelsup && oldv->varno == ctx->target_rti)
+        {
+            if (oldv->varattno >
+                list_length(ctx->child_appinfo->translated_vars))
+                elog(ERROR,
+                     "attribute %d of relation \"%s\" does not exist",
+                     oldv->varattno,
+                     get_rel_name(ctx->child_appinfo->parent_reloid));
+            return (Node*)copyObject(list_nth(ctx->child_appinfo->translated_vars,
+                                              oldv->varattno - 1));
+        }
+    }
+    return expression_tree_mutator(node,
+                                   transvars_merge_mutator, (void*)ctx);
+}
+
+/* * expand_inherited_rtentry *        Check whether a rangetable entry represents an inheritance set. *        If so,
addentries for all the child tables to the query's
 
diff --git a/src/include/optimizer/prep.h b/src/include/optimizer/prep.h
index 0934e63..7a83938 100644
--- a/src/include/optimizer/prep.h
+++ b/src/include/optimizer/prep.h
@@ -49,6 +49,7 @@ extern Plan *plan_set_operations(PlannerInfo *root, double tuple_fraction,                    List
**sortClauses);externvoid expand_inherited_tables(PlannerInfo *root);
 
+extern void collapse_appendrels(PlannerInfo *root);extern Node *adjust_appendrel_attrs(PlannerInfo *root, Node *node,
                    AppendRelInfo *appinfo); 
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e249628..3cba2a1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -57,6 +57,11 @@ typedef struct    AppendRelInfo *appinfo;} adjust_appendrel_attrs_context;
+typedef struct {
+    AppendRelInfo    *child_appinfo;
+    Index             target_rti;
+} transvars_merge_context;
+static Plan *recurse_set_operations(Node *setOp, PlannerInfo *root,                       double tuple_fraction,
               List *colTypes, List *colCollations,
 
@@ -98,6 +103,8 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,                      List
*input_plans,                     List *refnames_tlist);static List *generate_setop_grouplist(SetOperationStmt *op,
List*targetlist);
 
+static Node *transvars_merge_mutator(Node *node,
+                                     transvars_merge_context *ctx);static void expand_inherited_rtentry(PlannerInfo
*root,RangeTblEntry *rte,                         Index rti);static void make_inh_translation_list(Relation
oldrelation,
@@ -1210,6 +1217,34 @@ expand_inherited_tables(PlannerInfo *root)    }}
+static Node *
+transvars_merge_mutator(Node *node, transvars_merge_context *ctx)
+{
+    if (node == NULL)
+        return NULL;
+
+    if (IsA(node, Var))
+    {
+        Var *oldv = (Var*)node;
+
+        if (!oldv->varlevelsup && oldv->varno == ctx->target_rti)
+        {
+            if (oldv->varattno >
+                    list_length(ctx->child_appinfo->translated_vars))
+                elog(ERROR,
+                     "attribute %d of relation \"%s\" does not exist",
+                     oldv->varattno,
+                     get_rel_name(ctx->child_appinfo->parent_reloid));
+            
+            return (Node*)copyObject(
+                list_nth(ctx->child_appinfo->translated_vars,
+                         oldv->varattno - 1));
+        }
+    }
+    return expression_tree_mutator(node,
+                                   transvars_merge_mutator, (void*)ctx);
+}
+/* * expand_inherited_rtentry *        Check whether a rangetable entry represents an inheritance set.
@@ -1237,6 +1272,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)    List
*inhOIDs;   List       *appinfos;    ListCell   *l;
 
+    AppendRelInfo *parent_appinfo = NULL;    /* Does RT entry allow inheritance? */    if (!rte->inh)
@@ -1301,6 +1337,21 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
oldrc->isParent= true;    /*
 
+     * If parent relation is appearing in a subselect of UNION ALL, it has
+     * furthur parent appendrelinfo. Save it to pull up inheritance children
+     * later.
+     */
+    foreach(l, root->append_rel_list)
+    {
+        AppendRelInfo *appinfo = (AppendRelInfo *)lfirst(l);
+        if(appinfo->child_relid == rti)
+        {
+            parent_appinfo = appinfo;
+            break;
+        }
+    }
+    
+    /*     * Must open the parent relation to examine its tupdesc.  We need not lock     * it; we assume the rewriter
alreadydid.     */
 
@@ -1378,6 +1429,28 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)        }        /*
+         * Pull up this appinfo onto just above of the parent. The parent
+         * relation has its own parent when it appears as a subquery of UNION
+         * ALL. Pulling up these children gives a chance to consider
+         * MergeAppend on whole the UNION ALL tree.
+         */
+
+        if (parent_appinfo)
+        {
+            transvars_merge_context ctx;
+
+            ctx.child_appinfo = appinfo;
+            ctx.target_rti      = rti;
+
+            appinfo->parent_relid = parent_appinfo->parent_relid;
+            appinfo->parent_reltype = parent_appinfo->parent_reltype;
+            appinfo->parent_reloid = parent_appinfo->parent_reloid;
+            appinfo->translated_vars = (List*)expression_tree_mutator(
+                parent_appinfo->translated_vars,
+                transvars_merge_mutator, &ctx);
+        }
+
+        /*         * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.         */        if (oldrc)

pgsql-hackers by date:

Previous
From: Haribabu kommi
Date:
Subject: Ident context leak during reloading of conf files when no ident information is present in the file
Next
From: Thom Brown
Date:
Subject: Re: 9.4 regression