Home > mailing lists
Re: [Proposal] Table partition + join pushdown - Mailing list pgsql-hackers

From	Kouhei Kaigai
Subject	Re: [Proposal] Table partition + join pushdown
Date	November 10, 2015 17:59:30
Msg-id	9A28C8860F777E439AA12E8AEA7694F80116E350@BPXM15GP.gisp.nec.co.jp Whole thread Raw
In response to	Re: [Proposal] Table partition + join pushdown (Taiki Kondo <tai-kondo@yk.jp.nec.com>)
Responses	Re: [Proposal] Table partition + join pushdown
List	pgsql-hackers
Tree view
Hi, I put my comments towards the patch as follows.

Overall comments
----------------
* I think the enhancement in copyfuncs.c shall be in the separate patch; it is more graceful manner. At this moment,
hereis less than 20 Path delivered type definition. It is much easier works than entire Plan node support as we did
recently.(How about other folk's opinion?) 

* Can you integrate the attached test cases as regression test? It is more generic way, and allows people to detect
problemsif relevant feature gets troubled in the future updates. 

* Naming of "join pushdown" is a bit misleading because other component also uses this term, but different purpose. I'd
liketo suggest try_pullup_append_across_join. Any ideas from native English speaker? 

Patch review
------------

At try_join_pushdown:
+   /* When specified outer path is not an AppendPath, nothing to do here. */
+   if (!IsA(outer_rel->cheapest_total_path, AppendPath))
+   {
+       elog(DEBUG1, "Outer path is not an AppendPath. Do nothing.");
+       return;
+   }
It checks whether the cheapest_total_path is AppendPath at the head
of this function. It ought to be a loop to walk on the pathlist of
RelOptInfo, because multiple path-nodes might be still alive but
also might not be cheapest_total_path.


+   switch (inner_rel->cheapest_total_path->pathtype)
+
Also, we can construct the new Append node if one of the path-node
within pathlist of inner_rel are at least supported.


+   if (list_length(inner_rel->ppilist) > 0)
+   {
+       elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown.");
+       return;
+   }
+
You may need to introduce why this feature uses ParamPathInfos here.
It seems to me a good hack to attach additional qualifiers on the
underlying inner scan node, even if it is not a direct child of
inner relation.
However, people may have different opinion.

+static List *
+convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses,
+                                   RelOptInfo *outer_rel)
+{
+   Index       parent_relid =
+                   find_childrel_appendrelinfo(root, outer_rel)->parent_relid;
+   List        *clauses_parent = get_actual_clauses(join_clauses);
+   List        *clauses_child = NIL;
+   ListCell    *lc;
+
+   foreach(lc, clauses_parent)
+   {
+       Node    *one_clause_child = (Node *) copyObject(lfirst(lc));
+
+       ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0);
+       clauses_child = lappend(clauses_child, one_clause_child);
+   }
+
+   return make_restrictinfos_from_actual_clauses(root, clauses_child);
+}

Is ChangeVarNodes() right routine to replace var-node of parent relation
by relevant var-node of child relation?
It may look sufficient, however, nobody can ensure varattno of child
relation is identical with parent relation's one.
For example, which attribute number shall be assigned on 'z' here? CREATE TABLE tbl_parent(x int); CREATE TABLE
tbl_child(yint) INHERITS(tbl_parent); ALTER TABLE tbl_parent ADD COLUMN z int; 

--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,               /*
            * Ignore child members unless they match the rel being                * sorted. 
+                *
+                * If this is called from make_sort_from_pathkeys(),
+                * relids may be NULL. In this case, we must not ignore child
+                * members because inner/outer plan of pushed-down merge join is
+                * always child table.                */
-               if (em->em_is_child &&
+               if (relids != NULL &&
+                   em->em_is_child &&                   !bms_equal(em->em_relids, relids))                   continue;

It is a little bit hard to understand why this modification is needed.
Could you add source code comment that focus on the reason why.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
> Sent: Wednesday, October 21, 2015 8:07 PM
> To: Kaigai Kouhei(海外 浩平); Kyotaro HORIGUCHI
> Cc: pgsql-hackers@postgresql.org; Yanagisawa Hiroshi(柳澤 博)
> Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown
>
> Hello, KaiGai-san and Horiguchi-san.
>
> I created v2 patch. Please find attached.
> I believe this patch will fix the most of issues mentioned by
> Horiguchi-san except naming.
>
> In this v2 patch, scan node which is originally inner relation of
> Join node must be SeqScan (or SampleScan). This limitation is
> due to implementation of try_join_pushdown(), which copies Path nodes
> to attach new filtering conditions converted from CHECK() constraints.
>
> It uses copyObject() for this purpose, so I must implement copy functions
> for scan Path nodes like IndexPath, BitmapHeapPath, TidPath and so on.
>
>
> By the way, let me introduce the performance of this feature.
> Here are the results I tested in my environment.
> These results were got by "pushdown_test.v1.large.sql"
> running on the environment that "work_mem" set to "1536kB".
> (This file is also attached in this mail.)
>
> [Normal]
>                                                              QUERY PLAN
> ----------------------------------------------------------------------------
> ---------------------------------------------------------
>  Hash Join  (cost=1851.02..14638.11 rows=300004 width=20) (actual
> time=88.188..453.926 rows=299992 loops=1)
>    Hash Cond: (check_test_div.id = inner_t.id)
>    ->  Append  (cost=0.00..4911.03 rows=300004 width=20) (actual
> time=0.089..133.456 rows=300003 loops=1)
>          ->  Seq Scan on check_test_div  (cost=0.00..0.00 rows=1 width=20)
> (actual time=0.003..0.003 rows=0 loops=1)
>          ->  Seq Scan on check_test_div_0  (cost=0.00..1637.01 rows=100001
> width=20) (actual time=0.085..40.741 rows=100001 loops=1)
>          ->  Seq Scan on check_test_div_1  (cost=0.00..1637.01 rows=100001
> width=20) (actual time=0.023..29.213 rows=100001 loops=1)
>          ->  Seq Scan on check_test_div_2  (cost=0.00..1637.01 rows=100001
> width=20) (actual time=0.021..28.592 rows=100001 loops=1)
>    ->  Hash  (cost=866.01..866.01 rows=60001 width=8) (actual
> time=87.970..87.970 rows=60001 loops=1)
>          Buckets: 32768  Batches: 2  Memory Usage: 1446kB
>          ->  Seq Scan on inner_t  (cost=0.00..866.01 rows=60001 width=8)
> (actual time=0.030..39.133 rows=60001 loops=1)
>  Planning time: 0.867 ms
>  Execution time: 470.269 ms
> (12 rows)
>
> [With this feature]
>                                                              QUERY PLAN
> ----------------------------------------------------------------------------
> ---------------------------------------------------------
>  Append  (cost=0.01..10651.37 rows=300004 width=20) (actual
> time=55.548..377.615 rows=299992 loops=1)
>    ->  Hash Join  (cost=0.01..1091.04 rows=1 width=20) (actual
> time=0.017..0.017 rows=0 loops=1)
>          Hash Cond: (inner_t.id = check_test_div.id)
>          ->  Seq Scan on inner_t  (cost=0.00..866.01 rows=60001 width=8) (never
> executed)
>          ->  Hash  (cost=0.00..0.00 rows=1 width=20) (actual time=0.003..0.003
> rows=0 loops=1)
>                Buckets: 1024  Batches: 1  Memory Usage: 8kB
>                ->  Seq Scan on check_test_div  (cost=0.00..0.00 rows=1
> width=20) (actual time=0.002..0.002 rows=0 loops=1)
>    ->  Hash Join  (cost=1169.76..3186.78 rows=100001 width=20) (actual
> time=55.530..149.205 rows=100001 loops=1)
>          Hash Cond: (check_test_div_0.id = inner_t.id)
>          ->  Seq Scan on check_test_div_0  (cost=0.00..1637.01 rows=100001
> width=20) (actual time=0.058..34.268 rows=100001 loops=1)
>          ->  Hash  (cost=1166.01..1166.01 rows=300 width=8) (actual
> time=55.453..55.453 rows=20001 loops=1)
>                Buckets: 32768 (originally 1024)  Batches: 1 (originally 1)
> Memory Usage: 1038kB
>                ->  Seq Scan on inner_t  (cost=0.00..1166.01 rows=300 width=8)
> (actual time=0.031..43.590 rows=20001 loops=1)
>                      Filter: ((id % 3) = 0)
>                      Rows Removed by Filter: 40000
>    ->  Hash Join  (cost=1169.76..3186.78 rows=100001 width=20) (actual
> time=27.942..97.582 rows=99996 loops=1)
>          Hash Cond: (check_test_div_1.id = inner_t.id)
>          ->  Seq Scan on check_test_div_1  (cost=0.00..1637.01 rows=100001
> width=20) (actual time=0.030..25.514 rows=100001 loops=1)
>          ->  Hash  (cost=1166.01..1166.01 rows=300 width=8) (actual
> time=27.890..27.890 rows=20000 loops=1)
>                Buckets: 32768 (originally 1024)  Batches: 1 (originally 1)
> Memory Usage: 1038kB
>                ->  Seq Scan on inner_t  (cost=0.00..1166.01 rows=300 width=8)
> (actual time=0.014..21.688 rows=20000 loops=1)
>                      Filter: ((id % 3) = 1)
>                      Rows Removed by Filter: 40001
>    ->  Hash Join  (cost=1169.76..3186.78 rows=100001 width=20) (actual
> time=27.651..97.755 rows=99995 loops=1)
>          Hash Cond: (check_test_div_2.id = inner_t.id)
>          ->  Seq Scan on check_test_div_2  (cost=0.00..1637.01 rows=100001
> width=20) (actual time=0.026..25.620 rows=100001 loops=1)
>          ->  Hash  (cost=1166.01..1166.01 rows=300 width=8) (actual
> time=27.599..27.599 rows=20000 loops=1)
>                Buckets: 32768 (originally 1024)  Batches: 1 (originally 1)
> Memory Usage: 1038kB
>                ->  Seq Scan on inner_t  (cost=0.00..1166.01 rows=300 width=8)
> (actual time=0.017..21.307 rows=20000 loops=1)
>                      Filter: ((id % 3) = 2)
>                      Rows Removed by Filter: 40001
>  Planning time: 1.876 ms
>  Execution time: 394.007 ms
> (33 rows)
>
>
> The value of "Batches" is 2 on Hash node in normal,
> but these values are 1 on all Hash nodes in "with this feature".
>
> This means that the hash table is not split because of this feature.
>
> Therefore, PostgreSQL with this feature is faster than the normal one in this
> case.
> (470.269 ms @ normal vs 394.007 ms @ this feature)
>
> I think this point is large benefit of this feature.
>
>
> Best regards,
> --
> Taiki Kondo
>
> NEC Solution Innovators, Ltd.
>
>
>
>
>
> -----Original Message-----
> From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
> Sent: Thursday, October 15, 2015 10:21 AM
> To: Kondo Taiki(近藤 太樹); Kyotaro HORIGUCHI
> Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org
> Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown
>
> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
> > Sent: Thursday, October 08, 2015 5:28 PM
> > To: Kyotaro HORIGUCHI
> > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎);
> > pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown
> >
> > Hello, Horiguchi-san.
> >
> > Thank you for your comment.
> >
> > > I got some warning on compilation on unused variables and wrong
> > > arguemtn type.
> >
> > OK, I'll fix it.
> >
> > > I failed to have a query that this patch works on. Could you let me
> > > have some specific example for this patch?
> >
> > Please find attached.
> > And also make sure that setting of work_mem is '64kB' (not 64MB).
> >
> > If there is the large work_mem enough to create hash table for
> > relation after appending, its cost may be better than pushed-down
> > plan's cost, then planner will not choose pushed-down plan this patch makes.
> > So, to make this patch working fine, work_mem size must be smaller
> > than the hash table size for relation after appending.
> >
> > > This patch needs more comments. Please put comment about not only
> > > what it does but also the reason and other things for it.
> >
> > OK, I'll put more comments in the code.
> > But it will take a long time, maybe...
> >
> People (including me) can help. Even though your English capability is not enough,
> it is significant to put intention of the code.
>
> > > -- about namings
> > >
> > > Names for functions and variables are needed to be more appropriate,
> > > in other words, needed to be what properly informs what they are.
> > > The followings are the examples of such names.
> >
> > Thank you for your suggestion.
> >
> > I also think these names are not good much.
> > I'll try to make names better , but it maybe take a long time...
> > Of course, I will use your suggestion as reference.
> >
> > > "added_restrictlist"'s widely distributed as many function arguemnts
> > > and JoinPathExtraData makes me feel dizzy..
> >
> > "added_restrictinfo" will be deleted from almost functions other than
> > try_join_pushdown() in next (v2) patch because the place of filtering
> > using this info will be changed from Join node to Scan node and not
> > have to place it into other than try_join_pushdown().
> >
> This restrictinfo intends to filter out obviously unrelated rows in this join,
> due to the check constraint of other side of the join.
> So, correct but redundant name is:
>   restrictlist_to_drop_unrelated_rows_because_of_check_constraint
>
> How about 'restrictlist_by_constraint' instead?
>
> > > In make_restrictinfos_from_check_constr, the function returns
> > > modified constraint predicates correspond to vars under hashjoinable
> > > join clauses. I don't think expression_tree_mutator is suitable to
> > > do that since it could allow unwanted result when constraint
> > > predicates or join clauses are not simple OpExpr's.
> >
> > Do you have any example of this situation?
> > I am trying to find unwanted results you mentioned, but I don't have
> > any results in this timing. I have a hunch that it will allow unwanted
> > results because I have thought only about very simple situation for
> > this function.
> >
> check_constraint_mutator makes modified restrictlist with relacing Var node only
> when join clause is hash-joinable.
> It implies <expr> = <expr> form, thus we can safely replace the expression by
> the other side.
>
> Of course, we still have cases we cannot replace expressions simply.
> - If function (or function called by operators) has volatile attribute
>   (who use volatile function on CHECK constraint of partitioning?)
> - If it is uncertain whether expression returns always same result.
>   (is it possible to contain SubLink in the constraint?)
>
> I'd like to suggest to use white-list approach in this mutator routine.
> It means that only immutable expression node are allowed to include the modified
> restrictlist.
>
> Things to do is:
>
> check_constraint_mutator(...)
> {
>   if (node == NULL)
>     return NULL;
>   if (IsA(node, Var))
>   {
>      :
>   }
>   else if (node is not obviously immutable)
>   {
>     context->is_mutated = false;  <-- prohibit to make if expression
>   }                                   contains uncertain node.
>   return expression_tree_mutator(...)
> }
>
>
> > > Otherwise could you give me clear explanation on what it does?
> >
> > This function transfers CHECK() constraint to filter expression by
> > following procedures.
> > (1) Get outer table's CHECK() constraint by using get_relation_constraints().
> > (2) Walk through expression tree got in (1) by using expression_tree_mutator()
> >     with check_constraint_mutator() and change only outer's Var node to
> >     inner's one according to join clause.
> >
> > For example, when CHECK() constraint of table A is "num % 4 = 0" and
> > join clause between table A and B is "A.num = B.data", then we can get
> > "B.data % 4 = 0" for filtering purpose.
> >
> > This also accepts more complex join clause like "A.num = B.data * 2",
> > then we can get "(B.data * 2) % 4 = 0".
> >
> > In procedure (2), to decide whether to use each join clause for
> > changing Var node or not, I implement check_constraint_mutator() to
> > judge whether join clause is hash-joinable or not.
> >
> > Actually, I want to judge whether OpExpr as top expression tree of
> > join clause means "=" or not, but I can't find how to do it.
> >
> > If you know how to do it, please let me know.
> >
> >
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project KaiGai Kohei
> <kaigai@ak.jp.nec.com>
>
>
> > -----Original Message-----
> > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> > Sent: Tuesday, October 06, 2015 8:35 PM
> > To: tai-kondo@yk.jp.nec.com
> > Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com;
> > pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown
> >
> > Hello.
> >
> > I tried to read this and had some random comments on this.
> >
> > -- general
> >
> > I got some warning on compilation on unused variables and wrong arguemtn type.
> >
> > I failed to have a query that this patch works on. Could you let me
> > have some specific example for this patch?
> >
> > This patch needs more comments. Please put comment about not only what
> > it does but also the reason and other things for it.
> >
> >
> > -- about namings
> >
> > Names for functions and variables are needed to be more appropriate,
> > in other words, needed to be what properly informs what they are. The
> > followings are the examples of such names.
> >
> > "added_restrictlist"'s widely distributed as many function arguemnts
> > and JoinPathExtraData makes me feel dizzy.. create_mergejoin_path
> > takes it as "filtering_clauses", which looks far better.
> >
> > try_join_pushdown() is also the name with much wider meaning. This
> > patch tries to move hashjoins on inheritance parent to under append
> > paths. It could be generically called 'pushdown'
> > but this would be better be called such like 'transform appended
> > hashjoin' or 'hashjoin distribution'. The latter would be better.
> > (The function name would be try_distribute_hashjoin for the
> > case.)
> >
> > The name make_restrictinfos_from_check_contr() also tells me wrong
> > thing. For example,
> > extract_constraints_for_hashjoin_distribution() would inform me about
> > what it returns.
> >
> >
> > -- about what make_restrictinfos_from_check_constr() does
> >
> > In make_restrictinfos_from_check_constr, the function returns modified
> > constraint predicates correspond to vars under hashjoinable join
> > clauses. I don't think expression_tree_mutator is suitable to do that
> > since it could allow unwanted result when constraint predicates or join clauses
> are not simple OpExpr's.
> >
> > Could you try more simple and straight-forward way to do that?
> > Otherwise could you give me clear explanation on what it does?
> >
> > regards,
> >
> > --
> > Kyotaro Horiguchi
> > NTT Open Source Software Center
> >
> >
> >
pgsql-hackers by date:
From: Adam Brightwell
Date: 10 November 2015, 17:20:53
Subject: Re: bootstrap pg_shseclabel in relcache initialization
From: Tom Lane
Date: 10 November 2015, 18:07:59
Subject: Re: Per-table log_autovacuum_min_duration is actually documented
Re: [Proposal] Table partition + join pushdown - Mailing list pgsql-hackers

Previous

Next