Thread: [Proposal] Table partition + join pushdown
Hi all, I saw the email about the idea from KaiGai-san[1], and I worked to implement this idea. Now, I have implemented a part of this idea, so I want to propose this feature. Patch attached just shows my concept of this feature. It works fine for EXPLAIN, but it returns wrong result for other operations, sadly. Table partition + join pushdown =============================== Motivation ---------- To make join logic working more effectively, it is important to make the size of relations smaller. Especially in Hash-join, it is meaningful to make the inner relation smaller, because smaller inner relation can be stored within smaller hash table. This means that memory usage can be reduced when joining with big tables. Design ------ It was mentioned by the email from KaiGai-san. So I quote below here... ---- begin quotation --- Let's assume a table which is partitioned to four portions, and individual child relations have constraint by hash-value of its ID field. tbl_parent + tbl_child_0 ... CHECK(hash_func(id) % 4 = 0) + tbl_child_1 ... CHECK(hash_func(id) % 4 = 1) + tbl_child_2 ... CHECK(hash_func(id) % 4 = 2) + tbl_child_3 ... CHECK(hash_func(id) % 4 = 3) If someone tried to join another relation with tbl_parent using equivalence condition, like X = tbl_parent.ID, we know inner tuples that does not satisfies the condition hash_func(X) % 4 = 0 shall be never joined to the tuples in tbl_child_0. So, we can omit to load these tuples to inner hash table preliminary, then it potentially allows to split the inner hash-table. Current typical plan structure is below: HashJoin -> Append -> SeqScan on tbl_child_0 -> SeqScan on tbl_child_1 -> SeqScan on tbl_child_2 -> SeqScan on tbl_child_3 -> Hash -> SeqScan on other_table It may be rewritable to: Append -> HashJoin -> SeqScan on tbl_child_0 -> Hash ... Filter: hash_func(X) % 4 = 0 -> SeqScan on other_table -> HashJoin -> SeqScan on tbl_child_1 -> Hash ... Filter: hash_func(X) % 4 = 1 -> SeqScan on other_table -> HashJoin -> SeqScan on tbl_child_2 -> Hash ... Filter: hash_func(X) % 4 = 2 -> SeqScan on other_table -> HashJoin -> SeqScan on tbl_child_3 -> Hash ... Filter: hash_func(X) % 4 = 3 -> SeqScan on other_table ---- end quotation --- In the quotation above, it was written that filter is set at Hash node. But I implemented that filter is set at SeqScan node under Hash node. In my opinion, filtering tuples is work of Scanner. Append -> HashJoin -> SeqScan on tbl_child_0 -> Hash -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0 -> HashJoin -> SeqScan on tbl_child_1 -> Hash -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1 -> HashJoin -> SeqScan on tbl_child_2 -> Hash -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2 -> HashJoin -> SeqScan on tbl_child_3 -> Hash -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3 API --- There are 3 new internal (static) functions to implement this feature. try_hashjoin_pushdown(), which is main function of this feature, is called from try_hashjoin_path(), and tries to push down HashPath under the AppendPath. To do so, this function does following operations. 1. Check if this Hash-join can be pushed down under AppendPath 2. To avoid an influence on other Path making operation, copy inner path's RelOptInfo and make new SeqScan path from it. At here, get CHECK() constraints from OUTER path, and convert its Var node according to join condition. And also convert Var nodes in join condition itself. 3. Create new HashPath nodes between each sub-paths of AppendPath and inner path made above. 4. When operations from 1 to 3 are done for each sub-paths, create new AppendPath which sub-paths are HashPath nodes made above. get_replaced_clause_constr() is called from try_hashjoin_pushdown(), and get_var_nodes_recurse() is called from get_replaced_cluase_constr(). These 2 functions help above operations. (I may revise this part to use expression_tree_walker() and expression_tree_mutator().) In patch attached, it has the following limitations. o It only works for hash-join operation. (I want to support not only hash-join but also other logic.) o Join conditions must be "=" operator with int4 variables. o Inner path must be SeqScan. (I want to support other path-node.) o And now, planner may not choose this plan, because estimated costs are usually larger than original (non-pushdown) plan. And also 1 internal (static) function, get_relation_constraints() defined in plancat.c, is changed to global. This function will be called from get_replaced_clause_constr() to get CHECK() constraints. Usage ----- To use this feature, create partition tables and small table to join, and run select operation with joining these tables. For your convenience, I attach DDL and DML script. And I also attach the result of EXPLAIN. Any comments are welcome. But, at first, I need your advices to correct this patch's behavior. At least, I think it has to expand array of RangeTblEntry and other arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes mentioned above. Or, is it better choice to modify query parser to implement this feature more further? Remarks : [1] http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F8010F672B@BPXM15GP.gisp.nec.co.jp Best regards, -- Taiki Kondo NEC Solution Innovators, Ltd.
Attachment
Hello Kondo-san, I briefly checked your patch. Let me put some comments about its design and implementation, even though I have no arguments towards its concept. :-) * Construction of RelOptInfo In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() constructs RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It is entirely wrong implementation. I can understand we (may) have no RelOptInfo for the joinrel between tbl_child_0 and other_table, when planner investigates a joinpath to process join Append path with the other_table. > HashJoin > -> Append > -> SeqScan on tbl_child_0 > -> SeqScan on tbl_child_1 > -> SeqScan on tbl_child_2 > -> SeqScan on tbl_child_3 > -> Hash > -> SeqScan on other_table > How about these alternatives? - calls make_join_rel() to the pair of tbl_child_X and other_table by try_hashjoin_pushdown() or somewhere. make_join_rel()internally construct a RelOptInfo for the supplied pair of relations, so relevant RelOptInfo shall be properlyconstructed. - make_join_rel() also calls add_paths_to_joinrel() towards all the join logic, so it makes easier to support to push downother join logic including nested-loop or custom-join. - It may be an idea to add an extra argument to make_join_rel() to inform expressions to be applied for tuple filtering onconstruction of inner hash table. * Why only SeqScan is supported I think it is the role of Hash-node to filter out inner tuples obviously unrelated to the join (if CHECK constraint of outer relation gives information), because this join-pushdown may be able to support multi-stacked pushdown. For example, if planner considers a path to join this Append-path with another relation, and join clause contains a reference to X? > Append > -> HashJoin > -> SeqScan on tbl_child_0 > -> Hash ... Filter: hash_func(X) % 4 = 0 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_1 > -> Hash ... Filter: hash_func(X) % 4 = 1 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_2 > -> Hash ... Filter: hash_func(X) % 4 = 2 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_3 > -> Hash ... Filter: hash_func(X) % 4 = 3 > -> SeqScan on other_table > It may be a good challenge to consider additional join pushdown, even if subpaths of Append are HashJoin, not SeqScan, like: Append -> HashJoin -> HashJoin -> SeqScan on tbl_child_0 -> Hash ... Filter: hash_func(X) % 4 = 0 -> SeqScan on other_table -> Hash ... Filter: hash_func(X) % 4 = 0 -> SeqScan on another_table -> HashJoin -> HashJoin -> SeqScan on tbl_child_1 -> Hash ... Filter: hash_func(X) % 4 = 1 -> SeqScanon other_table -> Hash ... Filter: hash_func(X) % 4 = 1 -> SeqScan on another_table -> HashJoin ->HashJoin -> SeqScan on tbl_child_2 -> Hash ... Filter: hash_func(X) % 4 = 2 -> SeqScan on other_table -> Hash ... Filter: hash_func(X) % 4 = 2 -> SeqScan on another_table -> HashJoin -> HashJoin -> SeqScan on tbl_child_3 -> Hash ... Filter: hash_func(X) % 4 = 3 -> SeqScan on other_table -> Hash ... Filter: hash_func(X) % 4 = 3 -> SeqScan on another_table In this case, underlying nodes are not always SeqScan. So, only Hash-node can have filter clauses. * Way to handle expression nodes All this patch supported is CHECK() constraint with equal operation on INT4 data type. You can learn various useful infrastructure of PostgreSQL. For example, ... - expression_tree_mutator() is useful to make a copy of expression node with small modification - pull_varnos() is useful to check which relations are referenced by the expression node. - RestrictInfo->can_join is useful to check whether the clause is binary operator, or not. Anyway, reuse of existing infrastructure is the best way to make a reliable infrastructure and to keep the implementation simple. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > Sent: Thursday, August 13, 2015 6:30 PM > To: pgsql-hackers@postgresql.org > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎) > Subject: [HACKERS] [Proposal] Table partition + join pushdown > > Hi all, > > I saw the email about the idea from KaiGai-san[1], > and I worked to implement this idea. > > Now, I have implemented a part of this idea, > so I want to propose this feature. > > Patch attached just shows my concept of this feature. > It works fine for EXPLAIN, but it returns wrong result for other operations, sadly. > > > > Table partition + join pushdown > =============================== > > Motivation > ---------- > To make join logic working more effectively, > it is important to make the size of relations smaller. > > Especially in Hash-join, it is meaningful to make the inner relation smaller, > because smaller inner relation can be stored within smaller hash table. > This means that memory usage can be reduced when joining with big tables. > > > Design > ------ > It was mentioned by the email from KaiGai-san. > So I quote below here... > > ---- begin quotation --- > Let's assume a table which is partitioned to four portions, > and individual child relations have constraint by hash-value > of its ID field. > > tbl_parent > + tbl_child_0 ... CHECK(hash_func(id) % 4 = 0) > + tbl_child_1 ... CHECK(hash_func(id) % 4 = 1) > + tbl_child_2 ... CHECK(hash_func(id) % 4 = 2) > + tbl_child_3 ... CHECK(hash_func(id) % 4 = 3) > > If someone tried to join another relation with tbl_parent > using equivalence condition, like X = tbl_parent.ID, we > know inner tuples that does not satisfies the condition > hash_func(X) % 4 = 0 > shall be never joined to the tuples in tbl_child_0. > So, we can omit to load these tuples to inner hash table > preliminary, then it potentially allows to split the > inner hash-table. > > Current typical plan structure is below: > > HashJoin > -> Append > -> SeqScan on tbl_child_0 > -> SeqScan on tbl_child_1 > -> SeqScan on tbl_child_2 > -> SeqScan on tbl_child_3 > -> Hash > -> SeqScan on other_table > > It may be rewritable to: > > Append > -> HashJoin > -> SeqScan on tbl_child_0 > -> Hash ... Filter: hash_func(X) % 4 = 0 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_1 > -> Hash ... Filter: hash_func(X) % 4 = 1 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_2 > -> Hash ... Filter: hash_func(X) % 4 = 2 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_3 > -> Hash ... Filter: hash_func(X) % 4 = 3 > -> SeqScan on other_table > ---- end quotation --- > > In the quotation above, it was written that filter is set at Hash node. > But I implemented that filter is set at SeqScan node under Hash node. > In my opinion, filtering tuples is work of Scanner. > > Append > -> HashJoin > -> SeqScan on tbl_child_0 > -> Hash > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0 > -> HashJoin > -> SeqScan on tbl_child_1 > -> Hash > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1 > -> HashJoin > -> SeqScan on tbl_child_2 > -> Hash > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2 > -> HashJoin > -> SeqScan on tbl_child_3 > -> Hash > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3 > > > API > --- > There are 3 new internal (static) functions to implement this feature. > try_hashjoin_pushdown(), which is main function of this feature, > is called from try_hashjoin_path(), and tries to push down HashPath > under the AppendPath. > > To do so, this function does following operations. > > 1. Check if this Hash-join can be pushed down under AppendPath > 2. To avoid an influence on other Path making operation, > copy inner path's RelOptInfo and make new SeqScan path from it. > At here, get CHECK() constraints from OUTER path, and convert its > Var node according to join condition. And also convert Var nodes > in join condition itself. > 3. Create new HashPath nodes between each sub-paths of AppendPath and > inner path made above. > 4. When operations from 1 to 3 are done for each sub-paths, > create new AppendPath which sub-paths are HashPath nodes made above. > > get_replaced_clause_constr() is called from try_hashjoin_pushdown(), > and get_var_nodes_recurse() is called from get_replaced_cluase_constr(). > These 2 functions help above operations. > (I may revise this part to use expression_tree_walker() and > expression_tree_mutator().) > > In patch attached, it has the following limitations. > o It only works for hash-join operation. > (I want to support not only hash-join but also other logic.) > o Join conditions must be "=" operator with int4 variables. > o Inner path must be SeqScan. > (I want to support other path-node.) > o And now, planner may not choose this plan, > because estimated costs are usually larger than original (non-pushdown) plan. > > And also 1 internal (static) function, get_relation_constraints() defined in > plancat.c, is changed to global. This function will be called from > get_replaced_clause_constr() to get CHECK() constraints. > > > Usage > ----- > To use this feature, create partition tables and small table to join, > and run select operation with joining these tables. > > For your convenience, I attach DDL and DML script. > And I also attach the result of EXPLAIN. > > > Any comments are welcome. But, at first, I need your advices > to correct this patch's behavior. > > At least, I think it has to expand array of RangeTblEntry and other arrays defined > in PlannerInfo to register new RelOptInfos for new Path nodes mentioned above. > Or, is it better choice to modify query parser to implement this feature more > further? > > > Remarks : > [1] > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F8010F672 > B@BPXM15GP.gisp.nec.co.jp > > > Best regards, > -- > Taiki Kondo > > NEC Solution Innovators, Ltd.
Hello, KaiGai-san. Thank you for your comment, and sorry for late response. The attached patch is completely rewritten from previous patch[1], at your suggestion[2]. Please find attached. This patch contains following implementation, but I can't determine this is correct or wrong. 1. Cost estimation In this patch, additional row filter is implemented for Hash, Merge join and Nested Loop. I implemented cost estimation feature for this filter by watching other parts of filters, but I am not sure this implementation is correct. 2. Workaround for failing assertion at allpaths.c In standard_join_search(), we expect to have a single rel at the final level. But this expectation is disappointed by join pushdown feature, because this will search for the combinations not searched by original standard_join_serch() at the final level. Therefore, once trying join pushdown is succeeded, failing assertion occurs in allpaths.c. So I implemented workaround by temporary set NULL to root->join_rel_level while trying join pushdown, but I think this implementation may be wrong. 3. Searching pathkeys for Merge Join When join pushdown feature chooses merge join for pushed-down join operation, planner fails to create merge join node because it is unable to find pathkeys for this merge join. I found this is caused by skipping child table in finding pathkeys. I expect that it is for making planner faster, and I implemented that planner doesn't skip child table in finding pathkeys for merge join. But I am not sure this expectation is correct. Any comments/suggestions are welcome. Remarks : [1] http://www.postgresql.org/message-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@BPXM01GP.gisp.nec.co.jp [2] http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F8011345B6@BPXM15GP.gisp.nec.co.jp Best regards, -- Taiki Kondo NEC Solution Innovators, Ltd. -----Original Message----- From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] Sent: Tuesday, August 18, 2015 5:47 PM To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org Cc: Iwaasa Akio(岩浅 晃郎) Subject: RE: [Proposal] Table partition + join pushdown Hello Kondo-san, I briefly checked your patch. Let me put some comments about its design and implementation, even though I have no argumentstowards its concept. :-) * Construction of RelOptInfo In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() constructs RelOptInfo of the join-rel between inner-reland a subpath of Append node. It is entirely wrong implementation. I can understand we (may) have no RelOptInfo for the joinrel between tbl_child_0 and other_table, when planner investigates a joinpath to process join Append path with the other_table. > HashJoin > -> Append > -> SeqScan on tbl_child_0 > -> SeqScan on tbl_child_1 > -> SeqScan on tbl_child_2 > -> SeqScan on tbl_child_3 > -> Hash > -> SeqScan on other_table > How about these alternatives? - calls make_join_rel() to the pair of tbl_child_X and other_table by try_hashjoin_pushdown() or somewhere. make_join_rel() internally construct a RelOptInfo for the supplied pair of relations, so relevant RelOptInfo shall be properly constructed. - make_join_rel() also calls add_paths_to_joinrel() towards all the join logic, so it makes easier to support to push down other join logic including nested-loop or custom-join. - It may be an idea to add an extra argument to make_join_rel() to inform expressions to be applied for tuple filtering on construction of inner hash table. * Why only SeqScan is supported I think it is the role of Hash-node to filter out inner tuples obviously unrelated to the join (if CHECK constraint of outerrelation gives information), because this join-pushdown may be able to support multi-stacked pushdown. For example, if planner considers a path to join this Append-path with another relation, and join clause contains a referenceto X? > Append > -> HashJoin > -> SeqScan on tbl_child_0 > -> Hash ... Filter: hash_func(X) % 4 = 0 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_1 > -> Hash ... Filter: hash_func(X) % 4 = 1 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_2 > -> Hash ... Filter: hash_func(X) % 4 = 2 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_3 > -> Hash ... Filter: hash_func(X) % 4 = 3 > -> SeqScan on other_table > It may be a good challenge to consider additional join pushdown, even if subpaths of Append are HashJoin, not SeqScan, like: Append -> HashJoin -> HashJoin -> SeqScan on tbl_child_0 -> Hash ... Filter: hash_func(X) % 4 = 0 -> SeqScan on other_table -> Hash ... Filter: hash_func(X) % 4 = 0 -> SeqScan on another_table -> HashJoin -> HashJoin -> SeqScan on tbl_child_1 -> Hash ... Filter: hash_func(X) % 4 = 1 -> SeqScan on other_table -> Hash ... Filter: hash_func(X) % 4 = 1 -> SeqScan on another_table -> HashJoin -> HashJoin -> SeqScan on tbl_child_2 -> Hash ... Filter: hash_func(X) % 4 = 2 -> SeqScan on other_table -> Hash ... Filter: hash_func(X) % 4 = 2 -> SeqScan on another_table -> HashJoin -> HashJoin -> SeqScan on tbl_child_3 -> Hash ... Filter: hash_func(X) % 4 = 3 -> SeqScan on other_table -> Hash ... Filter: hash_func(X) % 4 = 3 -> SeqScan on another_table In this case, underlying nodes are not always SeqScan. So, only Hash-node can have filter clauses. * Way to handle expression nodes All this patch supported is CHECK() constraint with equal operation on INT4 data type. You can learn various useful infrastructureof PostgreSQL. For example, ... - expression_tree_mutator() is useful to make a copy of expression node with small modification - pull_varnos() is useful to check which relations are referenced by the expression node. - RestrictInfo->can_join is useful to check whether the clause is binary operator, or not. Anyway, reuse of existing infrastructure is the best way to make a reliable infrastructure and to keep the implementationsimple. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > Sent: Thursday, August 13, 2015 6:30 PM > To: pgsql-hackers@postgresql.org > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎) > Subject: [HACKERS] [Proposal] Table partition + join pushdown > > Hi all, > > I saw the email about the idea from KaiGai-san[1], and I worked to > implement this idea. > > Now, I have implemented a part of this idea, so I want to propose this > feature. > > Patch attached just shows my concept of this feature. > It works fine for EXPLAIN, but it returns wrong result for other operations, sadly. > > > > Table partition + join pushdown > =============================== > > Motivation > ---------- > To make join logic working more effectively, it is important to make > the size of relations smaller. > > Especially in Hash-join, it is meaningful to make the inner relation > smaller, because smaller inner relation can be stored within smaller hash table. > This means that memory usage can be reduced when joining with big tables. > > > Design > ------ > It was mentioned by the email from KaiGai-san. > So I quote below here... > > ---- begin quotation --- > Let's assume a table which is partitioned to four portions, and > individual child relations have constraint by hash-value of its ID > field. > > tbl_parent > + tbl_child_0 ... CHECK(hash_func(id) % 4 = 0) > + tbl_child_1 ... CHECK(hash_func(id) % 4 = 1) > + tbl_child_2 ... CHECK(hash_func(id) % 4 = 2) > + tbl_child_3 ... CHECK(hash_func(id) % 4 = 3) > > If someone tried to join another relation with tbl_parent using > equivalence condition, like X = tbl_parent.ID, we know inner tuples > that does not satisfies the condition > hash_func(X) % 4 = 0 > shall be never joined to the tuples in tbl_child_0. > So, we can omit to load these tuples to inner hash table preliminary, > then it potentially allows to split the inner hash-table. > > Current typical plan structure is below: > > HashJoin > -> Append > -> SeqScan on tbl_child_0 > -> SeqScan on tbl_child_1 > -> SeqScan on tbl_child_2 > -> SeqScan on tbl_child_3 > -> Hash > -> SeqScan on other_table > > It may be rewritable to: > > Append > -> HashJoin > -> SeqScan on tbl_child_0 > -> Hash ... Filter: hash_func(X) % 4 = 0 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_1 > -> Hash ... Filter: hash_func(X) % 4 = 1 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_2 > -> Hash ... Filter: hash_func(X) % 4 = 2 > -> SeqScan on other_table > -> HashJoin > -> SeqScan on tbl_child_3 > -> Hash ... Filter: hash_func(X) % 4 = 3 > -> SeqScan on other_table > ---- end quotation --- > > In the quotation above, it was written that filter is set at Hash node. > But I implemented that filter is set at SeqScan node under Hash node. > In my opinion, filtering tuples is work of Scanner. > > Append > -> HashJoin > -> SeqScan on tbl_child_0 > -> Hash > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0 > -> HashJoin > -> SeqScan on tbl_child_1 > -> Hash > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1 > -> HashJoin > -> SeqScan on tbl_child_2 > -> Hash > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2 > -> HashJoin > -> SeqScan on tbl_child_3 > -> Hash > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3 > > > API > --- > There are 3 new internal (static) functions to implement this feature. > try_hashjoin_pushdown(), which is main function of this feature, is > called from try_hashjoin_path(), and tries to push down HashPath under > the AppendPath. > > To do so, this function does following operations. > > 1. Check if this Hash-join can be pushed down under AppendPath > 2. To avoid an influence on other Path making operation, > copy inner path's RelOptInfo and make new SeqScan path from it. > At here, get CHECK() constraints from OUTER path, and convert its > Var node according to join condition. And also convert Var nodes > in join condition itself. > 3. Create new HashPath nodes between each sub-paths of AppendPath and > inner path made above. > 4. When operations from 1 to 3 are done for each sub-paths, > create new AppendPath which sub-paths are HashPath nodes made above. > > get_replaced_clause_constr() is called from try_hashjoin_pushdown(), > and get_var_nodes_recurse() is called from get_replaced_cluase_constr(). > These 2 functions help above operations. > (I may revise this part to use expression_tree_walker() and > expression_tree_mutator().) > > In patch attached, it has the following limitations. > o It only works for hash-join operation. > (I want to support not only hash-join but also other logic.) > o Join conditions must be "=" operator with int4 variables. > o Inner path must be SeqScan. > (I want to support other path-node.) > o And now, planner may not choose this plan, > because estimated costs are usually larger than original (non-pushdown) plan. > > And also 1 internal (static) function, get_relation_constraints() > defined in plancat.c, is changed to global. This function will be > called from > get_replaced_clause_constr() to get CHECK() constraints. > > > Usage > ----- > To use this feature, create partition tables and small table to join, > and run select operation with joining these tables. > > For your convenience, I attach DDL and DML script. > And I also attach the result of EXPLAIN. > > > Any comments are welcome. But, at first, I need your advices to > correct this patch's behavior. > > At least, I think it has to expand array of RangeTblEntry and other > arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes mentioned above. > Or, is it better choice to modify query parser to implement this > feature more further? > > > Remarks : > [1] > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F80 > 10F672 > B@BPXM15GP.gisp.nec.co.jp > > > Best regards, > -- > Taiki Kondo > > NEC Solution Innovators, Ltd.
Attachment
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > Sent: Thursday, September 24, 2015 8:06 PM > To: Kaigai Kouhei(海外 浩平) > Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > Hello, KaiGai-san. > > Thank you for your comment, and sorry for late response. > > The attached patch is completely rewritten from previous patch[1], at your > suggestion[2]. > Please find attached. > Thanks for your work, and let me introduce purpose of the work briefly, because the last submission was August. His work intends (1) to save resource consumption on tables join at this moment, and (2) to provide an infrastructure of one parallel join scenario once Funnel node gets capable. Even if we construct partition tables, it is unavailable to utilize to filter out candidate rows of join. In the result, size of Hash table may grow more than necessity and it causes unnecessary nBatch increase. Below is the scenario this project tries to tackle. In case when tables join takes partitioned table on one side, usually, we once need to run entire partitioned table unless we cannot drop particular child tables. XXXXJoin cond (x = y) -> Append -> SeqScan on tbl_child_0 ... CHECK (hash_func(x) % 4 = 0) -> SeqScan on tbl_child_1 ... CHECK (hash_func(x) % 4 = 1) -> SeqScan on tbl_child_2 ... CHECK (hash_func(x) % 4 = 2) -> SeqScanon tbl_child_3 ... CHECK (hash_func(x) % 4 = 3) -> Hash -> SeqScan on other_table However, CHECK() constraint assigned on child tables give us hint which rows in other side are never related to this join. For example, all the rows in other_table to be joined with tbl_child_0 should have multiple number of 4 on hash_func(y). We may be able to omit unrelated rows from the hash-table in this case, then it eventually allows to reduce the size of hash table. In case of INNER_JOIN, we can rewrite the query execution plan as below. Append -> HashJoin cond (x = y) -> SeqScan on tbl_child_0 -> Hash ... Filter: hash_func(y) % 4 = 0 -> SeqScanon other_table -> HashJoin cond (x = y) -> SeqScan on tbl_child_1 -> Hash ... Filter: hash_func(y) % 4= 1 -> SeqScan on other_table -> HashJoin cond (x = y) -> SeqScan on tbl_child_2 -> Hash ... Filter: hash_func(y)% 4 = 2 -> SeqScan on other_table -> HashJoin cond (x = y) -> SeqScan on tbl_child_3 -> Hash... Filter: hash_func(y) % 4 = 3 -> SeqScan on other_table Unrelated rows of Hash table is preliminarily, it allows to avoid hash table split when its size reaches to work_mem limitation. This join-pushdown is valuable on hash-join and merge-join if MJ takes unsorted relation and number of rows to be sorted is performance factor. Also, once Funnel gets capable to run Append on background worker, it is also helpful to run NestLoop in parallel. How about the opinion from third parties? I'm a bit biased, of course. OK, below is the brief comment to patch. * Suppose we focus on only HashJoin in the first version? This patch also add support on NestLoop and MergeJoin, however, NestLoop has no valuable scenario without parallel execution capability, and the most valuable scenario on MergeJoin is reduction of rows prior to Sort. Once input rows gets sorted, it is less attractive to filter out rows. * MultiExecHash() once put slot on outer_slot then move it to inner_slot This patch add set_hash_references() to replace varno in the expression of Hash->filterqual to OUTER_VAR. Why not INNER_VAR? If Var nodes would be initialized t oreference inner_slot, you don't need to re-assign slot. I'll try to have deeper investigation, later. > This patch contains following implementation, but I can't determine this is > correct or wrong. > > 1. Cost estimation > In this patch, additional row filter is implemented for Hash, Merge join and Nested > Loop. > I implemented cost estimation feature for this filter by watching other parts > of filters, > but I am not sure this implementation is correct. > @@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path, * not all of the quals may get evaluatedat each tuple.) */ startup_cost += qp_qual_cost.startup; + startup_cost += filter_qual_cost.startup + + filter_qual_cost.per_tuple * inner_path_rows; cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple; run_cost+= cpu_per_tuple * hashjointuples; It seems to me it is not a fair estimation because inner_path_rows means number of rows already filtered out, but filter_qual shall be applied to all the inner input rows. > 2. Workaround for failing assertion at allpaths.c > In standard_join_search(), we expect to have a single rel at the final level. > But this expectation is disappointed by join pushdown feature, because this will > search for the combinations not searched by original standard_join_serch() > at the final level. Therefore, once trying join pushdown is succeeded, > failing assertion occurs in allpaths.c. > > So I implemented workaround by temporary set NULL to root->join_rel_level while > trying join pushdown, but I think this implementation may be wrong. > It is my off-list suggestion. The standard_join_search expects root of the partition tables will appear, but child tables are out of scope. Once we try to push down the join under the append, we need to consider table joins between inner table and every outer child tables, however, it should not be visible to standard_join_search context. From the standpoint of standard_join_search, it get an AppendPath that represents a table join A and B, even if A contains 100 children and join was pushed down on behalf of the AppendPath. So, it is a reasonable way to set NULL on root->join_rel_level to avoid unexpected RelOptInfo addition by build_join_rel(). "to avoid assertion" is one fact, however, intension of the code is avoid pollution of the global data structure. ;-) > 3. Searching pathkeys for Merge Join > When join pushdown feature chooses merge join for pushed-down join operation, > planner fails to create merge join node because it is unable to find pathkeys > for this merge join. I found this is caused by skipping child table in finding > pathkeys. > > I expect that it is for making planner faster, and I implemented that > planner doesn't skip child table in finding pathkeys for merge join. > But I am not sure this expectation is correct. > I like to recommend to omit MergeJoin support at the first version. Thanks, > Any comments/suggestions are welcome. > > > Remarks : > [1] > http://www.postgresql.org/message-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@ > BPXM01GP.gisp.nec.co.jp > [2] > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F8011345B > 6@BPXM15GP.gisp.nec.co.jp > > Best regards, > -- > Taiki Kondo > > NEC Solution Innovators, Ltd. > > -----Original Message----- > From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] > Sent: Tuesday, August 18, 2015 5:47 PM > To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org > Cc: Iwaasa Akio(岩浅 晃郎) > Subject: RE: [Proposal] Table partition + join pushdown > > Hello Kondo-san, > > I briefly checked your patch. Let me put some comments about its design and > implementation, even though I have no arguments towards its concept. :-) > > * Construction of RelOptInfo > > In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() constructs > RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It > is entirely wrong implementation. > > I can understand we (may) have no RelOptInfo for the joinrel between > tbl_child_0 and other_table, when planner investigates a joinpath to process join > Append path with the other_table. > > > HashJoin > > -> Append > > -> SeqScan on tbl_child_0 > > -> SeqScan on tbl_child_1 > > -> SeqScan on tbl_child_2 > > -> SeqScan on tbl_child_3 > > -> Hash > > -> SeqScan on other_table > > > How about these alternatives? > > - calls make_join_rel() to the pair of tbl_child_X and other_table > by try_hashjoin_pushdown() or somewhere. make_join_rel() internally > construct a RelOptInfo for the supplied pair of relations, so > relevant RelOptInfo shall be properly constructed. > - make_join_rel() also calls add_paths_to_joinrel() towards all the > join logic, so it makes easier to support to push down other join > logic including nested-loop or custom-join. > - It may be an idea to add an extra argument to make_join_rel() to > inform expressions to be applied for tuple filtering on > construction of inner hash table. > > * Why only SeqScan is supported > > I think it is the role of Hash-node to filter out inner tuples obviously unrelated > to the join (if CHECK constraint of outer relation gives information), because > this join-pushdown may be able to support multi-stacked pushdown. > > For example, if planner considers a path to join this Append-path with another > relation, and join clause contains a reference to X? > > > Append > > -> HashJoin > > -> SeqScan on tbl_child_0 > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_1 > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_2 > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_3 > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > -> SeqScan on other_table > > > It may be a good challenge to consider additional join pushdown, even if subpaths > of Append are HashJoin, not SeqScan, like: > > Append > -> HashJoin > -> HashJoin > -> SeqScan on tbl_child_0 > -> Hash ... Filter: hash_func(X) % 4 = 0 > -> SeqScan on other_table > -> Hash ... Filter: hash_func(X) % 4 = 0 > -> SeqScan on another_table > -> HashJoin > -> HashJoin > -> SeqScan on tbl_child_1 > -> Hash ... Filter: hash_func(X) % 4 = 1 > -> SeqScan on other_table > -> Hash ... Filter: hash_func(X) % 4 = 1 > -> SeqScan on another_table > -> HashJoin > -> HashJoin > -> SeqScan on tbl_child_2 > -> Hash ... Filter: hash_func(X) % 4 = 2 > -> SeqScan on other_table > -> Hash ... Filter: hash_func(X) % 4 = 2 > -> SeqScan on another_table > -> HashJoin > -> HashJoin > -> SeqScan on tbl_child_3 > -> Hash ... Filter: hash_func(X) % 4 = 3 > -> SeqScan on other_table > -> Hash ... Filter: hash_func(X) % 4 = 3 > -> SeqScan on another_table > > In this case, underlying nodes are not always SeqScan. So, only Hash-node can > have filter clauses. > > > * Way to handle expression nodes > > All this patch supported is CHECK() constraint with equal operation on INT4 data > type. You can learn various useful infrastructure of PostgreSQL. For example, ... > - expression_tree_mutator() is useful to make a copy of expression > node with small modification > - pull_varnos() is useful to check which relations are referenced > by the expression node. > - RestrictInfo->can_join is useful to check whether the clause is > binary operator, or not. > > > Anyway, reuse of existing infrastructure is the best way to make a reliable > infrastructure and to keep the implementation simple. > > Thanks, > -- > NEC Business Creation Division / PG-Strom Project KaiGai Kohei > <kaigai@ak.jp.nec.com> > > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > > Sent: Thursday, August 13, 2015 6:30 PM > > To: pgsql-hackers@postgresql.org > > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎) > > Subject: [HACKERS] [Proposal] Table partition + join pushdown > > > > Hi all, > > > > I saw the email about the idea from KaiGai-san[1], and I worked to > > implement this idea. > > > > Now, I have implemented a part of this idea, so I want to propose this > > feature. > > > > Patch attached just shows my concept of this feature. > > It works fine for EXPLAIN, but it returns wrong result for other operations, > sadly. > > > > > > > > Table partition + join pushdown > > =============================== > > > > Motivation > > ---------- > > To make join logic working more effectively, it is important to make > > the size of relations smaller. > > > > Especially in Hash-join, it is meaningful to make the inner relation > > smaller, because smaller inner relation can be stored within smaller hash table. > > This means that memory usage can be reduced when joining with big tables. > > > > > > Design > > ------ > > It was mentioned by the email from KaiGai-san. > > So I quote below here... > > > > ---- begin quotation --- > > Let's assume a table which is partitioned to four portions, and > > individual child relations have constraint by hash-value of its ID > > field. > > > > tbl_parent > > + tbl_child_0 ... CHECK(hash_func(id) % 4 = 0) > > + tbl_child_1 ... CHECK(hash_func(id) % 4 = 1) > > + tbl_child_2 ... CHECK(hash_func(id) % 4 = 2) > > + tbl_child_3 ... CHECK(hash_func(id) % 4 = 3) > > > > If someone tried to join another relation with tbl_parent using > > equivalence condition, like X = tbl_parent.ID, we know inner tuples > > that does not satisfies the condition > > hash_func(X) % 4 = 0 > > shall be never joined to the tuples in tbl_child_0. > > So, we can omit to load these tuples to inner hash table preliminary, > > then it potentially allows to split the inner hash-table. > > > > Current typical plan structure is below: > > > > HashJoin > > -> Append > > -> SeqScan on tbl_child_0 > > -> SeqScan on tbl_child_1 > > -> SeqScan on tbl_child_2 > > -> SeqScan on tbl_child_3 > > -> Hash > > -> SeqScan on other_table > > > > It may be rewritable to: > > > > Append > > -> HashJoin > > -> SeqScan on tbl_child_0 > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_1 > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_2 > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_3 > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > -> SeqScan on other_table > > ---- end quotation --- > > > > In the quotation above, it was written that filter is set at Hash node. > > But I implemented that filter is set at SeqScan node under Hash node. > > In my opinion, filtering tuples is work of Scanner. > > > > Append > > -> HashJoin > > -> SeqScan on tbl_child_0 > > -> Hash > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0 > > -> HashJoin > > -> SeqScan on tbl_child_1 > > -> Hash > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1 > > -> HashJoin > > -> SeqScan on tbl_child_2 > > -> Hash > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2 > > -> HashJoin > > -> SeqScan on tbl_child_3 > > -> Hash > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3 > > > > > > API > > --- > > There are 3 new internal (static) functions to implement this feature. > > try_hashjoin_pushdown(), which is main function of this feature, is > > called from try_hashjoin_path(), and tries to push down HashPath under > > the AppendPath. > > > > To do so, this function does following operations. > > > > 1. Check if this Hash-join can be pushed down under AppendPath > > 2. To avoid an influence on other Path making operation, > > copy inner path's RelOptInfo and make new SeqScan path from it. > > At here, get CHECK() constraints from OUTER path, and convert its > > Var node according to join condition. And also convert Var nodes > > in join condition itself. > > 3. Create new HashPath nodes between each sub-paths of AppendPath and > > inner path made above. > > 4. When operations from 1 to 3 are done for each sub-paths, > > create new AppendPath which sub-paths are HashPath nodes made above. > > > > get_replaced_clause_constr() is called from try_hashjoin_pushdown(), > > and get_var_nodes_recurse() is called from get_replaced_cluase_constr(). > > These 2 functions help above operations. > > (I may revise this part to use expression_tree_walker() and > > expression_tree_mutator().) > > > > In patch attached, it has the following limitations. > > o It only works for hash-join operation. > > (I want to support not only hash-join but also other logic.) > > o Join conditions must be "=" operator with int4 variables. > > o Inner path must be SeqScan. > > (I want to support other path-node.) > > o And now, planner may not choose this plan, > > because estimated costs are usually larger than original (non-pushdown) > plan. > > > > And also 1 internal (static) function, get_relation_constraints() > > defined in plancat.c, is changed to global. This function will be > > called from > > get_replaced_clause_constr() to get CHECK() constraints. > > > > > > Usage > > ----- > > To use this feature, create partition tables and small table to join, > > and run select operation with joining these tables. > > > > For your convenience, I attach DDL and DML script. > > And I also attach the result of EXPLAIN. > > > > > > Any comments are welcome. But, at first, I need your advices to > > correct this patch's behavior. > > > > At least, I think it has to expand array of RangeTblEntry and other > > arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes > mentioned above. > > Or, is it better choice to modify query parser to implement this > > feature more further? > > > > > > Remarks : > > [1] > > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F80 > > 10F672 > > B@BPXM15GP.gisp.nec.co.jp > > > > > > Best regards, > > -- > > Taiki Kondo > > > > NEC Solution Innovators, Ltd.
Hello, KaiGai-san Thank you for your introduction and comments. > * Suppose we focus on only HashJoin in the first version? > This patch also add support on NestLoop and MergeJoin, however, NestLoop > has no valuable scenario without parallel execution capability, and the > most valuable scenario on MergeJoin is reduction of rows prior to Sort. > Once input rows gets sorted, it is less attractive to filter out rows. I agree that handling for NestLoop doesn't make sense in this timing. But I think that handling for MergeJoin still makes sense in this timing. In my v1 patch, I implemented that the additional filter is used for qualification at same place as join filter, same as NestLoop. It is not useful indeed. I agree with you at this point. I think, and you also mentioned, large factor of cost estimation for MergeJoin is Sort under MergeJoin, so I believe additional filtering at Sort is a good choice for this situation, as same way at Hash under HashJoin. Furthermore, I think it is better way that the additional filtering shall be "added" to Scan node under each child (pushed-down) Join nodes, because we don't have to implement additional qualification at Join nodes and we only have to implement simply concatenating original and additional RestrictInfos for filtering. As a mere idea, for realizing this way, I think we have to implement copyObject() for Scan path nodes and use ppi_clauses for this usage. What is your opinion? > * MultiExecHash() once put slot on outer_slot then move it to inner_slot > This patch add set_hash_references() to replace varno in the expression > of Hash->filterqual to OUTER_VAR. Why not INNER_VAR? > If Var nodes would be initialized t oreference inner_slot, you don't need > to re-assign slot. The node under Hash node is connected as the OUTER node. This implementation may be from implementation of set_dummy_tlist_references() commonly used by Material, Sort, Unique, SetOp, and Hash. And I was faced a problem when I was implementing EXPLAIN for the additional filter. I implemented same as you mentioned above, then error occurred in running EXPLAIN. I think EXPLAIN expects expression's varno is same as the position that the under node is connected to; i.e. if it is connected to OUTER, varno must be OUTER_VAR. > It seems to me it is not a fair estimation because inner_path_rows means > number of rows already filtered out, but filter_qual shall be applied to > all the inner input rows. OK. I'll fix it. Best regards, -- Taiki Kondo NEC Solution Innovators, Ltd. -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai Sent: Tuesday, September 29, 2015 11:46 AM To: Taiki Kondo Cc: Akio Iwaasa; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > Sent: Thursday, September 24, 2015 8:06 PM > To: Kaigai Kouhei(海外 浩平) > Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > Hello, KaiGai-san. > > Thank you for your comment, and sorry for late response. > > The attached patch is completely rewritten from previous patch[1], at your > suggestion[2]. > Please find attached. > Thanks for your work, and let me introduce purpose of the work briefly, because the last submission was August. His work intends (1) to save resource consumption on tables join at this moment, and (2) to provide an infrastructure of one parallel join scenario once Funnel node gets capable. Even if we construct partition tables, it is unavailable to utilize to filter out candidate rows of join. In the result, size of Hash table may grow more than necessity and it causes unnecessary nBatch increase. Below is the scenario this project tries to tackle. In case when tables join takes partitioned table on one side, usually, we once need to run entire partitioned table unless we cannot drop particular child tables. XXXXJoin cond (x = y) -> Append -> SeqScan on tbl_child_0 ... CHECK (hash_func(x) % 4 = 0) -> SeqScan on tbl_child_1 ... CHECK (hash_func(x) % 4 = 1) -> SeqScan on tbl_child_2 ... CHECK (hash_func(x) % 4 = 2) -> SeqScanon tbl_child_3 ... CHECK (hash_func(x) % 4 = 3) -> Hash -> SeqScan on other_table However, CHECK() constraint assigned on child tables give us hint which rows in other side are never related to this join. For example, all the rows in other_table to be joined with tbl_child_0 should have multiple number of 4 on hash_func(y). We may be able to omit unrelated rows from the hash-table in this case, then it eventually allows to reduce the size of hash table. In case of INNER_JOIN, we can rewrite the query execution plan as below. Append -> HashJoin cond (x = y) -> SeqScan on tbl_child_0 -> Hash ... Filter: hash_func(y) % 4 = 0 -> SeqScanon other_table -> HashJoin cond (x = y) -> SeqScan on tbl_child_1 -> Hash ... Filter: hash_func(y) % 4= 1 -> SeqScan on other_table -> HashJoin cond (x = y) -> SeqScan on tbl_child_2 -> Hash ... Filter: hash_func(y)% 4 = 2 -> SeqScan on other_table -> HashJoin cond (x = y) -> SeqScan on tbl_child_3 -> Hash... Filter: hash_func(y) % 4 = 3 -> SeqScan on other_table Unrelated rows of Hash table is preliminarily, it allows to avoid hash table split when its size reaches to work_mem limitation. This join-pushdown is valuable on hash-join and merge-join if MJ takes unsorted relation and number of rows to be sorted is performance factor. Also, once Funnel gets capable to run Append on background worker, it is also helpful to run NestLoop in parallel. How about the opinion from third parties? I'm a bit biased, of course. OK, below is the brief comment to patch. * Suppose we focus on only HashJoin in the first version? This patch also add support on NestLoop and MergeJoin, however, NestLoop has no valuable scenario without parallel execution capability, and the most valuable scenario on MergeJoin is reduction of rows prior to Sort. Once input rows gets sorted, it is less attractive to filter out rows. * MultiExecHash() once put slot on outer_slot then move it to inner_slot This patch add set_hash_references() to replace varno in the expression of Hash->filterqual to OUTER_VAR. Why not INNER_VAR? If Var nodes would be initialized t oreference inner_slot, you don't need to re-assign slot. I'll try to have deeper investigation, later. > This patch contains following implementation, but I can't determine this is > correct or wrong. > > 1. Cost estimation > In this patch, additional row filter is implemented for Hash, Merge join and Nested > Loop. > I implemented cost estimation feature for this filter by watching other parts > of filters, > but I am not sure this implementation is correct. > @@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path, * not all of the quals may get evaluatedat each tuple.) */ startup_cost += qp_qual_cost.startup; + startup_cost += filter_qual_cost.startup + + filter_qual_cost.per_tuple * inner_path_rows; cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple; run_cost+= cpu_per_tuple * hashjointuples; It seems to me it is not a fair estimation because inner_path_rows means number of rows already filtered out, but filter_qual shall be applied to all the inner input rows. > 2. Workaround for failing assertion at allpaths.c > In standard_join_search(), we expect to have a single rel at the final level. > But this expectation is disappointed by join pushdown feature, because this will > search for the combinations not searched by original standard_join_serch() > at the final level. Therefore, once trying join pushdown is succeeded, > failing assertion occurs in allpaths.c. > > So I implemented workaround by temporary set NULL to root->join_rel_level while > trying join pushdown, but I think this implementation may be wrong. > It is my off-list suggestion. The standard_join_search expects root of the partition tables will appear, but child tables are out of scope. Once we try to push down the join under the append, we need to consider table joins between inner table and every outer child tables, however, it should not be visible to standard_join_search context. From the standpoint of standard_join_search, it get an AppendPath that represents a table join A and B, even if A contains 100 children and join was pushed down on behalf of the AppendPath. So, it is a reasonable way to set NULL on root->join_rel_level to avoid unexpected RelOptInfo addition by build_join_rel(). "to avoid assertion" is one fact, however, intension of the code is avoid pollution of the global data structure. ;-) > 3. Searching pathkeys for Merge Join > When join pushdown feature chooses merge join for pushed-down join operation, > planner fails to create merge join node because it is unable to find pathkeys > for this merge join. I found this is caused by skipping child table in finding > pathkeys. > > I expect that it is for making planner faster, and I implemented that > planner doesn't skip child table in finding pathkeys for merge join. > But I am not sure this expectation is correct. > I like to recommend to omit MergeJoin support at the first version. Thanks, > Any comments/suggestions are welcome. > > > Remarks : > [1] > http://www.postgresql.org/message-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@ > BPXM01GP.gisp.nec.co.jp > [2] > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F8011345B > 6@BPXM15GP.gisp.nec.co.jp > > Best regards, > -- > Taiki Kondo > > NEC Solution Innovators, Ltd. > > -----Original Message----- > From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] > Sent: Tuesday, August 18, 2015 5:47 PM > To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org > Cc: Iwaasa Akio(岩浅 晃郎) > Subject: RE: [Proposal] Table partition + join pushdown > > Hello Kondo-san, > > I briefly checked your patch. Let me put some comments about its design and > implementation, even though I have no arguments towards its concept. :-) > > * Construction of RelOptInfo > > In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() constructs > RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It > is entirely wrong implementation. > > I can understand we (may) have no RelOptInfo for the joinrel between > tbl_child_0 and other_table, when planner investigates a joinpath to process join > Append path with the other_table. > > > HashJoin > > -> Append > > -> SeqScan on tbl_child_0 > > -> SeqScan on tbl_child_1 > > -> SeqScan on tbl_child_2 > > -> SeqScan on tbl_child_3 > > -> Hash > > -> SeqScan on other_table > > > How about these alternatives? > > - calls make_join_rel() to the pair of tbl_child_X and other_table > by try_hashjoin_pushdown() or somewhere. make_join_rel() internally > construct a RelOptInfo for the supplied pair of relations, so > relevant RelOptInfo shall be properly constructed. > - make_join_rel() also calls add_paths_to_joinrel() towards all the > join logic, so it makes easier to support to push down other join > logic including nested-loop or custom-join. > - It may be an idea to add an extra argument to make_join_rel() to > inform expressions to be applied for tuple filtering on > construction of inner hash table. > > * Why only SeqScan is supported > > I think it is the role of Hash-node to filter out inner tuples obviously unrelated > to the join (if CHECK constraint of outer relation gives information), because > this join-pushdown may be able to support multi-stacked pushdown. > > For example, if planner considers a path to join this Append-path with another > relation, and join clause contains a reference to X? > > > Append > > -> HashJoin > > -> SeqScan on tbl_child_0 > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_1 > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_2 > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_3 > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > -> SeqScan on other_table > > > It may be a good challenge to consider additional join pushdown, even if subpaths > of Append are HashJoin, not SeqScan, like: > > Append > -> HashJoin > -> HashJoin > -> SeqScan on tbl_child_0 > -> Hash ... Filter: hash_func(X) % 4 = 0 > -> SeqScan on other_table > -> Hash ... Filter: hash_func(X) % 4 = 0 > -> SeqScan on another_table > -> HashJoin > -> HashJoin > -> SeqScan on tbl_child_1 > -> Hash ... Filter: hash_func(X) % 4 = 1 > -> SeqScan on other_table > -> Hash ... Filter: hash_func(X) % 4 = 1 > -> SeqScan on another_table > -> HashJoin > -> HashJoin > -> SeqScan on tbl_child_2 > -> Hash ... Filter: hash_func(X) % 4 = 2 > -> SeqScan on other_table > -> Hash ... Filter: hash_func(X) % 4 = 2 > -> SeqScan on another_table > -> HashJoin > -> HashJoin > -> SeqScan on tbl_child_3 > -> Hash ... Filter: hash_func(X) % 4 = 3 > -> SeqScan on other_table > -> Hash ... Filter: hash_func(X) % 4 = 3 > -> SeqScan on another_table > > In this case, underlying nodes are not always SeqScan. So, only Hash-node can > have filter clauses. > > > * Way to handle expression nodes > > All this patch supported is CHECK() constraint with equal operation on INT4 data > type. You can learn various useful infrastructure of PostgreSQL. For example, ... > - expression_tree_mutator() is useful to make a copy of expression > node with small modification > - pull_varnos() is useful to check which relations are referenced > by the expression node. > - RestrictInfo->can_join is useful to check whether the clause is > binary operator, or not. > > > Anyway, reuse of existing infrastructure is the best way to make a reliable > infrastructure and to keep the implementation simple. > > Thanks, > -- > NEC Business Creation Division / PG-Strom Project KaiGai Kohei > <kaigai@ak.jp.nec.com> > > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > > Sent: Thursday, August 13, 2015 6:30 PM > > To: pgsql-hackers@postgresql.org > > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎) > > Subject: [HACKERS] [Proposal] Table partition + join pushdown > > > > Hi all, > > > > I saw the email about the idea from KaiGai-san[1], and I worked to > > implement this idea. > > > > Now, I have implemented a part of this idea, so I want to propose this > > feature. > > > > Patch attached just shows my concept of this feature. > > It works fine for EXPLAIN, but it returns wrong result for other operations, > sadly. > > > > > > > > Table partition + join pushdown > > =============================== > > > > Motivation > > ---------- > > To make join logic working more effectively, it is important to make > > the size of relations smaller. > > > > Especially in Hash-join, it is meaningful to make the inner relation > > smaller, because smaller inner relation can be stored within smaller hash table. > > This means that memory usage can be reduced when joining with big tables. > > > > > > Design > > ------ > > It was mentioned by the email from KaiGai-san. > > So I quote below here... > > > > ---- begin quotation --- > > Let's assume a table which is partitioned to four portions, and > > individual child relations have constraint by hash-value of its ID > > field. > > > > tbl_parent > > + tbl_child_0 ... CHECK(hash_func(id) % 4 = 0) > > + tbl_child_1 ... CHECK(hash_func(id) % 4 = 1) > > + tbl_child_2 ... CHECK(hash_func(id) % 4 = 2) > > + tbl_child_3 ... CHECK(hash_func(id) % 4 = 3) > > > > If someone tried to join another relation with tbl_parent using > > equivalence condition, like X = tbl_parent.ID, we know inner tuples > > that does not satisfies the condition > > hash_func(X) % 4 = 0 > > shall be never joined to the tuples in tbl_child_0. > > So, we can omit to load these tuples to inner hash table preliminary, > > then it potentially allows to split the inner hash-table. > > > > Current typical plan structure is below: > > > > HashJoin > > -> Append > > -> SeqScan on tbl_child_0 > > -> SeqScan on tbl_child_1 > > -> SeqScan on tbl_child_2 > > -> SeqScan on tbl_child_3 > > -> Hash > > -> SeqScan on other_table > > > > It may be rewritable to: > > > > Append > > -> HashJoin > > -> SeqScan on tbl_child_0 > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_1 > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_2 > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > -> SeqScan on other_table > > -> HashJoin > > -> SeqScan on tbl_child_3 > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > -> SeqScan on other_table > > ---- end quotation --- > > > > In the quotation above, it was written that filter is set at Hash node. > > But I implemented that filter is set at SeqScan node under Hash node. > > In my opinion, filtering tuples is work of Scanner. > > > > Append > > -> HashJoin > > -> SeqScan on tbl_child_0 > > -> Hash > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0 > > -> HashJoin > > -> SeqScan on tbl_child_1 > > -> Hash > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1 > > -> HashJoin > > -> SeqScan on tbl_child_2 > > -> Hash > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2 > > -> HashJoin > > -> SeqScan on tbl_child_3 > > -> Hash > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3 > > > > > > API > > --- > > There are 3 new internal (static) functions to implement this feature. > > try_hashjoin_pushdown(), which is main function of this feature, is > > called from try_hashjoin_path(), and tries to push down HashPath under > > the AppendPath. > > > > To do so, this function does following operations. > > > > 1. Check if this Hash-join can be pushed down under AppendPath > > 2. To avoid an influence on other Path making operation, > > copy inner path's RelOptInfo and make new SeqScan path from it. > > At here, get CHECK() constraints from OUTER path, and convert its > > Var node according to join condition. And also convert Var nodes > > in join condition itself. > > 3. Create new HashPath nodes between each sub-paths of AppendPath and > > inner path made above. > > 4. When operations from 1 to 3 are done for each sub-paths, > > create new AppendPath which sub-paths are HashPath nodes made above. > > > > get_replaced_clause_constr() is called from try_hashjoin_pushdown(), > > and get_var_nodes_recurse() is called from get_replaced_cluase_constr(). > > These 2 functions help above operations. > > (I may revise this part to use expression_tree_walker() and > > expression_tree_mutator().) > > > > In patch attached, it has the following limitations. > > o It only works for hash-join operation. > > (I want to support not only hash-join but also other logic.) > > o Join conditions must be "=" operator with int4 variables. > > o Inner path must be SeqScan. > > (I want to support other path-node.) > > o And now, planner may not choose this plan, > > because estimated costs are usually larger than original (non-pushdown) > plan. > > > > And also 1 internal (static) function, get_relation_constraints() > > defined in plancat.c, is changed to global. This function will be > > called from > > get_replaced_clause_constr() to get CHECK() constraints. > > > > > > Usage > > ----- > > To use this feature, create partition tables and small table to join, > > and run select operation with joining these tables. > > > > For your convenience, I attach DDL and DML script. > > And I also attach the result of EXPLAIN. > > > > > > Any comments are welcome. But, at first, I need your advices to > > correct this patch's behavior. > > > > At least, I think it has to expand array of RangeTblEntry and other > > arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes > mentioned above. > > Or, is it better choice to modify query parser to implement this > > feature more further? > > > > > > Remarks : > > [1] > > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F80 > > 10F672 > > B@BPXM15GP.gisp.nec.co.jp > > > > > > Best regards, > > -- > > Taiki Kondo > > > > NEC Solution Innovators, Ltd. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> > * Suppose we focus on only HashJoin in the first version? > > This patch also add support on NestLoop and MergeJoin, however, NestLoop > > has no valuable scenario without parallel execution capability, and the > > most valuable scenario on MergeJoin is reduction of rows prior to Sort. > > Once input rows gets sorted, it is less attractive to filter out rows. > > I agree that handling for NestLoop doesn't make sense in this timing. > But I think that handling for MergeJoin still makes sense in this timing. > > In my v1 patch, I implemented that the additional filter is used for > qualification at same place as join filter, same as NestLoop. > It is not useful indeed. I agree with you at this point. > > I think, and you also mentioned, large factor of cost estimation for MergeJoin > is > Sort under MergeJoin, so I believe additional filtering at Sort is a good choice > for > this situation, as same way at Hash under HashJoin. > > Furthermore, I think it is better way that the additional filtering shall be > "added" to Scan node under each child (pushed-down) Join nodes, because we > don't have to implement additional qualification at Join nodes and > we only have to implement simply concatenating original and additional > RestrictInfos for filtering. > > As a mere idea, for realizing this way, I think we have to implement copyObject() > for Scan path nodes and use ppi_clauses for this usage. > > What is your opinion? > You are saying this part at create_scan_plan(), aren't you. /* * If this is a parameterized scan, we also need to enforce all the join * clauses available from the outer relation(s). * * For paranoia's sake, don't modify the stored baserestrictinfo list. */ if (best_path->param_info) scan_clauses = list_concat(list_copy(scan_clauses), best_path->param_info->ppi_clauses); If inner-scan of the join under the append node has param_info, its qualifier shall be implicitly attached to the scan node. So, if it is legal, I'd like to have this approach because it is less invasive than enhancement of Hash node. You mention about copyObject() to make a duplication of underlying scan-path. Actually, copyObject() support is not minimum requirement, because all you need to do here is flat copy of the original path-node, then put param_info. (Be careful to check whether the original path is not parametalized.) ParamPathInfo is declared as below: typedef struct ParamPathInfo { NodeTag type; Relids ppi_req_outer; /* rels supplying parameters used by path */ double ppi_rows; /* estimatednumber of result tuples */ List *ppi_clauses; /* join clauses available from outer rels */ } ParamPathInfo; You may need to set the additional filter on ppi_clauses, number of rows after the filtering on ppi_rows and NULL on ppi_req_outer. However, I'm not 100% certain whether NULL is legal value on ppi_req_outer. If somebody can comment on, it is helpful. > > * MultiExecHash() once put slot on outer_slot then move it to inner_slot > > This patch add set_hash_references() to replace varno in the expression > > of Hash->filterqual to OUTER_VAR. Why not INNER_VAR? > > If Var nodes would be initialized t oreference inner_slot, you don't need > > to re-assign slot. > > The node under Hash node is connected as the OUTER node. This implementation may > be > from implementation of set_dummy_tlist_references() commonly used by Material, > Sort, Unique, SetOp, and Hash. > > And I was faced a problem when I was implementing EXPLAIN for the additional filter. > I implemented same as you mentioned above, then error occurred in running EXPLAIN. > I think EXPLAIN expects expression's varno is same as the position that the under > node > is connected to; i.e. if it is connected to OUTER, varno must be OUTER_VAR. > Ah, OK. It is a trade-off matter, indeed. > > It seems to me it is not a fair estimation because inner_path_rows means > > number of rows already filtered out, but filter_qual shall be applied to > > all the inner input rows. > > OK. I'll fix it. > > > Best regards, > -- > Taiki Kondo > > NEC Solution Innovators, Ltd. > > > > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai > Sent: Tuesday, September 29, 2015 11:46 AM > To: Taiki Kondo > Cc: Akio Iwaasa; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > > Sent: Thursday, September 24, 2015 8:06 PM > > To: Kaigai Kouhei(海外 浩平) > > Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > > > Hello, KaiGai-san. > > > > Thank you for your comment, and sorry for late response. > > > > The attached patch is completely rewritten from previous patch[1], at your > > suggestion[2]. > > Please find attached. > > > Thanks for your work, and let me introduce purpose of the work briefly, > because the last submission was August. > > His work intends (1) to save resource consumption on tables join at this > moment, and (2) to provide an infrastructure of one parallel join scenario > once Funnel node gets capable. > > Even if we construct partition tables, it is unavailable to utilize to > filter out candidate rows of join. In the result, size of Hash table > may grow more than necessity and it causes unnecessary nBatch increase. > > Below is the scenario this project tries to tackle. In case when tables > join takes partitioned table on one side, usually, we once need to run > entire partitioned table unless we cannot drop particular child tables. > > XXXXJoin cond (x = y) > -> Append > -> SeqScan on tbl_child_0 ... CHECK (hash_func(x) % 4 = 0) > -> SeqScan on tbl_child_1 ... CHECK (hash_func(x) % 4 = 1) > -> SeqScan on tbl_child_2 ... CHECK (hash_func(x) % 4 = 2) > -> SeqScan on tbl_child_3 ... CHECK (hash_func(x) % 4 = 3) > -> Hash > -> SeqScan on other_table > > However, CHECK() constraint assigned on child tables give us hint which > rows in other side are never related to this join. > For example, all the rows in other_table to be joined with tbl_child_0 > should have multiple number of 4 on hash_func(y). We may be able to omit > unrelated rows from the hash-table in this case, then it eventually allows > to reduce the size of hash table. > > In case of INNER_JOIN, we can rewrite the query execution plan as below. > > Append > -> HashJoin cond (x = y) > -> SeqScan on tbl_child_0 > -> Hash ... Filter: hash_func(y) % 4 = 0 > -> SeqScan on other_table > -> HashJoin cond (x = y) > -> SeqScan on tbl_child_1 > -> Hash ... Filter: hash_func(y) % 4 = 1 > -> SeqScan on other_table > -> HashJoin cond (x = y) > -> SeqScan on tbl_child_2 > -> Hash ... Filter: hash_func(y) % 4 = 2 > -> SeqScan on other_table > -> HashJoin cond (x = y) > -> SeqScan on tbl_child_3 > -> Hash ... Filter: hash_func(y) % 4 = 3 > -> SeqScan on other_table > > Unrelated rows of Hash table is preliminarily, it allows to avoid hash > table split when its size reaches to work_mem limitation. > > This join-pushdown is valuable on hash-join and merge-join if MJ takes > unsorted relation and number of rows to be sorted is performance factor. > Also, once Funnel gets capable to run Append on background worker, it > is also helpful to run NestLoop in parallel. > > How about the opinion from third parties? I'm a bit biased, of course. > > > OK, below is the brief comment to patch. > > * Suppose we focus on only HashJoin in the first version? > This patch also add support on NestLoop and MergeJoin, however, NestLoop > has no valuable scenario without parallel execution capability, and the > most valuable scenario on MergeJoin is reduction of rows prior to Sort. > Once input rows gets sorted, it is less attractive to filter out rows. > > * MultiExecHash() once put slot on outer_slot then move it to inner_slot > This patch add set_hash_references() to replace varno in the expression > of Hash->filterqual to OUTER_VAR. Why not INNER_VAR? > If Var nodes would be initialized t oreference inner_slot, you don't need > to re-assign slot. > > I'll try to have deeper investigation, later. > > > This patch contains following implementation, but I can't determine this is > > correct or wrong. > > > > 1. Cost estimation > > In this patch, additional row filter is implemented for Hash, Merge join and > Nested > > Loop. > > I implemented cost estimation feature for this filter by watching other parts > > of filters, > > but I am not sure this implementation is correct. > > > > @@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path, > * not all of the quals may get evaluated at each tuple.) > */ > startup_cost += qp_qual_cost.startup; > + startup_cost += filter_qual_cost.startup + > + filter_qual_cost.per_tuple * inner_path_rows; > cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple; > run_cost += cpu_per_tuple * hashjointuples; > > It seems to me it is not a fair estimation because inner_path_rows means > number of rows already filtered out, but filter_qual shall be applied to > all the inner input rows. > > > > 2. Workaround for failing assertion at allpaths.c > > In standard_join_search(), we expect to have a single rel at the final level. > > But this expectation is disappointed by join pushdown feature, because this > will > > search for the combinations not searched by original standard_join_serch() > > at the final level. Therefore, once trying join pushdown is succeeded, > > failing assertion occurs in allpaths.c. > > > > So I implemented workaround by temporary set NULL to root->join_rel_level while > > trying join pushdown, but I think this implementation may be wrong. > > > It is my off-list suggestion. The standard_join_search expects root of > the partition tables will appear, but child tables are out of scope. > Once we try to push down the join under the append, we need to consider > table joins between inner table and every outer child tables, however, > it should not be visible to standard_join_search context. > From the standpoint of standard_join_search, it get an AppendPath that > represents a table join A and B, even if A contains 100 children and > join was pushed down on behalf of the AppendPath. > So, it is a reasonable way to set NULL on root->join_rel_level to avoid > unexpected RelOptInfo addition by build_join_rel(). > "to avoid assertion" is one fact, however, intension of the code is > avoid pollution of the global data structure. ;-) > > > > 3. Searching pathkeys for Merge Join > > When join pushdown feature chooses merge join for pushed-down join operation, > > planner fails to create merge join node because it is unable to find pathkeys > > for this merge join. I found this is caused by skipping child table in finding > > pathkeys. > > > > I expect that it is for making planner faster, and I implemented that > > planner doesn't skip child table in finding pathkeys for merge join. > > But I am not sure this expectation is correct. > > > I like to recommend to omit MergeJoin support at the first version. > > Thanks, > > > Any comments/suggestions are welcome. > > > > > > Remarks : > > [1] > > > http://www.postgresql.org/message-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@ > > BPXM01GP.gisp.nec.co.jp > > [2] > > > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F8011345B > > 6@BPXM15GP.gisp.nec.co.jp > > > > Best regards, > > -- > > Taiki Kondo > > > > NEC Solution Innovators, Ltd. > > > > -----Original Message----- > > From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] > > Sent: Tuesday, August 18, 2015 5:47 PM > > To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org > > Cc: Iwaasa Akio(岩浅 晃郎) > > Subject: RE: [Proposal] Table partition + join pushdown > > > > Hello Kondo-san, > > > > I briefly checked your patch. Let me put some comments about its design and > > implementation, even though I have no arguments towards its concept. :-) > > > > * Construction of RelOptInfo > > > > In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() > constructs > > RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It > > is entirely wrong implementation. > > > > I can understand we (may) have no RelOptInfo for the joinrel between > > tbl_child_0 and other_table, when planner investigates a joinpath to process > join > > Append path with the other_table. > > > > > HashJoin > > > -> Append > > > -> SeqScan on tbl_child_0 > > > -> SeqScan on tbl_child_1 > > > -> SeqScan on tbl_child_2 > > > -> SeqScan on tbl_child_3 > > > -> Hash > > > -> SeqScan on other_table > > > > > How about these alternatives? > > > > - calls make_join_rel() to the pair of tbl_child_X and other_table > > by try_hashjoin_pushdown() or somewhere. make_join_rel() internally > > construct a RelOptInfo for the supplied pair of relations, so > > relevant RelOptInfo shall be properly constructed. > > - make_join_rel() also calls add_paths_to_joinrel() towards all the > > join logic, so it makes easier to support to push down other join > > logic including nested-loop or custom-join. > > - It may be an idea to add an extra argument to make_join_rel() to > > inform expressions to be applied for tuple filtering on > > construction of inner hash table. > > > > * Why only SeqScan is supported > > > > I think it is the role of Hash-node to filter out inner tuples obviously unrelated > > to the join (if CHECK constraint of outer relation gives information), because > > this join-pushdown may be able to support multi-stacked pushdown. > > > > For example, if planner considers a path to join this Append-path with another > > relation, and join clause contains a reference to X? > > > > > Append > > > -> HashJoin > > > -> SeqScan on tbl_child_0 > > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_1 > > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_2 > > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_3 > > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > > -> SeqScan on other_table > > > > > It may be a good challenge to consider additional join pushdown, even if subpaths > > of Append are HashJoin, not SeqScan, like: > > > > Append > > -> HashJoin > > -> HashJoin > > -> SeqScan on tbl_child_0 > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > -> SeqScan on other_table > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > -> SeqScan on another_table > > -> HashJoin > > -> HashJoin > > -> SeqScan on tbl_child_1 > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > -> SeqScan on other_table > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > -> SeqScan on another_table > > -> HashJoin > > -> HashJoin > > -> SeqScan on tbl_child_2 > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > -> SeqScan on other_table > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > -> SeqScan on another_table > > -> HashJoin > > -> HashJoin > > -> SeqScan on tbl_child_3 > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > -> SeqScan on other_table > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > -> SeqScan on another_table > > > > In this case, underlying nodes are not always SeqScan. So, only Hash-node can > > have filter clauses. > > > > > > * Way to handle expression nodes > > > > All this patch supported is CHECK() constraint with equal operation on INT4 > data > > type. You can learn various useful infrastructure of PostgreSQL. For example, ... > > - expression_tree_mutator() is useful to make a copy of expression > > node with small modification > > - pull_varnos() is useful to check which relations are referenced > > by the expression node. > > - RestrictInfo->can_join is useful to check whether the clause is > > binary operator, or not. > > > > > > Anyway, reuse of existing infrastructure is the best way to make a reliable > > infrastructure and to keep the implementation simple. > > > > Thanks, > > -- > > NEC Business Creation Division / PG-Strom Project KaiGai Kohei > > <kaigai@ak.jp.nec.com> > > > > > > > -----Original Message----- > > > From: pgsql-hackers-owner@postgresql.org > > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > > > Sent: Thursday, August 13, 2015 6:30 PM > > > To: pgsql-hackers@postgresql.org > > > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎) > > > Subject: [HACKERS] [Proposal] Table partition + join pushdown > > > > > > Hi all, > > > > > > I saw the email about the idea from KaiGai-san[1], and I worked to > > > implement this idea. > > > > > > Now, I have implemented a part of this idea, so I want to propose this > > > feature. > > > > > > Patch attached just shows my concept of this feature. > > > It works fine for EXPLAIN, but it returns wrong result for other operations, > > sadly. > > > > > > > > > > > > Table partition + join pushdown > > > =============================== > > > > > > Motivation > > > ---------- > > > To make join logic working more effectively, it is important to make > > > the size of relations smaller. > > > > > > Especially in Hash-join, it is meaningful to make the inner relation > > > smaller, because smaller inner relation can be stored within smaller hash > table. > > > This means that memory usage can be reduced when joining with big tables. > > > > > > > > > Design > > > ------ > > > It was mentioned by the email from KaiGai-san. > > > So I quote below here... > > > > > > ---- begin quotation --- > > > Let's assume a table which is partitioned to four portions, and > > > individual child relations have constraint by hash-value of its ID > > > field. > > > > > > tbl_parent > > > + tbl_child_0 ... CHECK(hash_func(id) % 4 = 0) > > > + tbl_child_1 ... CHECK(hash_func(id) % 4 = 1) > > > + tbl_child_2 ... CHECK(hash_func(id) % 4 = 2) > > > + tbl_child_3 ... CHECK(hash_func(id) % 4 = 3) > > > > > > If someone tried to join another relation with tbl_parent using > > > equivalence condition, like X = tbl_parent.ID, we know inner tuples > > > that does not satisfies the condition > > > hash_func(X) % 4 = 0 > > > shall be never joined to the tuples in tbl_child_0. > > > So, we can omit to load these tuples to inner hash table preliminary, > > > then it potentially allows to split the inner hash-table. > > > > > > Current typical plan structure is below: > > > > > > HashJoin > > > -> Append > > > -> SeqScan on tbl_child_0 > > > -> SeqScan on tbl_child_1 > > > -> SeqScan on tbl_child_2 > > > -> SeqScan on tbl_child_3 > > > -> Hash > > > -> SeqScan on other_table > > > > > > It may be rewritable to: > > > > > > Append > > > -> HashJoin > > > -> SeqScan on tbl_child_0 > > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_1 > > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_2 > > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_3 > > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > > -> SeqScan on other_table > > > ---- end quotation --- > > > > > > In the quotation above, it was written that filter is set at Hash node. > > > But I implemented that filter is set at SeqScan node under Hash node. > > > In my opinion, filtering tuples is work of Scanner. > > > > > > Append > > > -> HashJoin > > > -> SeqScan on tbl_child_0 > > > -> Hash > > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0 > > > -> HashJoin > > > -> SeqScan on tbl_child_1 > > > -> Hash > > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1 > > > -> HashJoin > > > -> SeqScan on tbl_child_2 > > > -> Hash > > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2 > > > -> HashJoin > > > -> SeqScan on tbl_child_3 > > > -> Hash > > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3 > > > > > > > > > API > > > --- > > > There are 3 new internal (static) functions to implement this feature. > > > try_hashjoin_pushdown(), which is main function of this feature, is > > > called from try_hashjoin_path(), and tries to push down HashPath under > > > the AppendPath. > > > > > > To do so, this function does following operations. > > > > > > 1. Check if this Hash-join can be pushed down under AppendPath > > > 2. To avoid an influence on other Path making operation, > > > copy inner path's RelOptInfo and make new SeqScan path from it. > > > At here, get CHECK() constraints from OUTER path, and convert its > > > Var node according to join condition. And also convert Var nodes > > > in join condition itself. > > > 3. Create new HashPath nodes between each sub-paths of AppendPath and > > > inner path made above. > > > 4. When operations from 1 to 3 are done for each sub-paths, > > > create new AppendPath which sub-paths are HashPath nodes made above. > > > > > > get_replaced_clause_constr() is called from try_hashjoin_pushdown(), > > > and get_var_nodes_recurse() is called from get_replaced_cluase_constr(). > > > These 2 functions help above operations. > > > (I may revise this part to use expression_tree_walker() and > > > expression_tree_mutator().) > > > > > > In patch attached, it has the following limitations. > > > o It only works for hash-join operation. > > > (I want to support not only hash-join but also other logic.) > > > o Join conditions must be "=" operator with int4 variables. > > > o Inner path must be SeqScan. > > > (I want to support other path-node.) > > > o And now, planner may not choose this plan, > > > because estimated costs are usually larger than original (non-pushdown) > > plan. > > > > > > And also 1 internal (static) function, get_relation_constraints() > > > defined in plancat.c, is changed to global. This function will be > > > called from > > > get_replaced_clause_constr() to get CHECK() constraints. > > > > > > > > > Usage > > > ----- > > > To use this feature, create partition tables and small table to join, > > > and run select operation with joining these tables. > > > > > > For your convenience, I attach DDL and DML script. > > > And I also attach the result of EXPLAIN. > > > > > > > > > Any comments are welcome. But, at first, I need your advices to > > > correct this patch's behavior. > > > > > > At least, I think it has to expand array of RangeTblEntry and other > > > arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes > > mentioned above. > > > Or, is it better choice to modify query parser to implement this > > > feature more further? > > > > > > > > > Remarks : > > > [1] > > > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F80 > > > 10F672 > > > B@BPXM15GP.gisp.nec.co.jp > > > > > > > > > Best regards, > > > -- > > > Taiki Kondo > > > > > > NEC Solution Innovators, Ltd. > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
Hello, KaiGai-san. Thank you for your comment, and sorry for late response. > If inner-scan of the join under the append node has param_info, its qualifier shall be implicitly attached to the scannode. So, if it is legal, I'd like to have this approach because it is less invasive than enhancement of Hash node. > > You mention about copyObject() to make a duplication of underlying scan-path. > Actually, copyObject() support is not minimum requirement, because all you need to do here is flat copy of the originalpath-node, then put param_info. > (Be careful to check whether the original path is not parametalized.) OK. I'll try implementation by a method you mentioned. Best regards, -- Taiki Kondo NEC Solution Innovators, Ltd. -----Original Message----- From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] Sent: Wednesday, September 30, 2015 11:19 PM To: Kondo Taiki(近藤 太樹) Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown > > * Suppose we focus on only HashJoin in the first version? > > This patch also add support on NestLoop and MergeJoin, however, > > NestLoop has no valuable scenario without parallel execution > > capability, and the most valuable scenario on MergeJoin is reduction of rows prior to Sort. > > Once input rows gets sorted, it is less attractive to filter out rows. > > I agree that handling for NestLoop doesn't make sense in this timing. > But I think that handling for MergeJoin still makes sense in this timing. > > In my v1 patch, I implemented that the additional filter is used for > qualification at same place as join filter, same as NestLoop. > It is not useful indeed. I agree with you at this point. > > I think, and you also mentioned, large factor of cost estimation for > MergeJoin is Sort under MergeJoin, so I believe additional filtering > at Sort is a good choice for this situation, as same way at Hash under > HashJoin. > > Furthermore, I think it is better way that the additional filtering > shall be "added" to Scan node under each child (pushed-down) Join > nodes, because we don't have to implement additional qualification at > Join nodes and we only have to implement simply concatenating original > and additional RestrictInfos for filtering. > > As a mere idea, for realizing this way, I think we have to implement > copyObject() for Scan path nodes and use ppi_clauses for this usage. > > What is your opinion? > You are saying this part at create_scan_plan(), aren't you. /* * If this is a parameterized scan, we also need to enforce all the join * clauses available from the outer relation(s). * * For paranoia's sake, don't modify the stored baserestrictinfo list. */ if (best_path->param_info) scan_clauses = list_concat(list_copy(scan_clauses), best_path->param_info->ppi_clauses); If inner-scan of the join under the append node has param_info, its qualifier shall be implicitly attached to the scan node.So, if it is legal, I'd like to have this approach because it is less invasive than enhancement of Hash node. You mention about copyObject() to make a duplication of underlying scan-path. Actually, copyObject() support is not minimum requirement, because all you need to do here is flat copy of the original path-node,then put param_info. (Be careful to check whether the original path is not parametalized.) ParamPathInfo is declared as below: typedef struct ParamPathInfo { NodeTag type; Relids ppi_req_outer; /* rels supplying parameters used by path */ double ppi_rows; /* estimatednumber of result tuples */ List *ppi_clauses; /* join clauses available from outer rels */ } ParamPathInfo; You may need to set the additional filter on ppi_clauses, number of rows after the filtering on ppi_rows and NULL on ppi_req_outer. However, I'm not 100% certain whether NULL is legal value on ppi_req_outer. If somebody can comment on, it is helpful. > > * MultiExecHash() once put slot on outer_slot then move it to > > inner_slot This patch add set_hash_references() to replace varno in > > the expression of Hash->filterqual to OUTER_VAR. Why not INNER_VAR? > > If Var nodes would be initialized t oreference inner_slot, you don't > > need to re-assign slot. > > The node under Hash node is connected as the OUTER node. This > implementation may be from implementation of > set_dummy_tlist_references() commonly used by Material, Sort, Unique, > SetOp, and Hash. > > And I was faced a problem when I was implementing EXPLAIN for the additional filter. > I implemented same as you mentioned above, then error occurred in running EXPLAIN. > I think EXPLAIN expects expression's varno is same as the position > that the under node is connected to; i.e. if it is connected to OUTER, > varno must be OUTER_VAR. > Ah, OK. It is a trade-off matter, indeed. > > It seems to me it is not a fair estimation because inner_path_rows > > means number of rows already filtered out, but filter_qual shall be > > applied to all the inner input rows. > > OK. I'll fix it. > > > Best regards, > -- > Taiki Kondo > > NEC Solution Innovators, Ltd. > > > > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai > Sent: Tuesday, September 29, 2015 11:46 AM > To: Taiki Kondo > Cc: Akio Iwaasa; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > > Sent: Thursday, September 24, 2015 8:06 PM > > To: Kaigai Kouhei(海外 浩平) > > Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > > > Hello, KaiGai-san. > > > > Thank you for your comment, and sorry for late response. > > > > The attached patch is completely rewritten from previous patch[1], > > at your suggestion[2]. > > Please find attached. > > > Thanks for your work, and let me introduce purpose of the work > briefly, because the last submission was August. > > His work intends (1) to save resource consumption on tables join at > this moment, and (2) to provide an infrastructure of one parallel join > scenario once Funnel node gets capable. > > Even if we construct partition tables, it is unavailable to utilize to > filter out candidate rows of join. In the result, size of Hash table > may grow more than necessity and it causes unnecessary nBatch increase. > > Below is the scenario this project tries to tackle. In case when > tables join takes partitioned table on one side, usually, we once need > to run entire partitioned table unless we cannot drop particular child tables. > > XXXXJoin cond (x = y) > -> Append > -> SeqScan on tbl_child_0 ... CHECK (hash_func(x) % 4 = 0) > -> SeqScan on tbl_child_1 ... CHECK (hash_func(x) % 4 = 1) > -> SeqScan on tbl_child_2 ... CHECK (hash_func(x) % 4 = 2) > -> SeqScan on tbl_child_3 ... CHECK (hash_func(x) % 4 = 3) > -> Hash > -> SeqScan on other_table > > However, CHECK() constraint assigned on child tables give us hint > which rows in other side are never related to this join. > For example, all the rows in other_table to be joined with tbl_child_0 > should have multiple number of 4 on hash_func(y). We may be able to > omit unrelated rows from the hash-table in this case, then it > eventually allows to reduce the size of hash table. > > In case of INNER_JOIN, we can rewrite the query execution plan as below. > > Append > -> HashJoin cond (x = y) > -> SeqScan on tbl_child_0 > -> Hash ... Filter: hash_func(y) % 4 = 0 > -> SeqScan on other_table > -> HashJoin cond (x = y) > -> SeqScan on tbl_child_1 > -> Hash ... Filter: hash_func(y) % 4 = 1 > -> SeqScan on other_table > -> HashJoin cond (x = y) > -> SeqScan on tbl_child_2 > -> Hash ... Filter: hash_func(y) % 4 = 2 > -> SeqScan on other_table > -> HashJoin cond (x = y) > -> SeqScan on tbl_child_3 > -> Hash ... Filter: hash_func(y) % 4 = 3 > -> SeqScan on other_table > > Unrelated rows of Hash table is preliminarily, it allows to avoid hash > table split when its size reaches to work_mem limitation. > > This join-pushdown is valuable on hash-join and merge-join if MJ takes > unsorted relation and number of rows to be sorted is performance factor. > Also, once Funnel gets capable to run Append on background worker, it > is also helpful to run NestLoop in parallel. > > How about the opinion from third parties? I'm a bit biased, of course. > > > OK, below is the brief comment to patch. > > * Suppose we focus on only HashJoin in the first version? > This patch also add support on NestLoop and MergeJoin, however, > NestLoop has no valuable scenario without parallel execution > capability, and the most valuable scenario on MergeJoin is reduction of rows prior to Sort. > Once input rows gets sorted, it is less attractive to filter out rows. > > * MultiExecHash() once put slot on outer_slot then move it to > inner_slot This patch add set_hash_references() to replace varno in > the expression of Hash->filterqual to OUTER_VAR. Why not INNER_VAR? > If Var nodes would be initialized t oreference inner_slot, you don't > need to re-assign slot. > > I'll try to have deeper investigation, later. > > > This patch contains following implementation, but I can't determine > > this is correct or wrong. > > > > 1. Cost estimation > > In this patch, additional row filter is implemented for Hash, Merge > > join and > Nested > > Loop. > > I implemented cost estimation feature for this filter by watching > > other parts of filters, but I am not sure this implementation is > > correct. > > > > @@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path, > * not all of the quals may get evaluated at each tuple.) > */ > startup_cost += qp_qual_cost.startup; > + startup_cost += filter_qual_cost.startup + > + filter_qual_cost.per_tuple * inner_path_rows; > cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple; > run_cost += cpu_per_tuple * hashjointuples; > > It seems to me it is not a fair estimation because inner_path_rows > means number of rows already filtered out, but filter_qual shall be > applied to all the inner input rows. > > > > 2. Workaround for failing assertion at allpaths.c In > > standard_join_search(), we expect to have a single rel at the final level. > > But this expectation is disappointed by join pushdown feature, > > because this > will > > search for the combinations not searched by original > > standard_join_serch() at the final level. Therefore, once trying > > join pushdown is succeeded, failing assertion occurs in allpaths.c. > > > > So I implemented workaround by temporary set NULL to > > root->join_rel_level while trying join pushdown, but I think this implementation may be wrong. > > > It is my off-list suggestion. The standard_join_search expects root of > the partition tables will appear, but child tables are out of scope. > Once we try to push down the join under the append, we need to > consider table joins between inner table and every outer child tables, > however, it should not be visible to standard_join_search context. > From the standpoint of standard_join_search, it get an AppendPath that > represents a table join A and B, even if A contains 100 children and > join was pushed down on behalf of the AppendPath. > So, it is a reasonable way to set NULL on root->join_rel_level to > avoid unexpected RelOptInfo addition by build_join_rel(). > "to avoid assertion" is one fact, however, intension of the code is > avoid pollution of the global data structure. ;-) > > > > 3. Searching pathkeys for Merge Join When join pushdown feature > > chooses merge join for pushed-down join operation, planner fails to > > create merge join node because it is unable to find pathkeys for > > this merge join. I found this is caused by skipping child table in > > finding pathkeys. > > > > I expect that it is for making planner faster, and I implemented > > that planner doesn't skip child table in finding pathkeys for merge join. > > But I am not sure this expectation is correct. > > > I like to recommend to omit MergeJoin support at the first version. > > Thanks, > > > Any comments/suggestions are welcome. > > > > > > Remarks : > > [1] > > > http://www.postgresql.org/message-id/12A9442FBAE80D4E8953883E0B84E0885 > C01FD@ > > BPXM01GP.gisp.nec.co.jp > > [2] > > > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA7694F80 > 11345B > > 6@BPXM15GP.gisp.nec.co.jp > > > > Best regards, > > -- > > Taiki Kondo > > > > NEC Solution Innovators, Ltd. > > > > -----Original Message----- > > From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] > > Sent: Tuesday, August 18, 2015 5:47 PM > > To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org > > Cc: Iwaasa Akio(岩浅 晃郎) > > Subject: RE: [Proposal] Table partition + join pushdown > > > > Hello Kondo-san, > > > > I briefly checked your patch. Let me put some comments about its > > design and implementation, even though I have no arguments towards > > its concept. :-) > > > > * Construction of RelOptInfo > > > > In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() > constructs > > RelOptInfo of the join-rel between inner-rel and a subpath of Append > > node. It is entirely wrong implementation. > > > > I can understand we (may) have no RelOptInfo for the joinrel between > > tbl_child_0 and other_table, when planner investigates a joinpath to > > process > join > > Append path with the other_table. > > > > > HashJoin > > > -> Append > > > -> SeqScan on tbl_child_0 > > > -> SeqScan on tbl_child_1 > > > -> SeqScan on tbl_child_2 > > > -> SeqScan on tbl_child_3 > > > -> Hash > > > -> SeqScan on other_table > > > > > How about these alternatives? > > > > - calls make_join_rel() to the pair of tbl_child_X and other_table > > by try_hashjoin_pushdown() or somewhere. make_join_rel() internally > > construct a RelOptInfo for the supplied pair of relations, so > > relevant RelOptInfo shall be properly constructed. > > - make_join_rel() also calls add_paths_to_joinrel() towards all the > > join logic, so it makes easier to support to push down other join > > logic including nested-loop or custom-join. > > - It may be an idea to add an extra argument to make_join_rel() to > > inform expressions to be applied for tuple filtering on > > construction of inner hash table. > > > > * Why only SeqScan is supported > > > > I think it is the role of Hash-node to filter out inner tuples > > obviously unrelated to the join (if CHECK constraint of outer > > relation gives information), because this join-pushdown may be able to support multi-stacked pushdown. > > > > For example, if planner considers a path to join this Append-path > > with another relation, and join clause contains a reference to X? > > > > > Append > > > -> HashJoin > > > -> SeqScan on tbl_child_0 > > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_1 > > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_2 > > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_3 > > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > > -> SeqScan on other_table > > > > > It may be a good challenge to consider additional join pushdown, > > even if subpaths of Append are HashJoin, not SeqScan, like: > > > > Append > > -> HashJoin > > -> HashJoin > > -> SeqScan on tbl_child_0 > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > -> SeqScan on other_table > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > -> SeqScan on another_table > > -> HashJoin > > -> HashJoin > > -> SeqScan on tbl_child_1 > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > -> SeqScan on other_table > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > -> SeqScan on another_table > > -> HashJoin > > -> HashJoin > > -> SeqScan on tbl_child_2 > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > -> SeqScan on other_table > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > -> SeqScan on another_table > > -> HashJoin > > -> HashJoin > > -> SeqScan on tbl_child_3 > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > -> SeqScan on other_table > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > -> SeqScan on another_table > > > > In this case, underlying nodes are not always SeqScan. So, only > > Hash-node can have filter clauses. > > > > > > * Way to handle expression nodes > > > > All this patch supported is CHECK() constraint with equal operation > > on INT4 > data > > type. You can learn various useful infrastructure of PostgreSQL. For example, ... > > - expression_tree_mutator() is useful to make a copy of expression > > node with small modification > > - pull_varnos() is useful to check which relations are referenced > > by the expression node. > > - RestrictInfo->can_join is useful to check whether the clause is > > binary operator, or not. > > > > > > Anyway, reuse of existing infrastructure is the best way to make a > > reliable infrastructure and to keep the implementation simple. > > > > Thanks, > > -- > > NEC Business Creation Division / PG-Strom Project KaiGai Kohei > > <kaigai@ak.jp.nec.com> > > > > > > > -----Original Message----- > > > From: pgsql-hackers-owner@postgresql.org > > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki > > > Kondo > > > Sent: Thursday, August 13, 2015 6:30 PM > > > To: pgsql-hackers@postgresql.org > > > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎) > > > Subject: [HACKERS] [Proposal] Table partition + join pushdown > > > > > > Hi all, > > > > > > I saw the email about the idea from KaiGai-san[1], and I worked to > > > implement this idea. > > > > > > Now, I have implemented a part of this idea, so I want to propose > > > this feature. > > > > > > Patch attached just shows my concept of this feature. > > > It works fine for EXPLAIN, but it returns wrong result for other > > > operations, > > sadly. > > > > > > > > > > > > Table partition + join pushdown > > > =============================== > > > > > > Motivation > > > ---------- > > > To make join logic working more effectively, it is important to > > > make the size of relations smaller. > > > > > > Especially in Hash-join, it is meaningful to make the inner > > > relation smaller, because smaller inner relation can be stored > > > within smaller hash > table. > > > This means that memory usage can be reduced when joining with big tables. > > > > > > > > > Design > > > ------ > > > It was mentioned by the email from KaiGai-san. > > > So I quote below here... > > > > > > ---- begin quotation --- > > > Let's assume a table which is partitioned to four portions, and > > > individual child relations have constraint by hash-value of its ID > > > field. > > > > > > tbl_parent > > > + tbl_child_0 ... CHECK(hash_func(id) % 4 = 0) > > > + tbl_child_1 ... CHECK(hash_func(id) % 4 = 1) > > > + tbl_child_2 ... CHECK(hash_func(id) % 4 = 2) > > > + tbl_child_3 ... CHECK(hash_func(id) % 4 = 3) > > > > > > If someone tried to join another relation with tbl_parent using > > > equivalence condition, like X = tbl_parent.ID, we know inner > > > tuples that does not satisfies the condition > > > hash_func(X) % 4 = 0 > > > shall be never joined to the tuples in tbl_child_0. > > > So, we can omit to load these tuples to inner hash table > > > preliminary, then it potentially allows to split the inner hash-table. > > > > > > Current typical plan structure is below: > > > > > > HashJoin > > > -> Append > > > -> SeqScan on tbl_child_0 > > > -> SeqScan on tbl_child_1 > > > -> SeqScan on tbl_child_2 > > > -> SeqScan on tbl_child_3 > > > -> Hash > > > -> SeqScan on other_table > > > > > > It may be rewritable to: > > > > > > Append > > > -> HashJoin > > > -> SeqScan on tbl_child_0 > > > -> Hash ... Filter: hash_func(X) % 4 = 0 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_1 > > > -> Hash ... Filter: hash_func(X) % 4 = 1 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_2 > > > -> Hash ... Filter: hash_func(X) % 4 = 2 > > > -> SeqScan on other_table > > > -> HashJoin > > > -> SeqScan on tbl_child_3 > > > -> Hash ... Filter: hash_func(X) % 4 = 3 > > > -> SeqScan on other_table > > > ---- end quotation --- > > > > > > In the quotation above, it was written that filter is set at Hash node. > > > But I implemented that filter is set at SeqScan node under Hash node. > > > In my opinion, filtering tuples is work of Scanner. > > > > > > Append > > > -> HashJoin > > > -> SeqScan on tbl_child_0 > > > -> Hash > > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0 > > > -> HashJoin > > > -> SeqScan on tbl_child_1 > > > -> Hash > > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1 > > > -> HashJoin > > > -> SeqScan on tbl_child_2 > > > -> Hash > > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2 > > > -> HashJoin > > > -> SeqScan on tbl_child_3 > > > -> Hash > > > -> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3 > > > > > > > > > API > > > --- > > > There are 3 new internal (static) functions to implement this feature. > > > try_hashjoin_pushdown(), which is main function of this feature, > > > is called from try_hashjoin_path(), and tries to push down > > > HashPath under the AppendPath. > > > > > > To do so, this function does following operations. > > > > > > 1. Check if this Hash-join can be pushed down under AppendPath > > > 2. To avoid an influence on other Path making operation, > > > copy inner path's RelOptInfo and make new SeqScan path from it. > > > At here, get CHECK() constraints from OUTER path, and convert its > > > Var node according to join condition. And also convert Var nodes > > > in join condition itself. > > > 3. Create new HashPath nodes between each sub-paths of AppendPath and > > > inner path made above. > > > 4. When operations from 1 to 3 are done for each sub-paths, > > > create new AppendPath which sub-paths are HashPath nodes made above. > > > > > > get_replaced_clause_constr() is called from > > > try_hashjoin_pushdown(), and get_var_nodes_recurse() is called from get_replaced_cluase_constr(). > > > These 2 functions help above operations. > > > (I may revise this part to use expression_tree_walker() and > > > expression_tree_mutator().) > > > > > > In patch attached, it has the following limitations. > > > o It only works for hash-join operation. > > > (I want to support not only hash-join but also other logic.) > > > o Join conditions must be "=" operator with int4 variables. > > > o Inner path must be SeqScan. > > > (I want to support other path-node.) > > > o And now, planner may not choose this plan, > > > because estimated costs are usually larger than original > > > (non-pushdown) > > plan. > > > > > > And also 1 internal (static) function, get_relation_constraints() > > > defined in plancat.c, is changed to global. This function will be > > > called from > > > get_replaced_clause_constr() to get CHECK() constraints. > > > > > > > > > Usage > > > ----- > > > To use this feature, create partition tables and small table to > > > join, and run select operation with joining these tables. > > > > > > For your convenience, I attach DDL and DML script. > > > And I also attach the result of EXPLAIN. > > > > > > > > > Any comments are welcome. But, at first, I need your advices to > > > correct this patch's behavior. > > > > > > At least, I think it has to expand array of RangeTblEntry and > > > other arrays defined in PlannerInfo to register new RelOptInfos > > > for new Path nodes > > mentioned above. > > > Or, is it better choice to modify query parser to implement this > > > feature more further? > > > > > > > > > Remarks : > > > [1] > > > http://www.postgresql.org/message-id/9A28C8860F777E439AA12E8AEA769 > > > 4F80 > > > 10F672 > > > B@BPXM15GP.gisp.nec.co.jp > > > > > > > > > Best regards, > > > -- > > > Taiki Kondo > > > > > > NEC Solution Innovators, Ltd. > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To > make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To > make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
Hello. I tried to read this and had some random comments on this. -- general I got some warning on compilation on unused variables and wrong arguemtn type. I failed to have a query that this patch works on. Could you let me have some specific example for this patch? This patch needs more comments. Please put comment about not only what it does but also the reason and other things for it. -- about namings Names for functions and variables are needed to be more appropriate, in other words, needed to be what properly informs what they are. The followings are the examples of such names. "added_restrictlist"'s widely distributed as many function arguemnts and JoinPathExtraData makes me feel dizzy.. create_mergejoin_path takes it as "filtering_clauses", which looks far better. try_join_pushdown() is also the name with much wider meaning. This patch tries to move hashjoins on inheritance parent to under append paths. It could be generically called 'pushdown' but this would be better be called such like 'transform appended hashjoin' or 'hashjoin distribution'. The latter would be better. (The function name would be try_distribute_hashjoin for the case.) The name make_restrictinfos_from_check_contr() also tells me wrong thing. For example, extract_constraints_for_hashjoin_distribution() would inform me about what it returns. -- about what make_restrictinfos_from_check_constr() does In make_restrictinfos_from_check_constr, the function returns modified constraint predicates correspond to vars under hashjoinable join clauses. I don't think expression_tree_mutator is suitable to do that since it could allow unwanted result when constraint predicates or join clauses are not simple OpExpr's. Could you try more simple and straight-forward way to do that? Otherwise could you give me clear explanation on what it does? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, Horiguchi-san. Thank you for your comment. > I got some warning on compilation on unused variables and wrong > arguemtn type. OK, I'll fix it. > I failed to have a query that this patch works on. Could you let > me have some specific example for this patch? Please find attached. And also make sure that setting of work_mem is '64kB' (not 64MB). If there is the large work_mem enough to create hash table for relation after appending, its cost may be better than pushed-down plan's cost, then planner will not choose pushed-down plan this patch makes. So, to make this patch working fine, work_mem size must be smaller than the hash table size for relation after appending. > This patch needs more comments. Please put comment about not only > what it does but also the reason and other things for it. OK, I'll put more comments in the code. But it will take a long time, maybe... > -- about namings > > Names for functions and variables are needed to be more > appropriate, in other words, needed to be what properly informs > what they are. The followings are the examples of such names. Thank you for your suggestion. I also think these names are not good much. I'll try to make names better , but it maybe take a long time... Of course, I will use your suggestion as reference. > "added_restrictlist"'s widely distributed as many function > arguemnts and JoinPathExtraData makes me feel > dizzy.. "added_restrictinfo" will be deleted from almost functions other than try_join_pushdown() in next (v2) patch because the place of filtering using this info will be changed from Join node to Scan node and not have to place it into other than try_join_pushdown(). > In make_restrictinfos_from_check_constr, the function returns > modified constraint predicates correspond to vars under > hashjoinable join clauses. I don't think expression_tree_mutator > is suitable to do that since it could allow unwanted result when > constraint predicates or join clauses are not simple OpExpr's. Do you have any example of this situation? I am trying to find unwanted results you mentioned, but I don't have any results in this timing. I have a hunch that it will allow unwanted results because I have thought only about very simple situation for this function. > Otherwise could you give me clear explanation on what it does? This function transfers CHECK() constraint to filter expression by following procedures. (1) Get outer table's CHECK() constraint by using get_relation_constraints(). (2) Walk through expression tree got in (1) by using expression_tree_mutator() with check_constraint_mutator() and change only outer's Var node to inner's one according to join clause. For example, when CHECK() constraint of table A is "num % 4 = 0" and join clause between table A and B is "A.num = B.data", then we can get "B.data % 4 = 0" for filtering purpose. This also accepts more complex join clause like "A.num = B.data * 2", then we can get "(B.data * 2) % 4 = 0". In procedure (2), to decide whether to use each join clause for changing Var node or not, I implement check_constraint_mutator() to judge whether join clause is hash-joinable or not. Actually, I want to judge whether OpExpr as top expression tree of join clause means "=" or not, but I can't find how to do it. If you know how to do it, please let me know. Best regards, -- Taiki Kondo NEC Solution Innovators, Ltd. -----Original Message----- From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] Sent: Tuesday, October 06, 2015 8:35 PM To: tai-kondo@yk.jp.nec.com Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown Hello. I tried to read this and had some random comments on this. -- general I got some warning on compilation on unused variables and wrong arguemtn type. I failed to have a query that this patch works on. Could you let me have some specific example for this patch? This patch needs more comments. Please put comment about not only what it does but also the reason and other things for it. -- about namings Names for functions and variables are needed to be more appropriate, in other words, needed to be what properly informs whatthey are. The followings are the examples of such names. "added_restrictlist"'s widely distributed as many function arguemnts and JoinPathExtraData makes me feel dizzy.. create_mergejoin_pathtakes it as "filtering_clauses", which looks far better. try_join_pushdown() is also the name with much wider meaning. This patch tries to move hashjoins on inheritance parent tounder append paths. It could be generically called 'pushdown' but this would be better be called such like 'transform appended hashjoin' or 'hashjoin distribution'. The latter would bebetter. (The function name would be try_distribute_hashjoin for the case.) The name make_restrictinfos_from_check_contr() also tells me wrong thing. For example, extract_constraints_for_hashjoin_distribution() would inform me about what it returns. -- about what make_restrictinfos_from_check_constr() does In make_restrictinfos_from_check_constr, the function returns modified constraint predicates correspond to vars under hashjoinablejoin clauses. I don't think expression_tree_mutator is suitable to do that since it could allow unwanted resultwhen constraint predicates or join clauses are not simple OpExpr's. Could you try more simple and straight-forward way to do that? Otherwise could you give me clear explanation on what it does? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Hello, thank you for the example. I could see this patch working with it. > > In make_restrictinfos_from_check_constr, the function returns > > modified constraint predicates correspond to vars under > > hashjoinable join clauses. I don't think expression_tree_mutator > > is suitable to do that since it could allow unwanted result when > > constraint predicates or join clauses are not simple OpExpr's. > > Do you have any example of this situation? As a rather simple case on the test environment made by the provided script, the following query, explain analyze select data_x, data_y, num from check_test_div join inner_t on check_test_div.id + 1 = inner_t.id; Makes the mutation fail then result in an assertion failure. | TRAP: FailedAssertion("!(list_length(check_constr) == list_length(result))", File: "joinpath.c", Line: 1608) This is because both 'check_test_div.id + 1' and inner_t.id don't match the var-side of the constraints. I don't see clearly what to do for the situation for now but this is the one of the most important function for this feature and should be cleanly designed. At Thu, 8 Oct 2015 08:28:04 +0000, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote in <12A9442FBAE80D4E8953883E0B84E0885F9913@BPXM01GP.gisp.nec.co.jp> > Hello, Horiguchi-san. > > Thank you for your comment. > > > I got some warning on compilation on unused variables and wrong > > arguemtn type. > > OK, I'll fix it. > > > I failed to have a query that this patch works on. Could you let > > me have some specific example for this patch? > > Please find attached. > And also make sure that setting of work_mem is '64kB' (not 64MB). > > If there is the large work_mem enough to create hash table for > relation after appending, its cost may be better than pushed-down > plan's cost, then planner will not choose pushed-down plan this patch makes. > So, to make this patch working fine, work_mem size must be smaller than > the hash table size for relation after appending. > > > This patch needs more comments. Please put comment about not only > > what it does but also the reason and other things for it. > > OK, I'll put more comments in the code. > But it will take a long time, maybe... Sure. > > -- about namings > > > > Names for functions and variables are needed to be more > > appropriate, in other words, needed to be what properly informs > > what they are. The followings are the examples of such names. > > Thank you for your suggestion. > > I also think these names are not good much. > I'll try to make names better , but it maybe take a long time... > Of course, I will use your suggestion as reference. Thanks. > > "added_restrictlist"'s widely distributed as many function > > arguemnts and JoinPathExtraData makes me feel > > dizzy.. > > "added_restrictinfo" will be deleted from almost functions > other than try_join_pushdown() in next (v2) patch because > the place of filtering using this info will be changed > from Join node to Scan node and not have to place it > into other than try_join_pushdown(). I'm looking forward to see it. > > In make_restrictinfos_from_check_constr, the function returns > > modified constraint predicates correspond to vars under > > hashjoinable join clauses. I don't think expression_tree_mutator > > is suitable to do that since it could allow unwanted result when > > constraint predicates or join clauses are not simple OpExpr's. > > Do you have any example of this situation? > I am trying to find unwanted results you mentioned, but I don't have > any results in this timing. I have a hunch that it will allow unwanted > results because I have thought only about very simple situation for > this function. As mentioned above. > > Otherwise could you give me clear explanation on what it does? > > This function transfers CHECK() constraint to filter expression by following > procedures. > (1) Get outer table's CHECK() constraint by using get_relation_constraints(). > (2) Walk through expression tree got in (1) by using expression_tree_mutator() > with check_constraint_mutator() and change only outer's Var node to > inner's one according to join clause. > > For example, when CHECK() constraint of table A is "num % 4 = 0" and > join clause between table A and B is "A.num = B.data", > then we can get "B.data % 4 = 0" for filtering purpose. > > This also accepts more complex join clause like "A.num = B.data * 2", > then we can get "(B.data * 2) % 4 = 0". > > In procedure (2), to decide whether to use each join clause for changing > Var node or not, I implement check_constraint_mutator() to judge whether > join clause is hash-joinable or not. Thanks for the explanation. I think that the function has been made considering only rather plain calses. We should put more thought to making the logic clearer so that we can define the desired/possible capability and limitations clearly. > Actually, I want to judge whether OpExpr as top expression tree of > join clause means "=" or not, but I can't find how to do it. > > If you know how to do it, please let me know. I don't see for now, too :p But we at least should put more consideration on the mechanism to obtain the expressions. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > Sent: Thursday, October 08, 2015 5:28 PM > To: Kyotaro HORIGUCHI > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎); > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > Hello, Horiguchi-san. > > Thank you for your comment. > > > I got some warning on compilation on unused variables and wrong > > arguemtn type. > > OK, I'll fix it. > > > I failed to have a query that this patch works on. Could you let > > me have some specific example for this patch? > > Please find attached. > And also make sure that setting of work_mem is '64kB' (not 64MB). > > If there is the large work_mem enough to create hash table for > relation after appending, its cost may be better than pushed-down > plan's cost, then planner will not choose pushed-down plan this patch makes. > So, to make this patch working fine, work_mem size must be smaller than > the hash table size for relation after appending. > > > This patch needs more comments. Please put comment about not only > > what it does but also the reason and other things for it. > > OK, I'll put more comments in the code. > But it will take a long time, maybe... > People (including me) can help. Even though your English capability is not enough, it is significant to put intention of the code. > > -- about namings > > > > Names for functions and variables are needed to be more > > appropriate, in other words, needed to be what properly informs > > what they are. The followings are the examples of such names. > > Thank you for your suggestion. > > I also think these names are not good much. > I'll try to make names better , but it maybe take a long time... > Of course, I will use your suggestion as reference. > > > "added_restrictlist"'s widely distributed as many function > > arguemnts and JoinPathExtraData makes me feel > > dizzy.. > > "added_restrictinfo" will be deleted from almost functions > other than try_join_pushdown() in next (v2) patch because > the place of filtering using this info will be changed > from Join node to Scan node and not have to place it > into other than try_join_pushdown(). > This restrictinfo intends to filter out obviously unrelated rows in this join, due to the check constraint of other side of the join. So, correct but redundant name is: restrictlist_to_drop_unrelated_rows_because_of_check_constraint How about 'restrictlist_by_constraint' instead? > > In make_restrictinfos_from_check_constr, the function returns > > modified constraint predicates correspond to vars under > > hashjoinable join clauses. I don't think expression_tree_mutator > > is suitable to do that since it could allow unwanted result when > > constraint predicates or join clauses are not simple OpExpr's. > > Do you have any example of this situation? > I am trying to find unwanted results you mentioned, but I don't have > any results in this timing. I have a hunch that it will allow unwanted > results because I have thought only about very simple situation for > this function. > check_constraint_mutator makes modified restrictlist with relacing Var node only when join clause is hash-joinable. It implies <expr> = <expr> form, thus we can safely replace the expression by the other side. Of course, we still have cases we cannot replace expressions simply. - If function (or function called by operators) has volatile attribute (who use volatile function on CHECK constraint ofpartitioning?) - If it is uncertain whether expression returns always same result. (is it possible to contain SubLink in the constraint?) I'd like to suggest to use white-list approach in this mutator routine. It means that only immutable expression node are allowed to include the modified restrictlist. Things to do is: check_constraint_mutator(...) { if (node == NULL) return NULL; if (IsA(node, Var)) { : } else if (node is not obviously immutable) { context->is_mutated= false; <-- prohibit to make if expression } contains uncertain node.return expression_tree_mutator(...) } > > Otherwise could you give me clear explanation on what it does? > > This function transfers CHECK() constraint to filter expression by following > procedures. > (1) Get outer table's CHECK() constraint by using get_relation_constraints(). > (2) Walk through expression tree got in (1) by using expression_tree_mutator() > with check_constraint_mutator() and change only outer's Var node to > inner's one according to join clause. > > For example, when CHECK() constraint of table A is "num % 4 = 0" and > join clause between table A and B is "A.num = B.data", > then we can get "B.data % 4 = 0" for filtering purpose. > > This also accepts more complex join clause like "A.num = B.data * 2", > then we can get "(B.data * 2) % 4 = 0". > > In procedure (2), to decide whether to use each join clause for changing > Var node or not, I implement check_constraint_mutator() to judge whether > join clause is hash-joinable or not. > > Actually, I want to judge whether OpExpr as top expression tree of > join clause means "=" or not, but I can't find how to do it. > > If you know how to do it, please let me know. > > Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > Sent: Tuesday, October 06, 2015 8:35 PM > To: tai-kondo@yk.jp.nec.com > Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > Hello. > > I tried to read this and had some random comments on this. > > -- general > > I got some warning on compilation on unused variables and wrong arguemtn type. > > I failed to have a query that this patch works on. Could you let me have some > specific example for this patch? > > This patch needs more comments. Please put comment about not only what it does > but also the reason and other things for it. > > > -- about namings > > Names for functions and variables are needed to be more appropriate, in other > words, needed to be what properly informs what they are. The followings are the > examples of such names. > > "added_restrictlist"'s widely distributed as many function arguemnts and > JoinPathExtraData makes me feel dizzy.. create_mergejoin_path takes it as > "filtering_clauses", which looks far better. > > try_join_pushdown() is also the name with much wider meaning. This patch tries > to move hashjoins on inheritance parent to under append paths. It could be > generically called 'pushdown' > but this would be better be called such like 'transform appended hashjoin' or > 'hashjoin distribution'. The latter would be better. > (The function name would be try_distribute_hashjoin for the > case.) > > The name make_restrictinfos_from_check_contr() also tells me wrong thing. For > example, > extract_constraints_for_hashjoin_distribution() would inform me about what it > returns. > > > -- about what make_restrictinfos_from_check_constr() does > > In make_restrictinfos_from_check_constr, the function returns modified > constraint predicates correspond to vars under hashjoinable join clauses. I don't > think expression_tree_mutator is suitable to do that since it could allow unwanted > result when constraint predicates or join clauses are not simple OpExpr's. > > Could you try more simple and straight-forward way to do that? > Otherwise could you give me clear explanation on what it does? > > regards, > > -- > Kyotaro Horiguchi > NTT Open Source Software Center > > >
Hello, Horiguchi-san. Sorry for late reply. > explain analyze > select data_x, data_y, num from check_test_div join inner_t on check_test_div.id + 1 = inner_t.id; > > Makes the mutation fail then result in an assertion failure. > > | TRAP: FailedAssertion("!(list_length(check_constr) == > | list_length(result))", File: "joinpath.c", Line: 1608) > > This is because both 'check_test_div.id + 1' and inner_t.id don't > match the var-side of the constraints. Thank you for picking up the failure example. This is exactly a bug. I'll fix it. > I don't see clearly what to do for the situation for now but this > is the one of the most important function for this feature and > should be cleanly designed. Yes, this function is one of the important features of this patch. This function makes new filtering conditions from CHECK() constraints. This is to reduce number of rows for making hash table smaller (or making sorting faster for MergeJoin) to fit to smaller work_mem environment. Maybe, I must collect realistic examples of CHECK() constraints, which are used for table partitioning, for designing more cleanly. Best regards, -- Taiki Kondo NEC Solution Innovators, Ltd. -----Original Message----- From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] Sent: Thursday, October 08, 2015 7:04 PM To: tai-kondo@yk.jp.nec.com Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown Hello, thank you for the example. I could see this patch working with it. > > In make_restrictinfos_from_check_constr, the function returns > > modified constraint predicates correspond to vars under hashjoinable > > join clauses. I don't think expression_tree_mutator is suitable to > > do that since it could allow unwanted result when constraint > > predicates or join clauses are not simple OpExpr's. > > Do you have any example of this situation? As a rather simple case on the test environment made by the provided script, the following query, explain analyze select data_x, data_y, num from check_test_div join inner_t on check_test_div.id + 1 = inner_t.id; Makes the mutation fail then result in an assertion failure. | TRAP: FailedAssertion("!(list_length(check_constr) == | list_length(result))", File: "joinpath.c", Line: 1608) This is because both 'check_test_div.id + 1' and inner_t.id don't match the var-side of the constraints. I don't see clearly what to do for the situation for now but this is the one of the most important function for this featureand should be cleanly designed. At Thu, 8 Oct 2015 08:28:04 +0000, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote in <12A9442FBAE80D4E8953883E0B84E0885F9913@BPXM01GP.gisp.nec.co.jp> > Hello, Horiguchi-san. > > Thank you for your comment. > > > I got some warning on compilation on unused variables and wrong > > arguemtn type. > > OK, I'll fix it. > > > I failed to have a query that this patch works on. Could you let me > > have some specific example for this patch? > > Please find attached. > And also make sure that setting of work_mem is '64kB' (not 64MB). > > If there is the large work_mem enough to create hash table for > relation after appending, its cost may be better than pushed-down > plan's cost, then planner will not choose pushed-down plan this patch makes. > So, to make this patch working fine, work_mem size must be smaller > than the hash table size for relation after appending. > > > This patch needs more comments. Please put comment about not only > > what it does but also the reason and other things for it. > > OK, I'll put more comments in the code. > But it will take a long time, maybe... Sure. > > -- about namings > > > > Names for functions and variables are needed to be more appropriate, > > in other words, needed to be what properly informs what they are. > > The followings are the examples of such names. > > Thank you for your suggestion. > > I also think these names are not good much. > I'll try to make names better , but it maybe take a long time... > Of course, I will use your suggestion as reference. Thanks. > > "added_restrictlist"'s widely distributed as many function arguemnts > > and JoinPathExtraData makes me feel dizzy.. > > "added_restrictinfo" will be deleted from almost functions other than > try_join_pushdown() in next (v2) patch because the place of filtering > using this info will be changed from Join node to Scan node and not > have to place it into other than try_join_pushdown(). I'm looking forward to see it. > > In make_restrictinfos_from_check_constr, the function returns > > modified constraint predicates correspond to vars under hashjoinable > > join clauses. I don't think expression_tree_mutator is suitable to > > do that since it could allow unwanted result when constraint > > predicates or join clauses are not simple OpExpr's. > > Do you have any example of this situation? > I am trying to find unwanted results you mentioned, but I don't have > any results in this timing. I have a hunch that it will allow unwanted > results because I have thought only about very simple situation for > this function. As mentioned above. > > Otherwise could you give me clear explanation on what it does? > > This function transfers CHECK() constraint to filter expression by > following procedures. > (1) Get outer table's CHECK() constraint by using get_relation_constraints(). > (2) Walk through expression tree got in (1) by using expression_tree_mutator() > with check_constraint_mutator() and change only outer's Var node to > inner's one according to join clause. > > For example, when CHECK() constraint of table A is "num % 4 = 0" and > join clause between table A and B is "A.num = B.data", then we can get > "B.data % 4 = 0" for filtering purpose. > > This also accepts more complex join clause like "A.num = B.data * 2", > then we can get "(B.data * 2) % 4 = 0". > > In procedure (2), to decide whether to use each join clause for > changing Var node or not, I implement check_constraint_mutator() to > judge whether join clause is hash-joinable or not. Thanks for the explanation. I think that the function has been made considering only rather plain calses. We should put morethought to making the logic clearer so that we can define the desired/possible capability and limitations clearly. > Actually, I want to judge whether OpExpr as top expression tree of > join clause means "=" or not, but I can't find how to do it. > > If you know how to do it, please let me know. I don't see for now, too :p But we at least should put more consideration on the mechanism to obtain the expressions. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, KaiGai-san and Horiguchi-san. I created v2 patch. Please find attached. I believe this patch will fix the most of issues mentioned by Horiguchi-san except naming. In this v2 patch, scan node which is originally inner relation of Join node must be SeqScan (or SampleScan). This limitation is due to implementation of try_join_pushdown(), which copies Path nodes to attach new filtering conditions converted from CHECK() constraints. It uses copyObject() for this purpose, so I must implement copy functions for scan Path nodes like IndexPath, BitmapHeapPath, TidPath and so on. By the way, let me introduce the performance of this feature. Here are the results I tested in my environment. These results were got by "pushdown_test.v1.large.sql" running on the environment that "work_mem" set to "1536kB". (This file is also attached in this mail.) [Normal] QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1851.02..14638.11 rows=300004 width=20) (actual time=88.188..453.926 rows=299992 loops=1) Hash Cond: (check_test_div.id = inner_t.id) -> Append (cost=0.00..4911.03 rows=300004 width=20) (actual time=0.089..133.456 rows=300003 loops=1) -> Seq Scan on check_test_div (cost=0.00..0.00 rows=1 width=20) (actual time=0.003..0.003 rows=0 loops=1) -> Seq Scan on check_test_div_0 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.085..40.741 rows=100001loops=1) -> Seq Scan on check_test_div_1 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.023..29.213 rows=100001loops=1) -> Seq Scan on check_test_div_2 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.021..28.592 rows=100001loops=1) -> Hash (cost=866.01..866.01 rows=60001 width=8) (actual time=87.970..87.970 rows=60001 loops=1) Buckets: 32768 Batches: 2 Memory Usage: 1446kB -> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 width=8) (actual time=0.030..39.133 rows=60001 loops=1) Planning time: 0.867 ms Execution time: 470.269 ms (12 rows) [With this feature] QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Append (cost=0.01..10651.37 rows=300004 width=20) (actual time=55.548..377.615 rows=299992 loops=1) -> Hash Join (cost=0.01..1091.04 rows=1 width=20) (actual time=0.017..0.017 rows=0 loops=1) Hash Cond: (inner_t.id = check_test_div.id) -> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 width=8) (never executed) -> Hash (cost=0.00..0.00 rows=1 width=20) (actual time=0.003..0.003 rows=0 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 8kB -> Seq Scan on check_test_div (cost=0.00..0.00 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1) -> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual time=55.530..149.205 rows=100001 loops=1) Hash Cond: (check_test_div_0.id = inner_t.id) -> Seq Scan on check_test_div_0 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.058..34.268 rows=100001loops=1) -> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual time=55.453..55.453 rows=20001 loops=1) Buckets: 32768 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1038kB -> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8) (actual time=0.031..43.590 rows=20001 loops=1) Filter: ((id % 3) = 0) Rows Removed by Filter: 40000 -> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual time=27.942..97.582 rows=99996 loops=1) Hash Cond: (check_test_div_1.id = inner_t.id) -> Seq Scan on check_test_div_1 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.030..25.514 rows=100001loops=1) -> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual time=27.890..27.890 rows=20000 loops=1) Buckets: 32768 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1038kB -> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8) (actual time=0.014..21.688 rows=20000 loops=1) Filter: ((id % 3) = 1) Rows Removed by Filter: 40001 -> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual time=27.651..97.755 rows=99995 loops=1) Hash Cond: (check_test_div_2.id = inner_t.id) -> Seq Scan on check_test_div_2 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.026..25.620 rows=100001loops=1) -> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual time=27.599..27.599 rows=20000 loops=1) Buckets: 32768 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1038kB -> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8) (actual time=0.017..21.307 rows=20000 loops=1) Filter: ((id % 3) = 2) Rows Removed by Filter: 40001 Planning time: 1.876 ms Execution time: 394.007 ms (33 rows) The value of "Batches" is 2 on Hash node in normal, but these values are 1 on all Hash nodes in "with this feature". This means that the hash table is not split because of this feature. Therefore, PostgreSQL with this feature is faster than the normal one in this case. (470.269 ms @ normal vs 394.007 ms @ this feature) I think this point is large benefit of this feature. Best regards, -- Taiki Kondo NEC Solution Innovators, Ltd. -----Original Message----- From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] Sent: Thursday, October 15, 2015 10:21 AM To: Kondo Taiki(近藤 太樹); Kyotaro HORIGUCHI Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > Sent: Thursday, October 08, 2015 5:28 PM > To: Kyotaro HORIGUCHI > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎); > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > Hello, Horiguchi-san. > > Thank you for your comment. > > > I got some warning on compilation on unused variables and wrong > > arguemtn type. > > OK, I'll fix it. > > > I failed to have a query that this patch works on. Could you let me > > have some specific example for this patch? > > Please find attached. > And also make sure that setting of work_mem is '64kB' (not 64MB). > > If there is the large work_mem enough to create hash table for > relation after appending, its cost may be better than pushed-down > plan's cost, then planner will not choose pushed-down plan this patch makes. > So, to make this patch working fine, work_mem size must be smaller > than the hash table size for relation after appending. > > > This patch needs more comments. Please put comment about not only > > what it does but also the reason and other things for it. > > OK, I'll put more comments in the code. > But it will take a long time, maybe... > People (including me) can help. Even though your English capability is not enough, it is significant to put intention ofthe code. > > -- about namings > > > > Names for functions and variables are needed to be more appropriate, > > in other words, needed to be what properly informs what they are. > > The followings are the examples of such names. > > Thank you for your suggestion. > > I also think these names are not good much. > I'll try to make names better , but it maybe take a long time... > Of course, I will use your suggestion as reference. > > > "added_restrictlist"'s widely distributed as many function arguemnts > > and JoinPathExtraData makes me feel dizzy.. > > "added_restrictinfo" will be deleted from almost functions other than > try_join_pushdown() in next (v2) patch because the place of filtering > using this info will be changed from Join node to Scan node and not > have to place it into other than try_join_pushdown(). > This restrictinfo intends to filter out obviously unrelated rows in this join, due to the check constraint of other sideof the join. So, correct but redundant name is: restrictlist_to_drop_unrelated_rows_because_of_check_constraint How about 'restrictlist_by_constraint' instead? > > In make_restrictinfos_from_check_constr, the function returns > > modified constraint predicates correspond to vars under hashjoinable > > join clauses. I don't think expression_tree_mutator is suitable to > > do that since it could allow unwanted result when constraint > > predicates or join clauses are not simple OpExpr's. > > Do you have any example of this situation? > I am trying to find unwanted results you mentioned, but I don't have > any results in this timing. I have a hunch that it will allow unwanted > results because I have thought only about very simple situation for > this function. > check_constraint_mutator makes modified restrictlist with relacing Var node only when join clause is hash-joinable. It implies <expr> = <expr> form, thus we can safely replace the expression by the other side. Of course, we still have cases we cannot replace expressions simply. - If function (or function called by operators) has volatile attribute (who use volatile function on CHECK constraint of partitioning?) - If it is uncertain whether expression returns always same result. (is it possible to contain SubLink in the constraint?) I'd like to suggest to use white-list approach in this mutator routine. It means that only immutable expression node are allowed to include the modified restrictlist. Things to do is: check_constraint_mutator(...) { if (node == NULL) return NULL; if (IsA(node, Var)) { : } else if (node is not obviously immutable) { context->is_mutated = false; <-- prohibit to make if expression } contains uncertain node. return expression_tree_mutator(...) } > > Otherwise could you give me clear explanation on what it does? > > This function transfers CHECK() constraint to filter expression by > following procedures. > (1) Get outer table's CHECK() constraint by using get_relation_constraints(). > (2) Walk through expression tree got in (1) by using expression_tree_mutator() > with check_constraint_mutator() and change only outer's Var node to > inner's one according to join clause. > > For example, when CHECK() constraint of table A is "num % 4 = 0" and > join clause between table A and B is "A.num = B.data", then we can get > "B.data % 4 = 0" for filtering purpose. > > This also accepts more complex join clause like "A.num = B.data * 2", > then we can get "(B.data * 2) % 4 = 0". > > In procedure (2), to decide whether to use each join clause for > changing Var node or not, I implement check_constraint_mutator() to > judge whether join clause is hash-joinable or not. > > Actually, I want to judge whether OpExpr as top expression tree of > join clause means "=" or not, but I can't find how to do it. > > If you know how to do it, please let me know. > > Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > Sent: Tuesday, October 06, 2015 8:35 PM > To: tai-kondo@yk.jp.nec.com > Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > Hello. > > I tried to read this and had some random comments on this. > > -- general > > I got some warning on compilation on unused variables and wrong arguemtn type. > > I failed to have a query that this patch works on. Could you let me > have some specific example for this patch? > > This patch needs more comments. Please put comment about not only what > it does but also the reason and other things for it. > > > -- about namings > > Names for functions and variables are needed to be more appropriate, > in other words, needed to be what properly informs what they are. The > followings are the examples of such names. > > "added_restrictlist"'s widely distributed as many function arguemnts > and JoinPathExtraData makes me feel dizzy.. create_mergejoin_path > takes it as "filtering_clauses", which looks far better. > > try_join_pushdown() is also the name with much wider meaning. This > patch tries to move hashjoins on inheritance parent to under append > paths. It could be generically called 'pushdown' > but this would be better be called such like 'transform appended > hashjoin' or 'hashjoin distribution'. The latter would be better. > (The function name would be try_distribute_hashjoin for the > case.) > > The name make_restrictinfos_from_check_contr() also tells me wrong > thing. For example, > extract_constraints_for_hashjoin_distribution() would inform me about > what it returns. > > > -- about what make_restrictinfos_from_check_constr() does > > In make_restrictinfos_from_check_constr, the function returns modified > constraint predicates correspond to vars under hashjoinable join > clauses. I don't think expression_tree_mutator is suitable to do that > since it could allow unwanted result when constraint predicates or join clauses are not simple OpExpr's. > > Could you try more simple and straight-forward way to do that? > Otherwise could you give me clear explanation on what it does? > > regards, > > -- > Kyotaro Horiguchi > NTT Open Source Software Center > > >
Attachment
Hi, I put my comments towards the patch as follows. Overall comments ---------------- * I think the enhancement in copyfuncs.c shall be in the separate patch; it is more graceful manner. At this moment, hereis less than 20 Path delivered type definition. It is much easier works than entire Plan node support as we did recently.(How about other folk's opinion?) * Can you integrate the attached test cases as regression test? It is more generic way, and allows people to detect problemsif relevant feature gets troubled in the future updates. * Naming of "join pushdown" is a bit misleading because other component also uses this term, but different purpose. I'd liketo suggest try_pullup_append_across_join. Any ideas from native English speaker? Patch review ------------ At try_join_pushdown: + /* When specified outer path is not an AppendPath, nothing to do here. */ + if (!IsA(outer_rel->cheapest_total_path, AppendPath)) + { + elog(DEBUG1, "Outer path is not an AppendPath. Do nothing."); + return; + } It checks whether the cheapest_total_path is AppendPath at the head of this function. It ought to be a loop to walk on the pathlist of RelOptInfo, because multiple path-nodes might be still alive but also might not be cheapest_total_path. + switch (inner_rel->cheapest_total_path->pathtype) + Also, we can construct the new Append node if one of the path-node within pathlist of inner_rel are at least supported. + if (list_length(inner_rel->ppilist) > 0) + { + elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown."); + return; + } + You may need to introduce why this feature uses ParamPathInfos here. It seems to me a good hack to attach additional qualifiers on the underlying inner scan node, even if it is not a direct child of inner relation. However, people may have different opinion. +static List * +convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses, + RelOptInfo *outer_rel) +{ + Index parent_relid = + find_childrel_appendrelinfo(root, outer_rel)->parent_relid; + List *clauses_parent = get_actual_clauses(join_clauses); + List *clauses_child = NIL; + ListCell *lc; + + foreach(lc, clauses_parent) + { + Node *one_clause_child = (Node *) copyObject(lfirst(lc)); + + ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0); + clauses_child = lappend(clauses_child, one_clause_child); + } + + return make_restrictinfos_from_actual_clauses(root, clauses_child); +} Is ChangeVarNodes() right routine to replace var-node of parent relation by relevant var-node of child relation? It may look sufficient, however, nobody can ensure varattno of child relation is identical with parent relation's one. For example, which attribute number shall be assigned on 'z' here? CREATE TABLE tbl_parent(x int); CREATE TABLE tbl_child(yint) INHERITS(tbl_parent); ALTER TABLE tbl_parent ADD COLUMN z int; --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys, /* * Ignore child members unless they match the rel being * sorted. + * + * If this is called from make_sort_from_pathkeys(), + * relids may be NULL. In this case, we must not ignore child + * members because inner/outer plan of pushed-down merge join is + * always child table. */ - if (em->em_is_child && + if (relids != NULL && + em->em_is_child && !bms_equal(em->em_relids, relids)) continue; It is a little bit hard to understand why this modification is needed. Could you add source code comment that focus on the reason why. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > Sent: Wednesday, October 21, 2015 8:07 PM > To: Kaigai Kouhei(海外 浩平); Kyotaro HORIGUCHI > Cc: pgsql-hackers@postgresql.org; Yanagisawa Hiroshi(柳澤 博) > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > Hello, KaiGai-san and Horiguchi-san. > > I created v2 patch. Please find attached. > I believe this patch will fix the most of issues mentioned by > Horiguchi-san except naming. > > In this v2 patch, scan node which is originally inner relation of > Join node must be SeqScan (or SampleScan). This limitation is > due to implementation of try_join_pushdown(), which copies Path nodes > to attach new filtering conditions converted from CHECK() constraints. > > It uses copyObject() for this purpose, so I must implement copy functions > for scan Path nodes like IndexPath, BitmapHeapPath, TidPath and so on. > > > By the way, let me introduce the performance of this feature. > Here are the results I tested in my environment. > These results were got by "pushdown_test.v1.large.sql" > running on the environment that "work_mem" set to "1536kB". > (This file is also attached in this mail.) > > [Normal] > QUERY PLAN > ---------------------------------------------------------------------------- > --------------------------------------------------------- > Hash Join (cost=1851.02..14638.11 rows=300004 width=20) (actual > time=88.188..453.926 rows=299992 loops=1) > Hash Cond: (check_test_div.id = inner_t.id) > -> Append (cost=0.00..4911.03 rows=300004 width=20) (actual > time=0.089..133.456 rows=300003 loops=1) > -> Seq Scan on check_test_div (cost=0.00..0.00 rows=1 width=20) > (actual time=0.003..0.003 rows=0 loops=1) > -> Seq Scan on check_test_div_0 (cost=0.00..1637.01 rows=100001 > width=20) (actual time=0.085..40.741 rows=100001 loops=1) > -> Seq Scan on check_test_div_1 (cost=0.00..1637.01 rows=100001 > width=20) (actual time=0.023..29.213 rows=100001 loops=1) > -> Seq Scan on check_test_div_2 (cost=0.00..1637.01 rows=100001 > width=20) (actual time=0.021..28.592 rows=100001 loops=1) > -> Hash (cost=866.01..866.01 rows=60001 width=8) (actual > time=87.970..87.970 rows=60001 loops=1) > Buckets: 32768 Batches: 2 Memory Usage: 1446kB > -> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 width=8) > (actual time=0.030..39.133 rows=60001 loops=1) > Planning time: 0.867 ms > Execution time: 470.269 ms > (12 rows) > > [With this feature] > QUERY PLAN > ---------------------------------------------------------------------------- > --------------------------------------------------------- > Append (cost=0.01..10651.37 rows=300004 width=20) (actual > time=55.548..377.615 rows=299992 loops=1) > -> Hash Join (cost=0.01..1091.04 rows=1 width=20) (actual > time=0.017..0.017 rows=0 loops=1) > Hash Cond: (inner_t.id = check_test_div.id) > -> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 width=8) (never > executed) > -> Hash (cost=0.00..0.00 rows=1 width=20) (actual time=0.003..0.003 > rows=0 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 8kB > -> Seq Scan on check_test_div (cost=0.00..0.00 rows=1 > width=20) (actual time=0.002..0.002 rows=0 loops=1) > -> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual > time=55.530..149.205 rows=100001 loops=1) > Hash Cond: (check_test_div_0.id = inner_t.id) > -> Seq Scan on check_test_div_0 (cost=0.00..1637.01 rows=100001 > width=20) (actual time=0.058..34.268 rows=100001 loops=1) > -> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual > time=55.453..55.453 rows=20001 loops=1) > Buckets: 32768 (originally 1024) Batches: 1 (originally 1) > Memory Usage: 1038kB > -> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8) > (actual time=0.031..43.590 rows=20001 loops=1) > Filter: ((id % 3) = 0) > Rows Removed by Filter: 40000 > -> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual > time=27.942..97.582 rows=99996 loops=1) > Hash Cond: (check_test_div_1.id = inner_t.id) > -> Seq Scan on check_test_div_1 (cost=0.00..1637.01 rows=100001 > width=20) (actual time=0.030..25.514 rows=100001 loops=1) > -> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual > time=27.890..27.890 rows=20000 loops=1) > Buckets: 32768 (originally 1024) Batches: 1 (originally 1) > Memory Usage: 1038kB > -> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8) > (actual time=0.014..21.688 rows=20000 loops=1) > Filter: ((id % 3) = 1) > Rows Removed by Filter: 40001 > -> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual > time=27.651..97.755 rows=99995 loops=1) > Hash Cond: (check_test_div_2.id = inner_t.id) > -> Seq Scan on check_test_div_2 (cost=0.00..1637.01 rows=100001 > width=20) (actual time=0.026..25.620 rows=100001 loops=1) > -> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual > time=27.599..27.599 rows=20000 loops=1) > Buckets: 32768 (originally 1024) Batches: 1 (originally 1) > Memory Usage: 1038kB > -> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8) > (actual time=0.017..21.307 rows=20000 loops=1) > Filter: ((id % 3) = 2) > Rows Removed by Filter: 40001 > Planning time: 1.876 ms > Execution time: 394.007 ms > (33 rows) > > > The value of "Batches" is 2 on Hash node in normal, > but these values are 1 on all Hash nodes in "with this feature". > > This means that the hash table is not split because of this feature. > > Therefore, PostgreSQL with this feature is faster than the normal one in this > case. > (470.269 ms @ normal vs 394.007 ms @ this feature) > > I think this point is large benefit of this feature. > > > Best regards, > -- > Taiki Kondo > > NEC Solution Innovators, Ltd. > > > > > > -----Original Message----- > From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] > Sent: Thursday, October 15, 2015 10:21 AM > To: Kondo Taiki(近藤 太樹); Kyotaro HORIGUCHI > Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org > Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > > Sent: Thursday, October 08, 2015 5:28 PM > > To: Kyotaro HORIGUCHI > > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎); > > pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > > > Hello, Horiguchi-san. > > > > Thank you for your comment. > > > > > I got some warning on compilation on unused variables and wrong > > > arguemtn type. > > > > OK, I'll fix it. > > > > > I failed to have a query that this patch works on. Could you let me > > > have some specific example for this patch? > > > > Please find attached. > > And also make sure that setting of work_mem is '64kB' (not 64MB). > > > > If there is the large work_mem enough to create hash table for > > relation after appending, its cost may be better than pushed-down > > plan's cost, then planner will not choose pushed-down plan this patch makes. > > So, to make this patch working fine, work_mem size must be smaller > > than the hash table size for relation after appending. > > > > > This patch needs more comments. Please put comment about not only > > > what it does but also the reason and other things for it. > > > > OK, I'll put more comments in the code. > > But it will take a long time, maybe... > > > People (including me) can help. Even though your English capability is not enough, > it is significant to put intention of the code. > > > > -- about namings > > > > > > Names for functions and variables are needed to be more appropriate, > > > in other words, needed to be what properly informs what they are. > > > The followings are the examples of such names. > > > > Thank you for your suggestion. > > > > I also think these names are not good much. > > I'll try to make names better , but it maybe take a long time... > > Of course, I will use your suggestion as reference. > > > > > "added_restrictlist"'s widely distributed as many function arguemnts > > > and JoinPathExtraData makes me feel dizzy.. > > > > "added_restrictinfo" will be deleted from almost functions other than > > try_join_pushdown() in next (v2) patch because the place of filtering > > using this info will be changed from Join node to Scan node and not > > have to place it into other than try_join_pushdown(). > > > This restrictinfo intends to filter out obviously unrelated rows in this join, > due to the check constraint of other side of the join. > So, correct but redundant name is: > restrictlist_to_drop_unrelated_rows_because_of_check_constraint > > How about 'restrictlist_by_constraint' instead? > > > > In make_restrictinfos_from_check_constr, the function returns > > > modified constraint predicates correspond to vars under hashjoinable > > > join clauses. I don't think expression_tree_mutator is suitable to > > > do that since it could allow unwanted result when constraint > > > predicates or join clauses are not simple OpExpr's. > > > > Do you have any example of this situation? > > I am trying to find unwanted results you mentioned, but I don't have > > any results in this timing. I have a hunch that it will allow unwanted > > results because I have thought only about very simple situation for > > this function. > > > check_constraint_mutator makes modified restrictlist with relacing Var node only > when join clause is hash-joinable. > It implies <expr> = <expr> form, thus we can safely replace the expression by > the other side. > > Of course, we still have cases we cannot replace expressions simply. > - If function (or function called by operators) has volatile attribute > (who use volatile function on CHECK constraint of partitioning?) > - If it is uncertain whether expression returns always same result. > (is it possible to contain SubLink in the constraint?) > > I'd like to suggest to use white-list approach in this mutator routine. > It means that only immutable expression node are allowed to include the modified > restrictlist. > > Things to do is: > > check_constraint_mutator(...) > { > if (node == NULL) > return NULL; > if (IsA(node, Var)) > { > : > } > else if (node is not obviously immutable) > { > context->is_mutated = false; <-- prohibit to make if expression > } contains uncertain node. > return expression_tree_mutator(...) > } > > > > > Otherwise could you give me clear explanation on what it does? > > > > This function transfers CHECK() constraint to filter expression by > > following procedures. > > (1) Get outer table's CHECK() constraint by using get_relation_constraints(). > > (2) Walk through expression tree got in (1) by using expression_tree_mutator() > > with check_constraint_mutator() and change only outer's Var node to > > inner's one according to join clause. > > > > For example, when CHECK() constraint of table A is "num % 4 = 0" and > > join clause between table A and B is "A.num = B.data", then we can get > > "B.data % 4 = 0" for filtering purpose. > > > > This also accepts more complex join clause like "A.num = B.data * 2", > > then we can get "(B.data * 2) % 4 = 0". > > > > In procedure (2), to decide whether to use each join clause for > > changing Var node or not, I implement check_constraint_mutator() to > > judge whether join clause is hash-joinable or not. > > > > Actually, I want to judge whether OpExpr as top expression tree of > > join clause means "=" or not, but I can't find how to do it. > > > > If you know how to do it, please let me know. > > > > > Thanks, > -- > NEC Business Creation Division / PG-Strom Project KaiGai Kohei > <kaigai@ak.jp.nec.com> > > > > -----Original Message----- > > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > > Sent: Tuesday, October 06, 2015 8:35 PM > > To: tai-kondo@yk.jp.nec.com > > Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com; > > pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > > > Hello. > > > > I tried to read this and had some random comments on this. > > > > -- general > > > > I got some warning on compilation on unused variables and wrong arguemtn type. > > > > I failed to have a query that this patch works on. Could you let me > > have some specific example for this patch? > > > > This patch needs more comments. Please put comment about not only what > > it does but also the reason and other things for it. > > > > > > -- about namings > > > > Names for functions and variables are needed to be more appropriate, > > in other words, needed to be what properly informs what they are. The > > followings are the examples of such names. > > > > "added_restrictlist"'s widely distributed as many function arguemnts > > and JoinPathExtraData makes me feel dizzy.. create_mergejoin_path > > takes it as "filtering_clauses", which looks far better. > > > > try_join_pushdown() is also the name with much wider meaning. This > > patch tries to move hashjoins on inheritance parent to under append > > paths. It could be generically called 'pushdown' > > but this would be better be called such like 'transform appended > > hashjoin' or 'hashjoin distribution'. The latter would be better. > > (The function name would be try_distribute_hashjoin for the > > case.) > > > > The name make_restrictinfos_from_check_contr() also tells me wrong > > thing. For example, > > extract_constraints_for_hashjoin_distribution() would inform me about > > what it returns. > > > > > > -- about what make_restrictinfos_from_check_constr() does > > > > In make_restrictinfos_from_check_constr, the function returns modified > > constraint predicates correspond to vars under hashjoinable join > > clauses. I don't think expression_tree_mutator is suitable to do that > > since it could allow unwanted result when constraint predicates or join clauses > are not simple OpExpr's. > > > > Could you try more simple and straight-forward way to do that? > > Otherwise could you give me clear explanation on what it does? > > > > regards, > > > > -- > > Kyotaro Horiguchi > > NTT Open Source Software Center > > > > > >
Hello, KaiGai-san. Thank you for your reply, and sorry for late response. I created v3 patch for this feature, and v1 patch for regression tests. Please find attached. Reply for your comments is below. > Overall comments > ---------------- > * I think the enhancement in copyfuncs.c shall be in the separate > patch; it is more graceful manner. At this moment, here is less > than 20 Path delivered type definition. It is much easier works > than entire Plan node support as we did recently. > (How about other folk's opinion?) I also would like to wait for other fork's opinion. So I don't divide this part from this patch yet. > * Can you integrate the attached test cases as regression test? > It is more generic way, and allows people to detect problems > if relevant feature gets troubled in the future updates. Ok, done. Please find attached. > * Naming of "join pushdown" is a bit misleading because other > component also uses this term, but different purpose. > I'd like to suggest try_pullup_append_across_join. > Any ideas from native English speaker? Thank you for your suggestion. I change its name to "try_append_pullup_accross_join", which is matched with the order of the previous name. However, this change is just temporary. I also would like to wait for other fork's opinion for the naming. > Patch review > ------------ > > At try_join_pushdown: > + /* When specified outer path is not an AppendPath, nothing to do here. */ > + if (!IsA(outer_rel->cheapest_total_path, AppendPath)) > + { > + elog(DEBUG1, "Outer path is not an AppendPath. Do nothing."); > + return; > + } > It checks whether the cheapest_total_path is AppendPath at the head > of this function. It ought to be a loop to walk on the pathlist of > RelOptInfo, because multiple path-nodes might be still alive > but also might not be cheapest_total_path. Ok, done. > + switch (inner_rel->cheapest_total_path->pathtype) > + > Also, we can construct the new Append node if one of the path-node > within pathlist of inner_rel are at least supported. Done. But, this change will create nested loop between inner_rel's pathlist and outer_rel's pathlist. It means that planning time is increased more. I think it is adequate to check only for cheapest_total_path because checking only for cheapest_total_path is implemented in other parts, like make_join_rel(). How about your (and also other people's) opinion? > + if (list_length(inner_rel->ppilist) > 0) > + { > + elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown."); > + return; > + } > + > You may need to introduce why this feature uses ParamPathInfos here. > It seems to me a good hack to attach additional qualifiers on > the underlying inner scan node, even if it is not a direct child of > inner relation. > However, people may have different opinion. Ok, added comment in source. Please find from attached patch. > +static List * > +convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses, > + RelOptInfo *outer_rel) { > + Index parent_relid = > + find_childrel_appendrelinfo(root, outer_rel)->parent_relid; > + List *clauses_parent = get_actual_clauses(join_clauses); > + List *clauses_child = NIL; > + ListCell *lc; > + > + foreach(lc, clauses_parent) > + { > + Node *one_clause_child = (Node *) copyObject(lfirst(lc)); > + > + ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0); > + clauses_child = lappend(clauses_child, one_clause_child); > + } > + > + return make_restrictinfos_from_actual_clauses(root, clauses_child); > +} > > Is ChangeVarNodes() right routine to replace var-node of parent relation > by relevant var-node of child relation? > It may look sufficient, however, nobody can ensure varattno of child > relation is identical with parent relation's one. > For example, which attribute number shall be assigned on 'z' here? > CREATE TABLE tbl_parent(x int); > CREATE TABLE tbl_child(y int) INHERITS(tbl_parent); > ALTER TABLE tbl_parent ADD COLUMN z int; Maybe you're right, so I agree with you. I use adjust_appendrel_attrs() instead of ChangeVarNodes() for this purpose. > --- a/src/backend/optimizer/plan/createplan.c > +++ b/src/backend/optimizer/plan/createplan.c > @@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys, > /* > * Ignore child members unless they match the rel being > * sorted. > + * > + * If this is called from make_sort_from_pathkeys(), > + * relids may be NULL. In this case, we must not ignore child > + * members because inner/outer plan of pushed-down merge join is > + * always child table. > */ > - if (em->em_is_child && > + if (relids != NULL && > + em->em_is_child && > !bms_equal(em->em_relids, relids)) > continue; > > It is a little bit hard to understand why this modification is needed. > Could you add source code comment that focus on the reason why. Ok, added comment in source. Please find from attached patch. Best regards, -- Taiki Kondo NEC Solution Innovators, Ltd. -----Original Message----- From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] Sent: Tuesday, November 10, 2015 11:59 PM To: Kondo Taiki(近藤 太樹); Kyotaro HORIGUCHI Cc: pgsql-hackers@postgresql.org; Yanagisawa Hiroshi(柳澤 博) Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown Hi, I put my comments towards the patch as follows. Overall comments ---------------- * I think the enhancement in copyfuncs.c shall be in the separate patch; it is more graceful manner. At this moment, here is less than 20 Path delivered type definition. It is much easier works than entire Plan node support as we did recently. (How about other folk's opinion?) * Can you integrate the attached test cases as regression test? It is more generic way, and allows people to detect problems if relevant feature gets troubled in the future updates. * Naming of "join pushdown" is a bit misleading because other component also uses this term, but different purpose. I'd like to suggest try_pullup_append_across_join. Any ideas from native English speaker? Patch review ------------ At try_join_pushdown: + /* When specified outer path is not an AppendPath, nothing to do here. */ + if (!IsA(outer_rel->cheapest_total_path, AppendPath)) + { + elog(DEBUG1, "Outer path is not an AppendPath. Do nothing."); + return; + } It checks whether the cheapest_total_path is AppendPath at the head of this function. It ought to be a loop to walk on thepathlist of RelOptInfo, because multiple path-nodes might be still alive but also might not be cheapest_total_path. + switch (inner_rel->cheapest_total_path->pathtype) + Also, we can construct the new Append node if one of the path-node within pathlist of inner_rel are at least supported. + if (list_length(inner_rel->ppilist) > 0) + { + elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown."); + return; + } + You may need to introduce why this feature uses ParamPathInfos here. It seems to me a good hack to attach additional qualifiers on the underlying inner scan node, even if it is not a directchild of inner relation. However, people may have different opinion. +static List * +convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses, + RelOptInfo *outer_rel) { + Index parent_relid = + find_childrel_appendrelinfo(root, outer_rel)->parent_relid; + List *clauses_parent = get_actual_clauses(join_clauses); + List *clauses_child = NIL; + ListCell *lc; + + foreach(lc, clauses_parent) + { + Node *one_clause_child = (Node *) copyObject(lfirst(lc)); + + ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0); + clauses_child = lappend(clauses_child, one_clause_child); + } + + return make_restrictinfos_from_actual_clauses(root, clauses_child); +} Is ChangeVarNodes() right routine to replace var-node of parent relation by relevant var-node of child relation? It may look sufficient, however, nobody can ensure varattno of child relation is identical with parent relation's one. For example, which attribute number shall be assigned on 'z' here? CREATE TABLE tbl_parent(x int); CREATE TABLE tbl_child(y int) INHERITS(tbl_parent); ALTER TABLE tbl_parent ADD COLUMN z int; --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys, /* * Ignore child members unless they match the rel being * sorted. + * + * If this is called from make_sort_from_pathkeys(), + * relids may be NULL. In this case, we must not ignore child + * members because inner/outer plan of pushed-down merge join is + * always child table. */ - if (em->em_is_child && + if (relids != NULL && + em->em_is_child && !bms_equal(em->em_relids, relids)) continue; It is a little bit hard to understand why this modification is needed. Could you add source code comment that focus on the reason why. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > Sent: Wednesday, October 21, 2015 8:07 PM > To: Kaigai Kouhei(海外 浩平); Kyotaro HORIGUCHI > Cc: pgsql-hackers@postgresql.org; Yanagisawa Hiroshi(柳澤 博) > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > Hello, KaiGai-san and Horiguchi-san. > > I created v2 patch. Please find attached. > I believe this patch will fix the most of issues mentioned by > Horiguchi-san except naming. > > In this v2 patch, scan node which is originally inner relation of Join > node must be SeqScan (or SampleScan). This limitation is due to > implementation of try_join_pushdown(), which copies Path nodes to > attach new filtering conditions converted from CHECK() constraints. > > It uses copyObject() for this purpose, so I must implement copy > functions for scan Path nodes like IndexPath, BitmapHeapPath, TidPath and so on. > > > By the way, let me introduce the performance of this feature. > Here are the results I tested in my environment. > These results were got by "pushdown_test.v1.large.sql" > running on the environment that "work_mem" set to "1536kB". > (This file is also attached in this mail.) > > [Normal] > QUERY > PLAN > ---------------------------------------------------------------------- > ------ > --------------------------------------------------------- > Hash Join (cost=1851.02..14638.11 rows=300004 width=20) (actual > time=88.188..453.926 rows=299992 loops=1) > Hash Cond: (check_test_div.id = inner_t.id) > -> Append (cost=0.00..4911.03 rows=300004 width=20) (actual > time=0.089..133.456 rows=300003 loops=1) > -> Seq Scan on check_test_div (cost=0.00..0.00 rows=1 > width=20) (actual time=0.003..0.003 rows=0 loops=1) > -> Seq Scan on check_test_div_0 (cost=0.00..1637.01 > rows=100001 > width=20) (actual time=0.085..40.741 rows=100001 loops=1) > -> Seq Scan on check_test_div_1 (cost=0.00..1637.01 > rows=100001 > width=20) (actual time=0.023..29.213 rows=100001 loops=1) > -> Seq Scan on check_test_div_2 (cost=0.00..1637.01 > rows=100001 > width=20) (actual time=0.021..28.592 rows=100001 loops=1) > -> Hash (cost=866.01..866.01 rows=60001 width=8) (actual > time=87.970..87.970 rows=60001 loops=1) > Buckets: 32768 Batches: 2 Memory Usage: 1446kB > -> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 > width=8) (actual time=0.030..39.133 rows=60001 loops=1) Planning > time: 0.867 ms Execution time: 470.269 ms > (12 rows) > > [With this feature] > QUERY > PLAN > ---------------------------------------------------------------------- > ------ > --------------------------------------------------------- > Append (cost=0.01..10651.37 rows=300004 width=20) (actual > time=55.548..377.615 rows=299992 loops=1) > -> Hash Join (cost=0.01..1091.04 rows=1 width=20) (actual > time=0.017..0.017 rows=0 loops=1) > Hash Cond: (inner_t.id = check_test_div.id) > -> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 > width=8) (never > executed) > -> Hash (cost=0.00..0.00 rows=1 width=20) (actual > time=0.003..0.003 > rows=0 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 8kB > -> Seq Scan on check_test_div (cost=0.00..0.00 rows=1 > width=20) (actual time=0.002..0.002 rows=0 loops=1) > -> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual > time=55.530..149.205 rows=100001 loops=1) > Hash Cond: (check_test_div_0.id = inner_t.id) > -> Seq Scan on check_test_div_0 (cost=0.00..1637.01 > rows=100001 > width=20) (actual time=0.058..34.268 rows=100001 loops=1) > -> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual > time=55.453..55.453 rows=20001 loops=1) > Buckets: 32768 (originally 1024) Batches: 1 > (originally 1) Memory Usage: 1038kB > -> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 > width=8) (actual time=0.031..43.590 rows=20001 loops=1) > Filter: ((id % 3) = 0) > Rows Removed by Filter: 40000 > -> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual > time=27.942..97.582 rows=99996 loops=1) > Hash Cond: (check_test_div_1.id = inner_t.id) > -> Seq Scan on check_test_div_1 (cost=0.00..1637.01 > rows=100001 > width=20) (actual time=0.030..25.514 rows=100001 loops=1) > -> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual > time=27.890..27.890 rows=20000 loops=1) > Buckets: 32768 (originally 1024) Batches: 1 > (originally 1) Memory Usage: 1038kB > -> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 > width=8) (actual time=0.014..21.688 rows=20000 loops=1) > Filter: ((id % 3) = 1) > Rows Removed by Filter: 40001 > -> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual > time=27.651..97.755 rows=99995 loops=1) > Hash Cond: (check_test_div_2.id = inner_t.id) > -> Seq Scan on check_test_div_2 (cost=0.00..1637.01 > rows=100001 > width=20) (actual time=0.026..25.620 rows=100001 loops=1) > -> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual > time=27.599..27.599 rows=20000 loops=1) > Buckets: 32768 (originally 1024) Batches: 1 > (originally 1) Memory Usage: 1038kB > -> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 > width=8) (actual time=0.017..21.307 rows=20000 loops=1) > Filter: ((id % 3) = 2) > Rows Removed by Filter: 40001 Planning time: > 1.876 ms Execution time: 394.007 ms > (33 rows) > > > The value of "Batches" is 2 on Hash node in normal, but these values > are 1 on all Hash nodes in "with this feature". > > This means that the hash table is not split because of this feature. > > Therefore, PostgreSQL with this feature is faster than the normal one > in this case. > (470.269 ms @ normal vs 394.007 ms @ this feature) > > I think this point is large benefit of this feature. > > > Best regards, > -- > Taiki Kondo > > NEC Solution Innovators, Ltd. > > > > > > -----Original Message----- > From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com] > Sent: Thursday, October 15, 2015 10:21 AM > To: Kondo Taiki(近藤 太樹); Kyotaro HORIGUCHI > Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org > Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo > > Sent: Thursday, October 08, 2015 5:28 PM > > To: Kyotaro HORIGUCHI > > Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎); > > pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > > > Hello, Horiguchi-san. > > > > Thank you for your comment. > > > > > I got some warning on compilation on unused variables and wrong > > > arguemtn type. > > > > OK, I'll fix it. > > > > > I failed to have a query that this patch works on. Could you let > > > me have some specific example for this patch? > > > > Please find attached. > > And also make sure that setting of work_mem is '64kB' (not 64MB). > > > > If there is the large work_mem enough to create hash table for > > relation after appending, its cost may be better than pushed-down > > plan's cost, then planner will not choose pushed-down plan this patch makes. > > So, to make this patch working fine, work_mem size must be smaller > > than the hash table size for relation after appending. > > > > > This patch needs more comments. Please put comment about not only > > > what it does but also the reason and other things for it. > > > > OK, I'll put more comments in the code. > > But it will take a long time, maybe... > > > People (including me) can help. Even though your English capability is > not enough, it is significant to put intention of the code. > > > > -- about namings > > > > > > Names for functions and variables are needed to be more > > > appropriate, in other words, needed to be what properly informs what they are. > > > The followings are the examples of such names. > > > > Thank you for your suggestion. > > > > I also think these names are not good much. > > I'll try to make names better , but it maybe take a long time... > > Of course, I will use your suggestion as reference. > > > > > "added_restrictlist"'s widely distributed as many function > > > arguemnts and JoinPathExtraData makes me feel dizzy.. > > > > "added_restrictinfo" will be deleted from almost functions other > > than > > try_join_pushdown() in next (v2) patch because the place of > > filtering using this info will be changed from Join node to Scan > > node and not have to place it into other than try_join_pushdown(). > > > This restrictinfo intends to filter out obviously unrelated rows in > this join, due to the check constraint of other side of the join. > So, correct but redundant name is: > restrictlist_to_drop_unrelated_rows_because_of_check_constraint > > How about 'restrictlist_by_constraint' instead? > > > > In make_restrictinfos_from_check_constr, the function returns > > > modified constraint predicates correspond to vars under > > > hashjoinable join clauses. I don't think expression_tree_mutator > > > is suitable to do that since it could allow unwanted result when > > > constraint predicates or join clauses are not simple OpExpr's. > > > > Do you have any example of this situation? > > I am trying to find unwanted results you mentioned, but I don't have > > any results in this timing. I have a hunch that it will allow > > unwanted results because I have thought only about very simple > > situation for this function. > > > check_constraint_mutator makes modified restrictlist with relacing Var > node only when join clause is hash-joinable. > It implies <expr> = <expr> form, thus we can safely replace the > expression by the other side. > > Of course, we still have cases we cannot replace expressions simply. > - If function (or function called by operators) has volatile attribute > (who use volatile function on CHECK constraint of partitioning?) > - If it is uncertain whether expression returns always same result. > (is it possible to contain SubLink in the constraint?) > > I'd like to suggest to use white-list approach in this mutator routine. > It means that only immutable expression node are allowed to include > the modified restrictlist. > > Things to do is: > > check_constraint_mutator(...) > { > if (node == NULL) > return NULL; > if (IsA(node, Var)) > { > : > } > else if (node is not obviously immutable) > { > context->is_mutated = false; <-- prohibit to make if expression > } contains uncertain node. > return expression_tree_mutator(...) > } > > > > > Otherwise could you give me clear explanation on what it does? > > > > This function transfers CHECK() constraint to filter expression by > > following procedures. > > (1) Get outer table's CHECK() constraint by using get_relation_constraints(). > > (2) Walk through expression tree got in (1) by using expression_tree_mutator() > > with check_constraint_mutator() and change only outer's Var node to > > inner's one according to join clause. > > > > For example, when CHECK() constraint of table A is "num % 4 = 0" and > > join clause between table A and B is "A.num = B.data", then we can > > get "B.data % 4 = 0" for filtering purpose. > > > > This also accepts more complex join clause like "A.num = B.data * > > 2", then we can get "(B.data * 2) % 4 = 0". > > > > In procedure (2), to decide whether to use each join clause for > > changing Var node or not, I implement check_constraint_mutator() to > > judge whether join clause is hash-joinable or not. > > > > Actually, I want to judge whether OpExpr as top expression tree of > > join clause means "=" or not, but I can't find how to do it. > > > > If you know how to do it, please let me know. > > > > > Thanks, > -- > NEC Business Creation Division / PG-Strom Project KaiGai Kohei > <kaigai@ak.jp.nec.com> > > > > -----Original Message----- > > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > > Sent: Tuesday, October 06, 2015 8:35 PM > > To: tai-kondo@yk.jp.nec.com > > Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com; > > pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown > > > > Hello. > > > > I tried to read this and had some random comments on this. > > > > -- general > > > > I got some warning on compilation on unused variables and wrong arguemtn type. > > > > I failed to have a query that this patch works on. Could you let me > > have some specific example for this patch? > > > > This patch needs more comments. Please put comment about not only > > what it does but also the reason and other things for it. > > > > > > -- about namings > > > > Names for functions and variables are needed to be more appropriate, > > in other words, needed to be what properly informs what they are. > > The followings are the examples of such names. > > > > "added_restrictlist"'s widely distributed as many function arguemnts > > and JoinPathExtraData makes me feel dizzy.. create_mergejoin_path > > takes it as "filtering_clauses", which looks far better. > > > > try_join_pushdown() is also the name with much wider meaning. This > > patch tries to move hashjoins on inheritance parent to under append > > paths. It could be generically called 'pushdown' > > but this would be better be called such like 'transform appended > > hashjoin' or 'hashjoin distribution'. The latter would be better. > > (The function name would be try_distribute_hashjoin for the > > case.) > > > > The name make_restrictinfos_from_check_contr() also tells me wrong > > thing. For example, > > extract_constraints_for_hashjoin_distribution() would inform me > > about what it returns. > > > > > > -- about what make_restrictinfos_from_check_constr() does > > > > In make_restrictinfos_from_check_constr, the function returns > > modified constraint predicates correspond to vars under hashjoinable > > join clauses. I don't think expression_tree_mutator is suitable to > > do that since it could allow unwanted result when constraint > > predicates or join clauses > are not simple OpExpr's. > > > > Could you try more simple and straight-forward way to do that? > > Otherwise could you give me clear explanation on what it does? > > > > regards, > > > > -- > > Kyotaro Horiguchi > > NTT Open Source Software Center > > > > > >
Attachment
Hello, sorry for late response and thank you for the new patch. At Fri, 20 Nov 2015 12:05:38 +0000, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote in <12A9442FBAE80D4E8953883E0B84E08863F115@BPXM01GP.gisp.nec.co.jp> > > I created v3 patch for this feature, and v1 patch for regression tests. > Please find attached. I think I understood what you intend to do by substitute_node_with_join_cond. It replaces *all* vars in check constraint by corresponding expressions containing inner vars. If it is correct, it is predictable whether the check condition can be successfully transformed. Addition to that, substitute_node_with_join_cond uses any side of join clases that matches the target var but it is somewhat different from what exactly should be done, even if it works correctly. For those conditions, substitute_node_with_join_cond and create_rinfo_from_check_constr could be simpler and clearer as following. Also this refactored code determines clearly what the function does, I believe. ==== create_rinfo_from_check_constr(...) { pull_varattnos(check_constr, outer_rel->relid, &chkattnos); replacements = extract_replacements(joininfo, outer_rel->relid,&joinattnos); /* * exit if the join clauses cannot replace all vars in the check * constraint */ if (!bms_is_subset(chkattnos, joinattnos)) return NULL; foreach(lc, check_constr) { result = lappend(result, expression_tree_mutator(...); } ==== The attached patch does this. What do you think about this refactoring? > Reply for your comments is below. > > > Overall comments > > ---------------- > > * I think the enhancement in copyfuncs.c shall be in the separate > > patch; it is more graceful manner. At this moment, here is less > > than 20 Path delivered type definition. It is much easier works > > than entire Plan node support as we did recently. > > (How about other folk's opinion?) > > I also would like to wait for other fork's opinion. > So I don't divide this part from this patch yet. Other fork? It's Me? _copyPathFields is independent from all other part of this patch and it looks to be a generic function. I prefer that such indenpendent features to be separate patches, too. > > At this moment, here is less > > than 20 Path delivered type definition. It is much easier works > > than entire Plan node support as we did recently. > > (How about other folk's opinion?) It should be doable but I don't think we should provide all possible _copyPath*. Curently it looks to be enough to provide the function for at least the Paths listed in try_append_pullup_across_join as shown below and others should not be added if it won't be used for now. T_SeqScan, T_SampleScan, T_IndexScan, T_IndexOnlyScan, T_BitmapHeapScan, T_TidScan, T_Gather I doubt that tid partitioning is used but there's no reason no refuse to support it. By the way, would you add regressions for these other types of path? > > * Can you integrate the attached test cases as regression test? > > It is more generic way, and allows people to detect problems > > if relevant feature gets troubled in the future updates. > > Ok, done. Please find attached. > > > * Naming of "join pushdown" is a bit misleading because other > > component also uses this term, but different purpose. > > I'd like to suggest try_pullup_append_across_join. > > Any ideas from native English speaker? > > Thank you for your suggestion. > > I change its name to "try_append_pullup_accross_join", > which is matched with the order of the previous name. > > However, this change is just temporary. > I also would like to wait for other fork's opinion > for the naming. > > > > Patch review > > ------------ > > > > At try_join_pushdown: > > + /* When specified outer path is not an AppendPath, nothing to do here. */ > > + if (!IsA(outer_rel->cheapest_total_path, AppendPath)) > > + { > > + elog(DEBUG1, "Outer path is not an AppendPath. Do nothing."); > > + return; > > + } > > It checks whether the cheapest_total_path is AppendPath at the head > > of this function. It ought to be a loop to walk on the pathlist of > > RelOptInfo, because multiple path-nodes might be still alive > > but also might not be cheapest_total_path. > > > Ok, done. > > > + switch (inner_rel->cheapest_total_path->pathtype) > > + > > Also, we can construct the new Append node if one of the path-node > > within pathlist of inner_rel are at least supported. > > Done. > But, this change will create nested loop between inner_rel's pathlist > and outer_rel's pathlist. It means that planning time is increased more. > > I think it is adequate to check only for cheapest_total_path > because checking only for cheapest_total_path is implemented in other > parts, like make_join_rel(). > > How about your (and also other people's) opinion? > > > > + if (list_length(inner_rel->ppilist) > 0) > > + { > > + elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown."); > > + return; > > + } > > + > > You may need to introduce why this feature uses ParamPathInfos here. > > It seems to me a good hack to attach additional qualifiers on > > the underlying inner scan node, even if it is not a direct child of > > inner relation. > > However, people may have different opinion. > > Ok, added comment in source. > Please find from attached patch. I suppose that the term 'parameter' used here is strictly defined as condition and information about 'parameterized' paths, which related to restrictions involving another relation. In contrast, the PPI added here contains totally defferent from parameter defined here, since it refers only to the relation, in other words, not a join condition. I think such kind of restrictions should be added to baserestrictinfo in RelOptInfo. The condition derived from constraints can be simply added to it. It doesn't need any additional explanation to do so. Any opinions ? > > +static List * > > +convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses, > > + RelOptInfo *outer_rel) { > > + Index parent_relid = > > + find_childrel_appendrelinfo(root, outer_rel)->parent_relid; > > + List *clauses_parent = get_actual_clauses(join_clauses); > > + List *clauses_child = NIL; > > + ListCell *lc; > > + > > + foreach(lc, clauses_parent) > > + { > > + Node *one_clause_child = (Node *) copyObject(lfirst(lc)); > > + > > + ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0); > > + clauses_child = lappend(clauses_child, one_clause_child); > > + } > > + > > + return make_restrictinfos_from_actual_clauses(root, clauses_child); > > +} > > > > Is ChangeVarNodes() right routine to replace var-node of parent relation > > by relevant var-node of child relation? > > It may look sufficient, however, nobody can ensure varattno of child > > relation is identical with parent relation's one. > > For example, which attribute number shall be assigned on 'z' here? > > CREATE TABLE tbl_parent(x int); > > CREATE TABLE tbl_child(y int) INHERITS(tbl_parent); > > ALTER TABLE tbl_parent ADD COLUMN z int; > > Maybe you're right, so I agree with you. > I use adjust_appendrel_attrs() instead of ChangeVarNodes() > for this purpose. > > > > --- a/src/backend/optimizer/plan/createplan.c > > +++ b/src/backend/optimizer/plan/createplan.c > > @@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys, > > /* > > * Ignore child members unless they match the rel being > > * sorted. > > + * > > + * If this is called from make_sort_from_pathkeys(), > > + * relids may be NULL. In this case, we must not ignore child > > + * members because inner/outer plan of pushed-down merge join is > > + * always child table. > > */ > > - if (em->em_is_child && > > + if (relids != NULL && > > + em->em_is_child && > > !bms_equal(em->em_relids, relids)) > > continue; > > > > It is a little bit hard to understand why this modification is needed. > > Could you add source code comment that focus on the reason why. > > Ok, added comment in source. > Please find from attached patch. Could you show me an example to execute there with relids == NULL? I don't know why prepare_sort_from_pathkeys is called with such condition with this patch but the modification apparently changes the behavior of this function. I guess we should find another way to get the same result or more general explanation for the validity to do so. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Nov 20, 2015 at 9:05 PM, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote: > I created v3 patch for this feature, and v1 patch for regression tests. > Please find attached. > > [blah review and replies] > > Please find from attached patch. This new patch did not actually get a review, moved to next CF. -- Michael
On Tue, Dec 22, 2015 at 8:36 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Nov 20, 2015 at 9:05 PM, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote: >> I created v3 patch for this feature, and v1 patch for regression tests. >> Please find attached. >> >> [blah review and replies] >> >> Please find from attached patch. > > This new patch did not actually get a review, moved to next CF. I think this patch is doomed. Suppose you join A to B on A.x = B.y. The existence of a constraint on table A which says CHECK(P(x)) does not imply that only rows from y where P(y) is true will join. For example, suppose that x and y are numeric columns and P(x) is length(x::text) == 3. Then you could have 1 in one table and 1.0 in the table; they join, but P(x) is true for one and false for the other. The fundamental problem is that equality according to the join operator need not mean equality for all purposes. 1 and 1.0, as numerics, are equal, but not the same. Since the whole patch is based on this idea, I believe that means this patch is dead in the water. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 18, 2016 at 5:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > For > example, suppose that x and y are numeric columns and P(x) is > length(x::text) == 3. Then you could have 1 in one table and 1.0 in > the table; they join, but P(x) is true for one and false for the > other. Fwiw, ages ago there was some talk about having a property on functions "equality preserving" or something like that. If a function, or more likely a <function,operator> tuple had this property set then x op y => f(x) op f(y). This would be most useful for things like substring or hash functions which would allow partial indexes or partition exclusion to be more generally useful. Of course then you really want <f,op1,op2> to indicate that "a op1 b => f(a) op2 f(b)" so you can handle things like <substring,lt,lte > so that "a < b => substring(a,n) <= substring(b,n)" and you need some way to represent the extra arguments to substring and the whole thing became too complex and got dropped. But perhaps even a simpler property that only worked for equality and single-argument functions would be useful since it would let us mark hash functions Or perhaps we only need to mark the few functions that expose properties that don't affect equality since I think there are actually very few of them. -- greg
On Tue, Jan 19, 2016 at 7:59 AM, Greg Stark <stark@mit.edu> wrote: > On Mon, Jan 18, 2016 at 5:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> For >> example, suppose that x and y are numeric columns and P(x) is >> length(x::text) == 3. Then you could have 1 in one table and 1.0 in >> the table; they join, but P(x) is true for one and false for the >> other. > > Fwiw, ages ago there was some talk about having a property on > functions "equality preserving" or something like that. If a function, > or more likely a <function,operator> tuple had this property set then > x op y => f(x) op f(y). This would be most useful for things like > substring or hash functions which would allow partial indexes or > partition exclusion to be more generally useful. > > Of course then you really want <f,op1,op2> to indicate that "a op1 b > => f(a) op2 f(b)" so you can handle things like <substring,lt,lte > so > that "a < b => substring(a,n) <= substring(b,n)" and you need some way > to represent the extra arguments to substring and the whole thing > became too complex and got dropped. > > But perhaps even a simpler property that only worked for equality and > single-argument functions would be useful since it would let us mark > hash functions Or perhaps we only need to mark the few functions that > expose properties that don't affect equality since I think there are > actually very few of them. We could certainly mark operators that amount to testing binary equality as such, and this optimization could be used for join operators so marked. But I worry that would become a crutch, with people implementing optimizations that work for such operators and leaving numeric (for example) out in the cold. Of course, we could worry about such problems when and if they happen, and accept the idea of markings for now. However, I'm inclined to think that there's a better way to optimize the case Taiki Kondo and Kouhei Kagai are targeting. If we get declarative partitioning, an oft-requested feature that has been worked on by various people over the years and currently by Amit Langote, and specifically if we get hash partitioning, then we'll presumably use the hash function for the default operator class of the partitioning column's datatype to partition the table. Then, if we do a join against some other table and consider a hash join, we'll be using the same hash function on our side, and either the same operator or a compatible operator for some other datatype in the same opfamily on the other side. At that point, if we push down the join, we can add a filter on the inner side of the join that the hash value of the matching column has to map to the partition it's being joined against. And we don't get a recurrence of this problem in that case, because we're not dealing with an arbitrary predicate - we're dealing with a hash function whose equality semantics are defined to be compatible with the join operator. That approach works with any data type that has a default hash operator class, which covers pretty much everything anybody is likely to care about, including numeric. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Tue, Jan 19, 2016 at 7:59 AM, Greg Stark <stark@mit.edu> wrote: > > On Mon, Jan 18, 2016 at 5:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> For > >> example, suppose that x and y are numeric columns and P(x) is > >> length(x::text) == 3. Then you could have 1 in one table and 1.0 in > >> the table; they join, but P(x) is true for one and false for the > >> other. > > > > Fwiw, ages ago there was some talk about having a property on > > functions "equality preserving" or something like that. If a function, > > or more likely a <function,operator> tuple had this property set then > > x op y => f(x) op f(y). This would be most useful for things like > > substring or hash functions which would allow partial indexes or > > partition exclusion to be more generally useful. > > > > Of course then you really want <f,op1,op2> to indicate that "a op1 b > > => f(a) op2 f(b)" so you can handle things like <substring,lt,lte > so > > that "a < b => substring(a,n) <= substring(b,n)" and you need some way > > to represent the extra arguments to substring and the whole thing > > became too complex and got dropped. > > > > But perhaps even a simpler property that only worked for equality and > > single-argument functions would be useful since it would let us mark > > hash functions Or perhaps we only need to mark the few functions that > > expose properties that don't affect equality since I think there are > > actually very few of them. > > We could certainly mark operators that amount to testing binary > equality as such, and this optimization could be used for join > operators so marked. But I worry that would become a crutch, with > people implementing optimizations that work for such operators and > leaving numeric (for example) out in the cold. Of course, we could > worry about such problems when and if they happen, and accept the idea > of markings for now. However, I'm inclined to think that there's a > better way to optimize the case Taiki Kondo and Kouhei Kagai are > targeting. > It seems to me Greg's idea intends to reduce CPU cycles by replacement of the operator in use. I never deny we can have valuable scenarios, however, we intend this feature to reduce amount of inner hash-table size, to fit GPU RAM for example. > If we get declarative partitioning, an oft-requested feature that has > been worked on by various people over the years and currently by Amit > Langote, and specifically if we get hash partitioning, then we'll > presumably use the hash function for the default operator class of the > partitioning column's datatype to partition the table. Then, if we do > a join against some other table and consider a hash join, we'll be > using the same hash function on our side, and either the same operator > or a compatible operator for some other datatype in the same opfamily > on the other side. At that point, if we push down the join, we can > add a filter on the inner side of the join that the hash value of the > matching column has to map to the partition it's being joined against. > And we don't get a recurrence of this problem in that case, because > we're not dealing with an arbitrary predicate - we're dealing with a > hash function whose equality semantics are defined to be compatible > with the join operator. > > That approach works with any data type that has a default hash > operator class, which covers pretty much everything anybody is likely > to care about, including numeric. > Except for usage of CHECK constraint, the above description is almost same as we are intending. Hash joinable operators are expected to check equality of both side at least, thus, we can predicate which inner columns shall have identical value once both tuples are joined. Then, we can filter out rows obviously obviously unmatched rows. > At that point, if we push down the join, we can > add a filter on the inner side of the join that the hash value of the > matching column has to map to the partition it's being joined against. > Of course, its implementation is not graceful enough, especially, above point because this extra filter will change expected number of rows to be produced by inner relation, and relevant cost. Right now, his patch calls cost_seqscan() and others according to the type of inner relation by itself. Of course, it is not a portable way, if inner relation would not be a simple relations scan. Due to path construction staging, AppendPath with underlying join paths has to be constructed on join path investigation steps. So, what is the reasonable way to make inner relation's path node with filters pushed- down? It is the most ugly part of the current patch. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Mon, Jan 25, 2016 at 9:09 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > Of course, its implementation is not graceful enough, especially, above > point because this extra filter will change expected number of rows to > be produced by inner relation, and relevant cost. > Right now, his patch calls cost_seqscan() and others according to the > type of inner relation by itself. Of course, it is not a portable way, > if inner relation would not be a simple relations scan. > > Due to path construction staging, AppendPath with underlying join paths > has to be constructed on join path investigation steps. So, what is the > reasonable way to make inner relation's path node with filters pushed- > down? > It is the most ugly part of the current patch. I think that it needs to be done only in contexts where we can guarantee that the optimization is correct, like declarative hash partitioning: http://www.postgresql.org/message-id/CA+Tgmob2wfJivFoCDLuUnovJsFTp6QsqxiPpxvNNoGLY+3rjbg@mail.gmail.com As I said upthread, in general your proposal will not work and will lead to wrong answers to queries. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company