Thread: Asymmetric partition-wise JOIN
Hello, PostgreSQL optimizer right now considers join pairs on only non-partition - non-partition or partition-leaf - partition-leaf relations. On the other hands, it is harmless and makes sense to consider a join pair on non-partition - partition-leaf. See the example below. ptable is partitioned by hash, and contains 10M rows. ftable is not partitioned and contains 50 rows. Most of ptable::fkey shall not have matched rows in this join. create table ptable (fkey int, dist text) partition by hash (dist); create table ptable_p0 partition of ptable for values with (modulus 3, remainder 0); create table ptable_p1 partition of ptable for values with (modulus 3, remainder 1); create table ptable_p2 partition of ptable for values with (modulus 3, remainder 2); insert into ptable (select x % 10000, md5(x::text) from generate_series(1,10000000) x); create table ftable (pkey int primary key, memo text); insert into ftable (select x, 'ftable__#' || x::text from generate_series(1,50) x); vacuum analyze; postgres=# explain analyze select count(*) from ptable p, ftable f where p.fkey = f.pkey; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual time=2333.193..2333.194 rows=1 loops=1) -> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual time=0.056..2330.079 rows=50000 loops=1) Hash Cond: (p.fkey = f.pkey) -> Append (cost=0.00..233335.00 rows=10000000 width=4) (actual time=0.012..1617.268 rows=10000000 loops=1) -> Seq Scan on ptable_p0 p (cost=0.00..61101.96 rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796 loops=1) -> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025 loops=1) -> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179 loops=1) -> Hash (cost=1.50..1.50 rows=50 width=4) (actual time=0.033..0.034 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.017 rows=50 loops=1) Planning Time: 0.286 ms Execution Time: 2333.264 ms (12 rows) We can manually rewrite this query as follows: postgres=# explain analyze select count(*) from ( select * from ptable_p0 p, ftable f where p.fkey = f.pkey union all select * from ptable_p1 p, ftable f where p.fkey = f.pkey union all select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry; Because Append does not process tuples that shall have no matched tuples in ftable, this query has cheaper cost and short query execution time. (2333ms --> 1396ms) postgres=# explain analyze select count(*) from ( select * from ptable_p0 p, ftable f where p.fkey = f.pkey union all select * from ptable_p1 p, ftable f where p.fkey = f.pkey union all select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual time=1396.024..1396.024 rows=1 loops=1) -> Append (cost=2.12..210353.14 rows=50042 width=0) (actual time=0.058..1393.008 rows=50000 loops=1) -> Subquery Scan on "*SELECT* 1" (cost=2.12..70023.66 rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1) -> Hash Join (cost=2.12..69856.40 rows=16726 width=72) (actual time=0.056..571.718 rows=16789 loops=1) Hash Cond: (p.fkey = f.pkey) -> Seq Scan on ptable_p0 p (cost=0.00..61101.96 rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796 loops=1) -> Hash (cost=1.50..1.50 rows=50 width=4) (actual time=0.034..0.035 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1) -> Subquery Scan on "*SELECT* 2" (cost=2.12..70027.43 rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1) -> Hash Join (cost=2.12..69861.26 rows=16617 width=72) (actual time=0.036..408.626 rows=16578 loops=1) Hash Cond: (p_1.fkey = f_1.pkey) -> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422 rows=3333025 loops=1) -> Hash (cost=1.50..1.50 rows=50 width=4) (actual time=0.020..0.020 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f_1 (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50 loops=1) -> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84 rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1) -> Hash Join (cost=2.12..69884.85 rows=16699 width=72) (actual time=0.025..406.048 rows=16633 loops=1) Hash Cond: (p_2.fkey = f_2.pkey) -> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015 rows=3334179 loops=1) -> Hash (cost=1.50..1.50 rows=50 width=4) (actual time=0.014..0.014 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f_2 (cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50 loops=1) Planning Time: 0.614 ms Execution Time: 1396.131 ms (25 rows) How about your opinions for this kind of asymmetric partition-wise JOIN support by the optimizer? I think we can harmlessly push-down inner-join and left-join if partition-leaf is left side. Probably, we need to implement two key functionalities. 1. Construction of RelOpInfo for join on non-partition table and partition-leafs for each pairs. Instead of JoinPaths, this logic adds AppendPath that takes asymmetric partition-wise join paths as sub-paths. Other optimization logic is equivalent as we are currently doing. 2. Allow to share the hash-table built from table scan distributed to individual partition leafs. In the above example, SeqScan on ftable and relevant Hash path will make identical hash- table for the upcoming hash-join. If sibling paths have equivalent results, it is reasonable to reuse it. Best regards, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei <kaigai@heterodb.com>
Hello, Even though nobody has respond the thread, I tried to make a prototype of the asymmetric partition-wise join support. This feature tries to join non-partitioned and partitioned relation before append. See the example below: create table ptable (dist int, a int, b int) partition by hash (dist); create table ptable_p0 partition of ptable for values with (modulus 3, remainder 0); create table ptable_p1 partition of ptable for values with (modulus 3, remainder 1); create table ptable_p2 partition of ptable for values with (modulus 3, remainder 2); create table t1 (aid int, label text); create table t2 (bid int, label text); insert into ptable (select x, (1000*random())::int, (1000*random())::int from generate_series(1,1000000) x); insert into t1 (select x, md5(x::text) from generate_series(1,50) x); insert into t2 (select x, md5(x::text) from generate_series(1,50) x); vacuum analyze ptable; vacuum analyze t1; vacuum analyze t2; ptable.a has values between 0 and 1000, and t1.aid has values between 1 and 50. Therefore, tables join on ptable and t1 by a=aid can reduce almost 95% rows. On the other hands, t1 is not partitioned and join-keys are not partition keys. So, Append must process million rows first, then HashJoin processes the rows read from the partitioned table, and 95% of them are eventually dropped. On the other words, 95% of jobs by Append are waste of time and CPU cycles. postgres=# explain select * from ptable, t1 where a = aid; QUERY PLAN ------------------------------------------------------------------------------ Hash Join (cost=2.12..24658.62 rows=49950 width=49) Hash Cond: (ptable_p0.a = t1.aid) -> Append (cost=0.00..20407.00 rows=1000000 width=12) -> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12) -> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12) -> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) (8 rows) The asymmetric partitionwise join allows to join non-partitioned tables and partitioned tables prior to Append. postgres=# set enable_partitionwise_join = on; SET postgres=# explain select * from ptable, t1 where a = aid; QUERY PLAN ------------------------------------------------------------------------------ Append (cost=2.12..19912.62 rows=49950 width=49) -> Hash Join (cost=2.12..6552.96 rows=16647 width=49) Hash Cond: (ptable_p0.a = t1.aid) -> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) -> Hash Join (cost=2.12..6557.29 rows=16658 width=49) Hash Cond: (ptable_p1.a = t1.aid) -> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) -> Hash Join (cost=2.12..6552.62 rows=16645 width=49) Hash Cond: (ptable_p2.a = t1.aid) -> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) (16 rows) We can consider the table join ptable X t1 above is equivalent to: (ptable_p0 + ptable_p1 + ptable_p2) X t1 = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1) It returns an equivalent result, however, rows are already reduced by HashJoin in the individual leaf of Append, so CPU-cycles consumed by Append node can be cheaper. On the other hands, it has a downside because t1 must be read 3 times and hash table also must be built 3 times. It increases the expected cost, so planner may not choose the asymmetric partition-wise join plan. One idea I have is, sibling HashJoin shares a hash table that was built once by any of the sibling Hash plan. Right now, it is not implemented yet. How about your thought for this feature? Best regards, 2019年8月12日(月) 15:03 Kohei KaiGai <kaigai@heterodb.com>: > > Hello, > > PostgreSQL optimizer right now considers join pairs on only > non-partition - non-partition or > partition-leaf - partition-leaf relations. On the other hands, it is > harmless and makes sense to > consider a join pair on non-partition - partition-leaf. > > See the example below. ptable is partitioned by hash, and contains 10M > rows. ftable is not > partitioned and contains 50 rows. Most of ptable::fkey shall not have > matched rows in this > join. > > create table ptable (fkey int, dist text) partition by hash (dist); > create table ptable_p0 partition of ptable for values with (modulus 3, > remainder 0); > create table ptable_p1 partition of ptable for values with (modulus 3, > remainder 1); > create table ptable_p2 partition of ptable for values with (modulus 3, > remainder 2); > insert into ptable (select x % 10000, md5(x::text) from > generate_series(1,10000000) x); > > create table ftable (pkey int primary key, memo text); > insert into ftable (select x, 'ftable__#' || x::text from > generate_series(1,50) x); > vacuum analyze; > > postgres=# explain analyze select count(*) from ptable p, ftable f > where p.fkey = f.pkey; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual > time=2333.193..2333.194 rows=1 loops=1) > -> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual > time=0.056..2330.079 rows=50000 loops=1) > Hash Cond: (p.fkey = f.pkey) > -> Append (cost=0.00..233335.00 rows=10000000 width=4) > (actual time=0.012..1617.268 rows=10000000 loops=1) > -> Seq Scan on ptable_p0 p (cost=0.00..61101.96 > rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796 > loops=1) > -> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25 > rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025 > loops=1) > -> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79 > rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179 > loops=1) > -> Hash (cost=1.50..1.50 rows=50 width=4) (actual > time=0.033..0.034 rows=50 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 10kB > -> Seq Scan on ftable f (cost=0.00..1.50 rows=50 > width=4) (actual time=0.004..0.017 rows=50 loops=1) > Planning Time: 0.286 ms > Execution Time: 2333.264 ms > (12 rows) > > We can manually rewrite this query as follows: > > postgres=# explain analyze select count(*) from ( > select * from ptable_p0 p, ftable f where p.fkey = > f.pkey union all > select * from ptable_p1 p, ftable f where p.fkey = > f.pkey union all > select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry; > > Because Append does not process tuples that shall have no matched > tuples in ftable, > this query has cheaper cost and short query execution time. > (2333ms --> 1396ms) > > postgres=# explain analyze select count(*) from ( > select * from ptable_p0 p, ftable f where p.fkey = > f.pkey union all > select * from ptable_p1 p, ftable f where p.fkey = > f.pkey union all > select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual > time=1396.024..1396.024 rows=1 loops=1) > -> Append (cost=2.12..210353.14 rows=50042 width=0) (actual > time=0.058..1393.008 rows=50000 loops=1) > -> Subquery Scan on "*SELECT* 1" (cost=2.12..70023.66 > rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1) > -> Hash Join (cost=2.12..69856.40 rows=16726 > width=72) (actual time=0.056..571.718 rows=16789 loops=1) > Hash Cond: (p.fkey = f.pkey) > -> Seq Scan on ptable_p0 p (cost=0.00..61101.96 > rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796 > loops=1) > -> Hash (cost=1.50..1.50 rows=50 width=4) > (actual time=0.034..0.035 rows=50 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 10kB > -> Seq Scan on ftable f (cost=0.00..1.50 > rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1) > -> Subquery Scan on "*SELECT* 2" (cost=2.12..70027.43 > rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1) > -> Hash Join (cost=2.12..69861.26 rows=16617 > width=72) (actual time=0.036..408.626 rows=16578 loops=1) > Hash Cond: (p_1.fkey = f_1.pkey) > -> Seq Scan on ptable_p1 p_1 > (cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422 > rows=3333025 loops=1) > -> Hash (cost=1.50..1.50 rows=50 width=4) > (actual time=0.020..0.020 rows=50 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 10kB > -> Seq Scan on ftable f_1 > (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50 > loops=1) > -> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84 > rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1) > -> Hash Join (cost=2.12..69884.85 rows=16699 > width=72) (actual time=0.025..406.048 rows=16633 loops=1) > Hash Cond: (p_2.fkey = f_2.pkey) > -> Seq Scan on ptable_p2 p_2 > (cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015 > rows=3334179 loops=1) > -> Hash (cost=1.50..1.50 rows=50 width=4) > (actual time=0.014..0.014 rows=50 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 10kB > -> Seq Scan on ftable f_2 > (cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50 > loops=1) > Planning Time: 0.614 ms > Execution Time: 1396.131 ms > (25 rows) > > How about your opinions for this kind of asymmetric partition-wise > JOIN support by the optimizer? > I think we can harmlessly push-down inner-join and left-join if > partition-leaf is left side. > > Probably, we need to implement two key functionalities. > 1. Construction of RelOpInfo for join on non-partition table and > partition-leafs for each pairs. > Instead of JoinPaths, this logic adds AppendPath that takes > asymmetric partition-wise join > paths as sub-paths. Other optimization logic is equivalent as we > are currently doing. > 2. Allow to share the hash-table built from table scan distributed to > individual partition leafs. > In the above example, SeqScan on ftable and relevant Hash path > will make identical hash- > table for the upcoming hash-join. If sibling paths have equivalent > results, it is reasonable to > reuse it. > > Best regards, > -- > HeteroDB, Inc / The PG-Strom Project > KaiGai Kohei <kaigai@heterodb.com> -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei <kaigai@heterodb.com>
Attachment
On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote: > We can consider the table join ptable X t1 above is equivalent to: > (ptable_p0 + ptable_p1 + ptable_p2) X t1 > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1) > It returns an equivalent result, however, rows are already reduced by HashJoin > in the individual leaf of Append, so CPU-cycles consumed by Append node can > be cheaper. > > On the other hands, it has a downside because t1 must be read 3 times and > hash table also must be built 3 times. It increases the expected cost, > so planner > may not choose the asymmetric partition-wise join plan. What if you include the partition constraint as a filter on t1? So you get: ptable X t1 = (ptable_p0 X (σ hash(dist)%4=0 (t1))) + (ptable_p1 X (σ hash(dist)%4=1 (t1))) + (ptable_p2 X (σ hash(dist)%4=2 (t1))) + (ptable_p3 X (σ hash(dist)%4=3 (t1))) Pros: 1. The hash tables will not contain unnecessary junk. 2. You'll get the right answer if t1 is on the outer side of an outer join. 3. If this runs underneath a Parallel Append and t1 is big enough then workers will hopefully cooperate and do a synchronised scan of t1. 4. The filter might enable a selective and efficient plan like an index scan. Cons: 1. The filter might not enable a selective and efficient plan, and therefore cause extra work. (It's a little weird in this example because don't usually see hash functions in WHERE clauses, but that could just as easily be dist BETWEEN 1 AND 99 or any other partition constraint.) > One idea I have is, sibling HashJoin shares a hash table that was built once > by any of the sibling Hash plan. Right now, it is not implemented yet. Yeah, I've thought a little bit about that in the context of Parallel Repartition. I'm interested in combining intra-node partitioning (where a single plan node repartitions data among workers on the fly) with inter-node partitioning (like PWJ, where partitions are handled by different parts of the plan, considered at planning time); you finish up needing to have nodes in the plan that 'receive' tuples for each partition, to match up with the PWJ plan structure. That's not entirely unlike CTE references, and not entirely unlike your idea of somehow sharing the same hash table. I ran into a number of problems while thinking about that, which I should write about in another thread. -- Thomas Munro https://enterprisedb.com
2019年8月24日(土) 7:02 Thomas Munro <thomas.munro@gmail.com>: > > On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote: > > We can consider the table join ptable X t1 above is equivalent to: > > (ptable_p0 + ptable_p1 + ptable_p2) X t1 > > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1) > > It returns an equivalent result, however, rows are already reduced by HashJoin > > in the individual leaf of Append, so CPU-cycles consumed by Append node can > > be cheaper. > > > > On the other hands, it has a downside because t1 must be read 3 times and > > hash table also must be built 3 times. It increases the expected cost, > > so planner > > may not choose the asymmetric partition-wise join plan. > > What if you include the partition constraint as a filter on t1? So you get: > > ptable X t1 = > (ptable_p0 X (σ hash(dist)%4=0 (t1))) + > (ptable_p1 X (σ hash(dist)%4=1 (t1))) + > (ptable_p2 X (σ hash(dist)%4=2 (t1))) + > (ptable_p3 X (σ hash(dist)%4=3 (t1))) > > Pros: > 1. The hash tables will not contain unnecessary junk. > 2. You'll get the right answer if t1 is on the outer side of an outer join. > 3. If this runs underneath a Parallel Append and t1 is big enough > then workers will hopefully cooperate and do a synchronised scan of > t1. > 4. The filter might enable a selective and efficient plan like an index scan. > > Cons: > 1. The filter might not enable a selective and efficient plan, and > therefore cause extra work. > > (It's a little weird in this example because don't usually see hash > functions in WHERE clauses, but that could just as easily be dist > BETWEEN 1 AND 99 or any other partition constraint.) > It requires the join-key must include the partition key and also must be equality-join, doesn't it? If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to the partition bounds, indeed. In case when some of partition leafs are pruned, it is exactly beneficial because relevant rows to be referenced by the pruned child relations are waste of memory. On the other hands, it eventually consumes almost equivalent amount of memory to load the inner relations, if no leafs are pruned, and if we could extend the Hash-node to share the hash-table with sibling join-nodess. > > One idea I have is, sibling HashJoin shares a hash table that was built once > > by any of the sibling Hash plan. Right now, it is not implemented yet. > > Yeah, I've thought a little bit about that in the context of Parallel > Repartition. I'm interested in combining intra-node partitioning > (where a single plan node repartitions data among workers on the fly) > with inter-node partitioning (like PWJ, where partitions are handled > by different parts of the plan, considered at planning time); you > finish up needing to have nodes in the plan that 'receive' tuples for > each partition, to match up with the PWJ plan structure. That's not > entirely unlike CTE references, and not entirely unlike your idea of > somehow sharing the same hash table. I ran into a number of problems > while thinking about that, which I should write about in another > thread. > Hmm. Do you intend the inner-path may have different behavior according to the partition bounds definition where the outer-path to be joined? Let me investigate its pros & cons. The reasons why I think the idea of sharing the same hash table is reasonable in this scenario are: 1. We can easily extend the idea for parallel optimization. A hash table on DSM segment, once built, can be shared by all the siblings in all the parallel workers. 2. We can save the memory consumption regardless of the join-keys and partition-keys, even if these are not involved in the query. On the other hands, below are the downside. Potentially, combined use of your idea may help these cases: 3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN. 4. Hash table contains rows to be referenced by only pruned partition leafs. Best regards, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei <kaigai@heterodb.com>
On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote: > On the other hands, it eventually consumes almost equivalent amount > of memory to load the inner relations, if no leafs are pruned, and if we > could extend the Hash-node to share the hash-table with sibling > join-nodess. The patch crashes when running the regression tests, per the report of the automatic patch tester. Could you look at that? I have moved the patch to nexf CF, waiting on author. -- Michael
Attachment
Hello, This crash was reproduced on our environment also. It looks to me adjust_child_relids_multilevel() didn't expect a case when supplied 'relids' (partially) indicate normal and non-partitioned relation. It tries to build a new 'parent_relids' that is a set of appinfo->parent_relid related to the supplied 'child_relids'. However, bits in child_relids that indicate normal relations are unintentionally dropped here. Then, adjust_child_relids_multilevel() goes to an infinite recursion until stack limitation. The attached v2 fixed the problem, and regression test finished correctly. Best regards, 2019年12月1日(日) 12:24 Michael Paquier <michael@paquier.xyz>: > > On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote: > > On the other hands, it eventually consumes almost equivalent amount > > of memory to load the inner relations, if no leafs are pruned, and if we > > could extend the Hash-node to share the hash-table with sibling > > join-nodess. > > The patch crashes when running the regression tests, per the report of > the automatic patch tester. Could you look at that? I have moved the > patch to nexf CF, waiting on author. > -- > Michael -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei <kaigai@heterodb.com>
Attachment
Hi Thomas, On 12/27/19 2:34 AM, Kohei KaiGai wrote: > > This crash was reproduced on our environment also. > It looks to me adjust_child_relids_multilevel() didn't expect a case > when supplied 'relids' > (partially) indicate normal and non-partitioned relation. > It tries to build a new 'parent_relids' that is a set of > appinfo->parent_relid related to the > supplied 'child_relids'. However, bits in child_relids that indicate > normal relations are > unintentionally dropped here. Then, adjust_child_relids_multilevel() > goes to an infinite > recursion until stack limitation. > > The attached v2 fixed the problem, and regression test finished correctly. Any thoughts on the new version of this patch? Regards, -- -David david@pgmasters.net
> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote: > The attached v2 fixed the problem, and regression test finished correctly. This patch no longer applies to HEAD, please submit an rebased version. Marking the entry Waiting on Author in the meantime. cheers ./daniel
On 12/27/19 12:34 PM, Kohei KaiGai wrote: > The attached v2 fixed the problem, and regression test finished correctly. Using your patch I saw incorrect value of predicted rows at the top node of the plan: "Append (cost=270.02..35165.37 rows=40004 width=16)" Full explain of the query plan see in attachment - explain_with_asymmetric.sql if I disable enable_partitionwise_join then: "Hash Join (cost=270.02..38855.25 rows=10001 width=16)" Full explain - explain_no_asymmetric.sql I thought that is the case of incorrect usage of cached values of norm_selec, but it is a corner-case problem of the eqjoinsel() routine : selectivity = 1/size_of_larger_relation; (selfuncs.c:2567) tuples = selectivity * outer_tuples * inner_tuples; (costsize.c:4607) i.e. number of tuples depends only on size of smaller relation. It is not a bug of your patch but I think you need to know because it may affect on planner decision. === P.S. Test case: CREATE TABLE t0 (a serial, b int); INSERT INTO t0 (b) (SELECT * FROM generate_series(1e4, 2e4) as g); CREATE TABLE parts (a serial, b int) PARTITION BY HASH(a) INSERT INTO parts (b) (SELECT * FROM generate_series(1, 1e6) as g); -- regards, Andrey Lepikhov Postgres Professional
Attachment
On 7/1/20 2:10 PM, Daniel Gustafsson wrote: >> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote: > >> The attached v2 fixed the problem, and regression test finished correctly. > > This patch no longer applies to HEAD, please submit an rebased version. > Marking the entry Waiting on Author in the meantime. Rebased version of the patch on current master (d259afa736). I rebased it because it is a base of my experimental feature than we don't break partitionwise join of a relation with foreign partition and a local relation if we have info that remote server has foreign table link to the local relation (by analogy with shippable extensions). Maybe mark as 'Needs review'? -- regards, Andrey Lepikhov Postgres Professional
Attachment
> On 21 Aug 2020, at 08:02, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > On 7/1/20 2:10 PM, Daniel Gustafsson wrote: >>> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote: >>> The attached v2 fixed the problem, and regression test finished correctly. >> This patch no longer applies to HEAD, please submit an rebased version. >> Marking the entry Waiting on Author in the meantime. > Rebased version of the patch on current master (d259afa736). > > I rebased it because it is a base of my experimental feature than we don't break partitionwise join of a relation withforeign partition and a local relation if we have info that remote server has foreign table link to the local relation(by analogy with shippable extensions). > > Maybe mark as 'Needs review'? Thanks for the rebase, I've updated the commitfest entry to reflect that it needs a round of review. cheers ./daniel
On Sat, Aug 24, 2019 at 2:03 PM Kohei KaiGai <kaigai@heterodb.com> wrote: > > 2019年8月24日(土) 7:02 Thomas Munro <thomas.munro@gmail.com>: > > > > On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote: > > > We can consider the table join ptable X t1 above is equivalent to: > > > (ptable_p0 + ptable_p1 + ptable_p2) X t1 > > > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1) > > > It returns an equivalent result, however, rows are already reduced by HashJoin > > > in the individual leaf of Append, so CPU-cycles consumed by Append node can > > > be cheaper. > > > > > > On the other hands, it has a downside because t1 must be read 3 times and > > > hash table also must be built 3 times. It increases the expected cost, > > > so planner > > > may not choose the asymmetric partition-wise join plan. > > > > What if you include the partition constraint as a filter on t1? So you get: > > > > ptable X t1 = > > (ptable_p0 X (σ hash(dist)%4=0 (t1))) + > > (ptable_p1 X (σ hash(dist)%4=1 (t1))) + > > (ptable_p2 X (σ hash(dist)%4=2 (t1))) + > > (ptable_p3 X (σ hash(dist)%4=3 (t1))) > > > > Pros: > > 1. The hash tables will not contain unnecessary junk. > > 2. You'll get the right answer if t1 is on the outer side of an outer join. > > 3. If this runs underneath a Parallel Append and t1 is big enough > > then workers will hopefully cooperate and do a synchronised scan of > > t1. > > 4. The filter might enable a selective and efficient plan like an index scan. > > > > Cons: > > 1. The filter might not enable a selective and efficient plan, and > > therefore cause extra work. > > > > (It's a little weird in this example because don't usually see hash > > functions in WHERE clauses, but that could just as easily be dist > > BETWEEN 1 AND 99 or any other partition constraint.) > > > It requires the join-key must include the partition key and also must be > equality-join, doesn't it? > If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute > t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to > the partition bounds, indeed. > > In case when some of partition leafs are pruned, it is exactly beneficial > because relevant rows to be referenced by the pruned child relations > are waste of memory. > > On the other hands, it eventually consumes almost equivalent amount > of memory to load the inner relations, if no leafs are pruned, and if we > could extend the Hash-node to share the hash-table with sibling join-nodess. > > > > One idea I have is, sibling HashJoin shares a hash table that was built once > > > by any of the sibling Hash plan. Right now, it is not implemented yet. > > > > Yeah, I've thought a little bit about that in the context of Parallel > > Repartition. I'm interested in combining intra-node partitioning > > (where a single plan node repartitions data among workers on the fly) > > with inter-node partitioning (like PWJ, where partitions are handled > > by different parts of the plan, considered at planning time); you > > finish up needing to have nodes in the plan that 'receive' tuples for > > each partition, to match up with the PWJ plan structure. That's not > > entirely unlike CTE references, and not entirely unlike your idea of > > somehow sharing the same hash table. I ran into a number of problems > > while thinking about that, which I should write about in another > > thread. > > > Hmm. Do you intend the inner-path may have different behavior according > to the partition bounds definition where the outer-path to be joined? > Let me investigate its pros & cons. > > The reasons why I think the idea of sharing the same hash table is reasonable > in this scenario are: > 1. We can easily extend the idea for parallel optimization. A hash table on DSM > segment, once built, can be shared by all the siblings in all the > parallel workers. > 2. We can save the memory consumption regardless of the join-keys and > partition-keys, even if these are not involved in the query. > > On the other hands, below are the downside. Potentially, combined use of > your idea may help these cases: > 3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN. > 4. Hash table contains rows to be referenced by only pruned partition leafs. > + many, for the sharable hash of the inner table of the join. IMHO, this could be the most interesting and captivating thing about this feature. But might be a complicated piece, is that still on the plan? Regards, Amul
On 21.08.2020 09:02, Andrey V. Lepikhov wrote: > On 7/1/20 2:10 PM, Daniel Gustafsson wrote: >>> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote: >> >>> The attached v2 fixed the problem, and regression test finished >>> correctly. >> >> This patch no longer applies to HEAD, please submit an rebased version. >> Marking the entry Waiting on Author in the meantime. > Rebased version of the patch on current master (d259afa736). > > I rebased it because it is a base of my experimental feature than we > don't break partitionwise join of a relation with foreign partition > and a local relation if we have info that remote server has foreign > table link to the local relation (by analogy with shippable extensions). > > Maybe mark as 'Needs review'? > Status update for a commitfest entry. According to cfbot, the patch fails to apply. Could you please send a rebased version? This thread was inactive for quite some time. Is anyone going to continue working on it? I see some interest in the idea of sharable hash, but I don't see even a prototype in this thread. So, probably, it is a matter of a separate discussion. Also, I took a look at the code. It looks like it needs some extra work. I am not a big expert in this area, so I'm sorry if questions are obvious. 1. What would happen if this assumption is not met? + * MEMO: We assume this pathlist keeps at least one AppendPath that + * represents partitioned table-scan, symmetric or asymmetric + * partition-wise join. It is not correct right now, however, a hook + * on add_path() to give additional decision for path removel allows + * to retain this kind of AppendPath, regardless of its cost. 2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into PG_TRY/PG_CATCH? What errors do we expect? 3. It looks like a crutch. If it isn't, I'd like to see a better comment about why "dynamic programming" is not applicable here. And shouldn't we also handle a root->join_cur_level? + /* temporary disables "dynamic programming" algorithm */ + root->join_rel_level = NULL; 4. This change looks like it can lead to a memory leak for old code. Maybe it is never the case, but again I think it worth a comment. - /* If there's nothing to adjust, don't call this function. */ - Assert(nappinfos >= 1 && appinfos != NULL); + /* If there's nothing to adjust, just return a duplication */ + if (nappinfos == 0) + return copyObject(node); 5. extract_asymmetric_partitionwise_subjoin() lacks a comment The new status of this patch is: Waiting on Author -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 09.11.2020 13:53, Anastasia Lubennikova wrote: > On 21.08.2020 09:02, Andrey V. Lepikhov wrote: >> On 7/1/20 2:10 PM, Daniel Gustafsson wrote: >>>> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote: >>> >>>> The attached v2 fixed the problem, and regression test finished >>>> correctly. >>> >>> This patch no longer applies to HEAD, please submit an rebased version. >>> Marking the entry Waiting on Author in the meantime. >> Rebased version of the patch on current master (d259afa736). >> >> I rebased it because it is a base of my experimental feature than we >> don't break partitionwise join of a relation with foreign partition >> and a local relation if we have info that remote server has foreign >> table link to the local relation (by analogy with shippable extensions). >> >> Maybe mark as 'Needs review'? >> > Status update for a commitfest entry. > > According to cfbot, the patch fails to apply. Could you please send a > rebased version? > > This thread was inactive for quite some time. Is anyone going to > continue working on it? > > I see some interest in the idea of sharable hash, but I don't see even > a prototype in this thread. So, probably, it is a matter of a separate > discussion. > > Also, I took a look at the code. It looks like it needs some extra > work. I am not a big expert in this area, so I'm sorry if questions > are obvious. > > 1. What would happen if this assumption is not met? > > + * MEMO: We assume this pathlist keeps at least one > AppendPath that > + * represents partitioned table-scan, symmetric or asymmetric > + * partition-wise join. It is not correct right now, however, > a hook > + * on add_path() to give additional decision for path removel > allows > + * to retain this kind of AppendPath, regardless of its cost. > > 2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into > PG_TRY/PG_CATCH? What errors do we expect? > > 3. It looks like a crutch. If it isn't, I'd like to see a better > comment about why "dynamic programming" is not applicable here. > And shouldn't we also handle a root->join_cur_level? > > + /* temporary disables "dynamic programming" algorithm */ > + root->join_rel_level = NULL; > > 4. This change looks like it can lead to a memory leak for old code. > Maybe it is never the case, but again I think it worth a comment. > > - /* If there's nothing to adjust, don't call this function. */ > - Assert(nappinfos >= 1 && appinfos != NULL); > + /* If there's nothing to adjust, just return a duplication */ > + if (nappinfos == 0) > + return copyObject(node); > > 5. extract_asymmetric_partitionwise_subjoin() lacks a comment > > The new status of this patch is: Waiting on Author > Status update for a commitfest entry. This entry was inactive during this CF, so I've marked it as returned with feedback. Feel free to resubmit an updated version to a future commitfest. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 11/30/20 7:43 PM, Anastasia Lubennikova wrote: > This entry was inactive during this CF, so I've marked it as returned > with feedback. Feel free to resubmit an updated version to a future > commitfest. > Attached version is rebased on current master and fixes problems with complex parameterized plans - 'reparameterize by child' feature. Problems with reparameterization machinery can be demonstrated by TPC-H benchmark. -- regards, Andrey Lepikhov Postgres Professional
Attachment
On 11/30/20 7:43 PM, Anastasia Lubennikova wrote: > This entry was inactive during this CF, so I've marked it as returned > with feedback. Feel free to resubmit an updated version to a future > commitfest. I return the patch to commitfest. My current reason differs from reason of origin author. This patch can open a door for more complex optimizations in the partitionwise join push-down technique. I mean, we can push-down join not only of two partitioned tables with the same partition schema, but a partitioned (sharded) table with an arbitrary subplan that is provable independent of local resources. Example: CREATE TABLE p(a int) PARTITION BY HASH (a); CREATE TABLE p1 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 0); CREATE TABLE p2 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 1); CREATE TABLE p3 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 2); SELECT * FROM p, (SELECT * FROM generate_series(1,2) AS a) AS s WHERE p.a=s.a; Hash Join Hash Cond: (p.a = a.a) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Seq Scan on p3 p_3 -> Hash -> Function Scan on generate_series a But with asymmetric join feature we have the plan: Append -> Hash Join Hash Cond: (p_1.a = a.a) -> Seq Scan on p1 p_1 -> Hash -> Function Scan on generate_series a -> Hash Join Hash Cond: (p_2.a = a.a) -> Seq Scan on p2 p_2 -> Hash -> Function Scan on generate_series a -> Hash Join Hash Cond: (p_3.a = a.a) -> Seq Scan on p3 p_3 -> Hash -> Function Scan on generate_series a In the case of FDW-sharding it means that if we can prove that the inner relation is independent from the execution server, we can push-down these joins and execute it in parallel. -- regards, Andrey Lepikhov Postgres Professional
Next version of the patch. For searching any problems I forced this patch during 'make check' tests. Some bugs were found and fixed. -- regards, Andrey Lepikhov Postgres Professional
Attachment
Andrey Lepikhov писал 2021-05-27 07:27: > Next version of the patch. > For searching any problems I forced this patch during 'make check' > tests. Some bugs were found and fixed. Hi. I've tested this patch and haven't found issues, but I have some comments. src/backend/optimizer/path/joinrels.c: 1554 1555 /* 1556 * Build RelOptInfo on JOIN of each partition of the outer relation and the inner 1557 * relation. Return List of such RelOptInfo's. Return NIL, if at least one of 1558 * these JOINs are impossible to build. 1559 */ 1560 static List * 1561 extract_asymmetric_partitionwise_subjoin(PlannerInfo *root, 1562 RelOptInfo *joinrel, 1563 AppendPath *append_path, 1564 RelOptInfo *inner_rel, 1565 JoinType jointype, 1566 JoinPathExtraData *extra) 1567 { 1568 List *result = NIL; 1569 ListCell *lc; 1570 1571 foreach (lc, append_path->subpaths) 1572 { 1573 Path *child_path = lfirst(lc); 1574 RelOptInfo *child_rel = child_path->parent; 1575 Relids child_join_relids; 1576 Relids parent_relids; 1577 RelOptInfo *child_join_rel; 1578 SpecialJoinInfo *child_sjinfo; 1579 List *child_restrictlist; Variable names - child_join_rel and child_join_relids seem to be inconsistent with rest of the file (I see child_joinrelids in try_partitionwise_join() and child_joinrel in try_partitionwise_join() and get_matching_part_pairs()). 1595 child_join_rel = build_child_join_rel(root, 1596 child_rel, 1597 inner_rel, 1598 joinrel, 1599 child_restrictlist, 1600 child_sjinfo, 1601 jointype); 1602 if (!child_join_rel) 1603 { 1604 /* 1605 * If can't build JOIN between inner relation and one of the outer 1606 * partitions - return immediately. 1607 */ 1608 return NIL; 1609 } When build_child_join_rel() can return NULL? If I read code correctly, joinrel is created in the begining of build_child_join_rel() with makeNode(), makeNode() wraps newNode() and newNode() uses MemoryContextAllocZero()/MemoryContextAllocZeroAligned(), which would error() on alloc() failure. 1637 1638 static bool 1639 is_asymmetric_join_capable(PlannerInfo *root, 1640 RelOptInfo *outer_rel, 1641 RelOptInfo *inner_rel, 1642 JoinType jointype) 1643 { Function misses a comment. 1656 /* 1657 * Don't allow asymmetric JOIN of two append subplans. 1658 * In the case of a parameterized NL join, a reparameterization procedure will 1659 * lead to large memory allocations and a CPU consumption: 1660 * each reparameterize will induce subpath duplication, creating new 1661 * ParamPathInfo instance and increasing of ppilist up to number of partitions 1662 * in the inner. Also, if we have many partitions, each bitmapset 1663 * variable will large and many leaks of such variable (caused by relid 1664 * replacement) will highly increase memory consumption. 1665 * So, we deny such paths for now. 1666 */ Missing word: each bitmapset variable will large => each bitmapset variable will be large 1694 foreach (lc, outer_rel->pathlist) 1695 { 1696 AppendPath *append_path = lfirst(lc); 1697 1698 /* 1699 * MEMO: We assume this pathlist keeps at least one AppendPath that 1700 * represents partitioned table-scan, symmetric or asymmetric 1701 * partition-wise join. It is not correct right now, however, a hook 1702 * on add_path() to give additional decision for path removal allows 1703 * to retain this kind of AppendPath, regardless of its cost. 1704 */ 1705 if (IsA(append_path, AppendPath)) What hook do you refer to? src/backend/optimizer/plan/setrefs.c: 282 /* 283 * Adjust RT indexes of AppendRelInfos and add to final appendrels list. 284 * We assume the AppendRelInfos were built during planning and don't need 285 * to be copied. 286 */ 287 foreach(lc, root->append_rel_list) 288 { 289 AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); 290 AppendRelInfo *newappinfo; 291 292 /* flat copy is enough since all valuable fields are scalars */ 293 newappinfo = (AppendRelInfo *) palloc(sizeof(AppendRelInfo)); 294 memcpy(newappinfo, appinfo, sizeof(AppendRelInfo)); You've changed function to copy appinfo, so now comment is incorrect. src/backend/optimizer/util/appendinfo.c: 588 /* Construct relids set for the immediate parent of the given child. */ 589 normal_relids = bms_copy(child_relids); 590 for (cnt = 0; cnt < nappinfos; cnt++) 591 { 592 AppendRelInfo *appinfo = appinfos[cnt]; 593 594 parent_relids = bms_add_member(parent_relids, appinfo->parent_relid); 595 normal_relids = bms_del_member(normal_relids, appinfo->child_relid); 596 } 597 parent_relids = bms_union(parent_relids, normal_relids); Do I understand correctly that now parent_relids also contains relids of relations from 'global' inner relation, which we join to childs? -- Best regards, Alexander Pyhalov, Postgres Professional
On 18/6/21 15:02, Alexander Pyhalov wrote: > Andrey Lepikhov писал 2021-05-27 07:27: >> Next version of the patch. >> For searching any problems I forced this patch during 'make check' >> tests. Some bugs were found and fixed. > > Hi. > I've tested this patch and haven't found issues, but I have some comments. Thank you for review! > Variable names - child_join_rel and child_join_relids seem to be > inconsistent with rest of the file fixed > When build_child_join_rel() can return NULL? Fixed > Missing word: > each bitmapset variable will large => each bitmapset variable will be large Fixed > What hook do you refer to? Removed> You've changed function to copy appinfo, so now comment is incorrect. Thanks, fixed> Do I understand correctly that now parent_relids also contains relids of > relations from 'global' inner relation, which we join to childs? Yes -- regards, Andrey Lepikhov Postgres Professional
Attachment
On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote:
On 18/6/21 15:02, Alexander Pyhalov wrote:
> Andrey Lepikhov писал 2021-05-27 07:27:
>> Next version of the patch.
>> For searching any problems I forced this patch during 'make check'
>> tests. Some bugs were found and fixed.
>
> Hi.
> I've tested this patch and haven't found issues, but I have some comments.
Thank you for review!
> Variable names - child_join_rel and child_join_relids seem to be
> inconsistent with rest of the file
fixed
> When build_child_join_rel() can return NULL?
Fixed
> Missing word:
> each bitmapset variable will large => each bitmapset variable will be large
Fixed
> What hook do you refer to?
Removed> You've changed function to copy appinfo, so now comment is
incorrect.
Thanks, fixed> Do I understand correctly that now parent_relids also
contains relids of
> relations from 'global' inner relation, which we join to childs?
Yes
--
regards,
Andrey Lepikhov
Postgres Professional
Hi,
during reparameterization of NestLoop path.
CPU and memory huge consumption -> huge consumption of CPU and memory
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs are impossible to build.
+ * these JOINs are impossible to build.
at least one of these JOINs are impossible to build. -> at least one of these JOINs is impossible to build.
+ * Can't imagine situation when join relation already exists. But in
+ * the 'partition_join' regression test it happens.
+ * It may be an indicator of possible problems.
+ * the 'partition_join' regression test it happens.
+ * It may be an indicator of possible problems.
Should a log be added in the above case ?
+is_asymmetric_join_capable(PlannerInfo *root,
is_asymmetric_join_capable -> is_asymmetric_join_feasible
Cheers
On 5/7/21 23:15, Zhihong Yu wrote: > On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov > <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote: > + * Can't imagine situation when join relation already > exists. But in > + * the 'partition_join' regression test it happens. > + * It may be an indicator of possible problems. > Should a log be added in the above case ? I made additional analysis of this branch of code. This situation can happen in the case of one child or if we join two plane tables with partitioned. Both situations are legal and I think we don't needed to add any log message here. Other mistakes were fixed. -- regards, Andrey Lepikhov Postgres Professional
Attachment
Andrey Lepikhov писал 2021-07-06 12:28: > On 5/7/21 23:15, Zhihong Yu wrote: >> On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov >> <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote: >> + * Can't imagine situation when join relation already >> exists. But in >> + * the 'partition_join' regression test it happens. >> + * It may be an indicator of possible problems. >> Should a log be added in the above case ? > I made additional analysis of this branch of code. This situation can > happen in the case of one child or if we join two plane tables with > partitioned. Both situations are legal and I think we don't needed to > add any log message here. > Other mistakes were fixed. Hi. Small typo in comment in src/backend/optimizer/plan/setrefs.c: 281 282 /* 283 * Adjust RT indexes of AppendRelInfos and add to final appendrels list. 284 * The AppendRelInfos are copied, because as a part of a subplan its could 285 * be visited many times in the case of asymmetric join. 286 */ 287 foreach(lc, root->append_rel_list) 288 { its -> it (or they) ? -- Best regards, Alexander Pyhalov, Postgres Professional
On 5/7/21 23:15, Zhihong Yu wrote: > On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov > <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote: > + * Can't imagine situation when join relation already > exists. But in > + * the 'partition_join' regression test it happens. > + * It may be an indicator of possible problems. > > Should a log be added in the above case ? I worked more on this case and found more serious mistake. During population of additional paths on the existed RelOptInfo we can remove some previously generated paths that pointed from a higher-level list of subplans and it could cause to lost of subplan links. I prohibit such situation (you can read comments in the new version of the patch). Also, choosing of a cheapest path after appendrel creation was added. Unstable tests were fixed. -- regards, Andrey Lepikhov Postgres Professional
Attachment
On Thu, Jul 15, 2021 at 11:32 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote:
On 5/7/21 23:15, Zhihong Yu wrote:
> On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov
> <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote:
> + * Can't imagine situation when join relation already
> exists. But in
> + * the 'partition_join' regression test it happens.
> + * It may be an indicator of possible problems.
>
> Should a log be added in the above case ?
I worked more on this case and found more serious mistake. During
population of additional paths on the existed RelOptInfo we can remove
some previously generated paths that pointed from a higher-level list of
subplans and it could cause to lost of subplan links. I prohibit such
situation (you can read comments in the new version of the patch).
Also, choosing of a cheapest path after appendrel creation was added.
Unstable tests were fixed.
--
regards,
Andrey Lepikhov
Postgres Professional
Patch is failing the regression, can you please take a look at that.
partition_join ... FAILED 6328 ms
--
--
Ibrar Ahmed
It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any tests.Changing the status to save time for reviewers. The new status of this patch is: Waiting on Author
On Thu, Sep 09, 2021 at 09:50:46AM +0000, Aleksander Alekseev wrote: > It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any tests.Changing the status to save time for reviewers. > > The new status of this patch is: Waiting on Author Just to give some more info to work on I found this patch made postgres crash with a segmentation fault. """ Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at bitmapset.c:126 126 if (shorter->words[i] != longer->words[i]) """ attached are the query that triggers the crash and the backtrace. -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Attachment
On 9/9/21 8:38 PM, Jaime Casanova wrote: > On Thu, Sep 09, 2021 at 09:50:46AM +0000, Aleksander Alekseev wrote: >> It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any tests.Changing the status to save time for reviewers. >> >> The new status of this patch is: Waiting on Author > > Just to give some more info to work on I found this patch made postgres > crash with a segmentation fault. > > """ > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x0000556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at bitmapset.c:126 > 126 if (shorter->words[i] != longer->words[i]) > """ > > attached are the query that triggers the crash and the backtrace. > Thank you for this good catch! The problem was in the adjust_child_relids_multilevel routine. The tmp_result variable sometimes points to original required_outer. This patch adds new ways which optimizer can generate plans. One possible way is optimizer reparameterizes an inner by a plain relation from the outer (maybe as a result of join of the plain relation and partitioned relation). In this case we have to compare tmp_result with original pointer to realize, it was changed or not. The patch in attachment fixes this problem. Additional regression test added. -- regards, Andrey Lepikhov Postgres Professional
Attachment
On 14/9/21 11:37, Andrey V. Lepikhov wrote: > Thank you for this good catch! > The problem was in the adjust_child_relids_multilevel routine. The > tmp_result variable sometimes points to original required_outer. > This patch adds new ways which optimizer can generate plans. One > possible way is optimizer reparameterizes an inner by a plain relation > from the outer (maybe as a result of join of the plain relation and > partitioned relation). In this case we have to compare tmp_result with > original pointer to realize, it was changed or not. > The patch in attachment fixes this problem. Additional regression test > added. > I thought more and realized there isn't necessary to recurse in the adjust_child_relids_multilevel() routine if required_outer contains only normal_relids. Also, regression tests were improved a bit. -- regards, Andrey Lepikhov Postgres Professional
Attachment
Andrey Lepikhov писал 2021-09-15 09:31: > On 14/9/21 11:37, Andrey V. Lepikhov wrote: >> Thank you for this good catch! >> The problem was in the adjust_child_relids_multilevel routine. The >> tmp_result variable sometimes points to original required_outer. >> This patch adds new ways which optimizer can generate plans. One >> possible way is optimizer reparameterizes an inner by a plain relation >> from the outer (maybe as a result of join of the plain relation and >> partitioned relation). In this case we have to compare tmp_result with >> original pointer to realize, it was changed or not. >> The patch in attachment fixes this problem. Additional regression test >> added. >> > I thought more and realized there isn't necessary to recurse in the > adjust_child_relids_multilevel() routine if required_outer contains > only > normal_relids. > Also, regression tests were improved a bit. Hi. The patch does not longer apply cleanly, so I rebased it. Attaching rebased version. I've looked through it once again and have several questions. 1) In adjust_appendrel_attrs_multilevel(), can it happen that child_relids is zero-length list (in this case pfree's will fail)? It seems, no, but should we at least assert this? Note that in adjust_appendrel_attrs() we add logic for nappinfos being 0. 2) In try_asymmetric_partitionwise_join() we state that 'Asymmetric join isn't needed if the append node has only one child'. This is not completely correct. Asymmetric join with one partition can be advantageous when JOIN(A, UNION(B)) is more expensive than UNION(JOIN (A, B)). The later is true, for example, when we join partitioned table having foreign partitions with another foreign table and only one partition is left. Let's take the attached case (foreign_join.sql). When list_length(append_path->subpaths) > 1 is present, we get the following plan set enable_partitionwise_join = on; explain SELECT t1.a,t2.b FROM fprt1 t1 INNER JOIN ftprt2_p1 t2 ON (t1.a = t2.b) WHERE t1.a < 250 AND t2.c like '%0004' ORDER BY 1,2; QUERY PLAN --------------------------------------------------------------------------------------- Sort (cost=208.65..208.69 rows=17 width=8) Sort Key: t1.a -> Hash Join (cost=202.60..208.30 rows=17 width=8) Hash Cond: (t1.a = t2.b) -> Foreign Scan on ftprt1_p1 t1 (cost=100.00..105.06 rows=125 width=4) -> Hash (cost=102.39..102.39 rows=17 width=4) -> Foreign Scan on ftprt2_p1 t2 (cost=100.00..102.39 rows=17 width=4) In case when we change it to list_length(append_path->subpaths) > 0, we get foreign join and cheaper plan: explain verbose SELECT t1.a,t2.b FROM fprt1 t1 INNER JOIN ftprt2_p1 t2 ON (t1.a = t2.b) WHERE t1.a < 250 AND t2.c like '%0004' ORDER BY 1,2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=106.15..106.19 rows=17 width=8) Output: t1.a, t2.b Sort Key: t1.a -> Foreign Scan (cost=102.26..105.80 rows=17 width=8) Output: t1.a, t2.b Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2) Remote SQL: SELECT r4.a, r2.b FROM (public.fprt1_p1 r4 INNER JOIN public.fprt2_p1 r2 ON (((r4.a = r2.b)) AND ((r2.c ~~ '%0004')) AND ((r4.a < 250)))) -- Best regards, Alexander Pyhalov, Postgres Professional
Attachment
Hi Alexander, Hi Andrey, Thank you for your work on this subject. On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov <a.pyhalov@postgrespro.ru> wrote: > The patch does not longer apply cleanly, so I rebased it. Attaching > rebased version. Not surprising that the patch doesn't apply after 1.5 years since the last message. Could you please rebase it? I read the thread and the patch. The patch improves the joining of partitioned tables with non-partitioned relations. Let's denote non-partitioned relation as A, partitions as P1 ... PN. The patch allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A, Append(P1, ... PN). That could be cheaper because it's generally cheaper to join small pieces rather than do one big join. The drawback is the need to scan A multiple times. But is this really necessary and acceptable? Let's consider multiple options. 1) A is non-table. For instance, A is a function scan. In this case, doing multiple scans of A is not just expensive, but could lead to unexpected side effects. When the user includes a function once in the FROM clause, she expects this function to be evaluated once. I propose that we should materialize a scan of non-table relations. So, materialized representation will be scanned multiple times, but the source only scanned once. That would be similar to CTE. 2) A is the table to be scanned with the parametrized path in the inner part of the nested loop join. In this case, there is no big scan of A and nothing to materialize. 3) A is the table to be used in merge join or outer part of nested loop join. In this case, it would be nice to consider materialize. It's not always good to materialize, because materialization has its additional costs. I think that could be a cost-based decision. 4) A is used in the hash join. Could we re-use the hashed representation of A between multiple joins? I read upthread it was proposed to share a hashed table between multiple background workers via shared memory. But the first step would be to just share it between multiple join nodes within the same process. As we consider joining with each partition individually, there could be chosen different join methods. As I get, the current patch considers joining with each of the partitions as a separate isolated optimization task. However, if we share resources between the multiple joins, then rises a need for some global optimization. For instance, a join type could be expensive when applied to an individual partition, but cheap when applied to all the partitions thanks to saving the common work. My idea is to consider generated common resources (such as materialized scans) as a property of the path. For instance, if the nested loop join is cheaper than the hash join, but the hash join generates a common hash map of table A, we don't drop hash join immediately from the consideration and leave it to see how it could help join other partitions. What do you think? ------ Regards, Alexander Korotkov
On 15/10/2023 07:18, Alexander Korotkov wrote: > Hi Alexander, > Hi Andrey, > > Thank you for your work on this subject. > > On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov > <a.pyhalov@postgrespro.ru> wrote: >> The patch does not longer apply cleanly, so I rebased it. Attaching >> rebased version. > > Not surprising that the patch doesn't apply after 1.5 years since the > last message. Could you please rebase it? > > I read the thread and the patch. The patch improves the joining of > partitioned tables with non-partitioned relations. Let's denote > non-partitioned relation as A, partitions as P1 ... PN. The patch > allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A, > Append(P1, ... PN). That could be cheaper because it's generally > cheaper to join small pieces rather than do one big join. The > drawback is the need to scan A multiple times. But is this really > necessary and acceptable? Let's consider multiple options. > > 1) A is non-table. For instance, A is a function scan. In this case, > doing multiple scans of A is not just expensive, but could lead to > unexpected side effects. When the user includes a function once in > the FROM clause, she expects this function to be evaluated once. I > propose that we should materialize a scan of non-table relations. So, > materialized representation will be scanned multiple times, but the > source only scanned once. That would be similar to CTE. > 2) A is the table to be scanned with the parametrized path in the > inner part of the nested loop join. In this case, there is no big > scan of A and nothing to materialize. > 3) A is the table to be used in merge join or outer part of nested > loop join. In this case, it would be nice to consider materialize. > It's not always good to materialize, because materialization has its > additional costs. I think that could be a cost-based decision. > 4) A is used in the hash join. Could we re-use the hashed > representation of A between multiple joins? I read upthread it was > proposed to share a hashed table between multiple background workers > via shared memory. But the first step would be to just share it > between multiple join nodes within the same process. > > As we consider joining with each partition individually, there could > be chosen different join methods. As I get, the current patch > considers joining with each of the partitions as a separate isolated > optimization task. However, if we share resources between the > multiple joins, then rises a need for some global optimization. For > instance, a join type could be expensive when applied to an individual > partition, but cheap when applied to all the partitions thanks to > saving the common work. > > My idea is to consider generated common resources (such as > materialized scans) as a property of the path. For instance, if the > nested loop join is cheaper than the hash join, but the hash join > generates a common hash map of table A, we don't drop hash join > immediately from the consideration and leave it to see how it could > help join other partitions. What do you think? Thanks for such detailed feedback! The rationale for this patch was to give the optimizer additional ways to push down more joins into foreign servers. And, because of asynchronous append, the benefit of that optimization was obvious. Unfortunately, we hadn't found other applications for this feature, which was why this patch was postponed in the core. You have brought new ideas about applying this idea locally. Moreover, the main issue of the patch was massive memory consumption in the case of many joins and partitions - because of reparameterization. But now, postponing the reparameterization proposed in the thread [1] resolves that problem and gives some insights into the reparameterization technique of some fields, like lateral references. Hence, I think we can restart this work. The first thing here (after rebase, of course) is to figure out and implement in the cost model cases of effectiveness when asymmetric join would give significant performance. [1] Oversight in reparameterize_path_by_child leading to executor crash https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com -- regards, Andrey Lepikhov Postgres Professional
On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > Thanks for such detailed feedback! > The rationale for this patch was to give the optimizer additional ways > to push down more joins into foreign servers. And, because of > asynchronous append, the benefit of that optimization was obvious. > Unfortunately, we hadn't found other applications for this feature, > which was why this patch was postponed in the core. > You have brought new ideas about applying this idea locally. Moreover, > the main issue of the patch was massive memory consumption in the case > of many joins and partitions - because of reparameterization. But now, > postponing the reparameterization proposed in the thread [1] resolves > that problem and gives some insights into the reparameterization > technique of some fields, like lateral references. > Hence, I think we can restart this work. > The first thing here (after rebase, of course) is to figure out and > implement in the cost model cases of effectiveness when asymmetric join > would give significant performance. Great! I'm looking forward to the revised patch. ------ Regards, Alexander Korotkov
On 15/10/2023 17:25, Alexander Korotkov wrote: > On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> Thanks for such detailed feedback! >> The rationale for this patch was to give the optimizer additional ways >> to push down more joins into foreign servers. And, because of >> asynchronous append, the benefit of that optimization was obvious. >> Unfortunately, we hadn't found other applications for this feature, >> which was why this patch was postponed in the core. >> You have brought new ideas about applying this idea locally. Moreover, >> the main issue of the patch was massive memory consumption in the case >> of many joins and partitions - because of reparameterization. But now, >> postponing the reparameterization proposed in the thread [1] resolves >> that problem and gives some insights into the reparameterization >> technique of some fields, like lateral references. >> Hence, I think we can restart this work. >> The first thing here (after rebase, of course) is to figure out and >> implement in the cost model cases of effectiveness when asymmetric join >> would give significant performance. > > Great! I'm looking forward to the revised patch Before preparing a new patch, it would be better to find the common ground in the next issue: So far, this optimization stays aside, proposing an alternative path for a join RelOptInfo if we have an underlying append path in the outer. My back burner is redesigning the approach: asymmetric join doesn't change the partitioning scheme and bounds of the partitioned side. So, it looks consecutive to make it a part of partitionwise_join machinery and implement it as a part of the try_partitionwise_join / generate_partitionwise_join_paths routines. -- regards, Andrey Lepikhov Postgres Professional
On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > > > > Great! I'm looking forward to the revised patch > Before preparing a new patch, it would be better to find the common > ground in the next issue: > So far, this optimization stays aside, proposing an alternative path for > a join RelOptInfo if we have an underlying append path in the outer. > My back burner is redesigning the approach: asymmetric join doesn't > change the partitioning scheme and bounds of the partitioned side. So, > it looks consecutive to make it a part of partitionwise_join machinery > and implement it as a part of the try_partitionwise_join / > generate_partitionwise_join_paths routines. > I think we need an example where such a join will be faster than non-partitioned join when both the sides are local. It might be possible to come up with such an example without writing any code. The idea would be to rewrite SQL as union of joins. Whenever I visited this idea, I hit one issue prominently - how would we differentiate different scans of the non-partitioned relation. Normally we do that using different Relids but in this case we wouldn't be able to know the number of such relations involved in the query unless we start planning such a join. It's late to add new base relations and assign them new Relids. Of course I haven't thought hard about it. I haven't looked at the patch to see whether this problem is solved and how. -- Best Wishes, Ashutosh Bapat
On 16/10/2023 23:21, Ashutosh Bapat wrote: > On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov > Whenever I visited this idea, I hit one issue prominently - how would > we differentiate different scans of the non-partitioned relation. > Normally we do that using different Relids but in this case we > wouldn't be able to know the number of such relations involved in the > query unless we start planning such a join. It's late to add new base > relations and assign them new Relids. Of course I haven't thought hard > about it. I haven't looked at the patch to see whether this problem is > solved and how. > I'm curious, which type of problems do you afraid here? Why we need a range table entry for each scan of non-partitioned relation? -- regards, Andrey Lepikhov Postgres Professional
On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > On 16/10/2023 23:21, Ashutosh Bapat wrote: > > On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov > > Whenever I visited this idea, I hit one issue prominently - how would > > we differentiate different scans of the non-partitioned relation. > > Normally we do that using different Relids but in this case we > > wouldn't be able to know the number of such relations involved in the > > query unless we start planning such a join. It's late to add new base > > relations and assign them new Relids. Of course I haven't thought hard > > about it. I haven't looked at the patch to see whether this problem is > > solved and how. > > > I'm curious, which type of problems do you afraid here? Why we need a > range table entry for each scan of non-partitioned relation? > Not RTE but RelOptInfo. Using the same example as Alexander Korotkov, let's say A is the nonpartitioned table and P is partitioned table with partitions P1, P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ... APn. Each of these joins may have different properties and thus will require creating paths. In order to save these paths, we need RelOptInfos which are indentified by relids. Let's assume that the relids of these join RelOptInfos are created by union of relid of A and relid of Px (the partition being joined). This is notionally misleading but doable. But the clauses of A parameterized by P will produce different translations for each of the partitions. I think we will need different RelOptInfos (for A) to store these translations. The relid is also used to track the scans at executor level. Since we have so many scans on A, each may be using different plan, we will need different ids for those. But if you have developed a way to use a single RelOptInfo of A to do all this, may be we don't need all this. Will take a look at your next version of patch. -- Best Wishes, Ashutosh Bapat
On 17/10/2023 17:09, Ashutosh Bapat wrote: > On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> >> On 16/10/2023 23:21, Ashutosh Bapat wrote: >>> On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov >>> Whenever I visited this idea, I hit one issue prominently - how would >>> we differentiate different scans of the non-partitioned relation. >>> Normally we do that using different Relids but in this case we >>> wouldn't be able to know the number of such relations involved in the >>> query unless we start planning such a join. It's late to add new base >>> relations and assign them new Relids. Of course I haven't thought hard >>> about it. I haven't looked at the patch to see whether this problem is >>> solved and how. >>> >> I'm curious, which type of problems do you afraid here? Why we need a >> range table entry for each scan of non-partitioned relation? >> > > Not RTE but RelOptInfo. > > Using the same example as Alexander Korotkov, let's say A is the > nonpartitioned table and P is partitioned table with partitions P1, > P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ... > APn. Each of these joins may have different properties and thus will > require creating paths. In order to save these paths, we need > RelOptInfos which are indentified by relids. Let's assume that the > relids of these join RelOptInfos are created by union of relid of A > and relid of Px (the partition being joined). This is notionally > misleading but doable. Ok, now I see your disquiet. In current patch we have built RelOptInfo for each JOIN(A, Pi) by the build_child_join_rel() routine. And of course, they all have different sets of cheapest paths (it is one more point of optimality). At this point the RelOptInfo of relation A is fully formed and upper joins use the pathlist "as is", without changes. > But the clauses of A parameterized by P will produce different > translations for each of the partitions. I think we will need > different RelOptInfos (for A) to store these translations. Does the answer above resolved this issue? > The relid is also used to track the scans at executor level. Since we > have so many scans on A, each may be using different plan, we will > need different ids for those. I don't understand this sentence. Which way executor uses this index of RelOptInfo ? -- regards, Andrey Lepikhov Postgres Professional
On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > > But the clauses of A parameterized by P will produce different > > translations for each of the partitions. I think we will need > > different RelOptInfos (for A) to store these translations. > > Does the answer above resolved this issue? May be. There are other problematic areas like EvalPlanQual, Rescans, reparameterised paths which can blow up if we use the same RelOptInfo for different scans of the same relation. It will be good to test those. And also A need not be a simple relation; it could be join as well. > > > The relid is also used to track the scans at executor level. Since we > > have so many scans on A, each may be using different plan, we will > > need different ids for those. > > I don't understand this sentence. Which way executor uses this index of > RelOptInfo ? See Scan::scanrelid -- Best Wishes, Ashutosh Bapat
On 18/10/2023 16:59, Ashutosh Bapat wrote: > On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> >>> But the clauses of A parameterized by P will produce different >>> translations for each of the partitions. I think we will need >>> different RelOptInfos (for A) to store these translations. >> >> Does the answer above resolved this issue? > > May be. There are other problematic areas like EvalPlanQual, Rescans, > reparameterised paths which can blow up if we use the same RelOptInfo > for different scans of the same relation. It will be good to test Yeah, now I got it. It is already the second place where I see some reference to a kind of hidden rule that the rte entry (or RelOptInfo) must correspond to only one plan node. I don't have a quick answer for now - maybe it is a kind of architectural agreement - and I will consider this issue during the development. > those. And also A need not be a simple relation; it could be join as > well. For a join RelOptInfo, as well as for any subtree, we have the same logic: the pathlist of this subtree is already formed during the previous level of the search and will not be changed. >> >>> The relid is also used to track the scans at executor level. Since we >>> have so many scans on A, each may be using different plan, we will >>> need different ids for those. >> >> I don't understand this sentence. Which way executor uses this index of >> RelOptInfo ? > > See Scan::scanrelid > -- regards, Andrey Lepikhov Postgres Professional
On 15/10/2023 13:25, Alexander Korotkov wrote: > Great! I'm looking forward to the revised patch. Revising the code and opinions before restarting this work, I found two different possible strategies mentioned in the thread: 1. 'Common Resources' shares the materialised result of the inner table scan (a hash table in the case of HashJoin) to join each partition one by one. It gives us a profit in the case of parallel append and possibly other cases, like the one shown in the initial message. 2. 'Individual strategies' - By limiting the AJ feature to cases when the JOIN clause contains a partitioning expression, we can push an additional scan clause into each copy of the inner table scan, reduce the number of tuples scanned, and even prune something because of proven zero input. I see the pros and cons of both approaches. The first option is more straightforward, and its outcome is obvious in the case of parallel append. But how can we guarantee the same join type for each join? Why should we ignore the positive effect of different strategies for different partitions? The second strategy is more expensive for the optimiser, especially in the multipartition case. But as I can predict, it is easier to implement and looks more natural for the architecture. What do you think about that? -- regards, Andrei Lepikhov Postgres Professional
On 18/10/2023 16:59, Ashutosh Bapat wrote: > On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov >>> The relid is also used to track the scans at executor level. Since we >>> have so many scans on A, each may be using different plan, we will >>> need different ids for those. >> >> I don't understand this sentence. Which way executor uses this index of >> RelOptInfo ? > > See Scan::scanrelid > Hi, In the attachment, you will find a fresh version of the patch. I've analysed the danger of the same RelOptInfo index for the executor. In the examples I found (scared), it is still not a problem because ExecQual() does all the jobs at one operation and doesn't intersect with over operations. Of course, it is not a good design, and we will work on this issue. But at least this code can be used in experiments. Furthermore, I've shared some reflections on this feature. To avoid cluttering the thread, I've published them in [1]. These thoughts provide additional context and considerations for our ongoing work. [1] https://danolivo.substack.com/p/postgresql-asymmetric-join-technique?r=34q1yy -- regards, Andrei Lepikhov
Attachment
Hi! On Tue, Apr 2, 2024 at 6:07 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > On 15/10/2023 13:25, Alexander Korotkov wrote: > > Great! I'm looking forward to the revised patch. > Revising the code and opinions before restarting this work, I found two > different possible strategies mentioned in the thread: > 1. 'Common Resources' shares the materialised result of the inner table > scan (a hash table in the case of HashJoin) to join each partition one > by one. It gives us a profit in the case of parallel append and possibly > other cases, like the one shown in the initial message. > 2. 'Individual strategies' - By limiting the AJ feature to cases when > the JOIN clause contains a partitioning expression, we can push an > additional scan clause into each copy of the inner table scan, reduce > the number of tuples scanned, and even prune something because of proven > zero input. > > I see the pros and cons of both approaches. The first option is more > straightforward, and its outcome is obvious in the case of parallel > append. But how can we guarantee the same join type for each join? Why > should we ignore the positive effect of different strategies for > different partitions? > The second strategy is more expensive for the optimiser, especially in > the multipartition case. But as I can predict, it is easier to implement > and looks more natural for the architecture. What do you think about that? Actually, the idea I tried to express is the combination of #1 and #2: to build individual plan for every partition, but consider the 'Common Resources'. Let me explain this a bit more. Right now, we basically we consider the following properties during selection of paths. 1) Cost. The cheaper path wins. There a two criteria though: startup cost and total cost. So, we can keep both paths with cheaper startup costs and paths with cheaper total cost. 2) Pathkeys. We can keep a path with more expensive path, which has pathkeys potentially useful in future. My idea is to introduce a new property for paths selection. 3) Usage of common resources. The common resource can be: hash representation of relation, memoize over relation scan, etc. We can exclude the cost of common resource generation from the path cost, but keep the reference for the common resource with its generation cost. If one path uses more common resources than another path, it could cost-dominate another one only if its cheaper together with its extra common resources cost. If one path uses less or equal common resources than another, it could normally cost-dominate another one. Using these rules, we can gather the the plurality of paths for each child join taking common resources into account. After that we can apply some global optimization finding generation of which common resources can reduce the global cost. However, I understand this is huge amount of work given we have to introduce new basic optimizer concepts. I get that the main application of this patch is sharding. If we have global tables residing each shard, we can push down any joins with them. Given this patch gives some optimization for non-sharded case, I think we *probably* can accept its concept even that it this optimization is obviously not perfect. ------ Regards, Alexander Korotkov Supabase
On Sun, May 5, 2024 at 5:55 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > On 18/10/2023 16:59, Ashutosh Bapat wrote: > > On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov > >>> The relid is also used to track the scans at executor level. Since we > >>> have so many scans on A, each may be using different plan, we will > >>> need different ids for those. > >> > >> I don't understand this sentence. Which way executor uses this index of > >> RelOptInfo ? > > > > See Scan::scanrelid > > > Hi, > > In the attachment, you will find a fresh version of the patch. > I've analysed the danger of the same RelOptInfo index for the executor. > In the examples I found (scared), it is still not a problem because > ExecQual() does all the jobs at one operation and doesn't intersect with > over operations. Of course, it is not a good design, and we will work on > this issue. But at least this code can be used in experiments. > Furthermore, I've shared some reflections on this feature. To avoid > cluttering the thread, I've published them in [1]. These thoughts > provide additional context and considerations for our ongoing work. > > [1] > https://danolivo.substack.com/p/postgresql-asymmetric-join-technique?r=34q1yy I've rebased the patch to the current master. Also, I didn't like the needFlatCopy argument to reparameterize_path_by_child(). It looks quite awkward. Instead, as soon as we need to copy paths, I've enabled native copy of paths. Now, we can do just copyObject() over path in caller. Looks much cleaner for me. What do you think? Other notes: 1) I think we need to cover the cases, which is_inner_rel_safe_for_asymmetric_join() filters out, by regression tests. 2) is_asymmetric_join() looks awkward for me. Should we instead make a flag in JoinPath? 3) I understand that you have re-use RelOptInfo multiple times. It's too late stage of query processing to add a simple relation into planner structs. I tried rescans issued by cursors, EvalPlanQual() caused by concurrent updates, but didn't manage to break this. It seems that even if same relation index is used multiple times in different places of a query, it never get used simultaneously. But even if this somehow is OK, this is significant change of assumptions in planner/executor data structures. Perhaps, we need at least Tom's opinion on this. ------ Regards, Alexander Korotkov Supabase
On 1/8/2024 20:56, Alexander Korotkov wrote: > On Tue, Apr 2, 2024 at 6:07 AM Andrei Lepikhov > <a.lepikhov@postgrespro.ru> wrote: > Actually, the idea I tried to express is the combination of #1 and #2: > to build individual plan for every partition, but consider the 'Common > Resources'. Let me explain this a bit more. Thanks for keeping your eye on it! > My idea is to introduce a new property for paths selection. > 3) Usage of common resources. The common resource can be: hash > representation of relation, memoize over relation scan, etc. We can > exclude the cost of common resource generation from the path cost, but > keep the reference for the common resource with its generation cost. > If one path uses more common resources than another path, it could > cost-dominate another one only if its cheaper together with its extra > common resources cost. If one path uses less or equal common > resources than another, it could normally cost-dominate another one. The most challenging part for me is the cost calculation, which is bonded with estimations of other paths. To correctly estimate the effect, we need to remember at least the whole number of paths sharing resources. Also, I wonder if it can cause some corner cases where prediction error on a shared resource will cause an even worse situation upstream. I think we could push off here from an example and a counter-example, but I still can't find them. > However, I understand this is huge amount of work given we have to > introduce new basic optimizer concepts. I get that the main > application of this patch is sharding. If we have global tables > residing each shard, we can push down any joins with them. Given this > patch gives some optimization for non-sharded case, I think we > *probably* can accept its concept even that it this optimization is > obviously not perfect. Yes, right now sharding is the most profitable case. We can push down parts of the plan which references only some common resources: FunctionScan, ValueScan, tables which can be proved are existed everywhere and provide the same output. But for now it is too far from the core code, IMO. - So, I search for cases that can be helpful for a single instance. -- regards, Andrei Lepikhov Postgres Professional