Thread: Asymmetric partition-wise JOIN

Asymmetric partition-wise JOIN

From
Kohei KaiGai
Date:
Hello,

PostgreSQL optimizer right now considers join pairs on only
non-partition - non-partition or
partition-leaf - partition-leaf relations. On the other hands, it is
harmless and makes sense to
consider a join pair on non-partition - partition-leaf.

See the example below. ptable is partitioned by hash, and contains 10M
rows. ftable is not
partitioned and contains 50 rows. Most of ptable::fkey shall not have
matched rows in this
join.

create table ptable (fkey int, dist text) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
insert into ptable (select x % 10000, md5(x::text) from
generate_series(1,10000000) x);

create table ftable (pkey int primary key, memo text);
insert into ftable (select x, 'ftable__#' || x::text from
generate_series(1,50) x);
vacuum analyze;

postgres=# explain analyze select count(*) from ptable p, ftable f
where p.fkey = f.pkey;
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=266393.38..266393.39 rows=1 width=8) (actual
time=2333.193..2333.194 rows=1 loops=1)
   ->  Hash Join  (cost=2.12..260143.38 rows=2500000 width=0) (actual
time=0.056..2330.079 rows=50000 loops=1)
         Hash Cond: (p.fkey = f.pkey)
         ->  Append  (cost=0.00..233335.00 rows=10000000 width=4)
(actual time=0.012..1617.268 rows=10000000 loops=1)
               ->  Seq Scan on ptable_p0 p  (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
loops=1)
               ->  Seq Scan on ptable_p1 p_1  (cost=0.00..61106.25
rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
loops=1)
               ->  Seq Scan on ptable_p2 p_2  (cost=0.00..61126.79
rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
loops=1)
         ->  Hash  (cost=1.50..1.50 rows=50 width=4) (actual
time=0.033..0.034 rows=50 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 10kB
               ->  Seq Scan on ftable f  (cost=0.00..1.50 rows=50
width=4) (actual time=0.004..0.017 rows=50 loops=1)
 Planning Time: 0.286 ms
 Execution Time: 2333.264 ms
(12 rows)

We can manually rewrite this query as follows:

postgres=# explain analyze select count(*) from (
              select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
              select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
              select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;

Because Append does not process tuples that shall have no matched
tuples in ftable,
this query has cheaper cost and short query execution time.
(2333ms --> 1396ms)

postgres=# explain analyze select count(*) from (
              select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
              select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
              select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=210478.25..210478.26 rows=1 width=8) (actual
time=1396.024..1396.024 rows=1 loops=1)
   ->  Append  (cost=2.12..210353.14 rows=50042 width=0) (actual
time=0.058..1393.008 rows=50000 loops=1)
         ->  Subquery Scan on "*SELECT* 1"  (cost=2.12..70023.66
rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
               ->  Hash Join  (cost=2.12..69856.40 rows=16726
width=72) (actual time=0.056..571.718 rows=16789 loops=1)
                     Hash Cond: (p.fkey = f.pkey)
                     ->  Seq Scan on ptable_p0 p  (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
loops=1)
                     ->  Hash  (cost=1.50..1.50 rows=50 width=4)
(actual time=0.034..0.035 rows=50 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 10kB
                           ->  Seq Scan on ftable f  (cost=0.00..1.50
rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
         ->  Subquery Scan on "*SELECT* 2"  (cost=2.12..70027.43
rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
               ->  Hash Join  (cost=2.12..69861.26 rows=16617
width=72) (actual time=0.036..408.626 rows=16578 loops=1)
                     Hash Cond: (p_1.fkey = f_1.pkey)
                     ->  Seq Scan on ptable_p1 p_1
(cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
rows=3333025 loops=1)
                     ->  Hash  (cost=1.50..1.50 rows=50 width=4)
(actual time=0.020..0.020 rows=50 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 10kB
                           ->  Seq Scan on ftable f_1
(cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
loops=1)
         ->  Subquery Scan on "*SELECT* 3"  (cost=2.12..70051.84
rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
               ->  Hash Join  (cost=2.12..69884.85 rows=16699
width=72) (actual time=0.025..406.048 rows=16633 loops=1)
                     Hash Cond: (p_2.fkey = f_2.pkey)
                     ->  Seq Scan on ptable_p2 p_2
(cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
rows=3334179 loops=1)
                     ->  Hash  (cost=1.50..1.50 rows=50 width=4)
(actual time=0.014..0.014 rows=50 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 10kB
                           ->  Seq Scan on ftable f_2
(cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
loops=1)
 Planning Time: 0.614 ms
 Execution Time: 1396.131 ms
(25 rows)

How about your opinions for this kind of asymmetric partition-wise
JOIN support by the optimizer?
I think we can harmlessly push-down inner-join and left-join if
partition-leaf is left side.

Probably, we need to implement two key functionalities.
1. Construction of RelOpInfo for join on non-partition table and
partition-leafs for each pairs.
    Instead of JoinPaths, this logic adds AppendPath that takes
asymmetric partition-wise join
    paths as sub-paths. Other optimization logic is equivalent as we
are currently doing.
2. Allow to share the hash-table built from table scan distributed to
individual partition leafs.
    In the above example, SeqScan on ftable and relevant Hash path
will make identical hash-
    table for the upcoming hash-join. If sibling paths have equivalent
results, it is reasonable to
    reuse it.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>



Re: Asymmetric partition-wise JOIN

From
Kohei KaiGai
Date:
Hello,

Even though nobody has respond the thread, I tried to make a prototype of
the asymmetric partition-wise join support.
This feature tries to join non-partitioned and partitioned relation
before append.

See the example below:

create table ptable (dist int, a int, b int) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
create table t1 (aid int, label text);
create table t2 (bid int, label text);
insert into ptable (select x, (1000*random())::int,
(1000*random())::int from generate_series(1,1000000) x);
insert into t1 (select x, md5(x::text) from generate_series(1,50) x);
insert into t2 (select x, md5(x::text) from generate_series(1,50) x);
vacuum analyze ptable;
vacuum analyze t1;
vacuum analyze t2;

ptable.a has values between 0 and 1000, and t1.aid has values between 1 and 50.
Therefore, tables join on ptable and t1 by a=aid can reduce almost 95% rows.
On the other hands, t1 is not partitioned and join-keys are not partition keys.
So, Append must process million rows first, then HashJoin processes
the rows read
from the partitioned table, and 95% of them are eventually dropped.
On the other words, 95% of jobs by Append are waste of time and CPU cycles.

postgres=# explain select * from ptable, t1 where a = aid;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Hash Join  (cost=2.12..24658.62 rows=49950 width=49)
   Hash Cond: (ptable_p0.a = t1.aid)
   ->  Append  (cost=0.00..20407.00 rows=1000000 width=12)
         ->  Seq Scan on ptable_p0  (cost=0.00..5134.63 rows=333263 width=12)
         ->  Seq Scan on ptable_p1  (cost=0.00..5137.97 rows=333497 width=12)
         ->  Seq Scan on ptable_p2  (cost=0.00..5134.40 rows=333240 width=12)
   ->  Hash  (cost=1.50..1.50 rows=50 width=37)
         ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
(8 rows)

The asymmetric partitionwise join allows to join non-partitioned tables and
partitioned tables prior to Append.

postgres=# set enable_partitionwise_join = on;
SET
postgres=# explain select * from ptable, t1 where a = aid;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Append  (cost=2.12..19912.62 rows=49950 width=49)
   ->  Hash Join  (cost=2.12..6552.96 rows=16647 width=49)
         Hash Cond: (ptable_p0.a = t1.aid)
         ->  Seq Scan on ptable_p0  (cost=0.00..5134.63 rows=333263 width=12)
         ->  Hash  (cost=1.50..1.50 rows=50 width=37)
               ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
   ->  Hash Join  (cost=2.12..6557.29 rows=16658 width=49)
         Hash Cond: (ptable_p1.a = t1.aid)
         ->  Seq Scan on ptable_p1  (cost=0.00..5137.97 rows=333497 width=12)
         ->  Hash  (cost=1.50..1.50 rows=50 width=37)
               ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
   ->  Hash Join  (cost=2.12..6552.62 rows=16645 width=49)
         Hash Cond: (ptable_p2.a = t1.aid)
         ->  Seq Scan on ptable_p2  (cost=0.00..5134.40 rows=333240 width=12)
         ->  Hash  (cost=1.50..1.50 rows=50 width=37)
               ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
(16 rows)

We can consider the table join ptable X t1 above is equivalent to:
  (ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.

On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.

One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.

How about your thought for this feature?

Best regards,

2019年8月12日(月) 15:03 Kohei KaiGai <kaigai@heterodb.com>:
>
> Hello,
>
> PostgreSQL optimizer right now considers join pairs on only
> non-partition - non-partition or
> partition-leaf - partition-leaf relations. On the other hands, it is
> harmless and makes sense to
> consider a join pair on non-partition - partition-leaf.
>
> See the example below. ptable is partitioned by hash, and contains 10M
> rows. ftable is not
> partitioned and contains 50 rows. Most of ptable::fkey shall not have
> matched rows in this
> join.
>
> create table ptable (fkey int, dist text) partition by hash (dist);
> create table ptable_p0 partition of ptable for values with (modulus 3,
> remainder 0);
> create table ptable_p1 partition of ptable for values with (modulus 3,
> remainder 1);
> create table ptable_p2 partition of ptable for values with (modulus 3,
> remainder 2);
> insert into ptable (select x % 10000, md5(x::text) from
> generate_series(1,10000000) x);
>
> create table ftable (pkey int primary key, memo text);
> insert into ftable (select x, 'ftable__#' || x::text from
> generate_series(1,50) x);
> vacuum analyze;
>
> postgres=# explain analyze select count(*) from ptable p, ftable f
> where p.fkey = f.pkey;
>                                                                 QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=266393.38..266393.39 rows=1 width=8) (actual
> time=2333.193..2333.194 rows=1 loops=1)
>    ->  Hash Join  (cost=2.12..260143.38 rows=2500000 width=0) (actual
> time=0.056..2330.079 rows=50000 loops=1)
>          Hash Cond: (p.fkey = f.pkey)
>          ->  Append  (cost=0.00..233335.00 rows=10000000 width=4)
> (actual time=0.012..1617.268 rows=10000000 loops=1)
>                ->  Seq Scan on ptable_p0 p  (cost=0.00..61101.96
> rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
> loops=1)
>                ->  Seq Scan on ptable_p1 p_1  (cost=0.00..61106.25
> rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
> loops=1)
>                ->  Seq Scan on ptable_p2 p_2  (cost=0.00..61126.79
> rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
> loops=1)
>          ->  Hash  (cost=1.50..1.50 rows=50 width=4) (actual
> time=0.033..0.034 rows=50 loops=1)
>                Buckets: 1024  Batches: 1  Memory Usage: 10kB
>                ->  Seq Scan on ftable f  (cost=0.00..1.50 rows=50
> width=4) (actual time=0.004..0.017 rows=50 loops=1)
>  Planning Time: 0.286 ms
>  Execution Time: 2333.264 ms
> (12 rows)
>
> We can manually rewrite this query as follows:
>
> postgres=# explain analyze select count(*) from (
>               select * from ptable_p0 p, ftable f where p.fkey =
> f.pkey union all
>               select * from ptable_p1 p, ftable f where p.fkey =
> f.pkey union all
>               select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
>
> Because Append does not process tuples that shall have no matched
> tuples in ftable,
> this query has cheaper cost and short query execution time.
> (2333ms --> 1396ms)
>
> postgres=# explain analyze select count(*) from (
>               select * from ptable_p0 p, ftable f where p.fkey =
> f.pkey union all
>               select * from ptable_p1 p, ftable f where p.fkey =
> f.pkey union all
>               select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
>                                                                    QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=210478.25..210478.26 rows=1 width=8) (actual
> time=1396.024..1396.024 rows=1 loops=1)
>    ->  Append  (cost=2.12..210353.14 rows=50042 width=0) (actual
> time=0.058..1393.008 rows=50000 loops=1)
>          ->  Subquery Scan on "*SELECT* 1"  (cost=2.12..70023.66
> rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
>                ->  Hash Join  (cost=2.12..69856.40 rows=16726
> width=72) (actual time=0.056..571.718 rows=16789 loops=1)
>                      Hash Cond: (p.fkey = f.pkey)
>                      ->  Seq Scan on ptable_p0 p  (cost=0.00..61101.96
> rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
> loops=1)
>                      ->  Hash  (cost=1.50..1.50 rows=50 width=4)
> (actual time=0.034..0.035 rows=50 loops=1)
>                            Buckets: 1024  Batches: 1  Memory Usage: 10kB
>                            ->  Seq Scan on ftable f  (cost=0.00..1.50
> rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
>          ->  Subquery Scan on "*SELECT* 2"  (cost=2.12..70027.43
> rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
>                ->  Hash Join  (cost=2.12..69861.26 rows=16617
> width=72) (actual time=0.036..408.626 rows=16578 loops=1)
>                      Hash Cond: (p_1.fkey = f_1.pkey)
>                      ->  Seq Scan on ptable_p1 p_1
> (cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
> rows=3333025 loops=1)
>                      ->  Hash  (cost=1.50..1.50 rows=50 width=4)
> (actual time=0.020..0.020 rows=50 loops=1)
>                            Buckets: 1024  Batches: 1  Memory Usage: 10kB
>                            ->  Seq Scan on ftable f_1
> (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
> loops=1)
>          ->  Subquery Scan on "*SELECT* 3"  (cost=2.12..70051.84
> rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
>                ->  Hash Join  (cost=2.12..69884.85 rows=16699
> width=72) (actual time=0.025..406.048 rows=16633 loops=1)
>                      Hash Cond: (p_2.fkey = f_2.pkey)
>                      ->  Seq Scan on ptable_p2 p_2
> (cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
> rows=3334179 loops=1)
>                      ->  Hash  (cost=1.50..1.50 rows=50 width=4)
> (actual time=0.014..0.014 rows=50 loops=1)
>                            Buckets: 1024  Batches: 1  Memory Usage: 10kB
>                            ->  Seq Scan on ftable f_2
> (cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
> loops=1)
>  Planning Time: 0.614 ms
>  Execution Time: 1396.131 ms
> (25 rows)
>
> How about your opinions for this kind of asymmetric partition-wise
> JOIN support by the optimizer?
> I think we can harmlessly push-down inner-join and left-join if
> partition-leaf is left side.
>
> Probably, we need to implement two key functionalities.
> 1. Construction of RelOpInfo for join on non-partition table and
> partition-leafs for each pairs.
>     Instead of JoinPaths, this logic adds AppendPath that takes
> asymmetric partition-wise join
>     paths as sub-paths. Other optimization logic is equivalent as we
> are currently doing.
> 2. Allow to share the hash-table built from table scan distributed to
> individual partition leafs.
>     In the above example, SeqScan on ftable and relevant Hash path
> will make identical hash-
>     table for the upcoming hash-join. If sibling paths have equivalent
> results, it is reasonable to
>     reuse it.
>
> Best regards,
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei <kaigai@heterodb.com>



--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>

Attachment

Re: Asymmetric partition-wise JOIN

From
Thomas Munro
Date:
On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:
> We can consider the table join ptable X t1 above is equivalent to:
>   (ptable_p0 + ptable_p1 + ptable_p2) X t1
> = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
> It returns an equivalent result, however, rows are already reduced by HashJoin
> in the individual leaf of Append, so CPU-cycles consumed by Append node can
> be cheaper.
>
> On the other hands, it has a downside because t1 must be read 3 times and
> hash table also must be built 3 times. It increases the expected cost,
> so planner
> may not choose the asymmetric partition-wise join plan.

What if you include the partition constraint as a filter on t1?  So you get:

ptable X t1 =
  (ptable_p0 X (σ hash(dist)%4=0 (t1))) +
  (ptable_p1 X (σ hash(dist)%4=1 (t1))) +
  (ptable_p2 X (σ hash(dist)%4=2 (t1))) +
  (ptable_p3 X (σ hash(dist)%4=3 (t1)))

Pros:
1.  The hash tables will not contain unnecessary junk.
2.  You'll get the right answer if t1 is on the outer side of an outer join.
3.  If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4.  The filter might enable a selective and efficient plan like an index scan.

Cons:
1.  The filter might not enable a selective and efficient plan, and
therefore cause extra work.

(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)

> One idea I have is, sibling HashJoin shares a hash table that was built once
> by any of the sibling Hash plan. Right now, it is not implemented yet.

Yeah, I've thought a little bit about that in the context of Parallel
Repartition.  I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure.  That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table.  I ran into a number of problems
while thinking about that, which I should write about in another
thread.

--
Thomas Munro
https://enterprisedb.com



Re: Asymmetric partition-wise JOIN

From
Kohei KaiGai
Date:
2019年8月24日(土) 7:02 Thomas Munro <thomas.munro@gmail.com>:
>
> On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:
> > We can consider the table join ptable X t1 above is equivalent to:
> >   (ptable_p0 + ptable_p1 + ptable_p2) X t1
> > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
> > It returns an equivalent result, however, rows are already reduced by HashJoin
> > in the individual leaf of Append, so CPU-cycles consumed by Append node can
> > be cheaper.
> >
> > On the other hands, it has a downside because t1 must be read 3 times and
> > hash table also must be built 3 times. It increases the expected cost,
> > so planner
> > may not choose the asymmetric partition-wise join plan.
>
> What if you include the partition constraint as a filter on t1?  So you get:
>
> ptable X t1 =
>   (ptable_p0 X (σ hash(dist)%4=0 (t1))) +
>   (ptable_p1 X (σ hash(dist)%4=1 (t1))) +
>   (ptable_p2 X (σ hash(dist)%4=2 (t1))) +
>   (ptable_p3 X (σ hash(dist)%4=3 (t1)))
>
> Pros:
> 1.  The hash tables will not contain unnecessary junk.
> 2.  You'll get the right answer if t1 is on the outer side of an outer join.
> 3.  If this runs underneath a Parallel Append and t1 is big enough
> then workers will hopefully cooperate and do a synchronised scan of
> t1.
> 4.  The filter might enable a selective and efficient plan like an index scan.
>
> Cons:
> 1.  The filter might not enable a selective and efficient plan, and
> therefore cause extra work.
>
> (It's a little weird in this example because don't usually see hash
> functions in WHERE clauses, but that could just as easily be dist
> BETWEEN 1 AND 99 or any other partition constraint.)
>
It requires the join-key must include the partition key and also must be
equality-join, doesn't it?
If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute
t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to
the partition bounds, indeed.

In case when some of partition leafs are pruned, it is exactly beneficial
because relevant rows to be referenced by the pruned child relations
are waste of memory.

On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling join-nodess.

> > One idea I have is, sibling HashJoin shares a hash table that was built once
> > by any of the sibling Hash plan. Right now, it is not implemented yet.
>
> Yeah, I've thought a little bit about that in the context of Parallel
> Repartition.  I'm interested in combining intra-node partitioning
> (where a single plan node repartitions data among workers on the fly)
> with inter-node partitioning (like PWJ, where partitions are handled
> by different parts of the plan, considered at planning time); you
> finish up needing to have nodes in the plan that 'receive' tuples for
> each partition, to match up with the PWJ plan structure.  That's not
> entirely unlike CTE references, and not entirely unlike your idea of
> somehow sharing the same hash table.  I ran into a number of problems
> while thinking about that, which I should write about in another
> thread.
>
Hmm. Do you intend the inner-path may have different behavior according
to the partition bounds definition where the outer-path to be joined?
Let me investigate its pros & cons.

The reasons why I think the idea of sharing the same hash table is reasonable
in this scenario are:
1. We can easily extend the idea for parallel optimization. A hash table on DSM
    segment, once built, can be shared by all the siblings in all the
parallel workers.
2. We can save the memory consumption regardless of the join-keys and
    partition-keys, even if these are not involved in the query.

On the other hands, below are the downside. Potentially, combined use of
your idea may help these cases:
3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN.
4. Hash table contains rows to be referenced by only pruned partition leafs.

Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>



Re: Asymmetric partition-wise JOIN

From
Michael Paquier
Date:
On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote:
> On the other hands, it eventually consumes almost equivalent amount
> of memory to load the inner relations, if no leafs are pruned, and if we
> could extend the Hash-node to share the hash-table with sibling
> join-nodess.

The patch crashes when running the regression tests, per the report of
the automatic patch tester.  Could you look at that?  I have moved the
patch to nexf CF, waiting on author.
--
Michael

Attachment

Re: Asymmetric partition-wise JOIN

From
Kohei KaiGai
Date:
Hello,

This crash was reproduced on our environment also.
It looks to me adjust_child_relids_multilevel() didn't expect a case
when supplied 'relids'
(partially) indicate normal and non-partitioned relation.
It tries to build a new 'parent_relids' that is a set of
appinfo->parent_relid related to the
supplied 'child_relids'. However, bits in child_relids that indicate
normal relations are
unintentionally dropped here. Then, adjust_child_relids_multilevel()
goes to an infinite
recursion until stack limitation.

The attached v2 fixed the problem, and regression test finished correctly.

Best regards,

2019年12月1日(日) 12:24 Michael Paquier <michael@paquier.xyz>:
>
> On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote:
> > On the other hands, it eventually consumes almost equivalent amount
> > of memory to load the inner relations, if no leafs are pruned, and if we
> > could extend the Hash-node to share the hash-table with sibling
> > join-nodess.
>
> The patch crashes when running the regression tests, per the report of
> the automatic patch tester.  Could you look at that?  I have moved the
> patch to nexf CF, waiting on author.
> --
> Michael



--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>

Attachment

Re: Asymmetric partition-wise JOIN

From
David Steele
Date:
Hi Thomas,

On 12/27/19 2:34 AM, Kohei KaiGai wrote:
 >
> This crash was reproduced on our environment also.
> It looks to me adjust_child_relids_multilevel() didn't expect a case
> when supplied 'relids'
> (partially) indicate normal and non-partitioned relation.
> It tries to build a new 'parent_relids' that is a set of
> appinfo->parent_relid related to the
> supplied 'child_relids'. However, bits in child_relids that indicate
> normal relations are
> unintentionally dropped here. Then, adjust_child_relids_multilevel()
> goes to an infinite
> recursion until stack limitation.
> 
> The attached v2 fixed the problem, and regression test finished correctly.

Any thoughts on the new version of this patch?

Regards,
-- 
-David
david@pgmasters.net



Re: Asymmetric partition-wise JOIN

From
Daniel Gustafsson
Date:
> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:

> The attached v2 fixed the problem, and regression test finished correctly.

This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

cheers ./daniel



Re: Asymmetric partition-wise JOIN

From
"Andrey V. Lepikhov"
Date:
On 12/27/19 12:34 PM, Kohei KaiGai wrote:
> The attached v2 fixed the problem, and regression test finished correctly.
Using your patch I saw incorrect value of predicted rows at the top node 
of the plan: "Append  (cost=270.02..35165.37 rows=40004 width=16)"
Full explain of the query plan see in attachment - 
explain_with_asymmetric.sql

if  I disable enable_partitionwise_join then:
"Hash Join  (cost=270.02..38855.25 rows=10001 width=16)"
Full explain - explain_no_asymmetric.sql

I thought that is the case of incorrect usage of cached values of 
norm_selec, but it is a corner-case problem of the eqjoinsel() routine :

selectivity = 1/size_of_larger_relation; (selfuncs.c:2567)
tuples = selectivity * outer_tuples * inner_tuples; (costsize.c:4607)

i.e. number of tuples depends only on size of smaller relation.
It is not a bug of your patch but I think you need to know because it 
may affect on planner decision.

===
P.S. Test case:
CREATE TABLE t0 (a serial, b int);
INSERT INTO t0 (b) (SELECT * FROM generate_series(1e4, 2e4) as g);
CREATE TABLE parts (a serial, b int) PARTITION BY HASH(a)
INSERT INTO parts (b) (SELECT * FROM generate_series(1, 1e6) as g);

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asymmetric partition-wise JOIN

From
"Andrey V. Lepikhov"
Date:
On 7/1/20 2:10 PM, Daniel Gustafsson wrote:
>> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
> 
>> The attached v2 fixed the problem, and regression test finished correctly.
> 
> This patch no longer applies to HEAD, please submit an rebased version.
> Marking the entry Waiting on Author in the meantime.
Rebased version of the patch on current master (d259afa736).

I rebased it because it is a base of my experimental feature than we 
don't break partitionwise join of a relation with foreign partition and 
a local relation if we have info that remote server has foreign table 
link to the local relation (by analogy with shippable extensions).

Maybe mark as 'Needs review'?

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asymmetric partition-wise JOIN

From
Daniel Gustafsson
Date:
> On 21 Aug 2020, at 08:02, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote:
>
> On 7/1/20 2:10 PM, Daniel Gustafsson wrote:
>>> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
>>> The attached v2 fixed the problem, and regression test finished correctly.
>> This patch no longer applies to HEAD, please submit an rebased version.
>> Marking the entry Waiting on Author in the meantime.
> Rebased version of the patch on current master (d259afa736).
>
> I rebased it because it is a base of my experimental feature than we don't break partitionwise join of a relation
withforeign partition and a local relation if we have info that remote server has foreign table link to the local
relation(by analogy with shippable extensions). 
>
> Maybe mark as 'Needs review'?

Thanks for the rebase, I've updated the commitfest entry to reflect that it
needs a round of review.

cheers ./daniel


Re: Asymmetric partition-wise JOIN

From
Amul Sul
Date:
On Sat, Aug 24, 2019 at 2:03 PM Kohei KaiGai <kaigai@heterodb.com> wrote:
>
> 2019年8月24日(土) 7:02 Thomas Munro <thomas.munro@gmail.com>:
> >
> > On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:
> > > We can consider the table join ptable X t1 above is equivalent to:
> > >   (ptable_p0 + ptable_p1 + ptable_p2) X t1
> > > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
> > > It returns an equivalent result, however, rows are already reduced by HashJoin
> > > in the individual leaf of Append, so CPU-cycles consumed by Append node can
> > > be cheaper.
> > >
> > > On the other hands, it has a downside because t1 must be read 3 times and
> > > hash table also must be built 3 times. It increases the expected cost,
> > > so planner
> > > may not choose the asymmetric partition-wise join plan.
> >
> > What if you include the partition constraint as a filter on t1?  So you get:
> >
> > ptable X t1 =
> >   (ptable_p0 X (σ hash(dist)%4=0 (t1))) +
> >   (ptable_p1 X (σ hash(dist)%4=1 (t1))) +
> >   (ptable_p2 X (σ hash(dist)%4=2 (t1))) +
> >   (ptable_p3 X (σ hash(dist)%4=3 (t1)))
> >
> > Pros:
> > 1.  The hash tables will not contain unnecessary junk.
> > 2.  You'll get the right answer if t1 is on the outer side of an outer join.
> > 3.  If this runs underneath a Parallel Append and t1 is big enough
> > then workers will hopefully cooperate and do a synchronised scan of
> > t1.
> > 4.  The filter might enable a selective and efficient plan like an index scan.
> >
> > Cons:
> > 1.  The filter might not enable a selective and efficient plan, and
> > therefore cause extra work.
> >
> > (It's a little weird in this example because don't usually see hash
> > functions in WHERE clauses, but that could just as easily be dist
> > BETWEEN 1 AND 99 or any other partition constraint.)
> >
> It requires the join-key must include the partition key and also must be
> equality-join, doesn't it?
> If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute
> t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to
> the partition bounds, indeed.
>
> In case when some of partition leafs are pruned, it is exactly beneficial
> because relevant rows to be referenced by the pruned child relations
> are waste of memory.
>
> On the other hands, it eventually consumes almost equivalent amount
> of memory to load the inner relations, if no leafs are pruned, and if we
> could extend the Hash-node to share the hash-table with sibling join-nodess.
>
> > > One idea I have is, sibling HashJoin shares a hash table that was built once
> > > by any of the sibling Hash plan. Right now, it is not implemented yet.
> >
> > Yeah, I've thought a little bit about that in the context of Parallel
> > Repartition.  I'm interested in combining intra-node partitioning
> > (where a single plan node repartitions data among workers on the fly)
> > with inter-node partitioning (like PWJ, where partitions are handled
> > by different parts of the plan, considered at planning time); you
> > finish up needing to have nodes in the plan that 'receive' tuples for
> > each partition, to match up with the PWJ plan structure.  That's not
> > entirely unlike CTE references, and not entirely unlike your idea of
> > somehow sharing the same hash table.  I ran into a number of problems
> > while thinking about that, which I should write about in another
> > thread.
> >
> Hmm. Do you intend the inner-path may have different behavior according
> to the partition bounds definition where the outer-path to be joined?
> Let me investigate its pros & cons.
>
> The reasons why I think the idea of sharing the same hash table is reasonable
> in this scenario are:
> 1. We can easily extend the idea for parallel optimization. A hash table on DSM
>     segment, once built, can be shared by all the siblings in all the
> parallel workers.
> 2. We can save the memory consumption regardless of the join-keys and
>     partition-keys, even if these are not involved in the query.
>
> On the other hands, below are the downside. Potentially, combined use of
> your idea may help these cases:
> 3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN.
> 4. Hash table contains rows to be referenced by only pruned partition leafs.
>

+ many, for the sharable hash of the inner table of the join. IMHO,
this could be the most interesting and captivating thing about this feature.
But might be a complicated piece, is that still on the plan?

Regards,
Amul



Re: Asymmetric partition-wise JOIN

From
Anastasia Lubennikova
Date:
On 21.08.2020 09:02, Andrey V. Lepikhov wrote:
> On 7/1/20 2:10 PM, Daniel Gustafsson wrote:
>>> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
>>
>>> The attached v2 fixed the problem, and regression test finished 
>>> correctly.
>>
>> This patch no longer applies to HEAD, please submit an rebased version.
>> Marking the entry Waiting on Author in the meantime.
> Rebased version of the patch on current master (d259afa736).
>
> I rebased it because it is a base of my experimental feature than we 
> don't break partitionwise join of a relation with foreign partition 
> and a local relation if we have info that remote server has foreign 
> table link to the local relation (by analogy with shippable extensions).
>
> Maybe mark as 'Needs review'?
>
Status update for a commitfest entry.

According to cfbot, the patch fails to apply. Could you please send a 
rebased version?

This thread was inactive for quite some time. Is anyone going to 
continue working on it?

I see some interest in the idea of sharable hash, but I don't see even a 
prototype in this thread. So, probably, it is a matter of a separate 
discussion.

Also, I took a look at the code. It looks like it needs some extra work. 
I am not a big expert in this area, so I'm sorry if questions are obvious.

1. What would happen if this assumption is not met?

+         * MEMO: We assume this pathlist keeps at least one AppendPath that
+         * represents partitioned table-scan, symmetric or asymmetric
+         * partition-wise join. It is not correct right now, however, a 
hook
+         * on add_path() to give additional decision for path removel 
allows
+         * to retain this kind of AppendPath, regardless of its cost.

2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into 
PG_TRY/PG_CATCH? What errors do we expect?

3. It looks like a crutch. If it isn't, I'd like to see a better comment 
about why "dynamic programming" is not applicable here.
And shouldn't we also handle a root->join_cur_level?

+                /* temporary disables "dynamic programming" algorithm */
+                root->join_rel_level = NULL;

4. This change looks like it can lead to a memory leak for old code. 
Maybe it is never the case, but again I think it worth a comment.

-    /* If there's nothing to adjust, don't call this function. */
-    Assert(nappinfos >= 1 && appinfos != NULL);
+    /* If there's nothing to adjust, just return a duplication */
+    if (nappinfos == 0)
+        return copyObject(node);

5. extract_asymmetric_partitionwise_subjoin() lacks a comment

The new status of this patch is: Waiting on Author

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Asymmetric partition-wise JOIN

From
Anastasia Lubennikova
Date:
On 09.11.2020 13:53, Anastasia Lubennikova wrote:
> On 21.08.2020 09:02, Andrey V. Lepikhov wrote:
>> On 7/1/20 2:10 PM, Daniel Gustafsson wrote:
>>>> On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
>>>
>>>> The attached v2 fixed the problem, and regression test finished 
>>>> correctly.
>>>
>>> This patch no longer applies to HEAD, please submit an rebased version.
>>> Marking the entry Waiting on Author in the meantime.
>> Rebased version of the patch on current master (d259afa736).
>>
>> I rebased it because it is a base of my experimental feature than we 
>> don't break partitionwise join of a relation with foreign partition 
>> and a local relation if we have info that remote server has foreign 
>> table link to the local relation (by analogy with shippable extensions).
>>
>> Maybe mark as 'Needs review'?
>>
> Status update for a commitfest entry.
>
> According to cfbot, the patch fails to apply. Could you please send a 
> rebased version?
>
> This thread was inactive for quite some time. Is anyone going to 
> continue working on it?
>
> I see some interest in the idea of sharable hash, but I don't see even 
> a prototype in this thread. So, probably, it is a matter of a separate 
> discussion.
>
> Also, I took a look at the code. It looks like it needs some extra 
> work. I am not a big expert in this area, so I'm sorry if questions 
> are obvious.
>
> 1. What would happen if this assumption is not met?
>
> +         * MEMO: We assume this pathlist keeps at least one 
> AppendPath that
> +         * represents partitioned table-scan, symmetric or asymmetric
> +         * partition-wise join. It is not correct right now, however, 
> a hook
> +         * on add_path() to give additional decision for path removel 
> allows
> +         * to retain this kind of AppendPath, regardless of its cost.
>
> 2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into 
> PG_TRY/PG_CATCH? What errors do we expect?
>
> 3. It looks like a crutch. If it isn't, I'd like to see a better 
> comment about why "dynamic programming" is not applicable here.
> And shouldn't we also handle a root->join_cur_level?
>
> +                /* temporary disables "dynamic programming" algorithm */
> +                root->join_rel_level = NULL;
>
> 4. This change looks like it can lead to a memory leak for old code. 
> Maybe it is never the case, but again I think it worth a comment.
>
> -    /* If there's nothing to adjust, don't call this function. */
> -    Assert(nappinfos >= 1 && appinfos != NULL);
> +    /* If there's nothing to adjust, just return a duplication */
> +    if (nappinfos == 0)
> +        return copyObject(node);
>
> 5. extract_asymmetric_partitionwise_subjoin() lacks a comment
>
> The new status of this patch is: Waiting on Author
>
Status update for a commitfest entry.

This entry was inactive during this CF, so I've marked it as returned 
with feedback. Feel free to resubmit an updated version to a future 
commitfest.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Asymmetric partition-wise JOIN

From
"Andrey V. Lepikhov"
Date:
On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:
> This entry was inactive during this CF, so I've marked it as returned 
> with feedback. Feel free to resubmit an updated version to a future 
> commitfest.
> 
Attached version is rebased on current master and fixes problems with 
complex parameterized plans - 'reparameterize by child' feature.
Problems with reparameterization machinery can be demonstrated by TPC-H 
benchmark.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asymmetric partition-wise JOIN

From
"Andrey V. Lepikhov"
Date:
On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:
> This entry was inactive during this CF, so I've marked it as returned 
> with feedback. Feel free to resubmit an updated version to a future 
> commitfest. 
I return the patch to commitfest. My current reason differs from reason 
of origin author.
This patch can open a door for more complex optimizations in the 
partitionwise join push-down technique.
I mean, we can push-down join not only of two partitioned tables with 
the same partition schema, but a partitioned (sharded) table with an 
arbitrary subplan that is provable independent of local resources.

Example:

CREATE TABLE p(a int) PARTITION BY HASH (a);
CREATE TABLE p1 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 0);
CREATE TABLE p2 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 1);
CREATE TABLE p3 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 2);

SELECT * FROM p, (SELECT * FROM generate_series(1,2) AS a) AS s
WHERE p.a=s.a;

  Hash Join
    Hash Cond: (p.a = a.a)
    ->  Append
          ->  Seq Scan on p1 p_1
          ->  Seq Scan on p2 p_2
          ->  Seq Scan on p3 p_3
    ->  Hash
          ->  Function Scan on generate_series a

But with asymmetric join feature we have the plan:

  Append
    ->  Hash Join
          Hash Cond: (p_1.a = a.a)
          ->  Seq Scan on p1 p_1
          ->  Hash
                ->  Function Scan on generate_series a
    ->  Hash Join
          Hash Cond: (p_2.a = a.a)
          ->  Seq Scan on p2 p_2
          ->  Hash
                ->  Function Scan on generate_series a
    ->  Hash Join
          Hash Cond: (p_3.a = a.a)
          ->  Seq Scan on p3 p_3
          ->  Hash
                ->  Function Scan on generate_series a

In the case of FDW-sharding it means that if we can prove that the inner 
relation is independent from the execution server, we can push-down 
these joins and execute it in parallel.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Asymmetric partition-wise JOIN

From
Andrey Lepikhov
Date:
Next version of the patch.
For searching any problems I forced this patch during 'make check' 
tests. Some bugs were found and fixed.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asymmetric partition-wise JOIN

From
Alexander Pyhalov
Date:
Andrey Lepikhov писал 2021-05-27 07:27:
> Next version of the patch.
> For searching any problems I forced this patch during 'make check'
> tests. Some bugs were found and fixed.

Hi.
I've tested this patch and haven't found issues, but I have some 
comments.

src/backend/optimizer/path/joinrels.c:

1554
1555 /*
1556  * Build RelOptInfo on JOIN of each partition of the outer relation 
and the inner
1557  * relation. Return List of such RelOptInfo's. Return NIL, if at 
least one of
1558  * these JOINs are impossible to build.
1559  */
1560 static List *
1561 extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
1562                                                                     
              RelOptInfo *joinrel,
1563                                                                     
              AppendPath *append_path,
1564                                                                     
              RelOptInfo *inner_rel,
1565                                                                     
              JoinType jointype,
1566                                                                     
              JoinPathExtraData *extra)
1567 {
1568         List            *result = NIL;
1569         ListCell        *lc;
1570
1571         foreach (lc, append_path->subpaths)
1572         {
1573                 Path                    *child_path = lfirst(lc);
1574                 RelOptInfo              *child_rel = 
child_path->parent;
1575                 Relids                  child_join_relids;
1576                 Relids                  parent_relids;
1577                 RelOptInfo              *child_join_rel;
1578                 SpecialJoinInfo *child_sjinfo;
1579                 List                    *child_restrictlist;


Variable names - child_join_rel and child_join_relids seem to be 
inconsistent with rest of the file (I see child_joinrelids in 
try_partitionwise_join() and child_joinrel in try_partitionwise_join() 
and get_matching_part_pairs()).


1595                         child_join_rel = build_child_join_rel(root,
1596                                                                     
                               child_rel,
1597                                                                     
                               inner_rel,
1598                                                                     
                               joinrel,
1599                                                                     
                               child_restrictlist,
1600                                                                     
                               child_sjinfo,
1601                                                                     
                               jointype);
1602                         if (!child_join_rel)
1603                         {
1604                                 /*
1605                                  * If can't build JOIN between 
inner relation and one of the outer
1606                                  * partitions - return immediately.
1607                                  */
1608                                 return NIL;
1609                         }

When build_child_join_rel() can return NULL?
If I read code correctly, joinrel is created in the begining of 
build_child_join_rel() with makeNode(), makeNode() wraps newNode() and 
newNode() uses MemoryContextAllocZero()/MemoryContextAllocZeroAligned(), 
which would error() on alloc() failure.

1637
1638 static bool
1639 is_asymmetric_join_capable(PlannerInfo *root,
1640                                                    RelOptInfo 
*outer_rel,
1641                                                    RelOptInfo 
*inner_rel,
1642                                                    JoinType 
jointype)
1643 {

Function misses a comment.

1656         /*
1657          * Don't allow asymmetric JOIN of two append subplans.
1658          * In the case of a parameterized NL join, a 
reparameterization procedure will
1659          * lead to large memory allocations and a CPU consumption:
1660          * each reparameterize will induce subpath duplication, 
creating new
1661          * ParamPathInfo instance and increasing of ppilist up to 
number of partitions
1662          * in the inner. Also, if we have many partitions, each 
bitmapset
1663          * variable will large and many leaks of such variable 
(caused by relid
1664          * replacement) will highly increase memory consumption.
1665          * So, we deny such paths for now.
1666          */


Missing word:
each bitmapset variable will large => each bitmapset variable will be 
large


1694         foreach (lc, outer_rel->pathlist)
1695         {
1696                 AppendPath *append_path = lfirst(lc);
1697
1698                 /*
1699                  * MEMO: We assume this pathlist keeps at least one 
AppendPath that
1700                  * represents partitioned table-scan, symmetric or 
asymmetric
1701                  * partition-wise join. It is not correct right 
now, however, a hook
1702                  * on add_path() to give additional decision for 
path removal allows
1703                  * to retain this kind of AppendPath, regardless of 
its cost.
1704                  */
1705                 if (IsA(append_path, AppendPath))


What hook do you refer to?

src/backend/optimizer/plan/setrefs.c:

282         /*
283          * Adjust RT indexes of AppendRelInfos and add to final 
appendrels list.
284          * We assume the AppendRelInfos were built during planning 
and don't need
285          * to be copied.
286          */
287         foreach(lc, root->append_rel_list)
288         {
289                 AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, 
lc);
290                 AppendRelInfo *newappinfo;
291
292                 /* flat copy is enough since all valuable fields are 
scalars */
293                 newappinfo = (AppendRelInfo *) 
palloc(sizeof(AppendRelInfo));
294                 memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));

You've changed function to copy appinfo, so now comment is incorrect.

src/backend/optimizer/util/appendinfo.c:
588         /* Construct relids set for the immediate parent of the 
given child. */
589         normal_relids = bms_copy(child_relids);
590         for (cnt = 0; cnt < nappinfos; cnt++)
591         {
592                 AppendRelInfo *appinfo = appinfos[cnt];
593
594                 parent_relids = bms_add_member(parent_relids, 
appinfo->parent_relid);
595                 normal_relids = bms_del_member(normal_relids, 
appinfo->child_relid);
596         }
597         parent_relids = bms_union(parent_relids, normal_relids);

Do I understand correctly that now parent_relids also contains relids of 
relations from 'global' inner relation, which we join to childs?


-- 
Best regards,
Alexander Pyhalov,
Postgres Professional



Re: Asymmetric partition-wise JOIN

From
Andrey Lepikhov
Date:
On 18/6/21 15:02, Alexander Pyhalov wrote:
> Andrey Lepikhov писал 2021-05-27 07:27:
>> Next version of the patch.
>> For searching any problems I forced this patch during 'make check'
>> tests. Some bugs were found and fixed.
> 
> Hi.
> I've tested this patch and haven't found issues, but I have some comments.
Thank you for review!
> Variable names - child_join_rel and child_join_relids seem to be 
> inconsistent with rest of the file
fixed

> When build_child_join_rel() can return NULL?
Fixed
> Missing word:
> each bitmapset variable will large => each bitmapset variable will be large
Fixed
> What hook do you refer to?
Removed> You've changed function to copy appinfo, so now comment is 
incorrect.
Thanks, fixed> Do I understand correctly that now parent_relids also 
contains relids of
> relations from 'global' inner relation, which we join to childs?
Yes

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asymmetric partition-wise JOIN

From
Zhihong Yu
Date:


On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote:
On 18/6/21 15:02, Alexander Pyhalov wrote:
> Andrey Lepikhov писал 2021-05-27 07:27:
>> Next version of the patch.
>> For searching any problems I forced this patch during 'make check'
>> tests. Some bugs were found and fixed.
>
> Hi.
> I've tested this patch and haven't found issues, but I have some comments.
Thank you for review!
> Variable names - child_join_rel and child_join_relids seem to be
> inconsistent with rest of the file
fixed

> When build_child_join_rel() can return NULL?
Fixed
> Missing word:
> each bitmapset variable will large => each bitmapset variable will be large
Fixed
> What hook do you refer to?
Removed> You've changed function to copy appinfo, so now comment is
incorrect.
Thanks, fixed> Do I understand correctly that now parent_relids also
contains relids of
> relations from 'global' inner relation, which we join to childs?
Yes

--
regards,
Andrey Lepikhov
Postgres Professional
Hi,

relations because it could cause CPU and memory huge consumption
during reparameterization of NestLoop path.

CPU and memory huge consumption -> huge consumption of CPU and memory 

+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs are impossible to build.

at least one of these JOINs are impossible to build. -> at least one of these JOINs is impossible to build.

+            * Can't imagine situation when join relation already exists. But in
+            * the 'partition_join' regression test it happens.
+            * It may be an indicator of possible problems.

Should a log be added in the above case ?

+is_asymmetric_join_capable(PlannerInfo *root,

is_asymmetric_join_capable -> is_asymmetric_join_feasible

Cheers

Re: Asymmetric partition-wise JOIN

From
Andrey Lepikhov
Date:
On 5/7/21 23:15, Zhihong Yu wrote:
> On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov 
> <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote:
> +            * Can't imagine situation when join relation already 
> exists. But in
> +            * the 'partition_join' regression test it happens.
> +            * It may be an indicator of possible problems.
> Should a log be added in the above case ?
I made additional analysis of this branch of code. This situation can 
happen in the case of one child or if we join two plane tables with 
partitioned. Both situations are legal and I think we don't needed to 
add any log message here.
Other mistakes were fixed.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asymmetric partition-wise JOIN

From
Alexander Pyhalov
Date:
Andrey Lepikhov писал 2021-07-06 12:28:
> On 5/7/21 23:15, Zhihong Yu wrote:
>> On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov 
>> <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote:
>> +            * Can't imagine situation when join relation already 
>> exists. But in
>> +            * the 'partition_join' regression test it happens.
>> +            * It may be an indicator of possible problems.
>> Should a log be added in the above case ?
> I made additional analysis of this branch of code. This situation can
> happen in the case of one child or if we join two plane tables with
> partitioned. Both situations are legal and I think we don't needed to
> add any log message here.
> Other mistakes were fixed.

Hi.

Small  typo in comment in src/backend/optimizer/plan/setrefs.c:

  281
  282     /*
  283      * Adjust RT indexes of AppendRelInfos and add to final 
appendrels list.
  284      * The AppendRelInfos are copied, because as a part of a 
subplan its could
  285      * be visited many times in the case of asymmetric join.
  286      */
  287     foreach(lc, root->append_rel_list)
  288     {

its  -> it (or they) ?
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional



Re: Asymmetric partition-wise JOIN

From
Andrey Lepikhov
Date:
On 5/7/21 23:15, Zhihong Yu wrote:
> On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov 
> <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote:
> +            * Can't imagine situation when join relation already 
> exists. But in
> +            * the 'partition_join' regression test it happens.
> +            * It may be an indicator of possible problems.
> 
> Should a log be added in the above case ?
I worked more on this case and found more serious mistake. During 
population of additional paths on the existed RelOptInfo we can remove 
some previously generated paths that pointed from a higher-level list of 
subplans and it could cause to lost of subplan links. I prohibit such 
situation (you can read comments in the new version of the patch).
Also, choosing of a cheapest path after appendrel creation was added.
Unstable tests were fixed.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asymmetric partition-wise JOIN

From
Ibrar Ahmed
Date:


On Thu, Jul 15, 2021 at 11:32 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote:
On 5/7/21 23:15, Zhihong Yu wrote:
> On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov
> <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote:
> +            * Can't imagine situation when join relation already
> exists. But in
> +            * the 'partition_join' regression test it happens.
> +            * It may be an indicator of possible problems.
>
> Should a log be added in the above case ?
I worked more on this case and found more serious mistake. During
population of additional paths on the existed RelOptInfo we can remove
some previously generated paths that pointed from a higher-level list of
subplans and it could cause to lost of subplan links. I prohibit such
situation (you can read comments in the new version of the patch).
Also, choosing of a cheapest path after appendrel creation was added.
Unstable tests were fixed.

--
regards,
Andrey Lepikhov
Postgres Professional

Patch is failing the regression, can you please take a look at that.

partition_join ... FAILED 6328 ms

--
Ibrar Ahmed

Re: Asymmetric partition-wise JOIN

From
Aleksander Alekseev
Date:
It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any
tests.Changing the status to save time for reviewers. 

The new status of this patch is: Waiting on Author

Re: Asymmetric partition-wise JOIN

From
Jaime Casanova
Date:
On Thu, Sep 09, 2021 at 09:50:46AM +0000, Aleksander Alekseev wrote:
> It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any
tests.Changing the status to save time for reviewers.
 
> 
> The new status of this patch is: Waiting on Author

Just to give some more info to work on I found this patch made postgres
crash with a segmentation fault.

"""
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at bitmapset.c:126
126            if (shorter->words[i] != longer->words[i])
"""

attached are the query that triggers the crash and the backtrace.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Attachment

Re: Asymmetric partition-wise JOIN

From
"Andrey V. Lepikhov"
Date:
On 9/9/21 8:38 PM, Jaime Casanova wrote:
> On Thu, Sep 09, 2021 at 09:50:46AM +0000, Aleksander Alekseev wrote:
>> It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any
tests.Changing the status to save time for reviewers.
 
>>
>> The new status of this patch is: Waiting on Author
> 
> Just to give some more info to work on I found this patch made postgres
> crash with a segmentation fault.
> 
> """
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x0000556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at bitmapset.c:126
> 126            if (shorter->words[i] != longer->words[i])
> """
> 
> attached are the query that triggers the crash and the backtrace.
> 

Thank you for this good catch!
The problem was in the adjust_child_relids_multilevel routine. The 
tmp_result variable sometimes points to original required_outer.
This patch adds new ways which optimizer can generate plans. One 
possible way is optimizer reparameterizes an inner by a plain relation 
from the outer (maybe as a result of join of the plain relation and 
partitioned relation). In this case we have to compare tmp_result with 
original pointer to realize, it was changed or not.
The patch in attachment fixes this problem. Additional regression test 
added.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asymmetric partition-wise JOIN

From
Andrey Lepikhov
Date:
On 14/9/21 11:37, Andrey V. Lepikhov wrote:
> Thank you for this good catch!
> The problem was in the adjust_child_relids_multilevel routine. The 
> tmp_result variable sometimes points to original required_outer.
> This patch adds new ways which optimizer can generate plans. One 
> possible way is optimizer reparameterizes an inner by a plain relation 
> from the outer (maybe as a result of join of the plain relation and 
> partitioned relation). In this case we have to compare tmp_result with 
> original pointer to realize, it was changed or not.
> The patch in attachment fixes this problem. Additional regression test 
> added.
> 
I thought more and realized there isn't necessary to recurse in the 
adjust_child_relids_multilevel() routine if required_outer contains only
normal_relids.
Also, regression tests were improved a bit.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asymmetric partition-wise JOIN

From
Alexander Pyhalov
Date:
Andrey Lepikhov писал 2021-09-15 09:31:
> On 14/9/21 11:37, Andrey V. Lepikhov wrote:
>> Thank you for this good catch!
>> The problem was in the adjust_child_relids_multilevel routine. The 
>> tmp_result variable sometimes points to original required_outer.
>> This patch adds new ways which optimizer can generate plans. One 
>> possible way is optimizer reparameterizes an inner by a plain relation 
>> from the outer (maybe as a result of join of the plain relation and 
>> partitioned relation). In this case we have to compare tmp_result with 
>> original pointer to realize, it was changed or not.
>> The patch in attachment fixes this problem. Additional regression test 
>> added.
>> 
> I thought more and realized there isn't necessary to recurse in the
> adjust_child_relids_multilevel() routine if required_outer contains
> only
> normal_relids.
> Also, regression tests were improved a bit.

Hi.
The patch does not longer apply cleanly, so I rebased it. Attaching 
rebased version.
I've looked through it once again and have several questions.

1) In adjust_appendrel_attrs_multilevel(), can it happen that 
child_relids is zero-length list (in this case pfree's will fail)? It 
seems, no, but should we at least assert this? Note that in 
adjust_appendrel_attrs() we add logic for nappinfos being 0.

2) In try_asymmetric_partitionwise_join() we state that 'Asymmetric join 
isn't needed if the append node has only one child'. This is not 
completely correct. Asymmetric join with one partition can be 
advantageous when JOIN(A, UNION(B)) is more expensive than UNION(JOIN 
(A, B)). The later is true, for example, when we join partitioned table 
having foreign partitions with another foreign table and only one 
partition is left.
Let's take the attached case (foreign_join.sql). When 
list_length(append_path->subpaths) > 1 is present, we get the following 
plan

set enable_partitionwise_join = on;

explain SELECT t1.a,t2.b FROM fprt1 t1 INNER JOIN ftprt2_p1 t2 ON (t1.a 
= t2.b) WHERE t1.a < 250 AND t2.c like '%0004' ORDER BY 1,2;
                                       QUERY PLAN
---------------------------------------------------------------------------------------
  Sort  (cost=208.65..208.69 rows=17 width=8)
    Sort Key: t1.a
    ->  Hash Join  (cost=202.60..208.30 rows=17 width=8)
          Hash Cond: (t1.a = t2.b)
          ->  Foreign Scan on ftprt1_p1 t1  (cost=100.00..105.06 rows=125 
width=4)
          ->  Hash  (cost=102.39..102.39 rows=17 width=4)
                ->  Foreign Scan on ftprt2_p1 t2  (cost=100.00..102.39 
rows=17 width=4)

In case when we change it to list_length(append_path->subpaths) > 0, we 
get foreign join and cheaper plan:

explain verbose SELECT t1.a,t2.b FROM fprt1 t1 INNER JOIN ftprt2_p1 t2 
ON (t1.a = t2.b) WHERE t1.a < 250 AND t2.c like '%0004' ORDER BY 1,2;
                                                                          
  QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=106.15..106.19 rows=17 width=8)
    Output: t1.a, t2.b
    Sort Key: t1.a
    ->  Foreign Scan  (cost=102.26..105.80 rows=17 width=8)
          Output: t1.a, t2.b
          Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 
t2)
          Remote SQL: SELECT r4.a, r2.b FROM (public.fprt1_p1 r4 INNER 
JOIN public.fprt2_p1 r2 ON (((r4.a = r2.b)) AND ((r2.c ~~ '%0004')) AND 
((r4.a < 250))))


-- 
Best regards,
Alexander Pyhalov,
Postgres Professional
Attachment

Re: Asymmetric partition-wise JOIN

From
Alexander Korotkov
Date:
Hi Alexander,
Hi Andrey,

Thank you for your work on this subject.

On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:
> The patch does not longer apply cleanly, so I rebased it. Attaching
> rebased version.

Not surprising that the patch doesn't apply after 1.5 years since the
last message.  Could you please rebase it?

I read the thread and the patch.  The patch improves the joining of
partitioned tables with non-partitioned relations.  Let's denote
non-partitioned relation as A, partitions as P1 ... PN.  The patch
allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A,
Append(P1, ... PN).  That could be cheaper because it's generally
cheaper to join small pieces rather than do one big join.  The
drawback is the need to scan A multiple times.  But is this really
necessary and acceptable?  Let's consider multiple options.

1) A is non-table.  For instance, A is a function scan.  In this case,
doing multiple scans of A is not just expensive, but could lead to
unexpected side effects.  When the user includes a function once in
the FROM clause, she expects this function to be evaluated once.  I
propose that we should materialize a scan of non-table relations.  So,
materialized representation will be scanned multiple times, but the
source only scanned once.  That would be similar to CTE.
2) A is the table to be scanned with the parametrized path in the
inner part of the nested loop join.  In this case, there is no big
scan of A and nothing to materialize.
3) A is the table to be used in merge join or outer part of nested
loop join.  In this case, it would be nice to consider materialize.
It's not always good to materialize, because materialization has its
additional costs.  I think that could be a cost-based decision.
4) A is used in the hash join.  Could we re-use the hashed
representation of A between multiple joins?  I read upthread it was
proposed to share a hashed table between multiple background workers
via shared memory.  But the first step would be to just share it
between multiple join nodes within the same process.

As we consider joining with each partition individually, there could
be chosen different join methods.  As I get, the current patch
considers joining with each of the partitions as a separate isolated
optimization task.  However, if we share resources between the
multiple joins, then rises a need for some global optimization.  For
instance, a join type could be expensive when applied to an individual
partition, but cheap when applied to all the partitions thanks to
saving the common work.

My idea is to consider generated common resources (such as
materialized scans) as a property of the path.  For instance, if the
nested loop join is cheaper than the hash join, but the hash join
generates a common hash map of table A, we don't drop hash join
immediately from the consideration and leave it to see how it could
help join other partitions.  What do you think?

------
Regards,
Alexander Korotkov



Re: Asymmetric partition-wise JOIN

From
Andrei Lepikhov
Date:
On 15/10/2023 07:18, Alexander Korotkov wrote:
> Hi Alexander,
> Hi Andrey,
> 
> Thank you for your work on this subject.
> 
> On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov
> <a.pyhalov@postgrespro.ru> wrote:
>> The patch does not longer apply cleanly, so I rebased it. Attaching
>> rebased version.
> 
> Not surprising that the patch doesn't apply after 1.5 years since the
> last message.  Could you please rebase it?
> 
> I read the thread and the patch.  The patch improves the joining of
> partitioned tables with non-partitioned relations.  Let's denote
> non-partitioned relation as A, partitions as P1 ... PN.  The patch
> allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A,
> Append(P1, ... PN).  That could be cheaper because it's generally
> cheaper to join small pieces rather than do one big join.  The
> drawback is the need to scan A multiple times.  But is this really
> necessary and acceptable?  Let's consider multiple options.
> 
> 1) A is non-table.  For instance, A is a function scan.  In this case,
> doing multiple scans of A is not just expensive, but could lead to
> unexpected side effects.  When the user includes a function once in
> the FROM clause, she expects this function to be evaluated once.  I
> propose that we should materialize a scan of non-table relations.  So,
> materialized representation will be scanned multiple times, but the
> source only scanned once.  That would be similar to CTE.
> 2) A is the table to be scanned with the parametrized path in the
> inner part of the nested loop join.  In this case, there is no big
> scan of A and nothing to materialize.
> 3) A is the table to be used in merge join or outer part of nested
> loop join.  In this case, it would be nice to consider materialize.
> It's not always good to materialize, because materialization has its
> additional costs.  I think that could be a cost-based decision.
> 4) A is used in the hash join.  Could we re-use the hashed
> representation of A between multiple joins?  I read upthread it was
> proposed to share a hashed table between multiple background workers
> via shared memory.  But the first step would be to just share it
> between multiple join nodes within the same process.
> 
> As we consider joining with each partition individually, there could
> be chosen different join methods.  As I get, the current patch
> considers joining with each of the partitions as a separate isolated
> optimization task.  However, if we share resources between the
> multiple joins, then rises a need for some global optimization.  For
> instance, a join type could be expensive when applied to an individual
> partition, but cheap when applied to all the partitions thanks to
> saving the common work.
> 
> My idea is to consider generated common resources (such as
> materialized scans) as a property of the path.  For instance, if the
> nested loop join is cheaper than the hash join, but the hash join
> generates a common hash map of table A, we don't drop hash join
> immediately from the consideration and leave it to see how it could
> help join other partitions.  What do you think?

Thanks for such detailed feedback!
The rationale for this patch was to give the optimizer additional ways 
to push down more joins into foreign servers. And, because of 
asynchronous append, the benefit of that optimization was obvious. 
Unfortunately, we hadn't found other applications for this feature, 
which was why this patch was postponed in the core.
You have brought new ideas about applying this idea locally. Moreover, 
the main issue of the patch was massive memory consumption in the case 
of many joins and partitions - because of reparameterization. But now, 
postponing the reparameterization proposed in the thread [1] resolves 
that problem and gives some insights into the reparameterization 
technique of some fields, like lateral references.
Hence, I think we can restart this work.
The first thing here (after rebase, of course) is to figure out and 
implement in the cost model cases of effectiveness when asymmetric join 
would give significant performance.

[1] Oversight in reparameterize_path_by_child leading to executor crash
https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com

-- 
regards,
Andrey Lepikhov
Postgres Professional




Re: Asymmetric partition-wise JOIN

From
Alexander Korotkov
Date:
On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> Thanks for such detailed feedback!
> The rationale for this patch was to give the optimizer additional ways
> to push down more joins into foreign servers. And, because of
> asynchronous append, the benefit of that optimization was obvious.
> Unfortunately, we hadn't found other applications for this feature,
> which was why this patch was postponed in the core.
> You have brought new ideas about applying this idea locally. Moreover,
> the main issue of the patch was massive memory consumption in the case
> of many joins and partitions - because of reparameterization. But now,
> postponing the reparameterization proposed in the thread [1] resolves
> that problem and gives some insights into the reparameterization
> technique of some fields, like lateral references.
> Hence, I think we can restart this work.
> The first thing here (after rebase, of course) is to figure out and
> implement in the cost model cases of effectiveness when asymmetric join
> would give significant performance.

Great!  I'm looking forward to the revised patch.

------
Regards,
Alexander Korotkov



Re: Asymmetric partition-wise JOIN

From
Andrei Lepikhov
Date:
On 15/10/2023 17:25, Alexander Korotkov wrote:
> On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> Thanks for such detailed feedback!
>> The rationale for this patch was to give the optimizer additional ways
>> to push down more joins into foreign servers. And, because of
>> asynchronous append, the benefit of that optimization was obvious.
>> Unfortunately, we hadn't found other applications for this feature,
>> which was why this patch was postponed in the core.
>> You have brought new ideas about applying this idea locally. Moreover,
>> the main issue of the patch was massive memory consumption in the case
>> of many joins and partitions - because of reparameterization. But now,
>> postponing the reparameterization proposed in the thread [1] resolves
>> that problem and gives some insights into the reparameterization
>> technique of some fields, like lateral references.
>> Hence, I think we can restart this work.
>> The first thing here (after rebase, of course) is to figure out and
>> implement in the cost model cases of effectiveness when asymmetric join
>> would give significant performance.
> 
> Great!  I'm looking forward to the revised patch
Before preparing a new patch, it would be better to find the common 
ground in the next issue:
So far, this optimization stays aside, proposing an alternative path for 
a join RelOptInfo if we have an underlying append path in the outer.
My back burner is redesigning the approach: asymmetric join doesn't 
change the partitioning scheme and bounds of the partitioned side. So, 
it looks consecutive to make it a part of partitionwise_join machinery 
and implement it as a part of the try_partitionwise_join / 
generate_partitionwise_join_paths routines.

-- 
regards,
Andrey Lepikhov
Postgres Professional




Re: Asymmetric partition-wise JOIN

From
Ashutosh Bapat
Date:
On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
>
> >
> > Great!  I'm looking forward to the revised patch
> Before preparing a new patch, it would be better to find the common
> ground in the next issue:
> So far, this optimization stays aside, proposing an alternative path for
> a join RelOptInfo if we have an underlying append path in the outer.
> My back burner is redesigning the approach: asymmetric join doesn't
> change the partitioning scheme and bounds of the partitioned side. So,
> it looks consecutive to make it a part of partitionwise_join machinery
> and implement it as a part of the try_partitionwise_join /
> generate_partitionwise_join_paths routines.
>

I think we need an example where such a join will be faster than
non-partitioned join when both the sides are local. It might be
possible to come up with such an example without writing any code. The
idea would be to rewrite SQL as union of joins.

Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.

--
Best Wishes,
Ashutosh Bapat



Re: Asymmetric partition-wise JOIN

From
Andrei Lepikhov
Date:
On 16/10/2023 23:21, Ashutosh Bapat wrote:
> On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
> Whenever I visited this idea, I hit one issue prominently - how would
> we differentiate different scans of the non-partitioned relation.
> Normally we do that using different Relids but in this case we
> wouldn't be able to know the number of such relations involved in the
> query unless we start planning such a join. It's late to add new base
> relations and assign them new Relids. Of course I haven't thought hard
> about it. I haven't looked at the patch to see whether this problem is
> solved and how.
> 
I'm curious, which type of problems do you afraid here? Why we need a 
range table entry for each scan of non-partitioned relation?

-- 
regards,
Andrey Lepikhov
Postgres Professional




Re: Asymmetric partition-wise JOIN

From
Ashutosh Bapat
Date:
On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
>
> On 16/10/2023 23:21, Ashutosh Bapat wrote:
> > On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
> > Whenever I visited this idea, I hit one issue prominently - how would
> > we differentiate different scans of the non-partitioned relation.
> > Normally we do that using different Relids but in this case we
> > wouldn't be able to know the number of such relations involved in the
> > query unless we start planning such a join. It's late to add new base
> > relations and assign them new Relids. Of course I haven't thought hard
> > about it. I haven't looked at the patch to see whether this problem is
> > solved and how.
> >
> I'm curious, which type of problems do you afraid here? Why we need a
> range table entry for each scan of non-partitioned relation?
>

Not RTE but RelOptInfo.

Using the same example as Alexander Korotkov, let's say A is the
nonpartitioned table and P is partitioned table with partitions P1,
P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ...
APn. Each of these joins may have different properties and thus will
require creating paths. In order to save these paths, we need
RelOptInfos which are indentified by relids. Let's assume that the
relids of these join RelOptInfos are created by union of relid of A
and relid of Px (the partition being joined). This is notionally
misleading but doable.

But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.

The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.

But if you have developed a way to use a single RelOptInfo of A to do
all this, may be we don't need all this. Will take a look at your next
version of patch.


--
Best Wishes,
Ashutosh Bapat



Re: Asymmetric partition-wise JOIN

From
Andrei Lepikhov
Date:
On 17/10/2023 17:09, Ashutosh Bapat wrote:
> On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>>
>> On 16/10/2023 23:21, Ashutosh Bapat wrote:
>>> On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
>>> Whenever I visited this idea, I hit one issue prominently - how would
>>> we differentiate different scans of the non-partitioned relation.
>>> Normally we do that using different Relids but in this case we
>>> wouldn't be able to know the number of such relations involved in the
>>> query unless we start planning such a join. It's late to add new base
>>> relations and assign them new Relids. Of course I haven't thought hard
>>> about it. I haven't looked at the patch to see whether this problem is
>>> solved and how.
>>>
>> I'm curious, which type of problems do you afraid here? Why we need a
>> range table entry for each scan of non-partitioned relation?
>>
> 
> Not RTE but RelOptInfo.
> 
> Using the same example as Alexander Korotkov, let's say A is the
> nonpartitioned table and P is partitioned table with partitions P1,
> P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ...
> APn. Each of these joins may have different properties and thus will
> require creating paths. In order to save these paths, we need
> RelOptInfos which are indentified by relids. Let's assume that the
> relids of these join RelOptInfos are created by union of relid of A
> and relid of Px (the partition being joined). This is notionally
> misleading but doable.

Ok, now I see your disquiet. In current patch we have built RelOptInfo 
for each JOIN(A, Pi) by the build_child_join_rel() routine. And of 
course, they all have different sets of cheapest paths (it is one more 
point of optimality). At this point the RelOptInfo of relation A is 
fully formed and upper joins use the pathlist "as is", without changes.

> But the clauses of A parameterized by P will produce different
> translations for each of the partitions. I think we will need
> different RelOptInfos (for A) to store these translations.

Does the answer above resolved this issue?

> The relid is also used to track the scans at executor level. Since we
> have so many scans on A, each may be using different plan, we will
> need different ids for those.

I don't understand this sentence. Which way executor uses this index of 
RelOptInfo ?

-- 
regards,
Andrey Lepikhov
Postgres Professional




Re: Asymmetric partition-wise JOIN

From
Ashutosh Bapat
Date:
On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
>
> > But the clauses of A parameterized by P will produce different
> > translations for each of the partitions. I think we will need
> > different RelOptInfos (for A) to store these translations.
>
> Does the answer above resolved this issue?

May be. There are other problematic areas like EvalPlanQual, Rescans,
reparameterised paths which can blow up if we use the same RelOptInfo
for different scans of the same relation. It will be good to test
those. And also A need not be a simple relation; it could be join as
well.

>
> > The relid is also used to track the scans at executor level. Since we
> > have so many scans on A, each may be using different plan, we will
> > need different ids for those.
>
> I don't understand this sentence. Which way executor uses this index of
> RelOptInfo ?

See Scan::scanrelid

--
Best Wishes,
Ashutosh Bapat



Re: Asymmetric partition-wise JOIN

From
Andrei Lepikhov
Date:
On 18/10/2023 16:59, Ashutosh Bapat wrote:
> On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>>
>>> But the clauses of A parameterized by P will produce different
>>> translations for each of the partitions. I think we will need
>>> different RelOptInfos (for A) to store these translations.
>>
>> Does the answer above resolved this issue?
> 
> May be. There are other problematic areas like EvalPlanQual, Rescans,
> reparameterised paths which can blow up if we use the same RelOptInfo
> for different scans of the same relation. It will be good to test

Yeah, now I got it. It is already the second place where I see some 
reference to a kind of hidden rule that the rte entry (or RelOptInfo) 
must correspond to only one plan node. I don't have a quick answer for 
now - maybe it is a kind of architectural agreement - and I will 
consider this issue during the development.

> those. And also A need not be a simple relation; it could be join as
> well.

For a join RelOptInfo, as well as for any subtree, we have the same 
logic: the pathlist of this subtree is already formed during the 
previous level of the search and will not be changed.

>>
>>> The relid is also used to track the scans at executor level. Since we
>>> have so many scans on A, each may be using different plan, we will
>>> need different ids for those.
>>
>> I don't understand this sentence. Which way executor uses this index of
>> RelOptInfo ?
> 
> See Scan::scanrelid
> 

-- 
regards,
Andrey Lepikhov
Postgres Professional




Re: Asymmetric partition-wise JOIN

From
Andrei Lepikhov
Date:
On 15/10/2023 13:25, Alexander Korotkov wrote:
> Great!  I'm looking forward to the revised patch.
Revising the code and opinions before restarting this work, I found two 
different possible strategies mentioned in the thread:
1. 'Common Resources' shares the materialised result of the inner table 
scan (a hash table in the case of HashJoin) to join each partition one 
by one. It gives us a profit in the case of parallel append and possibly 
other cases, like the one shown in the initial message.
2. 'Individual strategies' - By limiting the AJ feature to cases when 
the JOIN clause contains a partitioning expression, we can push an 
additional scan clause into each copy of the inner table scan, reduce 
the number of tuples scanned, and even prune something because of proven 
zero input.

I see the pros and cons of both approaches. The first option is more 
straightforward, and its outcome is obvious in the case of parallel 
append. But how can we guarantee the same join type for each join? Why 
should we ignore the positive effect of different strategies for 
different partitions?
The second strategy is more expensive for the optimiser, especially in 
the multipartition case. But as I can predict, it is easier to implement 
and looks more natural for the architecture. What do you think about that?

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: Asymmetric partition-wise JOIN

From
Andrei Lepikhov
Date:
On 18/10/2023 16:59, Ashutosh Bapat wrote:
> On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
>>> The relid is also used to track the scans at executor level. Since we
>>> have so many scans on A, each may be using different plan, we will
>>> need different ids for those.
>>
>> I don't understand this sentence. Which way executor uses this index of
>> RelOptInfo ?
> 
> See Scan::scanrelid
> 
Hi,

In the attachment, you will find a fresh version of the patch.
I've analysed the danger of the same RelOptInfo index for the executor. 
In the examples I found (scared), it is still not a problem because 
ExecQual() does all the jobs at one operation and doesn't intersect with 
over operations. Of course, it is not a good design, and we will work on 
this issue. But at least this code can be used in experiments.
Furthermore, I've shared some reflections on this feature. To avoid 
cluttering the thread, I've published them in [1]. These thoughts 
provide additional context and considerations for our ongoing work.

[1] 
https://danolivo.substack.com/p/postgresql-asymmetric-join-technique?r=34q1yy

-- 
regards, Andrei Lepikhov

Attachment

Re: Asymmetric partition-wise JOIN

From
Alexander Korotkov
Date:
Hi!

On Tue, Apr 2, 2024 at 6:07 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 15/10/2023 13:25, Alexander Korotkov wrote:
> > Great!  I'm looking forward to the revised patch.
> Revising the code and opinions before restarting this work, I found two
> different possible strategies mentioned in the thread:
> 1. 'Common Resources' shares the materialised result of the inner table
> scan (a hash table in the case of HashJoin) to join each partition one
> by one. It gives us a profit in the case of parallel append and possibly
> other cases, like the one shown in the initial message.
> 2. 'Individual strategies' - By limiting the AJ feature to cases when
> the JOIN clause contains a partitioning expression, we can push an
> additional scan clause into each copy of the inner table scan, reduce
> the number of tuples scanned, and even prune something because of proven
> zero input.
>
> I see the pros and cons of both approaches. The first option is more
> straightforward, and its outcome is obvious in the case of parallel
> append. But how can we guarantee the same join type for each join? Why
> should we ignore the positive effect of different strategies for
> different partitions?
> The second strategy is more expensive for the optimiser, especially in
> the multipartition case. But as I can predict, it is easier to implement
> and looks more natural for the architecture. What do you think about that?

Actually, the idea I tried to express is the combination of #1 and #2:
to build individual plan for every partition, but consider the 'Common
Resources'.  Let me explain this a bit more.

Right now, we basically we consider the following properties during
selection of paths.
1) Cost.  The cheaper path wins.  There a two criteria though: startup
cost and total cost.  So, we can keep both paths with cheaper startup
costs and paths with cheaper total cost.
2) Pathkeys.  We can keep a path with more expensive path, which has
pathkeys potentially useful in future.

My idea is to introduce a new property for paths selection.
3) Usage of common resources.  The common resource can be: hash
representation of relation, memoize over relation scan, etc.  We can
exclude the cost of common resource generation from the path cost, but
keep the reference for the common resource with its generation cost.
If one path uses more common resources than another path, it could
cost-dominate another one only if its cheaper together with its extra
common resources cost.  If one path uses less or equal common
resources than another, it could normally cost-dominate another one.

Using these rules, we can gather the the plurality of paths for each
child join taking common resources into account.  After that we can
apply some global optimization finding generation of which common
resources can reduce the global cost.

However, I understand this is huge amount of work given we have to
introduce new basic optimizer concepts.  I get that the main
application of this patch is sharding.  If we have global tables
residing each shard, we can push down any joins with them.  Given this
patch gives some optimization for non-sharded case, I think we
*probably* can accept its concept even that it this optimization is
obviously not perfect.

------
Regards,
Alexander Korotkov
Supabase



Re: Asymmetric partition-wise JOIN

From
Alexander Korotkov
Date:
On Sun, May 5, 2024 at 5:55 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
> On 18/10/2023 16:59, Ashutosh Bapat wrote:
> > On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
> >>> The relid is also used to track the scans at executor level. Since we
> >>> have so many scans on A, each may be using different plan, we will
> >>> need different ids for those.
> >>
> >> I don't understand this sentence. Which way executor uses this index of
> >> RelOptInfo ?
> >
> > See Scan::scanrelid
> >
> Hi,
>
> In the attachment, you will find a fresh version of the patch.
> I've analysed the danger of the same RelOptInfo index for the executor.
> In the examples I found (scared), it is still not a problem because
> ExecQual() does all the jobs at one operation and doesn't intersect with
> over operations. Of course, it is not a good design, and we will work on
> this issue. But at least this code can be used in experiments.
> Furthermore, I've shared some reflections on this feature. To avoid
> cluttering the thread, I've published them in [1]. These thoughts
> provide additional context and considerations for our ongoing work.
>
> [1]
> https://danolivo.substack.com/p/postgresql-asymmetric-join-technique?r=34q1yy

I've rebased the patch to the current master.  Also, I didn't like the
needFlatCopy argument to reparameterize_path_by_child().  It looks
quite awkward.  Instead, as soon as we need to copy paths, I've
enabled native copy of paths.  Now, we can do just copyObject() over
path in caller.  Looks much cleaner for me.  What do you think?

Other notes:

1) I think we need to cover the cases, which
is_inner_rel_safe_for_asymmetric_join() filters out, by regression
tests.
2) is_asymmetric_join() looks awkward for me.  Should we instead make
a flag in JoinPath?
3) I understand that you have re-use RelOptInfo multiple times.  It's
too late stage of query processing to add a simple relation into
planner structs.  I tried rescans issued by cursors, EvalPlanQual()
caused by concurrent updates, but didn't manage to break this.  It
seems that even if same relation index is used multiple times in
different places of a query, it never get used simultaneously.  But
even if this somehow is OK, this is significant change of assumptions
in planner/executor data structures.  Perhaps, we need at least Tom's
opinion on this.

------
Regards,
Alexander Korotkov
Supabase



Re: Asymmetric partition-wise JOIN

From
Andrei Lepikhov
Date:
On 1/8/2024 20:56, Alexander Korotkov wrote:
> On Tue, Apr 2, 2024 at 6:07 AM Andrei Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
> Actually, the idea I tried to express is the combination of #1 and #2:
> to build individual plan for every partition, but consider the 'Common
> Resources'.  Let me explain this a bit more.
Thanks for keeping your eye on it!
> My idea is to introduce a new property for paths selection.
> 3) Usage of common resources.  The common resource can be: hash
> representation of relation, memoize over relation scan, etc.  We can
> exclude the cost of common resource generation from the path cost, but
> keep the reference for the common resource with its generation cost.
> If one path uses more common resources than another path, it could
> cost-dominate another one only if its cheaper together with its extra
> common resources cost.  If one path uses less or equal common
> resources than another, it could normally cost-dominate another one.
The most challenging part for me is the cost calculation, which is 
bonded with estimations of other paths. To correctly estimate the 
effect, we need to remember at least the whole number of paths sharing 
resources.
Also, I wonder if it can cause some corner cases where prediction error 
on a shared resource will cause an even worse situation upstream.
I think we could push off here from an example and a counter-example, 
but I still can't find them.

> However, I understand this is huge amount of work given we have to
> introduce new basic optimizer concepts.  I get that the main
> application of this patch is sharding.  If we have global tables
> residing each shard, we can push down any joins with them.  Given this
> patch gives some optimization for non-sharded case, I think we
> *probably* can accept its concept even that it this optimization is
> obviously not perfect.
Yes, right now sharding is the most profitable case. We can push down 
parts of the plan which references only some common resources: 
FunctionScan, ValueScan, tables which can be proved are existed 
everywhere and provide the same output. But for now it is too far from 
the core code, IMO. - So, I search for cases that can be helpful for a 
single instance.

-- 
regards,
Andrei Lepikhov
Postgres Professional