Asymmetric partition-wise JOIN - Mailing list pgsql-hackers

From Kohei KaiGai
Subject Asymmetric partition-wise JOIN
Date
Msg-id CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ@mail.gmail.com
Whole thread Raw
Responses Re: Asymmetric partition-wise JOIN
List pgsql-hackers
Hello,

PostgreSQL optimizer right now considers join pairs on only
non-partition - non-partition or
partition-leaf - partition-leaf relations. On the other hands, it is
harmless and makes sense to
consider a join pair on non-partition - partition-leaf.

See the example below. ptable is partitioned by hash, and contains 10M
rows. ftable is not
partitioned and contains 50 rows. Most of ptable::fkey shall not have
matched rows in this
join.

create table ptable (fkey int, dist text) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
insert into ptable (select x % 10000, md5(x::text) from
generate_series(1,10000000) x);

create table ftable (pkey int primary key, memo text);
insert into ftable (select x, 'ftable__#' || x::text from
generate_series(1,50) x);
vacuum analyze;

postgres=# explain analyze select count(*) from ptable p, ftable f
where p.fkey = f.pkey;
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=266393.38..266393.39 rows=1 width=8) (actual
time=2333.193..2333.194 rows=1 loops=1)
   ->  Hash Join  (cost=2.12..260143.38 rows=2500000 width=0) (actual
time=0.056..2330.079 rows=50000 loops=1)
         Hash Cond: (p.fkey = f.pkey)
         ->  Append  (cost=0.00..233335.00 rows=10000000 width=4)
(actual time=0.012..1617.268 rows=10000000 loops=1)
               ->  Seq Scan on ptable_p0 p  (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
loops=1)
               ->  Seq Scan on ptable_p1 p_1  (cost=0.00..61106.25
rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
loops=1)
               ->  Seq Scan on ptable_p2 p_2  (cost=0.00..61126.79
rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
loops=1)
         ->  Hash  (cost=1.50..1.50 rows=50 width=4) (actual
time=0.033..0.034 rows=50 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 10kB
               ->  Seq Scan on ftable f  (cost=0.00..1.50 rows=50
width=4) (actual time=0.004..0.017 rows=50 loops=1)
 Planning Time: 0.286 ms
 Execution Time: 2333.264 ms
(12 rows)

We can manually rewrite this query as follows:

postgres=# explain analyze select count(*) from (
              select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
              select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
              select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;

Because Append does not process tuples that shall have no matched
tuples in ftable,
this query has cheaper cost and short query execution time.
(2333ms --> 1396ms)

postgres=# explain analyze select count(*) from (
              select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
              select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
              select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=210478.25..210478.26 rows=1 width=8) (actual
time=1396.024..1396.024 rows=1 loops=1)
   ->  Append  (cost=2.12..210353.14 rows=50042 width=0) (actual
time=0.058..1393.008 rows=50000 loops=1)
         ->  Subquery Scan on "*SELECT* 1"  (cost=2.12..70023.66
rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
               ->  Hash Join  (cost=2.12..69856.40 rows=16726
width=72) (actual time=0.056..571.718 rows=16789 loops=1)
                     Hash Cond: (p.fkey = f.pkey)
                     ->  Seq Scan on ptable_p0 p  (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
loops=1)
                     ->  Hash  (cost=1.50..1.50 rows=50 width=4)
(actual time=0.034..0.035 rows=50 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 10kB
                           ->  Seq Scan on ftable f  (cost=0.00..1.50
rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
         ->  Subquery Scan on "*SELECT* 2"  (cost=2.12..70027.43
rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
               ->  Hash Join  (cost=2.12..69861.26 rows=16617
width=72) (actual time=0.036..408.626 rows=16578 loops=1)
                     Hash Cond: (p_1.fkey = f_1.pkey)
                     ->  Seq Scan on ptable_p1 p_1
(cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
rows=3333025 loops=1)
                     ->  Hash  (cost=1.50..1.50 rows=50 width=4)
(actual time=0.020..0.020 rows=50 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 10kB
                           ->  Seq Scan on ftable f_1
(cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
loops=1)
         ->  Subquery Scan on "*SELECT* 3"  (cost=2.12..70051.84
rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
               ->  Hash Join  (cost=2.12..69884.85 rows=16699
width=72) (actual time=0.025..406.048 rows=16633 loops=1)
                     Hash Cond: (p_2.fkey = f_2.pkey)
                     ->  Seq Scan on ptable_p2 p_2
(cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
rows=3334179 loops=1)
                     ->  Hash  (cost=1.50..1.50 rows=50 width=4)
(actual time=0.014..0.014 rows=50 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 10kB
                           ->  Seq Scan on ftable f_2
(cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
loops=1)
 Planning Time: 0.614 ms
 Execution Time: 1396.131 ms
(25 rows)

How about your opinions for this kind of asymmetric partition-wise
JOIN support by the optimizer?
I think we can harmlessly push-down inner-join and left-join if
partition-leaf is left side.

Probably, we need to implement two key functionalities.
1. Construction of RelOpInfo for join on non-partition table and
partition-leafs for each pairs.
    Instead of JoinPaths, this logic adds AppendPath that takes
asymmetric partition-wise join
    paths as sub-paths. Other optimization logic is equivalent as we
are currently doing.
2. Allow to share the hash-table built from table scan distributed to
individual partition leafs.
    In the above example, SeqScan on ftable and relevant Hash path
will make identical hash-
    table for the upcoming hash-join. If sibling paths have equivalent
results, it is reasonable to
    reuse it.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>



pgsql-hackers by date:

Previous
From: Kohei KaiGai
Date:
Subject: Re: How to retain lesser paths at add_path()?
Next
From: Antonin Houska
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)