Asymmetric partition-wise JOIN - Mailing list pgsql-hackers
From | Kohei KaiGai |
---|---|
Subject | Asymmetric partition-wise JOIN |
Date | |
Msg-id | CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Asymmetric partition-wise JOIN
|
List | pgsql-hackers |
Hello, PostgreSQL optimizer right now considers join pairs on only non-partition - non-partition or partition-leaf - partition-leaf relations. On the other hands, it is harmless and makes sense to consider a join pair on non-partition - partition-leaf. See the example below. ptable is partitioned by hash, and contains 10M rows. ftable is not partitioned and contains 50 rows. Most of ptable::fkey shall not have matched rows in this join. create table ptable (fkey int, dist text) partition by hash (dist); create table ptable_p0 partition of ptable for values with (modulus 3, remainder 0); create table ptable_p1 partition of ptable for values with (modulus 3, remainder 1); create table ptable_p2 partition of ptable for values with (modulus 3, remainder 2); insert into ptable (select x % 10000, md5(x::text) from generate_series(1,10000000) x); create table ftable (pkey int primary key, memo text); insert into ftable (select x, 'ftable__#' || x::text from generate_series(1,50) x); vacuum analyze; postgres=# explain analyze select count(*) from ptable p, ftable f where p.fkey = f.pkey; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual time=2333.193..2333.194 rows=1 loops=1) -> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual time=0.056..2330.079 rows=50000 loops=1) Hash Cond: (p.fkey = f.pkey) -> Append (cost=0.00..233335.00 rows=10000000 width=4) (actual time=0.012..1617.268 rows=10000000 loops=1) -> Seq Scan on ptable_p0 p (cost=0.00..61101.96 rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796 loops=1) -> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025 loops=1) -> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179 loops=1) -> Hash (cost=1.50..1.50 rows=50 width=4) (actual time=0.033..0.034 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.017 rows=50 loops=1) Planning Time: 0.286 ms Execution Time: 2333.264 ms (12 rows) We can manually rewrite this query as follows: postgres=# explain analyze select count(*) from ( select * from ptable_p0 p, ftable f where p.fkey = f.pkey union all select * from ptable_p1 p, ftable f where p.fkey = f.pkey union all select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry; Because Append does not process tuples that shall have no matched tuples in ftable, this query has cheaper cost and short query execution time. (2333ms --> 1396ms) postgres=# explain analyze select count(*) from ( select * from ptable_p0 p, ftable f where p.fkey = f.pkey union all select * from ptable_p1 p, ftable f where p.fkey = f.pkey union all select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual time=1396.024..1396.024 rows=1 loops=1) -> Append (cost=2.12..210353.14 rows=50042 width=0) (actual time=0.058..1393.008 rows=50000 loops=1) -> Subquery Scan on "*SELECT* 1" (cost=2.12..70023.66 rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1) -> Hash Join (cost=2.12..69856.40 rows=16726 width=72) (actual time=0.056..571.718 rows=16789 loops=1) Hash Cond: (p.fkey = f.pkey) -> Seq Scan on ptable_p0 p (cost=0.00..61101.96 rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796 loops=1) -> Hash (cost=1.50..1.50 rows=50 width=4) (actual time=0.034..0.035 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1) -> Subquery Scan on "*SELECT* 2" (cost=2.12..70027.43 rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1) -> Hash Join (cost=2.12..69861.26 rows=16617 width=72) (actual time=0.036..408.626 rows=16578 loops=1) Hash Cond: (p_1.fkey = f_1.pkey) -> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422 rows=3333025 loops=1) -> Hash (cost=1.50..1.50 rows=50 width=4) (actual time=0.020..0.020 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f_1 (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50 loops=1) -> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84 rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1) -> Hash Join (cost=2.12..69884.85 rows=16699 width=72) (actual time=0.025..406.048 rows=16633 loops=1) Hash Cond: (p_2.fkey = f_2.pkey) -> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015 rows=3334179 loops=1) -> Hash (cost=1.50..1.50 rows=50 width=4) (actual time=0.014..0.014 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f_2 (cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50 loops=1) Planning Time: 0.614 ms Execution Time: 1396.131 ms (25 rows) How about your opinions for this kind of asymmetric partition-wise JOIN support by the optimizer? I think we can harmlessly push-down inner-join and left-join if partition-leaf is left side. Probably, we need to implement two key functionalities. 1. Construction of RelOpInfo for join on non-partition table and partition-leafs for each pairs. Instead of JoinPaths, this logic adds AppendPath that takes asymmetric partition-wise join paths as sub-paths. Other optimization logic is equivalent as we are currently doing. 2. Allow to share the hash-table built from table scan distributed to individual partition leafs. In the above example, SeqScan on ftable and relevant Hash path will make identical hash- table for the upcoming hash-join. If sibling paths have equivalent results, it is reasonable to reuse it. Best regards, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei <kaigai@heterodb.com>
pgsql-hackers by date: