Thread: Hash Right join and seq scan
Both tables are hash partition tables , and we have a left out join , optimizer convert to Hash Right Join, but it always try to seq scan on tablexxx 32 paritions. there are almost 250k rows per parition for tablexxxx , so it's slow. As a workaround, I disable hashjoin the it run much fast with index scan on tablexxxx ,nestloop join.
With Hash Right Join, optimizer always use seq scan for outer table ? PGv13.11
-> Hash Right Join (cost=22.50..6760.46 rows=5961 width=78)
Hash Cond: ((aa.partitionkeyid)::text = (b_9.paritionkeyid)::text)
-> Append (cost=0.00..6119.48 rows=149032 width=79)
-> Seq Scan on tablexxxx_p0 aa_2 (cost=0.00..89.71 rows=2471 width=78)
-> Seq Scan on tablexxxx_p1 aa_3 (cost=0.00..88.23 rows=2423 width=78)
-> Seq Scan on tablexxxx_p2 aa_4 (cost=0.00..205.26 rows=5726 width=79)
-> Seq Scan on tablexxxx_p3 aa_5 (cost=0.00..102.92 rows=2892 width=78)
-> Seq Scan on tablexxxx_p4 aa_6 (cost=0.00..170.27 rows=4727 width=78)
...
-> Seq Scan on tablexxxx_p31 aa_33 (cost=0.00..220.59 rows=6159 width=79)
-> Append (cost=0.69..187.64 rows=4034 width=78) (actual time=0.030..0.035 rows=3 loops=3)
index scan .... tableyyyy_p0 b_2
index scan ..... tableyyyy_p1 b_3
....
index scan ... tableyyyy_p31 b_33
With Hash Right Join, optimizer always use seq scan for outer table ? PGv13.11
-> Hash Right Join (cost=22.50..6760.46 rows=5961 width=78)
Hash Cond: ((aa.partitionkeyid)::text = (b_9.paritionkeyid)::text)
-> Append (cost=0.00..6119.48 rows=149032 width=79)
-> Seq Scan on tablexxxx_p0 aa_2 (cost=0.00..89.71 rows=2471 width=78)
-> Seq Scan on tablexxxx_p1 aa_3 (cost=0.00..88.23 rows=2423 width=78)
-> Seq Scan on tablexxxx_p2 aa_4 (cost=0.00..205.26 rows=5726 width=79)
-> Seq Scan on tablexxxx_p3 aa_5 (cost=0.00..102.92 rows=2892 width=78)
-> Seq Scan on tablexxxx_p4 aa_6 (cost=0.00..170.27 rows=4727 width=78)
...
-> Seq Scan on tablexxxx_p31 aa_33 (cost=0.00..220.59 rows=6159 width=79)
-> Append (cost=0.69..187.64 rows=4034 width=78) (actual time=0.030..0.035 rows=3 loops=3)
index scan .... tableyyyy_p0 b_2
index scan ..... tableyyyy_p1 b_3
....
index scan ... tableyyyy_p31 b_33
Thanks,
James
the query is
select ....
from tableyyyy b join table xxxx aa
on b.partitionkeyid=aa.partitionkeyid
where b.id1= $1 and b.id2=$2 and b.rtime between $3 and $4;
looks like optimizer try to "calculate cost for nestloop for scanning all partitions of tablexxx (32 hash partitions) " but actually , it only scan only a few partitions. that make the nestloop cost more than hashjoin with table seq scan cost. optimizer does not the partitioney passed in by tableyyy that got selected based on indexes on other columns. possible to make optimizer to calculate cost with partition pruning? since the join key is hash partition key .
Thanks,
James
James Pang <jamespang886@gmail.com> 於 2024年7月3日週三 下午12:57寫道:
Both tables are hash partition tables , and we have a left out join , optimizer convert to Hash Right Join, but it always try to seq scan on tablexxx 32 paritions. there are almost 250k rows per parition for tablexxxx , so it's slow. As a workaround, I disable hashjoin the it run much fast with index scan on tablexxxx ,nestloop join.
With Hash Right Join, optimizer always use seq scan for outer table ? PGv13.11
-> Hash Right Join (cost=22.50..6760.46 rows=5961 width=78)
Hash Cond: ((aa.partitionkeyid)::text = (b_9.paritionkeyid)::text)
-> Append (cost=0.00..6119.48 rows=149032 width=79)
-> Seq Scan on tablexxxx_p0 aa_2 (cost=0.00..89.71 rows=2471 width=78)
-> Seq Scan on tablexxxx_p1 aa_3 (cost=0.00..88.23 rows=2423 width=78)
-> Seq Scan on tablexxxx_p2 aa_4 (cost=0.00..205.26 rows=5726 width=79)
-> Seq Scan on tablexxxx_p3 aa_5 (cost=0.00..102.92 rows=2892 width=78)
-> Seq Scan on tablexxxx_p4 aa_6 (cost=0.00..170.27 rows=4727 width=78)
...
-> Seq Scan on tablexxxx_p31 aa_33 (cost=0.00..220.59 rows=6159 width=79)
-> Append (cost=0.69..187.64 rows=4034 width=78) (actual time=0.030..0.035 rows=3 loops=3)
index scan .... tableyyyy_p0 b_2
index scan ..... tableyyyy_p1 b_3
....
index scan ... tableyyyy_p31 b_33Thanks,James
the join is "left out join"
James Pang <jamespang886@gmail.com> 於 2024年7月3日週三 下午2:51寫道:
the query isselect ....from tableyyyy b join table xxxx aaon b.partitionkeyid=aa.partitionkeyidwhere b.id1= $1 and b.id2=$2 and b.rtime between $3 and $4;looks like optimizer try to "calculate cost for nestloop for scanning all partitions of tablexxx (32 hash partitions) " but actually , it only scan only a few partitions. that make the nestloop cost more than hashjoin with table seq scan cost. optimizer does not the partitioney passed in by tableyyy that got selected based on indexes on other columns. possible to make optimizer to calculate cost with partition pruning? since the join key is hash partition key .Thanks,JamesJames Pang <jamespang886@gmail.com> 於 2024年7月3日週三 下午12:57寫道:Both tables are hash partition tables , and we have a left out join , optimizer convert to Hash Right Join, but it always try to seq scan on tablexxx 32 paritions. there are almost 250k rows per parition for tablexxxx , so it's slow. As a workaround, I disable hashjoin the it run much fast with index scan on tablexxxx ,nestloop join.
With Hash Right Join, optimizer always use seq scan for outer table ? PGv13.11
-> Hash Right Join (cost=22.50..6760.46 rows=5961 width=78)
Hash Cond: ((aa.partitionkeyid)::text = (b_9.paritionkeyid)::text)
-> Append (cost=0.00..6119.48 rows=149032 width=79)
-> Seq Scan on tablexxxx_p0 aa_2 (cost=0.00..89.71 rows=2471 width=78)
-> Seq Scan on tablexxxx_p1 aa_3 (cost=0.00..88.23 rows=2423 width=78)
-> Seq Scan on tablexxxx_p2 aa_4 (cost=0.00..205.26 rows=5726 width=79)
-> Seq Scan on tablexxxx_p3 aa_5 (cost=0.00..102.92 rows=2892 width=78)
-> Seq Scan on tablexxxx_p4 aa_6 (cost=0.00..170.27 rows=4727 width=78)
...
-> Seq Scan on tablexxxx_p31 aa_33 (cost=0.00..220.59 rows=6159 width=79)
-> Append (cost=0.69..187.64 rows=4034 width=78) (actual time=0.030..0.035 rows=3 loops=3)
index scan .... tableyyyy_p0 b_2
index scan ..... tableyyyy_p1 b_3
....
index scan ... tableyyyy_p31 b_33Thanks,James
Hi James, I think it'd be much easier to help you with investigating this issue if you shared the actual queries, and the full EXPLAIN ANALYZE output both with and without disabled hashjoin. Or even better, share a script that reproduces the issue (creates tables, loads data, runs the queries). BTW you suggested each partition has ~250k rows, but the explain plan snippet you shared does not seem to be consistent with that - it only shows 2500-5000 rows per partition. If you run ANALYZE on the table, does that change the plan? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
we have a daily vacuumdb and analyze job, generally speaking it's done in seconds, sometimes it suddenly running more than tens of minutes with same bind variable values and huge temp space got used and at that time, explain show "Hash Anti join, Hash Right join" with seq scan two tables.
Tomas Vondra <tomas.vondra@enterprisedb.com> 於 2024年7月4日週四 上午1:40寫道:
Hi James,
I think it'd be much easier to help you with investigating this issue if
you shared the actual queries, and the full EXPLAIN ANALYZE output both
with and without disabled hashjoin. Or even better, share a script that
reproduces the issue (creates tables, loads data, runs the queries).
BTW you suggested each partition has ~250k rows, but the explain plan
snippet you shared does not seem to be consistent with that - it only
shows 2500-5000 rows per partition. If you run ANALYZE on the table,
does that change the plan?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment
On Fri, 5 Jul 2024 at 12:50, James Pang <jamespang886@gmail.com> wrote: > we have a daily vacuumdb and analyze job, generally speaking it's done in seconds, sometimes it suddenly running morethan tens of minutes with same bind variable values and huge temp space got used and at that time, explain show "HashAnti join, Hash Right join" with seq scan two tables. There was talk about adding costing for run-time partition pruning factors but nothing was ever agreed, so nothing was done. It's just not that obvious to me how we'd do that. If the Append had 10 partitions as subnodes, with an equality join condition, you could assume we'll only match to 1 of those 10, but we've no idea at plan time which one that'll be and the partitions might drastically vary in size. The best I think we could do is take the total cost of those 10 and divide by 10 to get the average cost. It's much harder for range conditions as those could match anything from 0 to all partitions. The best suggestion I saw for that was to multiply the costs by DEFAULT_INEQ_SEL. I think for now, you might want to lower the random_page_cost or increase effective_cache_size to encourage the nested loop -> index scan plan. Good ranges for effective_cache_size is anywhere between 50 - 75% of your servers's RAM. However, that might not be ideal if your server is under memory pressure from other running processes. It also depends on how large shared_buffers are as a percentage of total RAM. David
David Rowley <dgrowleyml@gmail.com> 於 2024年7月5日週五 上午10:15寫道:
On Fri, 5 Jul 2024 at 12:50, James Pang <jamespang886@gmail.com> wrote:
> we have a daily vacuumdb and analyze job, generally speaking it's done in seconds, sometimes it suddenly running more than tens of minutes with same bind variable values and huge temp space got used and at that time, explain show "Hash Anti join, Hash Right join" with seq scan two tables.
There was talk about adding costing for run-time partition pruning
factors but nothing was ever agreed, so nothing was done. It's just
not that obvious to me how we'd do that. If the Append had 10
partitions as subnodes, with an equality join condition, you could
assume we'll only match to 1 of those 10, but we've no idea at plan
time which one that'll be and the partitions might drastically vary in
size. The best I think we could do is take the total cost of those 10
and divide by 10 to get the average cost. It's much harder for range
conditions as those could match anything from 0 to all partitions. The
best suggestion I saw for that was to multiply the costs by
DEFAULT_INEQ_SEL.
I think for now, you might want to lower the random_page_cost or
increase effective_cache_size to encourage the nested loop -> index
scan plan. Good ranges for effective_cache_size is anywhere between 50
- 75% of your servers's RAM. However, that might not be ideal if your
server is under memory pressure from other running processes. It also
depends on how large shared_buffers are as a percentage of total RAM.
David
-> Nested Loop Anti Join (cost=40.32..132168227.57 rows=224338 width=78)
Join Filter: (lower((p.ctinfo)::text) = lower((w.ctinfo)::text))-> Nested Loop Left Join (cost=39.63..398917.29 rows=299118 width=78)
-> Append (cost=0.56..22.36 rows=8 width=54)
-> Index Scan using wmdata_p0_llid_hhid_stime_idx on wmdata_p0 m_1 (cost=0.5
6..2.79 rows=1 width=54)
....
-> Append (cost=39.07..49312.09 rows=54978 width=78)
-> Bitmap Heap Scan on wmvtee_p0 w.1 (cost=39.07..1491.06 rows=1669 width=78)Recheck Cond: ((m.partitionkeyid)::text = (partitionkeyid)::text)
-> Bitmap Index Scan on wmvtee_p0_partitionkeyid_intid_idx (cost=0.00..38.65 rows=1669 width=0)
Index Cond: ((partitionkeyid)::text = (m.partitionkeyid)::text)
...
-> Append (cost=0.69..516.96 rows=4010 width=78)
-> Index Only Scan using wmpct_p0_partitionkeyid_ctinfo_idx on wmpct_p0 p_1 (cost=0.69..15.78 rows=124 width=78)
...
for nest loop path, since the first one estimated only "8" rows , and they use partitionkeyid as joinkey and all are hash partitions , is it better to estimate cost to 8 (loop times) * 1600 = 12800 (each one loop map to only 1 hash partition bitmap scan ,avg one partition cost), that's much less than 398917.29 of all partitions ? for secondary Nest Loop Anti join could be rows 299118 rows * 15.78(avg index scan cost of one partition) = 4,720,082 that still much less than 132168227.57 ?
for Hash Right join, is it possible to estimate by 8 seq partition scan instead of all 32 hash partitions since the first query estimated 8 rows only ?
extend statistics may help estimate count(partitionkeyid) based on other columns bind variables, but looks like that did not help table join case.
Thanks,
James
On Sat, 6 Jul 2024 at 02:43, James Pang <jamespang886@gmail.com> wrote: > for nest loop path, since the first one estimated only "8" rows , and they use partitionkeyid as joinkey andall are hash partitions , is it better to estimate cost to 8 (loop times) * 1600 = 12800 (each one loop map to only1 hash partition bitmap scan ,avg one partition cost), that's much less than 398917.29 of all partitions ? I'm not really sure where you're getting the numbers from here. The outer side of the deepest nested loop has an 8 row estimate, not the nested loop itself. I'm unsure where the 1600 is from. I only see 1669. As of now, we don't do a great job of costing for partition pruning that will happen during execution. We won't be inventing anything to fix that in existing releases of PostgreSQL, so you'll need to either adjust the code yourself, or find a workaround. You've not shown us your schema, but perhaps enable_partitionwise_join = on might help you. Other things that might help are further lowering random_page_cost or raising effective_cache_size artificially high. It's hard to tell from here how much random I/O is being costed into the index scans. You could determine this by checking if the nested loop plan costs change as a result of doing further increases to effective_cache_size. You could maybe nudge it up enough for it to win over the hash join plan. It is possible that this won't work, however. > for secondary Nest Loop Anti join could be rows 299118 rows * 15.78(avg index scan cost of one partition) = 4,720,082that still much less than 132168227.57 ? > for Hash Right join, is it possible to estimate by 8 seq partition scan instead of all 32 hash partitions sincethe first query estimated 8 rows only ? > extend statistics may help estimate count(partitionkeyid) based on other columns bind variables, but looks likethat did not help table join case. I can't quite follow this. You'll need to better explain where you're getting these numbers for me to be able to understand. David
Sorry for confusion, it's from attached explain output of the SQL. please check attached. my questions is : for nestloop of two partition tables , they use same partition key and equal join on partition key, the cost could be "outer tables estimated rows" * (average index scan of only one partition of inner table) , instead of "outer tables estimated rows" * (index scans of all partitions), is it possible ? or it's still need running time partition pruning enhancement?
random_page_cost = 1.1, seq_page_cost=1.0, effective_cache_size=0.75*physical memory size. set random_page_cost=0.9 make optimizer to choose index scan instead of seq scan.
Thanks,
James
David Rowley <dgrowleyml@gmail.com> 於 2024年7月6日週六 上午8:33寫道:
On Sat, 6 Jul 2024 at 02:43, James Pang <jamespang886@gmail.com> wrote:
> for nest loop path, since the first one estimated only "8" rows , and they use partitionkeyid as joinkey and all are hash partitions , is it better to estimate cost to 8 (loop times) * 1600 = 12800 (each one loop map to only 1 hash partition bitmap scan ,avg one partition cost), that's much less than 398917.29 of all partitions ?
I'm not really sure where you're getting the numbers from here. The
outer side of the deepest nested loop has an 8 row estimate, not the
nested loop itself. I'm unsure where the 1600 is from. I only see
1669.
As of now, we don't do a great job of costing for partition pruning
that will happen during execution. We won't be inventing anything to
fix that in existing releases of PostgreSQL, so you'll need to either
adjust the code yourself, or find a workaround.
You've not shown us your schema, but perhaps enable_partitionwise_join
= on might help you. Other things that might help are further lowering
random_page_cost or raising effective_cache_size artificially high.
It's hard to tell from here how much random I/O is being costed into
the index scans. You could determine this by checking if the nested
loop plan costs change as a result of doing further increases to
effective_cache_size. You could maybe nudge it up enough for it to win
over the hash join plan. It is possible that this won't work, however.
> for secondary Nest Loop Anti join could be rows 299118 rows * 15.78(avg index scan cost of one partition) = 4,720,082 that still much less than 132168227.57 ?
> for Hash Right join, is it possible to estimate by 8 seq partition scan instead of all 32 hash partitions since the first query estimated 8 rows only ?
> extend statistics may help estimate count(partitionkeyid) based on other columns bind variables, but looks like that did not help table join case.
I can't quite follow this. You'll need to better explain where you're
getting these numbers for me to be able to understand.
David
Attachment
Is the query fast with some bind parameters but slow with others? If so, it'd be better to show an explain with 'fast' and 'slow' bind params, rather than the same bind params with enable_*=off. Or is the change because autoanalyze runs on some table and changes the statistics enough to change the plan ? Investigate by setting log_autovacuum_min_duration=0 or by checking pg_stat_all_tables.last_{auto,}{vacuum,analyze}. Maybe your llid/hhid are correlated, and you should CREATE STATISTICS. Or maybe the answer will be to increase the stats target.