Thread: BUG #17207: Bad cost estimate of Merge Join despite correct row estimate
BUG #17207: Bad cost estimate of Merge Join despite correct row estimate
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17207 Logged by: Simon Perepelitsa Email address: sema@sema.in PostgreSQL version: 13.4 Operating system: macOS 10.15.7 Description: Setup create table test_users (id serial primary key); create table test_user_sessions (id serial primary key, user_id int not null references test_users(id)); insert into test_users (id) select generate_series(1, 500000) as id; insert into test_user_sessions (user_id) values (1), (1), (500000); vacuum verbose analyze test_users, test_user_sessions; Query explain analyze select test_user_sessions.id from test_user_sessions join test_users on user_id = test_users.id; Merge Join (cost=1.49..1.55 rows=3 width=4) (actual time=0.015..71.034 rows=3 loops=1) Merge Cond: (test_users.id = test_user_sessions.user_id) -> Index Only Scan using test_users_pkey on test_users (cost=0.42..12996.42 rows=500000 width=4) (actual time=0.005..40.040 rows=500000 loops=1) Heap Fetches: 0 -> Sort (cost=1.05..1.06 rows=3 width=8) (actual time=0.009..0.010 rows=3 loops=1) Sort Key: test_user_sessions.user_id Sort Method: quicksort Memory: 25kB -> Seq Scan on test_user_sessions (cost=0.00..1.03 rows=3 width=8) (actual time=0.004..0.005 rows=3 loops=1) Planning Time: 0.106 ms Execution Time: 71.061 ms After set enable_mergejoin = false Nested Loop (cost=0.42..14.35 rows=3 width=4) (actual time=0.019..0.025 rows=3 loops=1) -> Seq Scan on test_user_sessions (cost=0.00..1.03 rows=3 width=8) (actual time=0.006..0.006 rows=3 loops=1) -> Index Only Scan using test_users_pkey on test_users (cost=0.42..4.44 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=3) Index Cond: (id = test_user_sessions.user_id) Heap Fetches: 0 Planning Time: 0.078 ms Execution Time: 0.040 ms As you can see, Merge Join adds just 0.5 cost on top of Seq Scan ignoring the high cost of full index scan (0.42..12996.42). If explicitly disabled, Nested Loop is obviously a much better join plan for such a small table (3 rows). While it is possible for Merge Join to also finish quickly - if user_id are all low numbers - I'm not sure if that's a realistic expectation for the default plan. I also tried rewriting it as a semi-join with exists/in, but the query plans were exactly the same. Not sure why, because in some of my other queries this makes the planner use more optimized "Semi-Join" instructions (e.g. Nested Loop Semi-Join). explain analyze select test_user_sessions.id from test_user_sessions where exists (select 1 from test_users where user_id = test_users.id);
PG Bug reporting form <noreply@postgresql.org> writes: > [ seriously awful mergejoin cost estimate ] > While it is possible for Merge Join to also finish quickly - if user_id are > all low numbers - I'm not sure if that's a realistic expectation for the > default plan. Yeah, it looks like it's incorrectly adjusting based on a belief that it won't have to scan all of the join input. I found that the problem only appears in v11 and later; earlier versions come out with a more realistic estimate of the cost of the merge. "git bisect" says it changed at 7d08ce286cd5854d58152e428c28636a616bdc42 is the first bad commit commit 7d08ce286cd5854d58152e428c28636a616bdc42 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Sep 13 11:12:39 2017 -0400 Distinguish selectivity of < from <= and > from >=. I've not yet figured out what the connection is, but a reasonable bet is that there's an edge case associated with the ranges of the two join keys being exactly the same. Anyway, I'll find it and fix it. Thanks for the report! regards, tom lane
I wrote: > I've not yet figured out what the connection is, but a reasonable > bet is that there's an edge case associated with the ranges of the > two join keys being exactly the same. Hmm, no, that's not it. The problem stems from the weird statistics we have for the smaller table's join column: attname | most_common_vals | most_common_freqs | histogram_bounds ---------+------------------+-------------------+------------------ user_id | {1} | {0.6666667} | We have an MCV item for "1", because it's repeated, but there's no histogram (since, with only one other value, there'd be no way to make one). Despite this, get_variable_range() cheerfully decides that the range of values of "user_id" is from 1 to 1, and then mergejoinscansel() concludes that only an insignificant fraction of the bigger table need be scanned. The apparent dependency on commit 7d08ce286 is a bit artificial. We need to calculate the fraction of the bigger table that has id <= 1 (the supposed upper bound of the smaller table). Before that commit, that was approximated as "id < 1", and we got an estimate of exactly 0. With accurate handling of <=, we realize that the estimate should be a shade larger than zero. However, there's a bogus-estimate detector at the end of mergejoinscansel: if (*rightstart >= *rightend) { *rightstart = 0.0; *rightend = 1.0; } In v10, we have both rightstart and rightend exactly zero at this point, so the BS detector trips and restores a sane rightend. In later versions, that doesn't happen so we accept the not-too-sane rightend value. It seems clear that the appropriate fix is to make get_variable_range more cautious. We could decide that it should always require a histogram to be present in order to succeed. However, I'm worried that that would result in unnecessary loss of accuracy in cases where the MCV list covers all of the table (which is typically true for enum-like columns). So I'm thinking that if it sees an MCV list and no histogram, it should succeed only if the MCV frequencies sum to 1, with some small allowance for roundoff error. regards, tom lane