Thread: BUG #17207: Bad cost estimate of Merge Join despite correct row estimate

BUG #17207: Bad cost estimate of Merge Join despite correct row estimate

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17207
Logged by:          Simon Perepelitsa
Email address:      sema@sema.in
PostgreSQL version: 13.4
Operating system:   macOS 10.15.7
Description:

Setup

    create table test_users (id serial primary key);
    create table test_user_sessions (id serial primary key, user_id int not
null references test_users(id));
    insert into test_users (id) select generate_series(1, 500000) as id;
    insert into test_user_sessions (user_id) values
      (1),
      (1),
      (500000);
    vacuum verbose analyze test_users, test_user_sessions;

Query

    explain analyze
    select test_user_sessions.id from test_user_sessions join test_users on
user_id = test_users.id;

    Merge Join  (cost=1.49..1.55 rows=3 width=4) (actual time=0.015..71.034
rows=3 loops=1)
      Merge Cond: (test_users.id = test_user_sessions.user_id)
      ->  Index Only Scan using test_users_pkey on test_users
(cost=0.42..12996.42 rows=500000 width=4) (actual time=0.005..40.040
rows=500000 loops=1)
            Heap Fetches: 0
      ->  Sort  (cost=1.05..1.06 rows=3 width=8) (actual time=0.009..0.010
rows=3 loops=1)
            Sort Key: test_user_sessions.user_id
            Sort Method: quicksort  Memory: 25kB
            ->  Seq Scan on test_user_sessions  (cost=0.00..1.03 rows=3
width=8) (actual time=0.004..0.005 rows=3 loops=1)
    Planning Time: 0.106 ms
    Execution Time: 71.061 ms

After set enable_mergejoin = false

    Nested Loop  (cost=0.42..14.35 rows=3 width=4) (actual time=0.019..0.025
rows=3 loops=1)
      ->  Seq Scan on test_user_sessions  (cost=0.00..1.03 rows=3 width=8)
(actual time=0.006..0.006 rows=3 loops=1)
      ->  Index Only Scan using test_users_pkey on test_users
(cost=0.42..4.44 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=3)
            Index Cond: (id = test_user_sessions.user_id)
            Heap Fetches: 0
    Planning Time: 0.078 ms
    Execution Time: 0.040 ms

As you can see, Merge Join adds just 0.5 cost on top of Seq Scan ignoring
the high cost of full index scan (0.42..12996.42). If explicitly disabled,
Nested Loop is obviously a much better join plan for such a small table (3
rows).

While it is possible for Merge Join to also finish quickly - if user_id are
all low numbers - I'm not sure if that's a realistic expectation for the
default plan.

I also tried rewriting it as a semi-join with exists/in, but the query plans
were exactly the same. Not sure why, because in some of my other queries
this makes the planner use more optimized "Semi-Join" instructions (e.g.
Nested Loop Semi-Join).

    explain analyze
    select test_user_sessions.id from test_user_sessions where exists
(select 1 from test_users where user_id = test_users.id);


PG Bug reporting form <noreply@postgresql.org> writes:
> [ seriously awful mergejoin cost estimate ]

> While it is possible for Merge Join to also finish quickly - if user_id are
> all low numbers - I'm not sure if that's a realistic expectation for the
> default plan.

Yeah, it looks like it's incorrectly adjusting based on a belief that
it won't have to scan all of the join input.

I found that the problem only appears in v11 and later; earlier versions
come out with a more realistic estimate of the cost of the merge.
"git bisect" says it changed at

    7d08ce286cd5854d58152e428c28636a616bdc42 is the first bad commit
    commit 7d08ce286cd5854d58152e428c28636a616bdc42
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Wed Sep 13 11:12:39 2017 -0400

        Distinguish selectivity of < from <= and > from >=.

I've not yet figured out what the connection is, but a reasonable
bet is that there's an edge case associated with the ranges of the
two join keys being exactly the same.  Anyway, I'll find it and
fix it.  Thanks for the report!

            regards, tom lane



I wrote:
> I've not yet figured out what the connection is, but a reasonable
> bet is that there's an edge case associated with the ranges of the
> two join keys being exactly the same.

Hmm, no, that's not it.  The problem stems from the weird statistics
we have for the smaller table's join column:

 attname | most_common_vals | most_common_freqs | histogram_bounds 
---------+------------------+-------------------+------------------
 user_id | {1}              | {0.6666667}       | 

We have an MCV item for "1", because it's repeated, but there's no
histogram (since, with only one other value, there'd be no way to
make one).  Despite this, get_variable_range() cheerfully decides
that the range of values of "user_id" is from 1 to 1, and then
mergejoinscansel() concludes that only an insignificant fraction
of the bigger table need be scanned.

The apparent dependency on commit 7d08ce286 is a bit artificial.
We need to calculate the fraction of the bigger table that
has id <= 1 (the supposed upper bound of the smaller table).
Before that commit, that was approximated as "id < 1", and we got
an estimate of exactly 0.  With accurate handling of <=, we
realize that the estimate should be a shade larger than zero.
However, there's a bogus-estimate detector at the end of
mergejoinscansel:

    if (*rightstart >= *rightend)
    {
        *rightstart = 0.0;
        *rightend = 1.0;
    }

In v10, we have both rightstart and rightend exactly zero at this
point, so the BS detector trips and restores a sane rightend.
In later versions, that doesn't happen so we accept the not-too-sane
rightend value.

It seems clear that the appropriate fix is to make get_variable_range
more cautious.  We could decide that it should always require a
histogram to be present in order to succeed.  However, I'm worried
that that would result in unnecessary loss of accuracy in cases where
the MCV list covers all of the table (which is typically true for
enum-like columns).  So I'm thinking that if it sees an MCV list and
no histogram, it should succeed only if the MCV frequencies sum to 1,
with some small allowance for roundoff error.

            regards, tom lane