BUG #17207: Bad cost estimate of Merge Join despite correct row estimate - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #17207: Bad cost estimate of Merge Join despite correct row estimate
Date
Msg-id 17207-5265aefa79e333b4@postgresql.org
Whole thread Raw
Responses Re: BUG #17207: Bad cost estimate of Merge Join despite correct row estimate  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17207
Logged by:          Simon Perepelitsa
Email address:      sema@sema.in
PostgreSQL version: 13.4
Operating system:   macOS 10.15.7
Description:

Setup

    create table test_users (id serial primary key);
    create table test_user_sessions (id serial primary key, user_id int not
null references test_users(id));
    insert into test_users (id) select generate_series(1, 500000) as id;
    insert into test_user_sessions (user_id) values
      (1),
      (1),
      (500000);
    vacuum verbose analyze test_users, test_user_sessions;

Query

    explain analyze
    select test_user_sessions.id from test_user_sessions join test_users on
user_id = test_users.id;

    Merge Join  (cost=1.49..1.55 rows=3 width=4) (actual time=0.015..71.034
rows=3 loops=1)
      Merge Cond: (test_users.id = test_user_sessions.user_id)
      ->  Index Only Scan using test_users_pkey on test_users
(cost=0.42..12996.42 rows=500000 width=4) (actual time=0.005..40.040
rows=500000 loops=1)
            Heap Fetches: 0
      ->  Sort  (cost=1.05..1.06 rows=3 width=8) (actual time=0.009..0.010
rows=3 loops=1)
            Sort Key: test_user_sessions.user_id
            Sort Method: quicksort  Memory: 25kB
            ->  Seq Scan on test_user_sessions  (cost=0.00..1.03 rows=3
width=8) (actual time=0.004..0.005 rows=3 loops=1)
    Planning Time: 0.106 ms
    Execution Time: 71.061 ms

After set enable_mergejoin = false

    Nested Loop  (cost=0.42..14.35 rows=3 width=4) (actual time=0.019..0.025
rows=3 loops=1)
      ->  Seq Scan on test_user_sessions  (cost=0.00..1.03 rows=3 width=8)
(actual time=0.006..0.006 rows=3 loops=1)
      ->  Index Only Scan using test_users_pkey on test_users
(cost=0.42..4.44 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=3)
            Index Cond: (id = test_user_sessions.user_id)
            Heap Fetches: 0
    Planning Time: 0.078 ms
    Execution Time: 0.040 ms

As you can see, Merge Join adds just 0.5 cost on top of Seq Scan ignoring
the high cost of full index scan (0.42..12996.42). If explicitly disabled,
Nested Loop is obviously a much better join plan for such a small table (3
rows).

While it is possible for Merge Join to also finish quickly - if user_id are
all low numbers - I'm not sure if that's a realistic expectation for the
default plan.

I also tried rewriting it as a semi-join with exists/in, but the query plans
were exactly the same. Not sure why, because in some of my other queries
this makes the planner use more optimized "Semi-Join" instructions (e.g.
Nested Loop Semi-Join).

    explain analyze
    select test_user_sessions.id from test_user_sessions where exists
(select 1 from test_users where user_id = test_users.id);


pgsql-bugs by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: BUG #17206: the function array_cat(anyarray, anyarray) does not exist
Next
From: Tom Lane
Date:
Subject: Re: BUG #17206: the function array_cat(anyarray, anyarray) does not exist