Re: Query plan prefers hash join when nested loop is much faster - Mailing list pgsql-general

From iulian dragos
Subject Re: Query plan prefers hash join when nested loop is much faster
Date
Msg-id CAMNsu3mu2FGVL=qRC50fHfa+SX3PN7tu4sPVRnFmVbWpvBfYHg@mail.gmail.com
Whole thread Raw
In response to Re: Query plan prefers hash join when nested loop is much faster  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Query plan prefers hash join when nested loop is much faster  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-general


On Tue, Aug 25, 2020 at 12:27 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 22 Aug 2020 at 00:35, iulian dragos
<iulian.dragos@databricks.com> wrote:
> I am trying to understand why the query planner insists on using a hash join, and how to make it choose the better option, which in this case would be a nested loop.

> |                           ->  Index Scan using test_result_module_result_id_idx on test_result  (cost=0.57..6911.17 rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) |
> |                                 Index Cond: (module_result_id = module_result.id)                                                                                                     |

You might want to check if the pg_stats view reports a realistic
n_distinct value for test_result.module_result_id.  If the
pg_class.retuples is correct for that relation then that would
indicate the n_distinct estimate is about 115000. Going by the number
of rows you've mentioned it would appear a more realistic value for
that would be -0.4. which is 0 - 1 / (500000000 / 200000000.0).
However, that's assuming each module_result  has a test_result.  You
could run a SELECT COUNT(DISTINCT module_result_id) FROM test_result;
to get a better idea.

If ANALYZE is not getting you a good value for n_distinct, then you
can overwrite it. See [1], search for n_distinct.

Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in pg_stats set at 131736.0, but the actual number is much higher: 210104361. I tried to set it manually, but the plan is still the same (both the actual number and a percentage, -0.4, as you suggested):

> ALTER TABLE test_result ALTER COLUMN module_result_id SET (n_distinct=210104361)                                                              
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
ALTER TABLE
Time: 0.205s

 

David

[1] https://www.postgresql.org/docs/current/sql-altertable.html

pgsql-general by date:

Previous
From: Dirk Lattermann
Date:
Subject: Trigger transaction isolation
Next
From: Peter Eisentraut
Date:
Subject: Re: pgbouncer bug?