Thread: Query plan prefers hash join when nested loop is much faster
Hi,
I am trying to understand why the query planner insists on using a hash join, and how to make it choose the better option, which in this case would be a nested loop. I have two tables:
// about 200 million rows
CREATE TABLE module_result(
id bigserial PRIMARY KEY,
name_id bigint NOT NULL references result_name(id),
run_id integer NOT NULL references run (id),
logs text NOT NULL,
status result_status NOT NULL
);
id bigserial PRIMARY KEY,
name_id bigint NOT NULL references result_name(id),
run_id integer NOT NULL references run (id),
logs text NOT NULL,
status result_status NOT NULL
);
CREATE INDEX ON module_result (run_id);
// 500 million rows
CREATE TABLE test_result(
id bigserial PRIMARY KEY,
name_id bigint NOT NULL references result_name(id),
module_result_id bigint NOT NULL references module_result (id),
seconds float NOT NULL,
failure_msg text, -- Either a <failure>...</failure> or an <error message="... />
status result_status NOT NULL
);
id bigserial PRIMARY KEY,
name_id bigint NOT NULL references result_name(id),
module_result_id bigint NOT NULL references module_result (id),
seconds float NOT NULL,
failure_msg text, -- Either a <failure>...</failure> or an <error message="... />
status result_status NOT NULL
);
CREATE INDEX ON test_result (module_result_id);
I'm trying to select all test cases that belong to a given run_id, which logically involves finding all IDs in module_result that belong to a given run, and then selecting the test results for those IDs (run_id has several module_result_id, which in turn have several test_results each).
EXPLAIN ANALYZE SELECT test_result.status, count(test_result.status) as "Count" FROM test_result INNER JOIN module_result ON module_result.id = test_result.module_result_id WHERE module_resul
t.run_id=158523 GROUP BY test_result.status
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Finalize GroupAggregate (cost=7771702.73..7771804.08 rows=3 width=12) (actual time=32341.993..32341.994 rows=2 loops=1) |
| Group Key: test_result.status |
| -> Gather Merge (cost=7771702.73..7771804.02 rows=6 width=12) (actual time=32341.970..32343.222 rows=6 loops=1) |
| Workers Planned: 2 |
| Workers Launched: 2 |
| -> Partial GroupAggregate (cost=7770702.71..7770803.30 rows=3 width=12) (actual time=32340.278..32340.286 rows=2 loops=3) |
| Group Key: test_result.status |
| -> Sort (cost=7770702.71..7770736.23 rows=13408 width=4) (actual time=32339.698..32339.916 rows=4941 loops=3) |
| Sort Key: test_result.status |
| Sort Method: quicksort Memory: 431kB |
| Worker 0: Sort Method: quicksort Memory: 433kB |
| Worker 1: Sort Method: quicksort Memory: 409kB |
| -> Hash Join (cost=586.15..7769783.54 rows=13408 width=4) (actual time=18112.078..32339.011 rows=4941 loops=3) |
| Hash Cond: (test_result.module_result_id = module_result.id) |
| -> Parallel Seq Scan on test_result (cost=0.00..7145224.72 rows=237703872 width=12) (actual time=0.034..15957.894 rows=190207740 loops=3) |
| -> Hash (cost=438.41..438.41 rows=11819 width=8) (actual time=3.905..3.905 rows=14824 loops=3) |
| Buckets: 16384 Batches: 1 Memory Usage: 708kB |
| -> Index Scan using module_result_run_id_idx on module_result (cost=0.57..438.41 rows=11819 width=8) (actual time=0.017..2.197 rows=14824 loops=3) |
| Index Cond: (run_id = 158523) |
| Planning Time: 0.178 ms |
| Execution Time: 32343.330 ms |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 32.572s (32 seconds), executed in: 32.551s (32 seconds)
t.run_id=158523 GROUP BY test_result.status
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Finalize GroupAggregate (cost=7771702.73..7771804.08 rows=3 width=12) (actual time=32341.993..32341.994 rows=2 loops=1) |
| Group Key: test_result.status |
| -> Gather Merge (cost=7771702.73..7771804.02 rows=6 width=12) (actual time=32341.970..32343.222 rows=6 loops=1) |
| Workers Planned: 2 |
| Workers Launched: 2 |
| -> Partial GroupAggregate (cost=7770702.71..7770803.30 rows=3 width=12) (actual time=32340.278..32340.286 rows=2 loops=3) |
| Group Key: test_result.status |
| -> Sort (cost=7770702.71..7770736.23 rows=13408 width=4) (actual time=32339.698..32339.916 rows=4941 loops=3) |
| Sort Key: test_result.status |
| Sort Method: quicksort Memory: 431kB |
| Worker 0: Sort Method: quicksort Memory: 433kB |
| Worker 1: Sort Method: quicksort Memory: 409kB |
| -> Hash Join (cost=586.15..7769783.54 rows=13408 width=4) (actual time=18112.078..32339.011 rows=4941 loops=3) |
| Hash Cond: (test_result.module_result_id = module_result.id) |
| -> Parallel Seq Scan on test_result (cost=0.00..7145224.72 rows=237703872 width=12) (actual time=0.034..15957.894 rows=190207740 loops=3) |
| -> Hash (cost=438.41..438.41 rows=11819 width=8) (actual time=3.905..3.905 rows=14824 loops=3) |
| Buckets: 16384 Batches: 1 Memory Usage: 708kB |
| -> Index Scan using module_result_run_id_idx on module_result (cost=0.57..438.41 rows=11819 width=8) (actual time=0.017..2.197 rows=14824 loops=3) |
| Index Cond: (run_id = 158523) |
| Planning Time: 0.178 ms |
| Execution Time: 32343.330 ms |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 32.572s (32 seconds), executed in: 32.551s (32 seconds)
This plan takes about 30s to execute. If I turn off seqscan, I get a nested loop join that takes about 0.02s to execute:
set enable_seqscan = off
SET
Time: 0.305s
> explain analyze select test_result.status, count(test_result.status) as "Count" from test_result inner join module_result ON module_result.id = test_result.module_result_id where module_resul
t.run_id=158523 group by test_result.status
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Finalize GroupAggregate (cost=34297042.16..34297143.50 rows=3 width=12) (actual time=15.014..15.015 rows=2 loops=1) |
| Group Key: test_result.status |
| -> Gather Merge (cost=34297042.16..34297143.44 rows=6 width=12) (actual time=15.005..15.850 rows=6 loops=1) |
| Workers Planned: 2 |
| Workers Launched: 2 |
| -> Partial GroupAggregate (cost=34296042.13..34296142.72 rows=3 width=12) (actual time=12.937..12.940 rows=2 loops=3) |
| Group Key: test_result.status |
| -> Sort (cost=34296042.13..34296075.65 rows=13408 width=4) (actual time=12.339..12.559 rows=4941 loops=3) |
| Sort Key: test_result.status |
| Sort Method: quicksort Memory: 461kB |
| Worker 0: Sort Method: quicksort Memory: 403kB |
| Worker 1: Sort Method: quicksort Memory: 408kB |
| -> Nested Loop (cost=232.74..34295122.96 rows=13408 width=4) (actual time=0.232..11.671 rows=4941 loops=3) |
| -> Parallel Bitmap Heap Scan on module_result (cost=232.17..44321.35 rows=4925 width=8) (actual time=0.218..0.671 rows=4941 loops=3) |
| Recheck Cond: (run_id = 158523) |
| Heap Blocks: exact=50 |
| -> Bitmap Index Scan on module_result_run_id_idx (cost=0.00..229.21 rows=11819 width=0) (actual time=0.592..0.592 rows=14824 loops=1) |
| Index Cond: (run_id = 158523) |
| -> Index Scan using test_result_module_result_id_idx on test_result (cost=0.57..6911.17 rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) |
| Index Cond: (module_result_id = module_result.id) |
| Planning Time: 0.214 ms |
| Execution Time: 15.932 ms |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.235s
SET
Time: 0.305s
> explain analyze select test_result.status, count(test_result.status) as "Count" from test_result inner join module_result ON module_result.id = test_result.module_result_id where module_resul
t.run_id=158523 group by test_result.status
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Finalize GroupAggregate (cost=34297042.16..34297143.50 rows=3 width=12) (actual time=15.014..15.015 rows=2 loops=1) |
| Group Key: test_result.status |
| -> Gather Merge (cost=34297042.16..34297143.44 rows=6 width=12) (actual time=15.005..15.850 rows=6 loops=1) |
| Workers Planned: 2 |
| Workers Launched: 2 |
| -> Partial GroupAggregate (cost=34296042.13..34296142.72 rows=3 width=12) (actual time=12.937..12.940 rows=2 loops=3) |
| Group Key: test_result.status |
| -> Sort (cost=34296042.13..34296075.65 rows=13408 width=4) (actual time=12.339..12.559 rows=4941 loops=3) |
| Sort Key: test_result.status |
| Sort Method: quicksort Memory: 461kB |
| Worker 0: Sort Method: quicksort Memory: 403kB |
| Worker 1: Sort Method: quicksort Memory: 408kB |
| -> Nested Loop (cost=232.74..34295122.96 rows=13408 width=4) (actual time=0.232..11.671 rows=4941 loops=3) |
| -> Parallel Bitmap Heap Scan on module_result (cost=232.17..44321.35 rows=4925 width=8) (actual time=0.218..0.671 rows=4941 loops=3) |
| Recheck Cond: (run_id = 158523) |
| Heap Blocks: exact=50 |
| -> Bitmap Index Scan on module_result_run_id_idx (cost=0.00..229.21 rows=11819 width=0) (actual time=0.592..0.592 rows=14824 loops=1) |
| Index Cond: (run_id = 158523) |
| -> Index Scan using test_result_module_result_id_idx on test_result (cost=0.57..6911.17 rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) |
| Index Cond: (module_result_id = module_result.id) |
| Planning Time: 0.214 ms |
| Execution Time: 15.932 ms |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.235s
I don't think it's recommended to turn off seqscan in production, so I'm looking for a way to make the query planner choose the significantly faster plan. How can I do that? It's probably related to some statistics, but they are up to date (I run ANALYZE several times).
Any pointers would be very helpful,
thank you,
iulian
Your system is preferring sequential scan to using test_result_module_result_id_idx in this case. What type of storage do you use, what type of cache hits do you expect, and what do you have random_page_cost set to? That comes to mind as a significant factor in choosing index scans based on costs.
Hi Michael,
Thanks for the answer. It's an RDS instance using SSD storage and the default `random_page_cost` set to 4.0. I don't expect a lot of repetitive queries here, so I think caching may not be extremely useful. I wonder if the selectivity of the query is wrongly estimated (out of 500 million rows, only a few thousands are returned).
I tried lowering the `random_page_cost` to 1.2 and it didn't make a difference in the query plan.
iulian
On Fri, Aug 21, 2020 at 6:30 PM Michael Lewis <mlewis@entrata.com> wrote:
Your system is preferring sequential scan to using test_result_module_result_id_idx in this case. What type of storage do you use, what type of cache hits do you expect, and what do you have random_page_cost set to? That comes to mind as a significant factor in choosing index scans based on costs.
On Mon, Aug 24, 2020 at 4:21 PM iulian dragos <iulian.dragos@databricks.com> wrote:
Hi Michael,Thanks for the answer. It's an RDS instance using SSD storage and the default `random_page_cost` set to 4.0. I don't expect a lot of repetitive queries here, so I think caching may not be extremely useful. I wonder if the selectivity of the query is wrongly estimated (out of 500 million rows, only a few thousands are returned).I tried lowering the `random_page_cost` to 1.2 and it didn't make a difference in the query plan.
I experimented a bit more with different values for this setting. The only way I could make it use the index was to use a value strictly less than `seq_page_cost` (0.8 for instance). That doesn't sound right, though.
The size of the effective_cache_size is fairly high as well (32 GB) for an instance with 64GB (db.m5.4xlarge).
iulian
On Sat, 22 Aug 2020 at 00:35, iulian dragos <iulian.dragos@databricks.com> wrote: > I am trying to understand why the query planner insists on using a hash join, and how to make it choose the better option,which in this case would be a nested loop. > | -> Index Scan using test_result_module_result_id_idx on test_result (cost=0.57..6911.17 rows=4331width=12) (actual time=0.002..0.002 rows=1 loops=14824) | > | Index Cond: (module_result_id = module_result.id) | You might want to check if the pg_stats view reports a realistic n_distinct value for test_result.module_result_id. If the pg_class.retuples is correct for that relation then that would indicate the n_distinct estimate is about 115000. Going by the number of rows you've mentioned it would appear a more realistic value for that would be -0.4. which is 0 - 1 / (500000000 / 200000000.0). However, that's assuming each module_result has a test_result. You could run a SELECT COUNT(DISTINCT module_result_id) FROM test_result; to get a better idea. If ANALYZE is not getting you a good value for n_distinct, then you can overwrite it. See [1], search for n_distinct. David [1] https://www.postgresql.org/docs/current/sql-altertable.html
On Tue, Aug 25, 2020 at 12:27 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 22 Aug 2020 at 00:35, iulian dragos
<iulian.dragos@databricks.com> wrote:
> I am trying to understand why the query planner insists on using a hash join, and how to make it choose the better option, which in this case would be a nested loop.
> | -> Index Scan using test_result_module_result_id_idx on test_result (cost=0.57..6911.17 rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) |
> | Index Cond: (module_result_id = module_result.id) |
You might want to check if the pg_stats view reports a realistic
n_distinct value for test_result.module_result_id. If the
pg_class.retuples is correct for that relation then that would
indicate the n_distinct estimate is about 115000. Going by the number
of rows you've mentioned it would appear a more realistic value for
that would be -0.4. which is 0 - 1 / (500000000 / 200000000.0).
However, that's assuming each module_result has a test_result. You
could run a SELECT COUNT(DISTINCT module_result_id) FROM test_result;
to get a better idea.
If ANALYZE is not getting you a good value for n_distinct, then you
can overwrite it. See [1], search for n_distinct.
Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in pg_stats set at 131736.0, but the actual number is much higher: 210104361. I tried to set it manually, but the plan is still the same (both the actual number and a percentage, -0.4, as you suggested):
> ALTER TABLE test_result ALTER COLUMN module_result_id SET (n_distinct=210104361)
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
ALTER TABLE
Time: 0.205s
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
ALTER TABLE
Time: 0.205s
David
[1] https://www.postgresql.org/docs/current/sql-altertable.html
On Tue, 25 Aug 2020 at 22:10, iulian dragos <iulian.dragos@databricks.com> wrote: > Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in pg_stats set at 131736.0, but the actual number ismuch higher: 210104361. I tried to set it manually, but the plan is still the same (both the actual number and a percentage,-0.4, as you suggested): You'll need to run ANALYZE on the table after doing the ALTER TABLE to change the n_distinct. The ANALYZE writes the value to pg_statistic. ALTER TABLE only takes it as far as pg_attribute's attoptions. ANALYZE reads that column to see if the n_distinct estimate should be overwritten before writing out pg_statistic Just remember if you're hardcoding a positive value that it'll stay fixed until you change it. If the table is likely to grow, then you might want to reconsider using a positive value and consider using a negative value as mentioned in the doc link. David
On Tue, Aug 25, 2020 at 12:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 25 Aug 2020 at 22:10, iulian dragos
<iulian.dragos@databricks.com> wrote:
> Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in pg_stats set at 131736.0, but the actual number is much higher: 210104361. I tried to set it manually, but the plan is still the same (both the actual number and a percentage, -0.4, as you suggested):
You'll need to run ANALYZE on the table after doing the ALTER TABLE to
change the n_distinct. The ANALYZE writes the value to pg_statistic.
ALTER TABLE only takes it as far as pg_attribute's attoptions.
ANALYZE reads that column to see if the n_distinct estimate should be
overwritten before writing out pg_statistic
Ah, rookie mistake. Thanks for clarifying this. Indeed, after I ran ANALYZE the faster plan was selected! Yay!
Just remember if you're hardcoding a positive value that it'll stay
fixed until you change it. If the table is likely to grow, then you
might want to reconsider using a positive value and consider using a
negative value as mentioned in the doc link.
Good point, I went for -0.4 and that seems to be doing the trick!
Thanks a lot for helping out!
David