Thread: Merge Join chooses very slow index scan
I am having problems with a join where the planner picks a merge join and an index scan on one of the tables. Manually disabling merge joins and running the query both ways shows the merge join takes over 10 seconds while a hash join takes less than 100ms. The planner total cost estimate favors the merge join, but the cost estimate for the index scan part is greater than the total cost estimate by a factor of 300x. My understanding of how this can occur is that it expects it won't actually have to scan all the rows, because using the histogram distribution stats it can know that all the relevant rows of the join column will be at the beginning of the scan. But in practice it appears to actually be index scanning all the rows, showing massive amounts of page hits. What is also confusing is that the planner estimate of the number of rows that match the second join condition is accurate and very low, so I would expect it to index scan on that column's index instead. Pasted at the bottom is the explain plan for the query and some other variations I think might be relevant. The table/index names are obfuscated. I ran ANALYZE on all the tables in the query first. All the pages are cached in the explain plans but we wouldn't expect that to be true in the production system. There are btree indexes on all the columns in both the join conditions and the filters. Searching, I found this thread http://postgresql.nabble.com/merge-join-killing-performance-td2076433.html which sounds kind of similar, but there are no Nulls in this table. Thanks for your help. Postgres version info: PostgreSQL 9.1.13 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit ----------------------- Original Query The estimated cost for Index Scan is 898k but the total cost estimate is 2.6k. The planner has a good estimate of the number of rows, 1335, for the index scan, but by the number of page hits (8M) it appears it actually scanned the entire table which has about 8M rows. ----------------------- EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicles v LEFT JOIN usagestats ON v.id = tid AND type = 'vehicle'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Right Join (cost=593.28..2634.10 rows=4155 width=619) (actual time=9.150..11464.949 rows=4155 loops=1) Merge Cond: (usagestats.tid = s.id) Buffers: shared hit=8063988 -> Index Scan using usagestats_tid_idx on usagestats (cost=0.00..898911.91 rows=1335 width=37) (actual time=0.027..11448.789 rows=2979 loops=1) Filter: ((type)::text = 'vehicle'::text) Buffers: shared hit=8063686 -> Sort (cost=593.28..603.67 rows=4155 width=582) (actual time=9.108..10.429 rows=4155 loops=1) Sort Key: s.id Sort Method: quicksort Memory: 1657kB Buffers: shared hit=302 -> Seq Scan on vehicles v (cost=0.00..343.55 rows=4155 width=582) (actual time=0.014..2.917 rows=4155 loops=1) Buffers: shared hit=302 Total runtime: 11466.122 ms (13 rows) ------------------------ Change the type='vehicle' condition to an always true condition If we change the filter from "type = 'vehicle'" (True for a small fraction of the rows) to "freq > -1" (True for all rows) then the plan is the same, but the actual time and page hits are much less and the query returns is fast. ------------------------ EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicle v LEFT JOIN usagestats ON (v.id = tid AND freq > -1); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Right Join (cost=593.28..2434.79 rows=7733 width=619) (actual time=5.635..59.852 rows=17096 loops=1) Merge Cond: (usagestats.tid = v.id) Buffers: shared hit=17653 -> Index Scan using usagestats_tid_idx on usagestats (cost=0.00..898914.00 rows=8006976 width=37) (actual time=0.010..34.075 rows=17225 loops=1) Filter: (freq > (-1)) Buffers: shared hit=17351 -> Sort (cost=593.28..603.67 rows=4155 width=582) (actual time=5.617..9.351 rows=17094 loops=1) Sort Key: v.id Sort Method: quicksort Memory: 1657kB Buffers: shared hit=302 -> Seq Scan on vehicle v (cost=0.00..343.55 rows=4155 width=582) (actual time=0.009..1.803 rows=4157 loops=1) Buffers: shared hit=302 Total runtime: 62.868 ms (13 rows) ---------------------- Original Query with merge joins disabled If we manually disable merge joins and run the original query we get a hash join with what seems like a more reasonable index scan on the more selective type column. The total cost estimate is higher than the merge join plan, but lower than the cost estimate for the index scan in the merge join query. --------------------- BEGIN; SET LOCAL enable_mergejoin = off; EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicle v LEFT JOIN usagestats ON v.id = tid AND type = 'vehicle'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Right Join (cost=395.49..5158.10 rows=4155 width=619) (actual time=8.038..20.886 rows=4155 loops=1) Hash Cond: (usagestats.tid = v.id) Buffers: shared hit=3250 -> Index Scan using usagestats_type_idx on usagestats (cost=0.00..4752.59 rows=1335 width=37) (actual time=0.100..6.770 rows=2979 loops=1) Index Cond: ((type)::text = 'vehicle'::text) Buffers: shared hit=2948 -> Hash (cost=343.55..343.55 rows=4155 width=582) (actual time=7.908..7.908 rows=4155 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1088kB Buffers: shared hit=302 -> Seq Scan on vehicle v (cost=0.00..343.55 rows=4155 width=582) (actual time=0.021..3.068 rows=4155 loops=1) Buffers: shared hit=302 Total runtime: 21.936 ms (12 rows) ----------------------- Miscellaneous stats ----------------------- SELECT COUNT(*) FROM vehicle; count ------- 4155 (1 row) SELECT COUNT(*) FROM usagestats; count --------- 8007015 (1 row) The usagestats table has 501 histogram buckets for the tid column. The max id in the vehicle table is 4155 and the first two buckets of the histogram cover all the values between 1 and 4500. -- View this message in context: http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Hi
what is your random_page_cost and seq_page_cost?2015-03-19 7:23 GMT+01:00 Jake Magner <jakemagner90@gmail.com>:
I am having problems with a join where the planner picks a merge join and an
index scan on one of the tables. Manually disabling merge joins and running
the query both ways shows the merge join takes over 10 seconds while a hash
join takes less than 100ms. The planner total cost estimate favors the merge
join, but the cost estimate for the index scan part is greater than the
total cost estimate by a factor of 300x. My understanding of how this can
occur is that it expects it won't actually have to scan all the rows,
because using the histogram distribution stats it can know that all the
relevant rows of the join column will be at the beginning of the scan. But
in practice it appears to actually be index scanning all the rows, showing
massive amounts of page hits. What is also confusing is that the planner
estimate of the number of rows that match the second join condition is
accurate and very low, so I would expect it to index scan on that column's
index instead. Pasted at the bottom is the explain plan for the query and
some other variations I think might be relevant. The table/index names are
obfuscated. I ran ANALYZE on all the tables in the query first. All the
pages are cached in the explain plans but we wouldn't expect that to be true
in the production system. There are btree indexes on all the columns in both
the join conditions and the filters.
Searching, I found this thread
http://postgresql.nabble.com/merge-join-killing-performance-td2076433.html
which sounds kind of similar, but there are no Nulls in this table.
Thanks for your help.
Postgres version info: PostgreSQL 9.1.13 on x86_64-unknown-linux-gnu,
compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
-----------------------
Original Query
The estimated cost for Index Scan is 898k but the total cost estimate is
2.6k. The planner has a good estimate of the number of rows, 1335, for the
index scan, but by the number of page hits (8M) it appears it actually
scanned the entire table which has about 8M rows.
-----------------------
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicles v LEFT JOIN usagestats ON
v.id = tid AND type = 'vehicle';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Right Join (cost=593.28..2634.10 rows=4155 width=619) (actual
time=9.150..11464.949 rows=4155 loops=1)
Merge Cond: (usagestats.tid = s.id)
Buffers: shared hit=8063988
-> Index Scan using usagestats_tid_idx on usagestats
(cost=0.00..898911.91 rows=1335 width=37) (actual time=0.027..11448.789
rows=2979 loops=1)
Filter: ((type)::text = 'vehicle'::text)
Buffers: shared hit=8063686
-> Sort (cost=593.28..603.67 rows=4155 width=582) (actual
time=9.108..10.429 rows=4155 loops=1)
Sort Key: s.id
Sort Method: quicksort Memory: 1657kB
Buffers: shared hit=302
-> Seq Scan on vehicles v (cost=0.00..343.55 rows=4155 width=582)
(actual time=0.014..2.917 rows=4155 loops=1)
Buffers: shared hit=302
Total runtime: 11466.122 ms
(13 rows)
------------------------
Change the type='vehicle' condition to an always true condition
If we change the filter from "type = 'vehicle'" (True for a small fraction
of the rows) to "freq > -1" (True for all rows) then the plan is the same,
but the actual time and page hits are much less and the query returns is
fast.
------------------------
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicle v LEFT JOIN usagestats ON
(v.id = tid AND freq > -1);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Right Join (cost=593.28..2434.79 rows=7733 width=619) (actual
time=5.635..59.852 rows=17096 loops=1)
Merge Cond: (usagestats.tid = v.id)
Buffers: shared hit=17653
-> Index Scan using usagestats_tid_idx on usagestats
(cost=0.00..898914.00 rows=8006976 width=37) (actual time=0.010..34.075
rows=17225 loops=1)
Filter: (freq > (-1))
Buffers: shared hit=17351
-> Sort (cost=593.28..603.67 rows=4155 width=582) (actual
time=5.617..9.351 rows=17094 loops=1)
Sort Key: v.id
Sort Method: quicksort Memory: 1657kB
Buffers: shared hit=302
-> Seq Scan on vehicle v (cost=0.00..343.55 rows=4155 width=582)
(actual time=0.009..1.803 rows=4157 loops=1)
Buffers: shared hit=302
Total runtime: 62.868 ms
(13 rows)
----------------------
Original Query with merge joins disabled
If we manually disable merge joins and run the original query we get a hash
join with what seems like
a more reasonable index scan on the more selective type column. The total
cost estimate is higher than the merge join plan, but lower than the cost
estimate for the index scan in the merge join query.
---------------------
BEGIN;
SET LOCAL enable_mergejoin = off;
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicle v LEFT JOIN usagestats ON
v.id = tid AND type = 'vehicle';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=395.49..5158.10 rows=4155 width=619) (actual
time=8.038..20.886 rows=4155 loops=1)
Hash Cond: (usagestats.tid = v.id)
Buffers: shared hit=3250
-> Index Scan using usagestats_type_idx on usagestats
(cost=0.00..4752.59 rows=1335 width=37) (actual time=0.100..6.770 rows=2979
loops=1)
Index Cond: ((type)::text = 'vehicle'::text)
Buffers: shared hit=2948
-> Hash (cost=343.55..343.55 rows=4155 width=582) (actual
time=7.908..7.908 rows=4155 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1088kB
Buffers: shared hit=302
-> Seq Scan on vehicle v (cost=0.00..343.55 rows=4155 width=582)
(actual time=0.021..3.068 rows=4155 loops=1)
Buffers: shared hit=302
Total runtime: 21.936 ms
(12 rows)
-----------------------
Miscellaneous stats
-----------------------
SELECT COUNT(*) FROM vehicle;
count
-------
4155
(1 row)
SELECT COUNT(*) FROM usagestats;
count
---------
8007015
(1 row)
The usagestats table has 501 histogram buckets for the tid column. The max
id in the vehicle table is 4155 and the first two buckets of the histogram
cover all the values between 1 and 4500.
--
View this message in context: http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
random_page_cost = 4 seq_page_cost = 1 Regardless of the the choice to use the index scan and random access to the rows, how come in the second query with the freq > -1 condition, it accesses far fewer pages with the same index scan even though no rows are filtered out? Thanks -- View this message in context: http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523p5842527.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Jake Magner <jakemagner90@gmail.com> writes: > I am having problems with a join where the planner picks a merge join and an > index scan on one of the tables. Manually disabling merge joins and running > the query both ways shows the merge join takes over 10 seconds while a hash > join takes less than 100ms. The planner total cost estimate favors the merge > join, but the cost estimate for the index scan part is greater than the > total cost estimate by a factor of 300x. My understanding of how this can > occur is that it expects it won't actually have to scan all the rows, > because using the histogram distribution stats it can know that all the > relevant rows of the join column will be at the beginning of the scan. But > in practice it appears to actually be index scanning all the rows, showing > massive amounts of page hits. > ... > If we change the filter from "type = 'vehicle'" (True for a small fraction > of the rows) to "freq > -1" (True for all rows) then the plan is the same, > but the actual time and page hits are much less and the query returns is > fast. I think what must be happening is that the planner notes the maximum possible value of v.id and supposes that the mergejoin will stop far short of completion because v.id spans just a small part of the range of usagestats.tid. Which it does, when you have only the nonselective filter condition on usagestats. However, the executor cannot stop until it's fetched a usagestats row that has a tid value larger than the last v.id value; otherwise it can't be sure it's emitted all the required join rows. I'm guessing that the "type = 'vehicle'" condition eliminates all such rows, or at least enough of them that a very large part of the usagestats table has to be scanned to find the first can't-possibly-match row. I'm not sure there's anything much we can do to improve this situation in Postgres. It seems like a sufficiently bizarre corner case that it wouldn't be appropriate to spend planner cycles checking for it, and I'm not sure how we'd check for it even if we were willing to spend those cycles. You might consider altering the query, or inserting some kind of dummy sentinel row in the data, or changing the schema (is it really sensible to keep vehicle usagestats in the same table as other usagestats?). A brute-force fix would be "enable_mergejoin = off", but that would prevent selecting this plan type even when it actually is a significant win. regards, tom lane
I wrote: > [ assorted possible workarounds ] Actually, an easy fix might be to create a 2-column index on usagestats(type, tid). I think the planner should be able to use that to produce sorted output for the mergejoin, and you'd get the best of both worlds, because the indexscan will stop immediately when it's exhausted the rows with type = 'vehicle'. regards, tom lane
Thanks Tom, that sounds like what is happening. Some additional comments/questions inline. Tom Lane-2 wrote > I think what must be happening is that the planner notes the maximum > possible value of v.id and supposes that the mergejoin will stop far short > of completion because v.id spans just a small part of the range of > usagestats.tid. Which it does, when you have only the nonselective filter > condition on usagestats. However, the executor cannot stop until it's > fetched a usagestats row that has a tid value larger than the last v.id > value; otherwise it can't be sure it's emitted all the required join rows. Ok, that makes a lot of sense. It is scanning the tid index though, so once it gets past the last value in v.id isn't it guaranteed that there can be no more required join rows? Even if it sees tid = 5000 and type = 'aircraft' then it can know there are no more tids less than 5000. It must be that it waits to do this check until it gets a row that matches the filter, maybe this is an optimization in most cases? Seems like the cost of the check would be small enough compared to the cost of looking up the next row to do it every time. Tom Lane-2 wrote > I'm guessing that the "type = 'vehicle'" condition eliminates all such > rows, or at least enough of them that a very large part of the usagestats > table has to be scanned to find the first can't-possibly-match row. Yes, you are exactly right. Tom Lane-2 wrote > I'm not sure there's anything much we can do to improve this situation > in Postgres. It seems like a sufficiently bizarre corner case that it > wouldn't be appropriate to spend planner cycles checking for it, and > I'm not sure how we'd check for it even if we were willing to spend those > cycles. You might consider altering the query, or inserting some kind of > dummy sentinel row in the data, or changing the schema (is it really > sensible to keep vehicle usagestats in the same table as other > usagestats?). A brute-force fix would be "enable_mergejoin = off", but > that would prevent selecting this plan type even when it actually is > a significant win. I agree it may make sense to change the schema, although there are some good reasons to have it this way (I obfuscated the table names). If we partitioned the table on "type" then would the planner be able to stop after finishing the 'vehicle' type partition? Tom Lane-2 wrote > Actually, an easy fix might be to create a 2-column index on > usagestats(type, tid). I think the planner should be able to > use that to produce sorted output for the mergejoin, and you'd > get the best of both worlds, because the indexscan will stop > immediately when it's exhausted the rows with type = 'vehicle'. Actually there is an index on (type, tid) and it doesn't help. I just tried adding an index on (tid, type) and it partially fixed the issue, judging by the page hits, it looks like it is still scanning all the rows of the compound index, but no longer needs to go to the actual table. This takes 600ms instead of the 11,500ms of the original query, but still much more than the 60ms when you change the type='vehicle' condition to freq > -1. So it isn't a perfect solution. We could also switch to enum values for the type field which may reduce the (tid, type) index size enough to make the performance adequate, but it would be best if we can just get it to quit the scan early, so the performance doesn't degrade if the table grows significantly. Best, Jake -- View this message in context: http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523p5842603.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
I think I understand now after reading the notes here on the merge join algorithm: https://github.com/postgres/postgres/blob/4ea51cdfe85ceef8afabceb03c446574daa0ac23/src/backend/executor/nodeMergejoin.c The index scanning node doesn't know the max id of the vehicle table and so can't know when to stop looking for the next tuple that matches the join. Doesn't seem like any easy way to improve that case. I imagine it is a fairly common pattern though. Any time metadata about entities modeled in different tables, is stored in the same table, and the distribution of the number of each entity type is skewed, this situation will arise. Best, Jake -- View this message in context: http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523p5842633.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.