Thread: Merge Join chooses very slow index scan

Merge Join chooses very slow index scan

From

Jake Magner

Date:

19 March 2015, 06:23:44

I am having problems with a join where the planner picks a merge join and an
index scan on one of the tables. Manually disabling merge joins and running
the query both ways shows the merge join takes over 10 seconds while a hash
join takes less than 100ms. The planner total cost estimate favors the merge
join, but the cost estimate for the index scan part is greater than the
total cost estimate by a factor of 300x. My understanding of how this can
occur is that it expects it won't actually have to scan all the rows,
because using the histogram distribution stats it can know that all the
relevant rows of the join column will be at the beginning of the scan. But
in practice it appears to actually be index scanning all the rows, showing
massive amounts of page hits. What is also confusing is that the planner
estimate of the number of rows that match the second join condition is
accurate and very low, so I would expect it to index scan on that column's
index instead. Pasted at the bottom is the explain plan for the query and
some other variations I think might be relevant. The table/index names are
obfuscated. I ran ANALYZE on all the tables in the query first. All  the
pages are cached in the explain plans but we wouldn't expect that to be true
in the production system. There are btree indexes on all the columns in both
the join conditions and the filters.

Searching, I found this thread
http://postgresql.nabble.com/merge-join-killing-performance-td2076433.html
which sounds kind of similar, but there are no Nulls in this table.

Thanks for your help.



Postgres version info: PostgreSQL 9.1.13 on x86_64-unknown-linux-gnu,
compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit


-----------------------
Original Query

The estimated cost for Index Scan is 898k but the total cost estimate is
2.6k. The planner has a good estimate of the number of rows, 1335, for the
index scan, but by the number of page hits (8M) it appears it actually
scanned the entire table which has about 8M rows.
-----------------------
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicles v LEFT JOIN usagestats ON
v.id = tid AND type = 'vehicle';

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=593.28..2634.10 rows=4155 width=619) (actual
time=9.150..11464.949 rows=4155 loops=1)
   Merge Cond: (usagestats.tid = s.id)
   Buffers: shared hit=8063988
   ->  Index Scan using usagestats_tid_idx on usagestats
(cost=0.00..898911.91 rows=1335 width=37) (actual time=0.027..11448.789
rows=2979 loops=1)
         Filter: ((type)::text = 'vehicle'::text)
         Buffers: shared hit=8063686
   ->  Sort  (cost=593.28..603.67 rows=4155 width=582) (actual
time=9.108..10.429 rows=4155 loops=1)
         Sort Key: s.id
         Sort Method: quicksort  Memory: 1657kB
         Buffers: shared hit=302
         ->  Seq Scan on vehicles v  (cost=0.00..343.55 rows=4155 width=582)
(actual time=0.014..2.917 rows=4155 loops=1)
               Buffers: shared hit=302
 Total runtime: 11466.122 ms
(13 rows)

------------------------
Change the type='vehicle' condition to an always true condition

If we change the filter from "type = 'vehicle'" (True for a small fraction
of the rows) to "freq > -1" (True for all rows) then the plan is the same,
but the actual time and page hits are much less and the query returns is
fast.
------------------------
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicle v LEFT JOIN usagestats ON
(v.id = tid AND freq > -1);


QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Merge Right Join  (cost=593.28..2434.79 rows=7733 width=619) (actual
time=5.635..59.852 rows=17096 loops=1)

   Merge Cond: (usagestats.tid = v.id)

   Buffers: shared hit=17653

   ->  Index Scan using usagestats_tid_idx on usagestats
(cost=0.00..898914.00 rows=8006976 width=37) (actual time=0.010..34.075
rows=17225 loops=1)

         Filter: (freq > (-1))

         Buffers: shared hit=17351

   ->  Sort  (cost=593.28..603.67 rows=4155 width=582) (actual
time=5.617..9.351 rows=17094 loops=1)

         Sort Key: v.id

         Sort Method: quicksort  Memory: 1657kB

         Buffers: shared hit=302

         ->  Seq Scan on vehicle v  (cost=0.00..343.55 rows=4155 width=582)
(actual time=0.009..1.803 rows=4157 loops=1)

               Buffers: shared hit=302

 Total runtime: 62.868 ms

(13 rows)


----------------------
Original Query with merge joins disabled

If we manually disable merge joins and run the original query we get a hash
join with what seems like
a more reasonable index scan on the more selective type column. The total
cost estimate is higher than the merge join plan, but lower than the cost
estimate for the index scan in the merge join query.
---------------------
BEGIN;
SET LOCAL enable_mergejoin = off;

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicle v LEFT JOIN usagestats ON
v.id = tid AND type = 'vehicle';

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=395.49..5158.10 rows=4155 width=619) (actual
time=8.038..20.886 rows=4155 loops=1)
   Hash Cond: (usagestats.tid = v.id)
   Buffers: shared hit=3250
   ->  Index Scan using usagestats_type_idx on usagestats
(cost=0.00..4752.59 rows=1335 width=37) (actual time=0.100..6.770 rows=2979
loops=1)
         Index Cond: ((type)::text = 'vehicle'::text)
         Buffers: shared hit=2948
   ->  Hash  (cost=343.55..343.55 rows=4155 width=582) (actual
time=7.908..7.908 rows=4155 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1088kB
         Buffers: shared hit=302
         ->  Seq Scan on vehicle v  (cost=0.00..343.55 rows=4155 width=582)
(actual time=0.021..3.068 rows=4155 loops=1)
               Buffers: shared hit=302
 Total runtime: 21.936 ms
(12 rows)

-----------------------
Miscellaneous stats
-----------------------

SELECT COUNT(*) FROM vehicle;
 count
-------
  4155
(1 row)

SELECT COUNT(*) FROM usagestats;
  count
---------
 8007015
(1 row)

The usagestats table has 501 histogram buckets for the tid column. The max
id in the vehicle table is 4155 and the first two buckets of the histogram
cover all the values between 1 and 4500.





--
View this message in context: http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Merge Join chooses very slow index scan

From

Pavel Stehule

Date:

19 March 2015, 06:42:57

what is your random_page_cost and seq_page_cost?

Regards

Pavel Stehule

2015-03-19 7:23 GMT+01:00 Jake Magner <jakemagner90@gmail.com>:

I am having problems with a join where the planner picks a merge join and an
index scan on one of the tables. Manually disabling merge joins and running
the query both ways shows the merge join takes over 10 seconds while a hash
join takes less than 100ms. The planner total cost estimate favors the merge
join, but the cost estimate for the index scan part is greater than the
total cost estimate by a factor of 300x. My understanding of how this can
occur is that it expects it won't actually have to scan all the rows,
because using the histogram distribution stats it can know that all the
relevant rows of the join column will be at the beginning of the scan. But
in practice it appears to actually be index scanning all the rows, showing
massive amounts of page hits. What is also confusing is that the planner
estimate of the number of rows that match the second join condition is
accurate and very low, so I would expect it to index scan on that column's
index instead. Pasted at the bottom is the explain plan for the query and
some other variations I think might be relevant. The table/index names are
obfuscated. I ran ANALYZE on all the tables in the query first. All the
pages are cached in the explain plans but we wouldn't expect that to be true
in the production system. There are btree indexes on all the columns in both
the join conditions and the filters.

Searching, I found this thread
http://postgresql.nabble.com/merge-join-killing-performance-td2076433.html
which sounds kind of similar, but there are no Nulls in this table.

Thanks for your help.

Postgres version info: PostgreSQL 9.1.13 on x86_64-unknown-linux-gnu,
compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit

-----------------------
Original Query

The estimated cost for Index Scan is 898k but the total cost estimate is
2.6k. The planner has a good estimate of the number of rows, 1335, for the
index scan, but by the number of page hits (8M) it appears it actually
scanned the entire table which has about 8M rows.
-----------------------
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicles v LEFT JOIN usagestats ON
v.id = tid AND type = 'vehicle';

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Right Join (cost=593.28..2634.10 rows=4155 width=619) (actual
time=9.150..11464.949 rows=4155 loops=1)
Merge Cond: (usagestats.tid = s.id)
Buffers: shared hit=8063988
-> Index Scan using usagestats_tid_idx on usagestats
(cost=0.00..898911.91 rows=1335 width=37) (actual time=0.027..11448.789
rows=2979 loops=1)
Filter: ((type)::text = 'vehicle'::text)
Buffers: shared hit=8063686
-> Sort (cost=593.28..603.67 rows=4155 width=582) (actual
time=9.108..10.429 rows=4155 loops=1)
Sort Key: s.id
Sort Method: quicksort Memory: 1657kB
Buffers: shared hit=302
-> Seq Scan on vehicles v (cost=0.00..343.55 rows=4155 width=582)
(actual time=0.014..2.917 rows=4155 loops=1)
Buffers: shared hit=302
Total runtime: 11466.122 ms
(13 rows)

------------------------
Change the type='vehicle' condition to an always true condition

If we change the filter from "type = 'vehicle'" (True for a small fraction
of the rows) to "freq > -1" (True for all rows) then the plan is the same,
but the actual time and page hits are much less and the query returns is
fast.
------------------------
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicle v LEFT JOIN usagestats ON
(v.id = tid AND freq > -1);

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Merge Right Join (cost=593.28..2434.79 rows=7733 width=619) (actual
time=5.635..59.852 rows=17096 loops=1)

Merge Cond: (usagestats.tid = v.id)

Buffers: shared hit=17653

-> Index Scan using usagestats_tid_idx on usagestats
(cost=0.00..898914.00 rows=8006976 width=37) (actual time=0.010..34.075
rows=17225 loops=1)

Filter: (freq > (-1))

Buffers: shared hit=17351

-> Sort (cost=593.28..603.67 rows=4155 width=582) (actual
time=5.617..9.351 rows=17094 loops=1)

Sort Key: v.id

Sort Method: quicksort Memory: 1657kB

Buffers: shared hit=302

-> Seq Scan on vehicle v (cost=0.00..343.55 rows=4155 width=582)
(actual time=0.009..1.803 rows=4157 loops=1)

Buffers: shared hit=302

Total runtime: 62.868 ms

(13 rows)

----------------------
Original Query with merge joins disabled

If we manually disable merge joins and run the original query we get a hash
join with what seems like
a more reasonable index scan on the more selective type column. The total
cost estimate is higher than the merge join plan, but lower than the cost
estimate for the index scan in the merge join query.
---------------------
BEGIN;
SET LOCAL enable_mergejoin = off;

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vehicle v LEFT JOIN usagestats ON
v.id = tid AND type = 'vehicle';

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=395.49..5158.10 rows=4155 width=619) (actual
time=8.038..20.886 rows=4155 loops=1)
Hash Cond: (usagestats.tid = v.id)
Buffers: shared hit=3250
-> Index Scan using usagestats_type_idx on usagestats
(cost=0.00..4752.59 rows=1335 width=37) (actual time=0.100..6.770 rows=2979
loops=1)
Index Cond: ((type)::text = 'vehicle'::text)
Buffers: shared hit=2948
-> Hash (cost=343.55..343.55 rows=4155 width=582) (actual
time=7.908..7.908 rows=4155 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1088kB
Buffers: shared hit=302
-> Seq Scan on vehicle v (cost=0.00..343.55 rows=4155 width=582)
(actual time=0.021..3.068 rows=4155 loops=1)
Buffers: shared hit=302
Total runtime: 21.936 ms
(12 rows)

-----------------------
Miscellaneous stats
-----------------------

SELECT COUNT(*) FROM vehicle;
count
-------
4155
(1 row)

SELECT COUNT(*) FROM usagestats;
count
---------
8007015
(1 row)

The usagestats table has 501 histogram buckets for the tid column. The max
id in the vehicle table is 4155 and the first two buckets of the histogram
cover all the values between 1 and 4500.

--
View this message in context: http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Merge Join chooses very slow index scan

From

Jake Magner

Date:

19 March 2015, 07:04:22

random_page_cost = 4
seq_page_cost = 1

Regardless of the the choice to use the index scan and random access to the
rows, how come in the second query with the freq > -1 condition, it accesses
far fewer pages with the same index scan even though no rows are filtered
out?

Thanks



--
View this message in context:
http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523p5842527.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Merge Join chooses very slow index scan

From

Tom Lane

Date:

19 March 2015, 14:29:05

Jake Magner <jakemagner90@gmail.com> writes:
> I am having problems with a join where the planner picks a merge join and an
> index scan on one of the tables. Manually disabling merge joins and running
> the query both ways shows the merge join takes over 10 seconds while a hash
> join takes less than 100ms. The planner total cost estimate favors the merge
> join, but the cost estimate for the index scan part is greater than the
> total cost estimate by a factor of 300x. My understanding of how this can
> occur is that it expects it won't actually have to scan all the rows,
> because using the histogram distribution stats it can know that all the
> relevant rows of the join column will be at the beginning of the scan. But
> in practice it appears to actually be index scanning all the rows, showing
> massive amounts of page hits.
> ...
> If we change the filter from "type = 'vehicle'" (True for a small fraction
> of the rows) to "freq > -1" (True for all rows) then the plan is the same,
> but the actual time and page hits are much less and the query returns is
> fast.

I think what must be happening is that the planner notes the maximum
possible value of v.id and supposes that the mergejoin will stop far short
of completion because v.id spans just a small part of the range of
usagestats.tid.  Which it does, when you have only the nonselective filter
condition on usagestats.  However, the executor cannot stop until it's
fetched a usagestats row that has a tid value larger than the last v.id
value; otherwise it can't be sure it's emitted all the required join rows.
I'm guessing that the "type = 'vehicle'" condition eliminates all such
rows, or at least enough of them that a very large part of the usagestats
table has to be scanned to find the first can't-possibly-match row.

I'm not sure there's anything much we can do to improve this situation
in Postgres.  It seems like a sufficiently bizarre corner case that it
wouldn't be appropriate to spend planner cycles checking for it, and
I'm not sure how we'd check for it even if we were willing to spend those
cycles.  You might consider altering the query, or inserting some kind of
dummy sentinel row in the data, or changing the schema (is it really
sensible to keep vehicle usagestats in the same table as other
usagestats?).  A brute-force fix would be "enable_mergejoin = off", but
that would prevent selecting this plan type even when it actually is
a significant win.

            regards, tom lane

Re: Merge Join chooses very slow index scan

From

Tom Lane

Date:

19 March 2015, 14:58:33

I wrote:
> [ assorted possible workarounds ]

Actually, an easy fix might be to create a 2-column index on
usagestats(type, tid).  I think the planner should be able to
use that to produce sorted output for the mergejoin, and you'd
get the best of both worlds, because the indexscan will stop
immediately when it's exhausted the rows with type = 'vehicle'.

            regards, tom lane

Re: Merge Join chooses very slow index scan

From

Jake Magner

Date:

19 March 2015, 16:40:58

Thanks Tom, that sounds like what is happening. Some additional
comments/questions inline.


Tom Lane-2 wrote
> I think what must be happening is that the planner notes the maximum
> possible value of v.id and supposes that the mergejoin will stop far short
> of completion because v.id spans just a small part of the range of
> usagestats.tid.  Which it does, when you have only the nonselective filter
> condition on usagestats.  However, the executor cannot stop until it's
> fetched a usagestats row that has a tid value larger than the last v.id
> value; otherwise it can't be sure it's emitted all the required join rows.

Ok, that makes a lot of sense. It is scanning the tid index though, so once
it gets past the last value in v.id isn't it guaranteed that there can be no
more required join rows? Even if it sees tid = 5000 and type = 'aircraft'
then it can know there are no more tids less than 5000. It must be that it
waits to do this check until it gets a row that matches the filter, maybe
this is an optimization in most cases? Seems like the cost of the check
would be small enough compared to the cost of looking up the next row to do
it every time.


Tom Lane-2 wrote
> I'm guessing that the "type = 'vehicle'" condition eliminates all such
> rows, or at least enough of them that a very large part of the usagestats
> table has to be scanned to find the first can't-possibly-match row.

Yes, you are exactly right.


Tom Lane-2 wrote
> I'm not sure there's anything much we can do to improve this situation
> in Postgres.  It seems like a sufficiently bizarre corner case that it
> wouldn't be appropriate to spend planner cycles checking for it, and
> I'm not sure how we'd check for it even if we were willing to spend those
> cycles.  You might consider altering the query, or inserting some kind of
> dummy sentinel row in the data, or changing the schema (is it really
> sensible to keep vehicle usagestats in the same table as other
> usagestats?).  A brute-force fix would be "enable_mergejoin = off", but
> that would prevent selecting this plan type even when it actually is
> a significant win.

I agree it may make sense to change the schema, although there are some good
reasons to have it this way (I obfuscated the table names). If we
partitioned the table on "type" then would the planner be able to stop after
finishing the 'vehicle' type partition?


Tom Lane-2 wrote
> Actually, an easy fix might be to create a 2-column index on
> usagestats(type, tid).  I think the planner should be able to
> use that to produce sorted output for the mergejoin, and you'd
> get the best of both worlds, because the indexscan will stop
> immediately when it's exhausted the rows with type = 'vehicle'.

Actually there is an index on (type, tid) and it doesn't help. I just tried
adding an index on (tid, type) and it partially fixed the issue, judging by
the page hits, it looks like it is still scanning all the rows of the
compound index, but no longer needs to go to the actual table. This takes
600ms instead of the 11,500ms of the original query, but still much more
than the 60ms when you change the type='vehicle' condition to freq > -1. So
it isn't a perfect solution. We could also switch to enum values for the
type field which may reduce the (tid, type) index size enough to make the
performance adequate, but it would be best if we can just get it to quit the
scan early, so the performance doesn't degrade if the table grows
significantly.

Best,
Jake



--
View this message in context:
http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523p5842603.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Merge Join chooses very slow index scan

From

Jake Magner

Date:

19 March 2015, 19:15:39

I think I understand now after reading the notes here on the merge join
algorithm:

https://github.com/postgres/postgres/blob/4ea51cdfe85ceef8afabceb03c446574daa0ac23/src/backend/executor/nodeMergejoin.c

The index scanning node doesn't know the max id of the vehicle table and so
can't know when to stop looking for the next tuple that matches the join.
Doesn't seem like any easy way to improve that case. I imagine it is a
fairly common pattern though. Any time metadata about entities modeled in
different tables, is stored in the same table, and the distribution of the
number of each entity type is skewed, this situation will arise.

Best,
Jake



--
View this message in context:
http://postgresql.nabble.com/Merge-Join-chooses-very-slow-index-scan-tp5842523p5842633.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.