Thread: Optimizer wrongly picks Nested Loop Left Join

Optimizer wrongly picks Nested Loop Left Join

From
"Edin Kadribasic"
Date:
Hi,


I have a query that is giving the optimizer (and me) great headache. When
its in the good mood the optimizer chooses Hash Left Join and the query
executes in 13ms or so, but sometimes (more and more often) it chooses
Nested Loop Left Join and the execution time goes up to 2-30sec.

The query:
SELECT COUNT(DISTINCT a.tid) FROM axp_temp_order_match a LEFT OUTER JOIN (
SELECT ol.tid, ds.orid FROM axp_dayschedule ds JOIN axp_order_line ol ON
ol.olid = ds.olid JOIN axp_order o ON ds.orid = o.orid WHERE o.status >= 100
AND ds.day between '2005-05-12' and '2005-05-12' AND ds.used = '1' ) b ON
(a.tid = b.tid) WHERE b.tid IS NULL AND a.sid = 16072;

Good plan:
=========
Aggregate  (cost=221.93..221.93 rows=1 width=4) (actual time=34.262..34.266
rows=1 loops=1)
   ->  Hash Left Join  (cost=9.07..220.86 rows=426 width=4) (actual
time=34.237..34.237 rows=0 loops=1)
         Hash Cond: ("outer".tid = "inner".tid)
         Filter: ("inner".tid IS NULL)
         ->  Index Scan using axp_temp_order_match_idx1 on
axp_temp_order_match a  (cost=0.00..209.65 rows=426 width=4) (actual
time=0.277..0.512 rows=6 loops=1)
               Index Cond: (sid = 16072)
         ->  Hash  (cost=9.07..9.07 rows=1 width=4) (actual
time=32.777..32.777 rows=0 loops=1)
               ->  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
time=0.208..31.563 rows=284 loops=1)
                     ->  Nested Loop  (cost=0.00..6.05 rows=1 width=4)
(actual time=0.178..20.684 rows=552 loops=1)
                           ->  Index Scan using axp_dayschedule_day_idx on
axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
time=0.036..3.973 rows=610 loops=1)
                                 Index Cond: (("day" >= '2005-05-12'::date)
AND ("day" <= '2005-05-12'::date))
                                 Filter: (used = B'1'::"bit")
                           ->  Index Scan using axp_order_orid_key on
axp_order o  (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013
rows=1 loops=610)
                                 Index Cond: ("outer".orid = o.orid)
                                 Filter: (status >= 100)
                     ->  Index Scan using axp_order_line_pk on
axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
time=0.006..0.008 rows=1 loops=552)
                           Index Cond: (ol.olid = "outer".olid)
 Total runtime: 34.581 ms

Bad plan (same query different values):
=======================================
 Aggregate  (cost=11.54..11.54 rows=1 width=4) (actual
time=11969.281..11969.285 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=0.00..11.53 rows=1 width=4) (actual
time=25.730..11967.180 rows=338 loops=1)
         Join Filter: ("outer".tid = "inner".tid)
         Filter: ("inner".tid IS NULL)
         ->  Index Scan using axp_temp_order_match_idx1 on
axp_temp_order_match a  (cost=0.00..2.45 rows=1 width=4) (actual
time=0.027..2.980 rows=471 loops=1)
               Index Cond: (sid = 16092)
         ->  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
time=0.088..24.350 rows=285 loops=471)
               ->  Nested Loop  (cost=0.00..6.04 rows=1 width=8) (actual
time=0.067..15.649 rows=317 loops=471)
                     ->  Index Scan using axp_dayschedule_day_idx on
axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
time=0.015..3.557 rows=606 loops=471)
                           Index Cond: (("day" >= '2005-05-13'::date) AND
("day" <= '2005-05-13'::date))
                           Filter: (used = B'1'::"bit")
                     ->  Index Scan using axp_order_line_pk on
axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
time=0.006..0.008 rows=1 loops=285426)
                           Index Cond: (ol.olid = "outer".olid)
               ->  Index Scan using axp_order_orid_key on axp_order o
(cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1
loops=149307)
                     Index Cond: ("outer".orid = o.orid)
                     Filter: (status >= 100)
 Total runtime: 11969.443 ms

Please note that sometimes when I get "bad plan" in the logfile, I just
re-run the query and the optimizer chooses the more efficient one. Sometime
it does not.

Any ideas?

postgresql-8.0.2 on 2x3.2 GHz Xeon with 2GB ram Linux 2.6
shared_buffers = 15000
work_mem = 128000
effective_cache_size = 200000
random_page_cost = (tried 1.0 - 4, seemingly without effect on this
particular issue).

Edin




Re: Optimizer wrongly picks Nested Loop Left Join

From
John A Meinel
Date:
Edin Kadribasic wrote:
> Hi,
>
>
> I have a query that is giving the optimizer (and me) great headache. When
> its in the good mood the optimizer chooses Hash Left Join and the query
> executes in 13ms or so, but sometimes (more and more often) it chooses
> Nested Loop Left Join and the execution time goes up to 2-30sec.
>
> The query:
> SELECT COUNT(DISTINCT a.tid) FROM axp_temp_order_match a LEFT OUTER JOIN (
> SELECT ol.tid, ds.orid FROM axp_dayschedule ds JOIN axp_order_line ol ON
> ol.olid = ds.olid JOIN axp_order o ON ds.orid = o.orid WHERE o.status >= 100
> AND ds.day between '2005-05-12' and '2005-05-12' AND ds.used = '1' ) b ON
> (a.tid = b.tid) WHERE b.tid IS NULL AND a.sid = 16072;
>

Unfortunately, because Hash Join doesn't report the number of rows
(rows=0 always), it's hard to tell how good the estimator is. But I
*can* say that the NestLoop estimation is way off.

>
> Good plan:
> =========
> Aggregate  (cost=221.93..221.93 rows=1 width=4) (actual time=34.262..34.266
> rows=1 loops=1)
>    ->  Hash Left Join  (cost=9.07..220.86 rows=426 width=4) (actual
> time=34.237..34.237 rows=0 loops=1)
>          Hash Cond: ("outer".tid = "inner".tid)
>          Filter: ("inner".tid IS NULL)
>          ->  Index Scan using axp_temp_order_match_idx1 on
> axp_temp_order_match a  (cost=0.00..209.65 rows=426 width=4) (actual
> time=0.277..0.512 rows=6 loops=1)
>                Index Cond: (sid = 16072)
>          ->  Hash  (cost=9.07..9.07 rows=1 width=4) (actual
> time=32.777..32.777 rows=0 loops=1)
>                ->  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
> time=0.208..31.563 rows=284 loops=1)
>                      ->  Nested Loop  (cost=0.00..6.05 rows=1 width=4)
> (actual time=0.178..20.684 rows=552 loops=1)
>                            ->  Index Scan using axp_dayschedule_day_idx on
> axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.036..3.973 rows=610 loops=1)
>                                  Index Cond: (("day" >= '2005-05-12'::date)
> AND ("day" <= '2005-05-12'::date))
>                                  Filter: (used = B'1'::"bit")
>                            ->  Index Scan using axp_order_orid_key on
> axp_order o  (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013
> rows=1 loops=610)
>                                  Index Cond: ("outer".orid = o.orid)
>                                  Filter: (status >= 100)
>                      ->  Index Scan using axp_order_line_pk on
> axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
> time=0.006..0.008 rows=1 loops=552)
>                            Index Cond: (ol.olid = "outer".olid)
>  Total runtime: 34.581 ms
>
> Bad plan (same query different values):
> =======================================
>  Aggregate  (cost=11.54..11.54 rows=1 width=4) (actual
> time=11969.281..11969.285 rows=1 loops=1)
>    ->  Nested Loop Left Join  (cost=0.00..11.53 rows=1 width=4) (actual
> time=25.730..11967.180 rows=338 loops=1)

See here, it thinks it will only have to do 1 nestloop, which would be
quite fast, but it hast to do 338.

>          Join Filter: ("outer".tid = "inner".tid)
>          Filter: ("inner".tid IS NULL)
>          ->  Index Scan using axp_temp_order_match_idx1 on
> axp_temp_order_match a  (cost=0.00..2.45 rows=1 width=4) (actual
> time=0.027..2.980 rows=471 loops=1)
>                Index Cond: (sid = 16092)
>          ->  Nested Loop  (cost=0.00..9.07 rows=1 width=4) (actual
> time=0.088..24.350 rows=285 loops=471)

Same thing here.

>                ->  Nested Loop  (cost=0.00..6.04 rows=1 width=8) (actual
> time=0.067..15.649 rows=317 loops=471)

And here.

>                      ->  Index Scan using axp_dayschedule_day_idx on
> axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.015..3.557 rows=606 loops=471)

This estimate is way off too, but it is off in both plans.

>                            Index Cond: (("day" >= '2005-05-13'::date) AND
> ("day" <= '2005-05-13'::date))
>                            Filter: (used = B'1'::"bit")
>                      ->  Index Scan using axp_order_line_pk on
> axp_order_line ol  (cost=0.00..3.01 rows=1 width=8) (actual
> time=0.006..0.008 rows=1 loops=285426)

This is probably what is killing you. It is doing a single lookup 285k
times. The above plan only does it 552 times.
>                            Index Cond: (ol.olid = "outer".olid)
>                ->  Index Scan using axp_order_orid_key on axp_order o
> (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1
> loops=149307)
>                      Index Cond: ("outer".orid = o.orid)
>                      Filter: (status >= 100)
>  Total runtime: 11969.443 ms
>
> Please note that sometimes when I get "bad plan" in the logfile, I just
> re-run the query and the optimizer chooses the more efficient one. Sometime
> it does not.

You work_mem is quite high relative to your total Ram, hopefully you
don't have many allowed concurrent connections. But that is a side point.

I assume the tables are freshly VACUUM ANALYZEd. Have you tried altering
the statistics for the columns, one of them to look at is
axp_dayschedule(day). That one seems to be consistently incorrect.
Perhaps because a between with the same date is too restrictive in
postgres? I don't really know.


Also, looking at the original query, maybe I am missing something, but
you are doing a COUNT(DISTINCT) with a LEFT OUTER JOIN.

If it is a LEFT JOIN, isn't that the same as

SELECT COUNT(DISTINCT a.tid) FROM axp_temp_order_match a
  WHERE a.sid = 16072;

I also have to wonder about:

SELECT ... a.tid FROM a LEFT JOIN (...) b ON (a.tid = b.tid)
  WHERE b.tid IS NULL

Isn't that equivalent to

SELECT COUNT(DISTINCT a.tid) FROM a WHERE a.tid IS NULL;

Which is also equivalent to
SELECT CASE WHEN EXISTS (SELECT a.tid FROM a WHERE a.tid IS NULL AND
a.sid = 16072) THEN 1 ELSE 0 END;

I could be wrong, but if a.tid IS NULL (because b.tid IS NULL, and a.tid
= b.tid), DISTINCT can only return 0 or 1 rows, which is the same as
using a case statement. You could also put a LIMIT in there, since you
know DISTINCT can only return 1 row

SELECT COUNT(a.tid) FROM a WHERE a.tid IS NULL AND a.sid = ... LIMIT 1;

However, one final point, COUNT(column) where column is NULL doesn't
count anything. So really your query can be replaced with:

SELECT 0;

Now maybe this is a generated query, and under other circumstances it
gives a different query which actually is important.

The biggest thing is that using a COUNT() and a DISTINCT on a column
that is on the left side of a LEFT JOIN, sounds like you can get rid of
the entire right side of that join.

John
=:->

>
> Any ideas?
>
> postgresql-8.0.2 on 2x3.2 GHz Xeon with 2GB ram Linux 2.6
> shared_buffers = 15000
> work_mem = 128000
> effective_cache_size = 200000
> random_page_cost = (tried 1.0 - 4, seemingly without effect on this
> particular issue).
>
> Edin
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match


Attachment

Re: Optimizer wrongly picks Nested Loop Left Join

From
Tom Lane
Date:
"Edin Kadribasic" <edink@emini.dk> writes:
> I have a query that is giving the optimizer (and me) great headache.

The main problem seems to be that the rowcount estimates for
axp_temp_order_match and axp_dayschedule are way off:

>          ->  Index Scan using axp_temp_order_match_idx1 on
> axp_temp_order_match a  (cost=0.00..209.65 rows=426 width=4) (actual
> time=0.277..0.512 rows=6 loops=1)
>                Index Cond: (sid = 16072)

>                            ->  Index Scan using axp_dayschedule_day_idx on
> axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.036..3.973 rows=610 loops=1)
>                                  Index Cond: (("day" >= '2005-05-12'::date)
> AND ("day" <= '2005-05-12'::date))
>                                  Filter: (used = B'1'::"bit")

>          ->  Index Scan using axp_temp_order_match_idx1 on
> axp_temp_order_match a  (cost=0.00..2.45 rows=1 width=4) (actual
> time=0.027..2.980 rows=471 loops=1)
>                Index Cond: (sid = 16092)

>                      ->  Index Scan using axp_dayschedule_day_idx on
> axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.015..3.557 rows=606 loops=471)
>                            Index Cond: (("day" >= '2005-05-13'::date) AND
> ("day" <= '2005-05-13'::date))
>                            Filter: (used = B'1'::"bit")

Do you ANALYZE these tables on a regular basis?  If so, it may be
necessary to increase the statistics target to the point where you
get better estimates.

> Please note that sometimes when I get "bad plan" in the logfile, I just
> re-run the query and the optimizer chooses the more efficient one.

That's fairly hard to believe, unless you've got autovacuum running
in the background.

            regards, tom lane

Re: Optimizer wrongly picks Nested Loop Left Join

From
Tom Lane
Date:
John A Meinel <john@arbash-meinel.com> writes:
> Unfortunately, because Hash Join doesn't report the number of rows
> (rows=0 always), it's hard to tell how good the estimator is.

This is only a cosmetic problem because you can just look at the number
of rows actually emitted by the Hash node's child; that's always exactly
the number loaded into the hashtable.

(But having said that, it is fixed in CVS tip.)

You may be confused though --- the Hash node is not the Hash Join node.
A zero report from Hash Join does actually mean that it emitted zero
rows.

            regards, tom lane

Re: Optimizer wrongly picks Nested Loop Left Join

From
"Edin Kadribasic"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
> "Edin Kadribasic" <edink@emini.dk> writes:
> > I have a query that is giving the optimizer (and me) great headache.
>
> The main problem seems to be that the rowcount estimates for
> axp_temp_order_match and axp_dayschedule are way off:
>
> >          ->  Index Scan using axp_temp_order_match_idx1 on
> > axp_temp_order_match a  (cost=0.00..209.65 rows=426 width=4) (actual
> > time=0.277..0.512 rows=6 loops=1)
> >                Index Cond: (sid = 16072)
>
> >                            ->  Index Scan using axp_dayschedule_day_idx
on
> > axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> > time=0.036..3.973 rows=610 loops=1)
> >                                  Index Cond: (("day" >=
'2005-05-12'::date)
> > AND ("day" <= '2005-05-12'::date))
> >                                  Filter: (used = B'1'::"bit")
>
> >          ->  Index Scan using axp_temp_order_match_idx1 on
> > axp_temp_order_match a  (cost=0.00..2.45 rows=1 width=4) (actual
> > time=0.027..2.980 rows=471 loops=1)
> >                Index Cond: (sid = 16092)
>
> >                      ->  Index Scan using axp_dayschedule_day_idx on
> > axp_dayschedule ds  (cost=0.00..3.02 rows=1 width=8) (actual
> > time=0.015..3.557 rows=606 loops=471)
> >                            Index Cond: (("day" >= '2005-05-13'::date)
AND
> > ("day" <= '2005-05-13'::date))
> >                            Filter: (used = B'1'::"bit")
>
> Do you ANALYZE these tables on a regular basis?  If so, it may be
> necessary to increase the statistics target to the point where you
> get better estimates.

Increasing statistics didn't seem to help, but both of you gave me an idea
of what might be wrong. axp_temp_order match contains temporary matches for
a search. Just before execution of that query the new matches are inserted
into the table under a new search id (sid column). Since the ANALYZE was
that before it it grossly underestimates the number of matches for that sid.
As this table is relatively small inserting ANALYZE axp_temp_order_match
just before running the query does not introduce a great perforance penalty
(50ms) and it reduces the query execution time from up to 50s down to ~20ms.

> > Please note that sometimes when I get "bad plan" in the logfile, I just
> > re-run the query and the optimizer chooses the more efficient one.
>
> That's fairly hard to believe, unless you've got autovacuum running
> in the background.

The application had ANALYZE axp_temp_order_match placed in the "slightly"
wrong location, before the large insert was done (1000 rows with  a new
sid). So when the app run the next search, previous search got correctly
analyzed and the query execution time dropped dramatically as I was trying
to EXPLAIN ANALYZE query recorded in the log file.

Thanks for your help,

Edin