planner and worst case scenario - Mailing list pgsql-performance

From Joseph Shraibman
Subject planner and worst case scenario
Date
Msg-id cbvvju$16u2$1@news.hub.org
Whole thread Raw
List pgsql-performance
Here is my query, that returns one row:
SELECT f1, f2,(SELECT dfield FROM d WHERE d.ukey = f1) FROM m WHERE
status IN(2) AND jid IN(17674) ORDER BY pkey DESC LIMIT 25 OFFSET 0;

Here was the really bad plan chosen.  This didn't come back for a long
while and had to be cancelled:

                                               QUERY PLAN
------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..10493.05 rows=25 width=118)
    ->  Index Scan Backward using m_pkey on m  (cost=0.00..1883712.97
rows=4488 width=118)
          Filter: ((status = 2) AND (jid = 17674))
          SubPlan
            ->  Index Scan using d_pkey on d  (cost=0.00..3.83 rows=1
width=24)
                  Index Cond: (ukey = $0)
(6 rows)

After an ANALYZE the plan was much better:

                                               QUERY PLAN
------------------------------------------------------------------------------------------------------
  Limit  (cost=22060.13..22060.19 rows=25 width=119)
    ->  Sort  (cost=22060.13..22067.61 rows=2993 width=119)
          Sort Key: serial
          ->  Index Scan using m_jid_uid_key on m  (cost=0.00..21887.32
rows=2993 width=119)
                Index Cond: (jid = 17674)
                Filter: (status = 2)
                SubPlan
                  ->  Index Scan using d_pkey on d  (cost=0.00..3.83
rows=1 width=24)
                        Index Cond: (ukey = $0)
(9 rows)


The thing is since there was only 1 row in the (very big) table with
that jid, the ANALYZE didn't
include that row in the stats table, so I'm figuring there was a small
random change that made it
choose the better query.

Doing: ALTER TABLE m ALTER jid SET STATISTICS 1000;
produce a much more accurate row guess:

                                               QUERY PLAN
------------------------------------------------------------------------------------------------------
  Limit  (cost=2909.65..2909.71 rows=25 width=115)
    ->  Sort  (cost=2909.65..2910.64 rows=395 width=115)
          Sort Key: serial
          ->  Index Scan using m_jid_uid_key on m  (cost=0.00..2892.61
rows=395 width=115)
                Index Cond: (jbid = 17674)
                Filter: (status = 2)
                SubPlan
                  ->  Index Scan using d_pkey on d  (cost=0.00..3.83
rows=1 width=24)
                        Index Cond: (userkey = $0)
(9 rows)


It seems the problem is that the pg planner goes for the job with the
lowest projected time,
but ignores the worst case scenario.

I think the odds of this problem happening again are lower since the SET
STATISTICS, but I don't know what triggered the really bad plan in the
first place.  Did pg think that because so many rows would match the
limit would be filled up soon, so that a more accurate and lower
assumption would cause it to choose the better plan?

pgsql-performance by date:

Previous
From: "P.A.M. van Dam "
Date:
Subject: Re: How can one see what queries are running withing a
Next
From: Litao Wu
Date:
Subject: network address query