Thread: Postgresql 9.0.7 weird planner decision (rows in plan close to reality but plan suboptimal)

Hi,

Today on one of databases under my management I found very strange plan for simple query.
Postgresql 9.0.7 on Linux, 
random_page_cost=4
seq_page_cost=1

The query and plan:

db=# EXPLAIN (ANALYZE, COSTS, BUFFERS)  select obj_id, obj_commented,p2o_id FROM blog_post as obj
JOIN person2obj ON p2o_obj_obj_id = obj_id
JOIN person2obj_counters ON p2oc_id = p2o_id
WHERE obj_status_did = 1
AND obj_commented IS NOT NULL
AND obj_commented > now() - '7days'::interval
AND obj_commented > p2o_notified
AND p2o_notify = 't';
                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=6546.54..6584.69 rows=3 width=24) (actual time=86.262..1349.266 rows=1770 loops=1)
   Merge Cond: (person2obj_counters.p2oc_id = person2obj.p2o_id)
   Buffers: shared hit=1140491
   ->  Index Scan using pk_person2obj_counters on person2obj_counters  (cost=0.00..47110.95 rows=1212591 width=8) (actual time=0.008..997.948 rows=1212765 loops=1)
         Buffers: shared hit=1108452
   ->  Sort  (cost=6546.42..6546.98 rows=221 width=24) (actual time=85.877..88.373 rows=7870 loops=1)
         Sort Key: person2obj.p2o_id
         Sort Method:  quicksort  Memory: 807kB
         Buffers: shared hit=32039
         ->  Nested Loop  (cost=0.00..6537.82 rows=221 width=24) (actual time=0.097..80.129 rows=7870 loops=1)
               Buffers: shared hit=32039
               ->  Index Scan using i_blog_post_commented_active on blog_post obj  (cost=0.00..225.73 rows=1726 width=16) (actual time=0.028..17.957 rows=6010 loops=1)
                     Index Cond: ((obj_commented IS NOT NULL) AND (obj_commented > (now() - '7 days'::interval)))
                     Buffers: shared hit=6207
               ->  Index Scan using i_person2obj_obj_notified_subscribed on person2obj  (cost=0.00..3.64 rows=1 width=24) (actual time=0.006..0.009 rows=1 loops=6010)
                     Index Cond: ((person2obj.p2o_obj_obj_id = obj.obj_id) AND (obj.obj_commented > person2obj.p2o_notified))
                     Buffers: shared hit=25832
 Total runtime: 1349.767 ms

I don't understand why database choose merge join with 1.2M entries table.
person2obj_counters have an index on p2oc_id (it's a primary key field).

Switch to fast inner loop plan could be managed with set random_page_cost=10

db=# set random_page_cost to 10;
SET
db=# EXPLAIN (ANALYZE, COSTS, BUFFERS)  select obj_id, obj_commented,p2o_id FROM blog_post as obj
JOIN person2obj ON p2o_obj_obj_id = obj_id
JOIN person2obj_counters ON p2oc_id = p2o_id
WHERE obj_status_did = 1
AND obj_commented IS NOT NULL
AND obj_commented > now() - '7days'::interval
AND obj_commented > p2o_notified
AND p2o_notify = 't';
                                                                            QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..14810.38 rows=3 width=24) (actual time=16.910..115.110 rows=1758 loops=1)
   Buffers: shared hit=57403
   ->  Nested Loop  (cost=0.00..14616.37 rows=221 width=24) (actual time=0.088..82.342 rows=7858 loops=1)
         Buffers: shared hit=32046
         ->  Index Scan using i_blog_post_commented_active on blog_post obj  (cost=0.00..273.70 rows=1725 width=16) (actual time=0.029..16.260 rows=6009 loops=1)
               Index Cond: ((obj_commented IS NOT NULL) AND (obj_commented > (now() - '7 days'::interval)))
               Buffers: shared hit=6222
         ->  Index Scan using i_person2obj_obj_notified_subscribed on person2obj  (cost=0.00..8.30 rows=1 width=24) (actual time=0.007..0.010 rows=1 loops=6009)
               Index Cond: ((person2obj.p2o_obj_obj_id = obj.obj_id) AND (obj.obj_commented > person2obj.p2o_notified))
               Buffers: shared hit=25824
   ->  Index Scan using pk_person2obj_counters on person2obj_counters  (cost=0.00..0.87 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=7858)
         Index Cond: (person2obj_counters.p2oc_id = person2obj.p2o_id)
         Buffers: shared hit=25357
 Total runtime: 115.465 ms
(14 rows)

Ok... so what I see... the database think it will need join 221 rows from previous level to the person2obj_counters table.
And somehow Pg  manage to calculate that the full index scan over 1.2M entries of the person2obj_counters table is faster that nested loop probes over 221 value.

What look very suspicious is that merge full index scan+merge join part
adds only 40 points to the total cost (
Merge Join  (cost=6546.54..6584.69 rows=3 width=24) (actual time=86.262..1349.266 rows=1770 loops=1)
...
->  Index Scan using pk_person2obj_counters on person2obj_counters  (cost=0.00..47110.95 rows=1212591 width=8) (actual time=0.008..997.948 rows=1212765 loops=1)
...
   ->  Sort  (cost=6546.42..6546.98 rows=221 width=24) (actual time=85.877..88.373 rows=7870 loops=1)
)... how that could be?

--
Maxim Boguk
Senior Postgresql DBA.

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com

LinkedIn profile: http://nz.linkedin.com/in/maximboguk
"If they can send one man to the moon... why can't they send them all?"

МойКруг: http://mboguk.moikrug.ru/
"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."