Thread: Forcing postgresql to use an index

Forcing postgresql to use an index

From
Eugene Morozov
Date:
Hello,

I have a following query (autogenerated by Django)

SELECT activity_activityevent.id, activity_activityevent.user_id, activity_activityevent.added_on
FROM activity_activityevent
WHERE activity_activityevent.user_id IN (
   SELECT U0.user_id
   FROM profile U0
   INNER JOIN profile_friends U1
   ON U0.user_id = U1.to_profile_id
   WHERE U1.from_profile_id = 5
)
ORDER BY activity_activityevent.added_on DESC LIMIT 10


When I run EXPLAIN ANALYZE with my default settings (seq scan is on,
random_page_cost = 4) I get the following result:

Limit  (cost=4815.62..4815.65 rows=10 width=202) (actual time=332.938..332.977 rows=10 loops=1)
  ->  Sort  (cost=4815.62..4816.35 rows=292 width=202) (actual time=332.931..332.945 rows=10 loops=1)
        Sort Key: activity_activityevent.added_on
        Sort Method:  top-N heapsort  Memory: 19kB
        ->  Hash IN Join  (cost=2204.80..4809.31 rows=292 width=202) (actual time=12.856..283.916 rows=15702 loops=1)
              Hash Cond: (activity_activityevent.user_id = u0.user_id)
              ->  Seq Scan on activity_activityevent  (cost=0.00..2370.43 rows=61643 width=202) (actual
time=0.020..126.129rows=61643 loops=1) 
              ->  Hash  (cost=2200.05..2200.05 rows=380 width=8) (actual time=12.777..12.777 rows=424 loops=1)
                    ->  Nested Loop  (cost=11.20..2200.05 rows=380 width=8) (actual time=0.260..11.594 rows=424
loops=1)
                          ->  Bitmap Heap Scan on profile_friends u1  (cost=11.20..62.95 rows=380 width=4) (actual
time=0.228..1.202rows=424 loops=1) 
                                Recheck Cond: (from_profile_id = 5)
                                ->  Bitmap Index Scan on profile_friends_from_profile_id_key  (cost=0.00..11.10
rows=380width=0) (actual time=0.208..0.208 rows=424 loops=1) 
                                      Index Cond: (from_profile_id = 5)
                          ->  Index Scan using profile_pkey on profile u0  (cost=0.00..5.61 rows=1 width=4) (actual
time=0.012..0.015rows=1 loops=424) 
                                Index Cond: (u0.user_id = u1.to_profile_id)
Total runtime: 333.190 ms

But when I disable seq scan or set random_page_cost to 1.2 (higher
values doesn't change the plan), postgres starts using index and query
runs two times faster:

Limit  (cost=9528.36..9528.38 rows=10 width=202) (actual time=165.047..165.090 rows=10 loops=1)
  ->  Sort  (cost=9528.36..9529.09 rows=292 width=202) (actual time=165.042..165.058 rows=10 loops=1)
        Sort Key: activity_activityevent.added_on
        Sort Method:  top-N heapsort  Memory: 19kB
        ->  Nested Loop  (cost=2201.00..9522.05 rows=292 width=202) (actual time=13.074..126.209 rows=15702 loops=1)
              ->  HashAggregate  (cost=2201.00..2204.80 rows=380 width=8) (actual time=12.996..14.131 rows=424 loops=1)
                    ->  Nested Loop  (cost=11.20..2200.05 rows=380 width=8) (actual time=0.263..11.665 rows=424
loops=1)
                          ->  Bitmap Heap Scan on profile_friends u1  (cost=11.20..62.95 rows=380 width=4) (actual
time=0.232..1.181rows=424 loops=1) 
                                Recheck Cond: (from_profile_id = 5)
                                ->  Bitmap Index Scan on profile_friends_from_profile_id_key  (cost=0.00..11.10
rows=380width=0) (actual time=0.210..0.210 rows=424 loops=1) 
                                      Index Cond: (from_profile_id = 5)
                          ->  Index Scan using profile_pkey on profile u0  (cost=0.00..5.61 rows=1 width=4) (actual
time=0.013..0.016rows=1 loops=424) 
                                Index Cond: (u0.user_id = u1.to_profile_id)
              ->  Index Scan using activity_activityevent_user_id on activity_activityevent  (cost=0.00..18.82 rows=35
width=202)(actual time=0.014..0.130 rows=37 loops=424) 
                    Index Cond: (activity_activityevent.user_id = u0.user_id)
Total runtime: 165.323 ms


Can anyone enlighten me? Should I set random_page_cost to 1.2
permanently (I feel this is not a really good idea in my case)?

Eugene

Re: Forcing postgresql to use an index

From
Grzegorz Jaśkiewicz
Date:
Learn it to not generate with "WITH IN (subq)", is this can be quite
slow on postgresql. Use joins instead.

looks like planner was wrong about rowcount in one place: Hash IN Join
 (cost=2204.80..4809.31 rows=292 width=202) (actual
time=12.856..283.916 rows=15702 loops=1)

I have no idea why, probably more knowledgeable guys will know more
about why. But overall, all other stats seem to be okay.
What's the default_statistics_target setting in the postgresql set to?

One thing tho, what's the version, and platform.

Re: Forcing postgresql to use an index

From
"Kevin Grittner"
Date:
Eugene Morozov <eugene@cactus-mouse.com> wrote:

> Can anyone enlighten me? Should I set random_page_cost to 1.2
> permanently (I feel this is not a really good idea in my case)?

For it to pass as many rows as it did in the time that it did, most or
all of the "reads" were cached.  If this is typically the case, at
least for the queries for which performance is most critical, your
change makes sense as a permanent setting.  In fact, you might want to
go even further -- there have been many reports of people getting good
performance on fully-cached systems by dropping both random_page_cost
and seq_page_cost to 0.1, so that the optimizer better estimates the
relative cost of "disk access" versus CPU-based operations.

-Kevin

Re: Forcing postgresql to use an index

From
Scott Marlowe
Date:
On Tue, Sep 8, 2009 at 8:12 AM, Eugene Morozov<eugene@cactus-mouse.com> wrote:
> Hello,
>
> I have a following query (autogenerated by Django)
>
> SELECT activity_activityevent.id, activity_activityevent.user_id, activity_activityevent.added_on
> FROM activity_activityevent
> WHERE activity_activityevent.user_id IN (
>   SELECT U0.user_id
>   FROM profile U0
>   INNER JOIN profile_friends U1
>   ON U0.user_id = U1.to_profile_id
>   WHERE U1.from_profile_id = 5
> )
> ORDER BY activity_activityevent.added_on DESC LIMIT 10
>
>
> When I run EXPLAIN ANALYZE with my default settings (seq scan is on,
> random_page_cost = 4) I get the following result:
>
> Limit  (cost=4815.62..4815.65 rows=10 width=202) (actual time=332.938..332.977 rows=10 loops=1)
>  ->  Sort  (cost=4815.62..4816.35 rows=292 width=202) (actual time=332.931..332.945 rows=10 loops=1)
>        Sort Key: activity_activityevent.added_on
>        Sort Method:  top-N heapsort  Memory: 19kB
>        ->  Hash IN Join  (cost=2204.80..4809.31 rows=292 width=202) (actual time=12.856..283.916 rows=15702 loops=1)
>              Hash Cond: (activity_activityevent.user_id = u0.user_id)
>              ->  Seq Scan on activity_activityevent  (cost=0.00..2370.43 rows=61643 width=202) (actual
time=0.020..126.129rows=61643 loops=1) 
>              ->  Hash  (cost=2200.05..2200.05 rows=380 width=8) (actual time=12.777..12.777 rows=424 loops=1)
>                    ->  Nested Loop  (cost=11.20..2200.05 rows=380 width=8) (actual time=0.260..11.594 rows=424
loops=1)
>                          ->  Bitmap Heap Scan on profile_friends u1  (cost=11.20..62.95 rows=380 width=4) (actual
time=0.228..1.202rows=424 loops=1) 
>                                Recheck Cond: (from_profile_id = 5)
>                                ->  Bitmap Index Scan on profile_friends_from_profile_id_key  (cost=0.00..11.10
rows=380width=0) (actual time=0.208..0.208 rows=424 loops=1) 
>                                      Index Cond: (from_profile_id = 5)
>                          ->  Index Scan using profile_pkey on profile u0  (cost=0.00..5.61 rows=1 width=4) (actual
time=0.012..0.015rows=1 loops=424) 
>                                Index Cond: (u0.user_id = u1.to_profile_id)
> Total runtime: 333.190 ms
>
> But when I disable seq scan or set random_page_cost to 1.2 (higher
> values doesn't change the plan), postgres starts using index and query
> runs two times faster:
>
> Limit  (cost=9528.36..9528.38 rows=10 width=202) (actual time=165.047..165.090 rows=10 loops=1)
>  ->  Sort  (cost=9528.36..9529.09 rows=292 width=202) (actual time=165.042..165.058 rows=10 loops=1)
>        Sort Key: activity_activityevent.added_on
>        Sort Method:  top-N heapsort  Memory: 19kB
>        ->  Nested Loop  (cost=2201.00..9522.05 rows=292 width=202) (actual time=13.074..126.209 rows=15702 loops=1)
>              ->  HashAggregate  (cost=2201.00..2204.80 rows=380 width=8) (actual time=12.996..14.131 rows=424
loops=1)
>                    ->  Nested Loop  (cost=11.20..2200.05 rows=380 width=8) (actual time=0.263..11.665 rows=424
loops=1)
>                          ->  Bitmap Heap Scan on profile_friends u1  (cost=11.20..62.95 rows=380 width=4) (actual
time=0.232..1.181rows=424 loops=1) 
>                                Recheck Cond: (from_profile_id = 5)
>                                ->  Bitmap Index Scan on profile_friends_from_profile_id_key  (cost=0.00..11.10
rows=380width=0) (actual time=0.210..0.210 rows=424 loops=1) 
>                                      Index Cond: (from_profile_id = 5)
>                          ->  Index Scan using profile_pkey on profile u0  (cost=0.00..5.61 rows=1 width=4) (actual
time=0.013..0.016rows=1 loops=424) 
>                                Index Cond: (u0.user_id = u1.to_profile_id)
>              ->  Index Scan using activity_activityevent_user_id on activity_activityevent  (cost=0.00..18.82 rows=35
width=202)(actual time=0.014..0.130 rows=37 loops=424) 
>                    Index Cond: (activity_activityevent.user_id = u0.user_id)
> Total runtime: 165.323 ms
>
>
> Can anyone enlighten me? Should I set random_page_cost to 1.2
> permanently (I feel this is not a really good idea in my case)?

OK, you need to look a little deeper at what's happening here.  The
pgsql query planner looks at a lot of things to decide if to use seq
scan or and index.  If you look at your row estimates versus actual
rows returned, you'll see they're off, sometimes by quite a bit.
Particularly the ones near the top of your query plans. There are a
few things you can do to help out here.  Increase default stats target
and re-analyse, increase effective_cache_size to reflect the actual
size of data being cached by your OS / filesystem / pgsql, and then
lowering random_page_cost.

Re: Forcing postgresql to use an index

From
Eugene Morozov
Date:
Scott Marlowe <scott.marlowe@gmail.com> writes:

> On Tue, Sep 8, 2009 at 8:12 AM, Eugene Morozov<eugene@cactus-mouse.com> wrote:
> OK, you need to look a little deeper at what's happening here.  The
> pgsql query planner looks at a lot of things to decide if to use seq
> scan or and index.  If you look at your row estimates versus actual
> rows returned, you'll see they're off, sometimes by quite a bit.
> Particularly the ones near the top of your query plans. There are a
> few things you can do to help out here.  Increase default stats target
> and re-analyse, increase effective_cache_size to reflect the actual
> size of data being cached by your OS / filesystem / pgsql, and then
> lowering random_page_cost.

Thanks to all who answered. Your answers were really helpful, I've
split the query in two (couldn't make Django to use JOIN here) and was
able to speed it up by a factor of 10!
Eugene

Re: Forcing postgresql to use an index

From
Eugene Morozov
Date:
Grzegorz Jaśkiewicz <gryzman@gmail.com> writes:

> Learn it to not generate with "WITH IN (subq)", is this can be quite
> slow on postgresql. Use joins instead.

OK, I've split the query in two (can't make Django to generate JOIN in this
case) and it always uses index now. This immediately opened road for
other optimizations. Thanks!

>
> looks like planner was wrong about rowcount in one place: Hash IN Join
>  (cost=2204.80..4809.31 rows=292 width=202) (actual
> time=12.856..283.916 rows=15702 loops=1)
>
> I have no idea why, probably more knowledgeable guys will know more
> about why. But overall, all other stats seem to be okay.
> What's the default_statistics_target setting in the postgresql set to?
>
> One thing tho, what's the version, and platform.

PostgreSQL 8.3.7, Ubuntu 8.10