Index Selection: ORDER BY vs. PRIMARY KEY - Mailing list pgsql-performance

From Thomas F. O'Connell
Subject Index Selection: ORDER BY vs. PRIMARY KEY
Date
Msg-id 5DDEAB6E-36F9-422C-A44F-56E190B9FF55@sitening.com
Whole thread Raw
Responses Re: Index Selection: ORDER BY vs. PRIMARY KEY
List pgsql-performance
I have a query that looks roughly like this (I've removed irrelevant
SELECT clause material and obfuscated names, trying to keep them
consistent where altered in EXPLAIN output):

SELECT u.emma_member_id, h.action_ts
FROM user as u, history as h
WHERE u.user_id = h.user_id
AND h.action_id = '$constant_data'
ORDER BY h.action_ts DESC LIMIT 100 OFFSET 0

The user table has ~25,000 rows. The history table has ~750,000 rows.
Currently, there is an index on history.action_ts and a separate one
on history.action_id. There's also a PRIMARY KEY on user.user_id. If
I run the query as such, I get a plan like this:

                QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------------------
Limit  (cost=0.00..2196.30 rows=100 width=925) (actual
time=947.208..3178.775 rows=3 loops=1)
    ->  Nested Loop  (cost=0.00..83898.65 rows=3820 width=925)
(actual time=947.201..3178.759 rows=3 loops=1)
          ->  Index Scan Backward using h_action_ts_idx on history h
(cost=0.00..60823.53 rows=3820 width=480) (actual
time=946.730..3177.953 rows=3 loops=1)
                Filter: (action_id = $constant_data::bigint)
          ->  Index Scan using user_pkey on user u  (cost=0.00..6.01
rows=1 width=445) (actual time=0.156..0.161 rows=1 loops=3)
                Index Cond: (u.user_id = "outer".user_id)
Total runtime: 3179.143 ms
(7 rows)

If I drop the index on the timestamp field, I get a plan like this:

            QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-------------------------------------------------
Limit  (cost=17041.41..17041.66 rows=100 width=925) (actual
time=201.725..201.735 rows=3 loops=1)
    ->  Sort  (cost=17041.41..17050.96 rows=3820 width=925) (actual
time=201.719..201.722 rows=3 loops=1)
          Sort Key: h.action_ts
          ->  Merge Join  (cost=13488.15..16814.13 rows=3820
width=925) (actual time=7.306..201.666 rows=3 loops=1)
                Merge Cond: ("outer".user_id = "inner".user_id)
                ->  Index Scan using user_pkey on user u
(cost=0.00..3134.82 rows=26802 width=445) (actual time=0.204..151.351
rows=24220 loops=1)
                ->  Sort  (cost=13488.15..13497.70 rows=3820
width=480) (actual time=0.226..0.234 rows=3 loops=1)
                      Sort Key: h.user_id
                      ->  Index Scan using h_action_id_idx on history
h  (cost=0.00..13260.87 rows=3820 width=480) (actual
time=0.184..0.195 rows=3 loops=1)
                            Index Cond: (action_id =
$constant_data::bigint)
Total runtime: 202.089 ms
(11 rows)

Clearly, if the index on the timestamp field is there, postgres wants
to use it for the ORDER BY, even though the performance is worse. How
is this preference made internally? If both indexes exist, will
postgres always prefer the index on an ordered column? If I need the
index on the timestamp field for other queries, is my best bet just
to increase sort_mem for this query?

Here's my version string:
PostgreSQL 8.0.3 on i686-pc-linux-gnu, compiled by GCC 2.95.4

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Strategic Open Source: Open Your i™

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)


pgsql-performance by date:

Previous
From: Martin Nickel
Date:
Subject: How can this be?
Next
From: Steve Atkins
Date:
Subject: Re: How can this be?