Re: SELECT ignoring index even though ORDER BY and LIMIT present - Mailing list pgsql-performance

From Jori Jovanovich
Subject Re: SELECT ignoring index even though ORDER BY and LIMIT present
Date
Msg-id AANLkTimEP72yYXjbH6fhm0-P8r0BbEfMZkmy7Ywjyeqy@mail.gmail.com
Whole thread Raw
In response to Re: SELECT ignoring index even though ORDER BY and LIMIT present  (Bob Lunney <bob_lunney@yahoo.com>)
List pgsql-performance
hi,

I'm sorry for not posting this first.

The server is the following and is being used exclusively for this PostgreSQL instance:

PostgreSQL 8.4.2 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.2.real (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit
Amazon EC2 Large Instance, 7.5GB memory, 64-bit

This is what is set in my postgresql.conf file:

max_connections = 100
ssl = true
shared_buffers = 24MB

ANALYZE VERBOSE EVENTS;
INFO:  analyzing "public.events"
INFO:  "events": scanned 30000 of 211312 pages, containing 1725088 live rows and 0 dead rows; 30000 rows in sample, 12151060 estimated total rows

Updating statistics did not effect the results -- it's still doing full table scans (I had run statistics as well before posting here as well so this was expected).

thank you

On Wed, Jun 2, 2010 at 8:49 PM, Bob Lunney <bob_lunney@yahoo.com> wrote:
Jori,

What is the PostgreSQL version/shared_buffers/work_mem/effective_cache_size/default_statistics_target?  Are the statistics for the table up to date?  (Run analyze verbose <tablename> to update them.)  Table and index structure would be nice to know, too.

If all else fails you can set enable_seqscan = off for the session, but that is a Big Hammer for what is probably a smaller problem.

Bob Lunney

--- On Wed, 6/2/10, Jori Jovanovich <jori@dimensiology.com> wrote:

From: Jori Jovanovich <jori@dimensiology.com>
Subject: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT present
To: pgsql-performance@postgresql.org
Date: Wednesday, June 2, 2010, 4:28 PM


hi,

I have a problem space where the main goal is to search backward in time for events.  Time can go back very far into the past, and so the
table can get quite large.  However, the vast majority of queries are all satisfied by relatively recent data.  I have an index on the row creation date and I would like almost all of my queries to have a query plan looking something like:

 Limit ...
   ->  Index Scan Backward using server_timestamp_idx on events  (cost=0.00..623055.91 rows=8695 width=177)
         ...

However, PostgreSQL frequently tries to do a full table scan.  Often what controls whether a scan is performed or not is dependent on the size of the LIMIT and how detailed the WHERE clause is.  In practice, the scan is always the wrong answer for my use cases (where "always" is defined to be >99.9%).

Some examples:

(1) A sample query that devolves to a full table scan

  EXPLAIN
   SELECT events.id, events.client_duration, events.message, events.created_by, events.source, events.type, events.event, events.environment,
          events.server_timestamp, events.session_id, events.reference, events.client_uuid
     FROM events
    WHERE client_uuid ~* E'^foo bar so what'
 ORDER BY server_timestamp DESC
    LIMIT 20;
                                QUERY PLAN (BAD!)
--------------------------------------------------------------------------
 Limit  (cost=363278.56..363278.61 rows=20 width=177)
   ->  Sort  (cost=363278.56..363278.62 rows=24 width=177)
         Sort Key: server_timestamp
         ->  Seq Scan on events  (cost=0.00..363278.01 rows=24 width=177)
               Filter: (client_uuid ~* '^foo bar so what'::text)


(2) Making the query faster by making the string match LESS specific (odd, seems like it should be MORE)

  EXPLAIN
   SELECT events.id, events.client_duration, events.message, events.created_by, events.source, events.type, events.event, events.environment,
          events.server_timestamp, events.session_id, events.reference, events.client_uuid
     FROM events
    WHERE client_uuid ~* E'^foo'
 ORDER BY server_timestamp DESC
    LIMIT 20;
                                QUERY PLAN (GOOD!)                                       
------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1433.14 rows=20 width=177)
   ->  Index Scan Backward using server_timestamp_idx on events  (cost=0.00..623055.91 rows=8695 width=177)
         Filter: (client_uuid ~* '^foo'::text)


(3) Alternatively making the query faster by using a smaller limit

  EXPLAIN
   SELECT events.id, events.client_duration, events.message, events.created_by, events.source, events.type, events.event, events.environment,
          events.server_timestamp, events.session_id, events.reference, events.client_uuid
     FROM events
    WHERE client_uuid ~* E'^foo bar so what'
 ORDER BY server_timestamp DESC
    LIMIT 10;
                                QUERY PLAN (GOOD!)                                       
----------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..259606.63 rows=10 width=177)
   ->  Index Scan Backward using server_timestamp_idx on events  (cost=0.00..623055.91 rows=24 width=177)
         Filter: (client_uuid ~* '^foo bar so what'::text)


I find myself wishing I could just put a SQL HINT on the query to force the index to be used but I understand that HINTs are considered harmful and are therefore not provided for PostgreSQL, so what is the recommended way to solve this?

thank you very much

On Thu, Jun 3, 2010 at 5:15 AM, Matthew Wakeling <matthew@flymine.org> wrote:
On Wed, 2 Jun 2010, Jori Jovanovich wrote:
(2) Making the query faster by making the string match LESS specific (odd,
seems like it should be MORE)

No, that's the way round it should be. The LIMIT changes it all. Consider if you have a huge table, and half of the entries match your WHERE clause. To fetch the ORDER BY ... LIMIT 20 using an index scan would involve accessing only on average 40 entries from the table referenced by the index. Therefore, the index is quick. However, consider a huge table that only has twenty matching entries. The index scan would need to touch every single row in the table to return the matching rows, so a sequential scan, filter, and sort would be much faster. Of course, if you had an index capable of answering the WHERE clause, that would be even better for that case.

Okay, this makes sense, thank you -- I was thinking about it backwards.


pgsql-performance by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Weird XFS WAL problem
Next
From: Greg Smith
Date:
Subject: Re: Weird XFS WAL problem