Re: have: seq scan - want: index scan - Mailing list pgsql-performance

From Samuel Gendler
Subject Re: have: seq scan - want: index scan
Date
Msg-id CAEV0TzDtiE4yre0fYK5r1feWPCPigsX03N_fZbfpfsMiaK6Daw@mail.gmail.com
Whole thread Raw
In response to have: seq scan - want: index scan  (Chris Ruprecht <chris@cdrbill.com>)
List pgsql-performance


On Tue, Oct 16, 2012 at 4:45 PM, Chris Ruprecht <chris@cdrbill.com> wrote:
Hi guys,

PG = 9.1.5
OS = winDOS 2008R8

I have a table that currently has 207 million rows.
there is a timestamp field that contains data.
more data gets copied from another database into this database.
How do I make this do an index scan instead?
I did an "analyze audittrailclinical" to no avail.
I tested different indexes - no same behavior.


The query plan says:

"              ->  Seq Scan on audittrailclinical  (cost=0.00..8637598.76 rows=203856829 width=62)"
"                    Filter: (("timestamp" >= '2008-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2012-10-05 00:00:00'::timestamp without time zone))"

which takes forever.

How do I make this do an index scan instead?
I did an "analyze audittrailclinical" to no avail.

analyze says 203 million out of 207 million rows are matched by your timestamp filter, so it is definitely going to favour a sequential scan, since an index scan that matches that many rows will inevitably be slower than simply scanning the table, since it will have to both do the lookups and load the actual records from the table (all of them, basically) in order to determine their visibility to you, so your index scan will just turn sequential access of the table pages into random access and require index lookups as well.   You can possibly verify this by setting enable_seqscan to false and running your analyze again and see how the plan changes, though I don't believe that will necessarily remove all sequential scans, it just reduces their likelihood, so you may see that nothing changes. If the estimate for the number of matching rows is incorrect, you'll want to increase the statistics gathering for that table or just that column.

ALTER TABLE <table> ALTER COLUMN <column> SET STATISTICS <number>

where number is between 10 and 1000 and I think the default is 100.  Then re-analyze the table and see if the query plan shows better estimates.  I think 9.2 also supports "index only scans" which eliminate the need to load the matched records in certain circumstances. However, all of the columns used by the query would need to be in the index, and you are using an awful lot of columns between the select clause and the table joins.

Are you lacking indexes on the columns used for joins that would allow more selective index scans on those columns which could then just filter by timestamp?  I'm not much of an expert on the query planner, so I'm not sure what exactly will cause that behaviour, but I'd think that good statistics and useful indexes should allow the rest of the where clause to be more selective of the rows from audittrailclinical unless patientaccount.defaultaccount = 'Y' and patient.dnsortpersonnumber = '347450'  are similarly non-selective, though patient.dnsortpersonnumber would seem like it is probably the strong filter, so make sure you've got indexes and accurate stats on all of the foreign keys that connect patient table and audittrailclinical table.  It'd be useful to see the rest of the explain analyze output so we could see how it is handling the joins and why.  Note that because you have multiple composite foreign keys joining tables in your query, you almost certainly won't those composite keys in a single index.  If you have indexes on those columns but they are single-column indexes, that may be what is causing the planner to try to filter the atc table on the timestamp rather than via the joins.  I'm sure someone more knowledgable than I will be along eventually to correct any misinformation I may have passed along.  Without knowing anything about your schema or the rest of the explain analyze output, I'm mostly just guessing.  There is an entire page devoted to formulating useful mailing list questions, incidentally.  Yours really isn't.  Or if the atc table definition is complete, you are definitely missing potentially useful indexes, since you are joining to that table via encountersid and you don't show an index on that column - yet that is the column that eventually joins out to the patient and patientaccount tables, which have the stronger filters on them.

Incidentally, why the join to the entity table via entitysid?  No columns from that table appear to be used anywhere else in the query.

--sam


pgsql-performance by date:

Previous
From: Martin French
Date:
Subject: Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6
Next
From: Maciek Sakrejda
Date:
Subject: Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6