Home > mailing lists

Re: PG8.2.1 choosing slow seqscan over idx scan - Mailing list pgsql-performance

From	Chad Wagner
Subject	Re: PG8.2.1 choosing slow seqscan over idx scan
Date	January 17, 2007 00:53:55
Msg-id	81961ff50701162052s4779103dq3f0314fe2ac17134@mail.gmail.com Whole thread Raw
In response to	Re: PG8.2.1 choosing slow seqscan over idx scan ("Jeremy Haile" <jhaile@fastmail.fm>)
Responses	Re: PG8.2.1 choosing slow seqscan over idx scan
List	pgsql-performance

Tree view

On 1/16/07, Jeremy Haile <jhaile@fastmail.fm> wrote:

The table is heavily inserted and deleted from. Recently I had done a
very large delete.

That's what I suspected.

Here is the results of the query you sent me: (sorry it's hard to read)

"transaction_date";0;8;172593;-0.194848

Just curious - what does that tell us?

Based on the explain plan you posted earlier we learned the optimizer believed the query would return 322558 rows (and it was reasonably accurate, too) for a relatively small time frame (15 hours and 20 minutes).

-> Seq Scan on transaction_facts (cost=0.00..333928.25 rows=322558
width=16) (actual time=19917.957..140036.910 rows=347434 loops=1)

Based on the information you just posted, the average row length is 156 bytes.

347434 rows * 156 bytes = 52MB (reasonable it could be held in your shared buffers, which makes Tom's suggestion very plausible, the index scan may not be cheaper -- because it is all cached)

The estimation of cost when you are requesting a "range" of data will involve the "correlation" factor, the correlation is defined as:

"Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is NULL if the column data type does not have a < operator.)"

Which means that as correlation approaches zero (which it is -0.19, I would call that zero) it represents that the physical ordering of the data (in the data files) is such that a range scan of the btree index would result in many scatter reads (which are slow). So the optimizer considers whether a "scattered" read will be cheaper than a sequential scan based on a few factors: a) correlation [for ranges] b) size of table c) estimated cardinality [what does it think you are going to get back]. Based on those factors it may choose either access path (sequential or index).

One of the reasons why the sequential scan is slower is because the optimizer doesn't know the data you are requesting is sitting in the cache (and it is VERY unlikely you have the entire table in cache, unless it is a heavily used table or very small table, which it's probably not). So once it decides a sequential scan is the best plan, it starts chugging away at the disk and pulling in most of the pages in the table and throws away the pages that do not meet your criteria.

The index scan is quicker (may be bogus, as Tom suggested) because the it starts chugging away at the index and finds that many of the pages you are interested in are cached (but it didn't know, you just got lucky!).

In practice, once you start pulling in 15% or more of the table it is often cheaper just to read the entire table in, rather than scatter reads + double I/O. Remember that an index access means I have to read the index PLUS the table from disk, and a sequential scan just throws away the index and reads the table from disk.

I would suggest running explain analyze after restarting the database (that may not be even good enough, because a large portion of the data file may be in the OS's buffer cache, hrm -- reboot? unmount?) and see how cheap that index access path is.

One thing to be careful of here is that you really need to consider what is the primary use of the table, and what are the major queries you will be launching against it. But you could improve the correlation by rebuilding the table ordered by the transaction_date column, but it may screw up other range scans. Another option is partitioning. I wouldn't do any of this stuff, until you find out the last tweak you made still holds true, give it a few days, perhaps test it after a clean reboot of the server.

Let me know if any of this is inaccurate folks, as I certainly don't profess to be an expert on the internals of PostgreSQL, but I believe it is accurate based on my prior experiences with other database products. :)

--
Chad
http://www.postgresqlforums.com/

pgsql-performance by date:

From: "Jeremy Haile"
Date: 16 January 2007, 23:01:54
Subject: Re: PG8.2.1 choosing slow seqscan over idx scan

From: Scott Marlowe
Date: 17 January 2007, 01:47:29
Subject: Re: PG8.2.1 choosing slow seqscan over idx scan

Re: PG8.2.1 choosing slow seqscan over idx scan - Mailing list pgsql-performance

Previous

Next