Re: [SQL] [PERFORM] 7.3.1 index use / performance - Mailing list pgsql-general

From Tom Lane
Subject Re: [SQL] [PERFORM] 7.3.1 index use / performance
Date
Msg-id 4828.1042047153@sss.pgh.pa.us
Whole thread Raw
In response to Re: [SQL] [PERFORM] 7.3.1 index use / performance  (Achilleus Mantzios <achill@matrix.gatewaynet.com>)
List pgsql-general
Just to close off the thread, here is the end-result of investigating
Achilleus Mantzios' problem.

------- Forwarded Message

Date:    Wed, 08 Jan 2003 11:54:36 -0500
From:    Tom Lane <tgl@sss.pgh.pa.us>
To:      Achilleus Mantzios <achill@matrix.gatewaynet.com>
Subject: Re: [SQL] [PERFORM] 7.3.1 index use / performance

I believe I see what's going on.  You have a number of silly outlier
values in the report_date column --- quite a few instances of '10007-06-09'
for example.  Depending on whether ANALYZE's random sample happens to
include one of these, the histogram generated by ANALYZE might look like
this (it took about half a dozen tries with ANALYZE to get this result):

dynacom=# analyze noon;
ANALYZE
dynacom=# select histogram_bounds from pg_stats where attname = 'report_date';
                                                      histogram_bounds

-----------------------------------------------------------------------------------------------------------------------------

{1969-06-26,1994-09-24,1996-04-05,1997-07-21,1998-08-27,1999-03-13,1999-11-11,2000-08-18,2001-04-18,2002-01-04,10007-06-09}
(1 row)

in which case we get this:

dynacom=# EXPLAIN select * from noon where
dynacom-# report_date between '2002-01-07' and '2003-01-07';
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Index Scan using noonf_date on noon  (cost=0.00..4.08 rows=1 width=1975)
   Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <= '2003-01-07'::date))
(2 rows)

Seeing this histogram, the planner assumes that one-tenth of the table
is uniformly distributed between 2002-01-04 and 10007-06-09, which leads
it to the conclusion that the range between 2002-01-07 and 2003-01-07
probably contains only about one row, which causes it to prefer a scan
on report_date rather than on v_code.

The reason the problem comes and goes is that any given ANALYZE run
might or might not happen across one of the outliers.  When it doesn't,
you get a histogram that leads to reasonably accurate estimates.

There are a couple of things you could do about this.  One is to
increase the statistics target for report_date (see ALTER TABLE SET
STATISTICS) so that a finer-grained histogram is generated for the
report_date column.  The other thing, which is more work but probably
the best answer in the long run, is to fix the outliers, which I imagine
must be incorrect entries.

You could perhaps put a constraint on report_date to prevent bogus
entries from sneaking in in future.

It looks like increasing the stats target would be worth doing also,
if you make many queries using ranges of report_date.

            regards, tom lane

------- End of Forwarded Message

pgsql-general by date:

Previous
From: "Gavin M. Roy"
Date:
Subject: Re: Get PostgreSQL work with Kylix 3 ?
Next
From: Tom Lane
Date:
Subject: Re: Strange Error