Re: [SQL] [PERFORM] 7.3.1 index use / performance - Mailing list pgsql-general
From | Tom Lane |
---|---|
Subject | Re: [SQL] [PERFORM] 7.3.1 index use / performance |
Date | |
Msg-id | 4828.1042047153@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [SQL] [PERFORM] 7.3.1 index use / performance (Achilleus Mantzios <achill@matrix.gatewaynet.com>) |
List | pgsql-general |
Just to close off the thread, here is the end-result of investigating Achilleus Mantzios' problem. ------- Forwarded Message Date: Wed, 08 Jan 2003 11:54:36 -0500 From: Tom Lane <tgl@sss.pgh.pa.us> To: Achilleus Mantzios <achill@matrix.gatewaynet.com> Subject: Re: [SQL] [PERFORM] 7.3.1 index use / performance I believe I see what's going on. You have a number of silly outlier values in the report_date column --- quite a few instances of '10007-06-09' for example. Depending on whether ANALYZE's random sample happens to include one of these, the histogram generated by ANALYZE might look like this (it took about half a dozen tries with ANALYZE to get this result): dynacom=# analyze noon; ANALYZE dynacom=# select histogram_bounds from pg_stats where attname = 'report_date'; histogram_bounds ----------------------------------------------------------------------------------------------------------------------------- {1969-06-26,1994-09-24,1996-04-05,1997-07-21,1998-08-27,1999-03-13,1999-11-11,2000-08-18,2001-04-18,2002-01-04,10007-06-09} (1 row) in which case we get this: dynacom=# EXPLAIN select * from noon where dynacom-# report_date between '2002-01-07' and '2003-01-07'; QUERY PLAN --------------------------------------------------------------------------------------------- Index Scan using noonf_date on noon (cost=0.00..4.08 rows=1 width=1975) Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <= '2003-01-07'::date)) (2 rows) Seeing this histogram, the planner assumes that one-tenth of the table is uniformly distributed between 2002-01-04 and 10007-06-09, which leads it to the conclusion that the range between 2002-01-07 and 2003-01-07 probably contains only about one row, which causes it to prefer a scan on report_date rather than on v_code. The reason the problem comes and goes is that any given ANALYZE run might or might not happen across one of the outliers. When it doesn't, you get a histogram that leads to reasonably accurate estimates. There are a couple of things you could do about this. One is to increase the statistics target for report_date (see ALTER TABLE SET STATISTICS) so that a finer-grained histogram is generated for the report_date column. The other thing, which is more work but probably the best answer in the long run, is to fix the outliers, which I imagine must be incorrect entries. You could perhaps put a constraint on report_date to prevent bogus entries from sneaking in in future. It looks like increasing the stats target would be worth doing also, if you make many queries using ranges of report_date. regards, tom lane ------- End of Forwarded Message
pgsql-general by date: