Re: Optimizer improvements: to do or not to do? - Mailing list pgsql-hackers

From Ron Mayer
Subject Re: Optimizer improvements: to do or not to do?
Date
Msg-id 45083D08.8040309@cheapcomplexdevices.com
Whole thread Raw
In response to Re: Optimizer improvements: to do or not to do?  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: Optimizer improvements: to do or not to do?  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
Simon Riggs wrote:
> On Mon, 2006-09-11 at 06:20 -0700, Say42 wrote:
>> That's what I want to do:
>> 1. Replace not very useful indexCorrelation with indexClustering.
> 
> An opinion such as "not very useful" isn't considered sufficient
> explanation or justification for a change around here.

"Not sufficient for some types of data" would have been more fair.

I speculate that an new additional stat of "average # of unique values for a column within a block"
would go a long way to helping my worst queries.


It's common here for queries to vastly overestimate the
number of pages that would need to be read because
postgresql's guess at the correlation being practically 0
despite the fact that the distinct values for any given
column are closely packed on a few pages.

Our biggest tables (180G or so) are mostly spatial data with columns
like "City" "State" "Zip" "County" "Street" "School District", "Police
Beat", "lat/long" etc; and we cluster the table on zip,street.

Note that practically all the rows for any single value of any
of the columns will lay in the same few blocks.  However the
calculated "correlation" being low because the total ordering
of the other values doesn't match that of zip codes.  This
makes the optimizer vastly overestimate the cost of index
scans because it guesses that most of the table will need
to be read, even though in reality just a few pages are needed.


If someone does look at the correlation calculations, I hope
this type of data gets considered as well.


I speculate that a new stat of "average # of unique values for a column within a block"
could be useful here in addition to correlation.  For most
all my columns in my big table, this stat would be 1 or 2;
which I think would be a useful hint that despite a low
"correlation", the distinct values are indeed packed together
in blocks.   That way the optimizer can see that a
smaller number of pages would need to be accessed than
correlation alone would suggest.

Does this make sense, or am I missing something.


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: CVS commit messages and backpatching
Next
From: Peter Eisentraut
Date:
Subject: Re: CVS commit messages and backpatching