Home > mailing lists

Re: Optimizer improvements: to do or not to do? - Mailing list pgsql-hackers

From	Ron Mayer
Subject	Re: Optimizer improvements: to do or not to do?
Date	September 13, 2006 17:16:42
Msg-id	45083D08.8040309@cheapcomplexdevices.com Whole thread Raw
In response to	Re: Optimizer improvements: to do or not to do? (Simon Riggs <simon@2ndquadrant.com>)
Responses	Re: Optimizer improvements: to do or not to do? (Gregory Stark <stark@enterprisedb.com>)
List	pgsql-hackers

Tree view

Simon Riggs wrote:
> On Mon, 2006-09-11 at 06:20 -0700, Say42 wrote:
>> That's what I want to do:
>> 1. Replace not very useful indexCorrelation with indexClustering.
> 
> An opinion such as "not very useful" isn't considered sufficient
> explanation or justification for a change around here.

"Not sufficient for some types of data" would have been more fair.

I speculate that an new additional stat of "average # of unique values for a column within a block"
would go a long way to helping my worst queries.

It's common here for queries to vastly overestimate the
number of pages that would need to be read because
postgresql's guess at the correlation being practically 0
despite the fact that the distinct values for any given
column are closely packed on a few pages.

Our biggest tables (180G or so) are mostly spatial data with columns
like "City" "State" "Zip" "County" "Street" "School District", "Police
Beat", "lat/long" etc; and we cluster the table on zip,street.

Note that practically all the rows for any single value of any
of the columns will lay in the same few blocks.  However the
calculated "correlation" being low because the total ordering
of the other values doesn't match that of zip codes.  This
makes the optimizer vastly overestimate the cost of index
scans because it guesses that most of the table will need
to be read, even though in reality just a few pages are needed.

If someone does look at the correlation calculations, I hope
this type of data gets considered as well.

I speculate that a new stat of "average # of unique values for a column within a block"
could be useful here in addition to correlation.  For most
all my columns in my big table, this stat would be 1 or 2;
which I think would be a useful hint that despite a low
"correlation", the distinct values are indeed packed together
in blocks.   That way the optimizer can see that a
smaller number of pages would need to be accessed than
correlation alone would suggest.

Does this make sense, or am I missing something.

pgsql-hackers by date:

From: Bruce Momjian
Date: 13 September 2006, 17:03:34
Subject: CVS commit messages and backpatching

From: Peter Eisentraut
Date: 13 September 2006, 17:30:27
Subject: Re: CVS commit messages and backpatching

Re: Optimizer improvements: to do or not to do? - Mailing list pgsql-hackers

Previous

Next