Home > mailing lists

Re: Optimizer improvements: to do or not to do? - Mailing list pgsql-hackers

From	Gregory Stark
Subject	Re: Optimizer improvements: to do or not to do?
Date	September 12, 2006 09:54:05
Msg-id	877j09v2a8.fsf@enterprisedb.com Whole thread Raw
In response to	Re: Optimizer improvements: to do or not to do? (Simon Riggs <simon@2ndquadrant.com>)
Responses	Re: Optimizer improvements: to do or not to do? (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

Simon Riggs <simon@2ndquadrant.com> writes:

>> That's what I want to do:
>> 1. Replace not very useful indexCorrelation with indexClustering.
>
> An opinion such as "not very useful" isn't considered sufficient
> explanation or justification for a change around here.

There's been some previous discussion about how "correlation" was not really
what we wanted to be measuring. But that discussion was in regards to
cross-column "correlation". In that case we're trying to predict how selective
a clause will be. If we read x% of the table due to a restriction on X what
percentage of the values of Y will be represented?

In this case I think we do need to know correlation or something like it.
That's because what we're trying to predict is how close to sequential the i/o
accesses will be. If there's no correlation between index order and disk order
then they'll be random. If they're highly correlated then accesses will be
close to sequential.

It's possible there's some sort of "block-wise correlated" measure which would
be even better for our needs. We don't care if all the high values are towards
the start and low values towards the end as long as each section is in order,
for example.

It's also possible that we could use something like what you describe to
predict how many physical i/os will happen altogether. If the table is highly
clustered but disordered then the io will be random access but the cache will
be more effective than if the table is highly correlated but not clustered
(though it would take a large table to make that possible I think).

In short I think what's needed is someone to review a lot of different stats
metrics for correlation and clustering and do some analysis of how each would
be useful for cost modelling. 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

pgsql-hackers by date:

From: Heikki Linnakangas
Date: 12 September 2006, 09:44:26
Subject: Bitmap index status

From: Gregory Stark
Date: 12 September 2006, 10:10:49
Subject: Re: Fixed length data types issue

Re: Optimizer improvements: to do or not to do? - Mailing list pgsql-hackers

Previous

Next