Re: Optimizer improvements: to do or not to do? - Mailing list pgsql-hackers

From Say42
Subject Re: Optimizer improvements: to do or not to do?
Date
Msg-id 1158058116.453148.224620@i42g2000cwa.googlegroups.com
Whole thread Raw
In response to Optimizer improvements: to do or not to do?  ("Say42" <andrews42@yandex.ru>)
Responses Re: Optimizer improvements: to do or not to do?  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-hackers
Simon Riggs wrote:
> This type of work is 90% analysis, 10% coding. You'll need to do a lot
> of investigation, lots of discussion and listening.

I absolutely agree with you and I am not about to rush into coding
right now. First of all I'm going to dig a lot in the PG sources,
readme's and so on. It's a good school of coding and DBMS internals
understanding.

> > That's what I want to do:
> > 1. Replace not very useful indexCorrelation with indexClustering.
>
> An opinion such as "not very useful" isn't considered sufficient
> explanation or justification for a change around here.

Sometimes the indexCorrelation even wrongful. There are many examples
of overestimation of index scan cost (data well-clustered but not
ordered - correlation is low) and some cases of underestimation when
tuples look like well ordered with high degree of correlation, but
index scan actually causes random page fetches (1-3-2-4-6-5, for
example. On server without RAID it is VERY slow. 25 times slower than
bitmap index scan). If we have special clustering measure we can more
precisely estimate pages count.
The next step could be to introduce 'ordering'  as a measure of
pages access sequentiality. Without the 'ordering' all we can
assume that pages are fetched in random order. Anyhow, if index access
cost is overestimated we can set random_page_cost=2. (Is it true in a
production database with smart RAID?)
Moreover, I think problem is more complex. With assumption that index
access is always random we dip in another problem: overestimation of
master table index scan. If it is small enough PG can choose seq scan
instead of index scan even if the last one actually much cheaper
because of caching. That is why caching should be taking into account
during joining cost calculation.

> > 2. Consider caching of inner table in a nested loops join during
> > estimation total cost of the join.

> I'd work on one thing at a time and go into it deeply.

Good news. So I'm very interested in what you think about my ideas.
Is it wrong or too naive?



pgsql-hackers by date:

Previous
From: Stefan Kaltenbrunner
Date:
Subject: Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Next
From: "Albe Laurenz"
Date:
Subject: Re: [PATCHES] Fix linking of OpenLDAP libraries