Re: RFC: planner statistics in 7.2 - Mailing list pgsql-hackers

From Philip Warner
Subject Re: RFC: planner statistics in 7.2
Date
Msg-id 3.0.5.32.20010424015647.03ca35e0@mail.rhyme.com.au
Whole thread Raw
In response to Re: RFC: planner statistics in 7.2  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
At 10:10 23/04/01 -0400, Tom Lane wrote:
>>> All that we're discussing here is one specific parameter in the cost
>>> estimation for an indexscan, viz, the extent to which the table ordering
>>> agrees with the index ordering. 
>
>> This does not necessarily follow. A table ordering need not follow the sort
>> order of an index for the index to have a low indexscan cost. All that is
>> required is that most of the rows referred to by an index node must reside
>> in a page or pages that will be read by one IO. eg. a table that has a
>> sequence based ID, with, say 20% of rows updated, will work nicely with an
>> indexscan on the ID, even though it has never been clustered. 
>
>Right, what matters is the extent of correlation between table ordering
>and index ordering, not how it got to be that way.

No; *local* ordering needs to *roughly* match. Global ordering and
record-by-record ordering don't matter.

For example, for a table with an ID field, the rows may be stored as (where
--- indicates a mythical page)

-----
5
9
6
7
-----
1
10
2
3
-----
4
8
11
12
-----

A sorted index may have nodes pointers to (1,2,3), (4,5,6), (7,8,9) and
(10,11,12). The first node would take 1 IO, then each of the others would
take 2. This would give a much more reasonable estimate for the indexscan
costs (assuming a random sample is adequate).


>> What I'm suggesting is that if you look at a random sample of index nodes,
>> you should be able to get a statistically valid estimate of the 'clumping'
>> of the data pointed to by the index. 
>
>And I'm saying that you don't actually have to look at the index in
>order to compute the very same estimate. 

No. Not given the above.


>The only property of the index
>that matters is its sort order; if you assume you know the right sort
>order (and in practice there's usually only one interesting possibility
>for a column) then you can compute the correlation just by looking at
>the table.

This is true, but only if you are strictly interested in sort order, which
I don't think we are.


>Andreas correctly points out that this approach doesn't extend very well
>to multi-column or functional indexes, but I'm willing to punt on those
>for the time being ...

My approach should work with arbitrary indexes.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: RFC: planner statistics in 7.2
Next
From: Adam Rose
Date:
Subject: Re: row name length