Re: Does "correlation" mislead the optimizer on large - Mailing list pgsql-performance

From Ron Mayer
Subject Re: Does "correlation" mislead the optimizer on large
Date
Msg-id Pine.LNX.4.44.0301241138070.986-100000@localhost.localdomain
Whole thread Raw
In response to Re: Does "correlation" mislead the optimizer on large  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
Responses Re: Does "correlation" mislead the optimizer on large
List pgsql-performance
On Fri, 24 Jan 2003, Stephan Szabo wrote:
>
> I think it's a clumping effect.


Yup, I think that's exactly the effect.

A proposal.... (yes I I'm volunteering if people point me in the right
direction)... would be to have a "plugable" set of analyze functions so that a
huge database that runs analyze infrequently could choose to have a very slow
analyze that might work better for it's data.

I see no reason different analyze functions would to be compiled into
the source code ... but could probably exists as PL/pgSQL languages.

The one thing compiling it in would help with is to let me know
the exact number of tuples on each individual page, but I guess
reltuples/relpages from pg_class is a good estimate.


> For example, I made a table (ordered)  with 20 values of a, 50 values of b
> (each showing up in each a) and 100 values of c (not used, just means 100
> rows for each (a,b) combination. It's got 541 pages it looks like. Analyze
> sets the correlation to about 0.08 on the table and so a query like:
> select * from test1 where b=1; prefers a sequence scan (1791 vs 2231)
> while the index scan actually performs about 5 times better.

That sounds like the same situation I was in. If my logic is right,  this
means you had about 184 tuples/page (200*50*100/541), so it looks to me
like for each "a", you get half-a-page where "b=1".

If you had 'c' have 200 values, I think you'd get even a bigger speedup
because half the page is still "wasted" with b=2 values.

If you had 'c' have 10000 values, I think you'd get even a slightly bigger
speedup because you'd have so many b=1 pages next to each other you'd
benefit from more sequential disk access.


> I guess the reason is that in general, the index scan *really* is reading
> something on the order of 40 pages rather than the much larger estimate
> (I'd guess something on the order of say 300-400?  I'm not sure how to
> find that except by trying to reverse engineer the estimate number),

Or by adding a printf()... I think it'd be in cost_index in costsize.c.

> because pretty much each value of a will probably have 1 or 2 pages with
> b=1.
>
> I'm not really sure how to measure that, however.


As I said... I'm happy to volunteer and experiment if people point
me in a good direction.

    Ron



pgsql-performance by date:

Previous
From: Ron Mayer
Date:
Subject: Re: Does "correlation" mislead the optimizer on large
Next
From: Tom Lane
Date:
Subject: Re: Does "correlation" mislead the optimizer on large