Re: Does "correlation" mislead the optimizer on large - Mailing list pgsql-performance

From Ron Mayer
Subject Re: Does "correlation" mislead the optimizer on large
Date
Msg-id Pine.LNX.4.44.0301241123310.986-100000@localhost.localdomain
Whole thread Raw
In response to Re: Does "correlation" mislead the optimizer on large tables?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On Fri, 24 Jan 2003, Tom Lane wrote:
>
> Ron Mayer <ron@intervideo.com> writes:
> >   On a large tables, I think the "correlation" pg_stats field as calculated
> >   by "vacuum analyze" or "analyze" can mislead the optimizer.
>
> If you look in the pghackers archives, you will find some discussion
> about changing the equation that cost_index() uses to estimate the
> impact of correlation on indexscan cost.  The existing equation is
> ad-hoc and surely wrong, but so far no one's proposed a replacement
> that can be justified any better.  If you've got such a replacement
> then we're all ears...

I've got a very slow one (full table scan perl script) that helps
my database...  I don't know if it's a good general purpose solution.

That's why I'm asking if the concept is good here.  :-)


> >   In particular, if I have a large table t with columns 'a','b','c', etc,
> >   and I cluster the table as follows:
> >     create table t_ordered as select * from t order by a,b;
> >     vacuum analyze t_ordered;
> >   Column "b" will (correctly) get a very low "correlation" in
> >   the pg_stats table -- but I think the optimizer would do better
> >   assuming a high correlation because similar 'b' values are still
> >   grouped closely on the same disk pages.
>
> How would that be?  They'll be separated by the stride of 'a'.


In the case of date/time (for the queries I showed) the issue was
that 'a's were not at all unique so I had data like this:

     dat     |  time    | value
 ------------|----------|--------------------------------
 2002-01-01  | 00:00:00 | whatever
 2002-01-01  | 00:00:00 |
 2002-01-01  | 00:00:00 |
 2002-01-01  | 00:00:01 |
 2002-01-01  | 00:00:01 |  [many pages of 12am]
 2002-01-01  | 00:00:01 |
 2002-01-01  | 00:00:01 |
 ... thousands more rows....
 2002-01-01  | 00:00:59 |
 2002-01-01  | 00:01:00 | [many pages of 1am]
 ... tens of thousands of rows.
 2002-01-01  | 23:59:59 |
 2002-01-01  | 23:59:59 |
 2002-01-01  | 23:59:59 | [many pages of 11pm]
 2002-01-02  | 00:00:00 | [many *MORE* pages of 12am]
 2002-01-02  | 00:00:00 |
 2002-01-02  | 00:00:00 |
 ... tens of thousands of rows...
 2002-01-02  | 23:59:59 | [many pages of 11pm]
 2002-01-03  | 00:00:00 | [many *MORE* pages of 12am]
 ... millions more rows ...


A similar problem actually shows up again in the dimention tables
of my database; where I bulk load many pages at a time (which can
easily be ordered to give a good correlation for a single load) ...
but then the next week's data gets appended to the end.

    id | value
     ------|----------------------------------
     1 | aalok mehta                      [many pages of all 'a's]
     2 | aamir khan
     3 | aaron beall
       | [...]
      6234 | axel rose
      6234 | austin wolf
      6123 | barbara boxer                    [many pages of all 'b's]
       | [...]
    123456 | young
    123457 | zebra
       | [...data loaded later..]
    123458 | aaron whatever                   [more pages of all 'a's]
    123458 | aaron something else
    123458 | aaron something else
           | [...]
    512344 | zelany


In this case I get many clustered blocks of "a" values, but these
clustered blocks happen at many different times across the table.


> It seems likely to me that a one-dimensional correlation statistic may
> be inadequate, but I haven't seen any proposals for better stats.

The idea is it walks the whole table and looks for more local
correlations and replaces the correlation value with a "good"
value if values "close" to each other on the disk are similar.

This way a single "correlation" value still works ... so I didn't
have to change the optimizer logic, just the "analyze" logic.


Basically if data within each block is highly correlated, it doesn't
matter as much (yeah, I now the issue about sequential reads vs. random
reads).


    Ron


pgsql-performance by date:

Previous
From: Josh Berkus
Date:
Subject: Mount options for Ext3?
Next
From: Ron Mayer
Date:
Subject: Re: Does "correlation" mislead the optimizer on large