Thread: new correlation metric

new correlation metric

From
Jeff Davis
Date:
Currently, we use correlation to estimate the I/O costs of an index
scan. However, this has some problems:

http://archives.postgresql.org/pgsql-hackers/2008-01/msg01084.php

I worked with Nathan Boley to come up with what we think is a better
metric for measuring this cost. It is based on the number of times in
the ordered sample that you have to physically backtrack (i.e. the data
value increases, but the physical position is earlier).

For example, if the table's physical order is

6 7 8 9 10 1 2 3 4 5

then the new metric will see one backtrack -- after reading 5 it
backtracks to the beginning to read 6. The old code would have produced
a correlation near -0.5. The new code will yield a "correlation" of 8/9.
The old code would have produced a run_cost of 1/4*min_IO_cost +
3/4*max_IO_cost, while the new code produces a run_cost of
8/9*min_IO_cost + 1/9*max_IO_cost.

Attached is a patch. We replaced "correlation", but left the name the
same (even though that's now a misnomer). We can change it if needed.

Comments?

Regards,
    Jeff Davis

Attachment

Re: new correlation metric

From
Martijn van Oosterhout
Date:
On Sun, Oct 26, 2008 at 01:38:02AM -0700, Jeff Davis wrote:
> I worked with Nathan Boley to come up with what we think is a better
> metric for measuring this cost. It is based on the number of times in
> the ordered sample that you have to physically backtrack (i.e. the data
> value increases, but the physical position is earlier).
>
> For example, if the table's physical order is
>
> 6 7 8 9 10 1 2 3 4 5

How does it deal with a case like the following:

1 6 2 7 3 8 4 9 5 10  (interleaving)

ISTM that your code will overestimate the cost whereas the old code
wouldn't have done too badly.

I think the code is in the right direction, but I think want you want
is some kind of estimate of "given I've looked for tuple X, how many
tuples in the next k pages are near this one". Unfortunatly I don't see
a way of calculating it other than a full simulation.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: new correlation metric

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sun, Oct 26, 2008 at 01:38:02AM -0700, Jeff Davis wrote:
>> I worked with Nathan Boley to come up with what we think is a better
>> metric for measuring this cost.

> I think the code is in the right direction, but I think want you want
> is some kind of estimate of "given I've looked for tuple X, how many
> tuples in the next k pages are near this one".

ISTM that some experimental studies would be required to justify any
proposal in this area.

Having said that ... one of the things I never liked about the existing
code is that it pays no attention to block-boundary effects.  It doesn't
matter to an indexscan how badly tuples within a single block are
misordered --- what matters is how many block reads you have to do.
So I think that any changes here ought to include measuring the
correlation on the basis of block numbers not tuple numbers.  But what's
not too clear to me is whether our sampling methods would mess that up.
        regards, tom lane


Re: new correlation metric

From
Heikki Linnakangas
Date:
Martijn van Oosterhout wrote:
> On Sun, Oct 26, 2008 at 01:38:02AM -0700, Jeff Davis wrote:
>> I worked with Nathan Boley to come up with what we think is a better
>> metric for measuring this cost. It is based on the number of times in
>> the ordered sample that you have to physically backtrack (i.e. the data
>> value increases, but the physical position is earlier).
>>
>> For example, if the table's physical order is
>>
>> 6 7 8 9 10 1 2 3 4 5
> 
> How does it deal with a case like the following:
> 
> 1 6 2 7 3 8 4 9 5 10  (interleaving)
> 
> ISTM that your code will overestimate the cost whereas the old code
> wouldn't have done too badly.

Another similar case is where the tables is almost in order, but not 
quite. For example, take an ordered table, but swap every other value:

2 1 4 3 6 5 7 8 10 9

Distributions similar to that that would happen naturully if you have a 
logging application that receives timestamped events from elsewhere. The 
events would be come in roughly in timestamp order, but not quite 
because of different delays, for example. For all practical purposes, 
that is just as good as a perfectly sorted table.

However, the patch actually operates on the *sample*, not the real 
physical order. So I think it would actually work well for my example, 
because if you take just a few values from that sequence in random, the 
sample is likely to be in good order. Not sure about the interleaved case.

I wonder how the proposed measurement behaves wrt. the sample size vs. 
table size ratio. Intuitively, it feels like it becomes less sensitive 
to disorder, as the table size grows and (sample size)/(table size) 
becomes smaller. Which is not good, I believe.

> I think the code is in the right direction, but I think want you want
> is some kind of estimate of "given I've looked for tuple X, how many
> tuples in the next k pages are near this one". Unfortunatly I don't see
> a way of calculating it other than a full simulation.

Yeah, it would be nice to figure out something, as the current method 
sure isn't perfect.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: new correlation metric

From
Greg Stark
Date:
I haven't look at the patch yet -- I'm actually on a train now. I'm  
sorry if these questions are answered in the patch.

I think there are three questions here:
A) what metric are we going to use
B) how do we estimate thy metric given a sample
C) how do we draw the conclusions we need based on this metric

I looked recently on this topic and I found papers on estimating two  
metrics based on samples: longest sorted subsequence and number of  
sorted subsequence. The latter of which sounds like it might be  
equivalent to what you have.

I didn't did any papers on drawing conclusions from either of these  
metrics though.

I think we can just look at block number to avoid intra-block disorder  
confusing us. It does mean we have effectively a much smaller sample  
though.

I don't think the other objection is worth worrying about though. If  
our sample has 15263748 then it may be that readahead will save us in  
that particular instance but it's clear the table isn't well clustered  
and it's just luck that those blocks were intermixed. In a very  
particular way. The rest of the table is unlikely to be so lucky.


greg

On 26 Oct 2008, at 03:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Martijn van Oosterhout <kleptog@svana.org> writes:
>> On Sun, Oct 26, 2008 at 01:38:02AM -0700, Jeff Davis wrote:
>>> I worked with Nathan Boley to come up with what we think is a better
>>> metric for measuring this cost.
>
>> I think the code is in the right direction, but I think want you want
>> is some kind of estimate of "given I've looked for tuple X, how many
>> tuples in the next k pages are near this one".
>
> ISTM that some experimental studies would be required to justify any
> proposal in this area.
>
> Having said that ... one of the things I never liked about the  
> existing
> code is that it pays no attention to block-boundary effects.  It  
> doesn't
> matter to an indexscan how badly tuples within a single block are
> misordered --- what matters is how many block reads you have to do.
> So I think that any changes here ought to include measuring the
> correlation on the basis of block numbers not tuple numbers.  But  
> what's
> not too clear to me is whether our sampling methods would mess that  
> up.
>
>            regards, tom lane
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: new correlation metric

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> I think the code is in the right direction, but I think want you want
> is some kind of estimate of "given I've looked for tuple X, how many
> tuples in the next k pages are near this one". Unfortunatly I don't see
> a way of calculating it other than a full simulation.

I wonder whether we ought to rethink the problem entirely.  In
particular, the notion of associating correlation stats with particular
columns doesn't seem amazingly useful when you get right down to it.
We're wasting our time calculating the correlation of a column that has
no index; and if the column is part of a multicolumn index it's not that
easy to figure out what the index correlation is from the per-column
numbers.  Not to mention functional and partial indexes.

So it occurs to me that maybe we should forget about per-column
correlations altogether, and instead try directly to calculate *per
index* correlations.  You could do this now by doing an index-only scan
and looking at the series of tuple block numbers that come back.
However, there's no obvious way to do any sampling in that approach
--- you can't read random subsets of the index without cooperation from
the index AM.

So the direction I'm headed in is to imagine that we should add an
"analyze" entry point to the index AM API, with the definition that it
will be called once per index per ANALYZE, and is in charge of putting
useful stats into someplace or other.  We might need to invent some
other catalog besides pg_statistic if we want to represent per-index
properties like correlation.  A minimal solution would be to add a
correlation column to pg_class or pg_index, but that doesn't scale well
if you imagine that different index AMs might want different sorts of
stats.
        regards, tom lane


Re: new correlation metric

From
Jeff Davis
Date:
On Sun, 2008-10-26 at 12:44 -0400, Tom Lane wrote:
> I wonder whether we ought to rethink the problem entirely.
[...]

What you say makes a lot of sense.

We would have to take a sample of index leaf pages, but I think we
could get a really useful number from it. For BTree, we can just read
the values in order, and maintain 3 counts:

* same_count: this TID points to the same page as the last one did
* near_count: points to a page that is close enough that readahead or
caching will help
* far_count: points to a page where we don't have a reason to think
readahead or caching will be a major factor

Then, we could use a formula like:

run_cost = min_IO_cost + (max_IO_cost - min_IO_cost)*(far_count/total_count)         + small_factor*(max_IO_cost -
min_IO_cost)*(near_count/total_count)

small_factor should be the one minus the percentage of the near_count
that we expect to find in cache (because it was recently read or due to
readahead).

I think from these numbers most of the concerns are taken into account.
Heikki's almost-in-order case is solved because we could recognize that
the pages are close enough to be in cache still.

Interleaving might still cause a lower estimate, but it seems like any
optimistic estimate we could make for that case is pretty risky. If the
caching effects or readahead aren't as helpful as we expected, it would
mean very bad performance.

Of course, I agree that we need some empiral testing.

> useful stats into someplace or other.  We might need to invent some
> other catalog besides pg_statistic if we want to represent per-index
> properties like correlation.  A minimal solution would be to add a
> correlation column to pg_class or pg_index, but that doesn't scale well
> if you imagine that different index AMs might want different sorts of
> stats.

Why can't we just use pg_statistic with the starelid set to the index
oid?

Regards,Jeff Davis



Re: new correlation metric

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sun, 2008-10-26 at 12:44 -0400, Tom Lane wrote:
>> We might need to invent some
>> other catalog besides pg_statistic if we want to represent per-index
>> properties like correlation.

> Why can't we just use pg_statistic with the starelid set to the index
> oid?

Well, because pg_statistic is built for per-column stats.  You'd have to
invent some value for staattnum, which would be problematic for views
like pg_stats that expect it to join to a valid pg_attribute row;
and you'd have useless columns like stanullfrac and stadistinct.

There's no problem with using pg_statistic for stats that correspond to
individual index columns (and in fact we do that already); but ISTM
the point here is that correlation/ordering is about the index as a
whole, not any one column of it.
        regards, tom lane


Re: new correlation metric

From
Ron Mayer
Date:
Jeff Davis wrote:
> Currently, we use correlation to estimate the I/O costs of an index
> scan. However, this has some problems:

It certainly helps some cases.

Without the patch, the little test script below ends up picking the
third fastest plan (a seq-scan) instead of a faster bitmapscan, or
an even faster-than-that indexscan for the query below.
With the patch, it finds the fastest index scan.

Without Patch     Estimated_cost     Actual_Time
Index Scan        39638.36           331ms
Bitmap Scan       22218.43           415ms
Seq Scan          20125.83           595ms

With Patch        Estimated_cost     Actual_Time
Index Scan        17684.18           333ms
Bitmap Scan       22110.60           400ms
Seq Scan          20117.51           573ms

I was somewhat surprised that the bitmap cost estimates didn't
also change much.  Wouldn't the estimated # of data blocks
read for the bitmap be roughly the same as for the index?

And yes, I know that table's a contrived one that is almost
ideal for this patch - but I have some large clustered-by-zip
address tables where I can find queries that show similar results.

Back in 8.0 I cared a lot since I had a number of real-world
queries picking Seq-Scans instead of Index-Scans.   With 8.3,
though, AFAICT the vast majority of my similar real-world queries
pick the bitmap scans which in practice are pretty close in speed
to the index scans.




======================================================================
-- [1] Test script variation from this 2005 thread:
--    http://archives.postgresql.org/pgsql-hackers/2005-02/msg00298.php

create temporary table tmp1mil as      select * from             (select generate_series as a from
generate_series(0,9))as a,             (select generate_series as b from generate_series(0,9)) as b,
(selectgenerate_series as c from generate_series(0,9)) as c,             (select generate_series as d from
generate_series(0,9))as d,             (select generate_series as e from generate_series(0,9)) as e,
(selectgenerate_series as f from generate_series(0,9)) as f      order by a,b,c,d,e,f;
 
create index tmp1mil__c on tmp1mil(c);
vacuum analyze tmp1mil;
select * from pg_stats where tablename='tmp1mil';

\timing
explain select count(*) from tmp1mil where c<5;
select count(*) from tmp1mil where c<5;
select count(*) from tmp1mil where c<5; -- 615 ms seqscan

set enable_seqscan = false;
explain select count(*) from tmp1mil where c<5;
select count(*) from tmp1mil where c<5;
select count(*) from tmp1mil where c<5; -- 425 ms bitmapscan

set enable_bitmapscan to false;
explain select count(*) from tmp1mil where c<5;
select count(*) from tmp1mil where c<5;
select count(*) from tmp1mil where c<5; -- 342 ms indexscan


Re: new correlation metric

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> ... I was somewhat surprised that the bitmap cost estimates didn't
> also change much.  Wouldn't the estimated # of data blocks
> read for the bitmap be roughly the same as for the index?

By definition, a bitmap scan's cost isn't affected by index order
correlation.
        regards, tom lane


Re: new correlation metric

From
Ron Mayer
Date:
Tom Lane wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> ...bitmap cost estimates didn't also change much....
> 
> By definition, a bitmap scan's cost isn't affected by index order
> correlation.

No?   I think I understand that for index scans the correlation
influenced how many data pages are estimated to get sucked in.

Wouldn't a bitmap scan using a single index also fetch roughly
the same number of data pages as an index scan?

I'm not complaining, since 8.3's doing great on all my real-world
queries.   And sorry for my naive questions - feel free to tell
me to just read the code if this is something I should be able
to figure out myself.



Re: new correlation metric

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Tom Lane wrote:
>> By definition, a bitmap scan's cost isn't affected by index order
>> correlation.

> No?   I think I understand that for index scans the correlation
> influenced how many data pages are estimated to get sucked in.

No, it's not about that, it's about whether we're likely to have to read
any of those pages more than once.  The bitmap processing implicitly
sorts all the tuples-to-be-read into physical TID order, so the heap
access is always linear no matter how badly correlated the index is.
        regards, tom lane


Re: new correlation metric

From
"Brendan Jurd"
Date:
Hi Jeff,

I've been assigned to do an initial review of your "new correlation
metric" patch.

If I'm grokking the thread, it looks like Tom suggested a substantial
change in the approach (targetting per-index correlation rather than
per-column) [1], and although you agreed with the spirit of his
suggestion[2], there hasn't been a new version of the patch submitted
in response.

The result is, I'm not really sure how I should direct my effort here.Is it worth me reviewing the patch as it stands,
orshould I hold off
 
until a new version has been prepared, incorporating Tom's comments?

Cheers,
BJ

[1] http://archives.postgresql.org/pgsql-hackers/2008-10/msg01284.php
[2] http://archives.postgresql.org/pgsql-hackers/2008-10/msg01287.php


Re: new correlation metric

From
Jeff Davis
Date:
On Mon, 2008-11-03 at 18:33 +1100, Brendan Jurd wrote:
> If I'm grokking the thread, it looks like Tom suggested a substantial
> change in the approach (targetting per-index correlation rather than
> per-column) [1], and although you agreed with the spirit of his
> suggestion[2], there hasn't been a new version of the patch submitted
> in response.

Yes, that's correct.

Due to the substantial changes from the original version, I don't think
our patch counts as being finished before November 1st. We're continuing
to work on it, but I think that the earliest we'll be ready for review
is at the end of the week.

We don't want to hold anything up, so feel free to move on to another
patch. If you still have time to review when we have a better patch,
we'd appreciate your feedback even if it's too late for 8.4.

Regards,Jeff Davis



Re: new correlation metric

From
"Brendan Jurd"
Date:
On Tue, Nov 4, 2008 at 4:21 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> We don't want to hold anything up, so feel free to move on to another
> patch. If you still have time to review when we have a better patch,
> we'd appreciate your feedback even if it's too late for 8.4.
>

No worries, thanks Jeff.

I've moved your patch to "Returned with Feedback" in the commitfest
for now.  I'd be happy to do an initial review of the next version of
the patch whenever it is ready.

Cheers,
BJ