Home > mailing lists

Re: new correlation metric - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Re: new correlation metric
Date	October 26, 2008 17:08:39
Msg-id	1225051712.5795.1.camel@localhost.localdomain Whole thread Raw
In response to	Re: new correlation metric (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: new correlation metric Re: new correlation metric
List	pgsql-hackers

Tree view

On Sun, 2008-10-26 at 12:44 -0400, Tom Lane wrote:
> I wonder whether we ought to rethink the problem entirely.
[...]

What you say makes a lot of sense.

We would have to take a sample of index leaf pages, but I think we
could get a really useful number from it. For BTree, we can just read
the values in order, and maintain 3 counts:

* same_count: this TID points to the same page as the last one did
* near_count: points to a page that is close enough that readahead or
caching will help
* far_count: points to a page where we don't have a reason to think
readahead or caching will be a major factor

Then, we could use a formula like:

run_cost = min_IO_cost + (max_IO_cost - min_IO_cost)*(far_count/total_count)         + small_factor*(max_IO_cost -
min_IO_cost)*(near_count/total_count)

small_factor should be the one minus the percentage of the near_count
that we expect to find in cache (because it was recently read or due to
readahead).

I think from these numbers most of the concerns are taken into account.
Heikki's almost-in-order case is solved because we could recognize that
the pages are close enough to be in cache still.

Interleaving might still cause a lower estimate, but it seems like any
optimistic estimate we could make for that case is pretty risky. If the
caching effects or readahead aren't as helpful as we expected, it would
mean very bad performance.

Of course, I agree that we need some empiral testing.

> useful stats into someplace or other.  We might need to invent some
> other catalog besides pg_statistic if we want to represent per-index
> properties like correlation.  A minimal solution would be to add a
> correlation column to pg_class or pg_index, but that doesn't scale well
> if you imagine that different index AMs might want different sorts of
> stats.

Why can't we just use pg_statistic with the starelid set to the index
oid?

Regards,Jeff Davis

pgsql-hackers by date:

From: Tom Lane
Date: 26 October 2008, 16:25:02
Subject: Re: BufferAccessStrategy for bulk insert

From: "Ian Caulfield"
Date: 26 October 2008, 17:08:55
Subject: Re: array_agg and array_accum (patch)

Re: new correlation metric - Mailing list pgsql-hackers

Previous

Next