Re: Optimizer improvements: to do or not to do? - Mailing list pgsql-hackers

From Ron Mayer
Subject Re: Optimizer improvements: to do or not to do?
Date
Msg-id 45087708.4060306@cheapcomplexdevices.com
Whole thread Raw
In response to Re: Optimizer improvements: to do or not to do?  (Gregory Stark <stark@enterprisedb.com>)
Responses Re: Optimizer improvements: to do or not to do?  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Optimizer improvements: to do or not to do?  (Joshua Reich <josh@root.net>)
List pgsql-hackers
Gregory Stark wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> ...vastly overestimate the number of pages .. because postgresql's guess 
>> at the correlation being practically 0 despite the fact that the distinct 
>> values for any given column are closely packed on a few pages.
> 
> I think we need a serious statistics jock to pipe up with some standard
> metrics that do what we need. Otherwise we'll never have a solid footing for
> the predictions we make and will never know how much we can trust them.

Do we know if any such people participate/lurk on this list, or
if the conversation should go elsewhere?

> That said I'm now going to do exactly what I just said we should stop doing
> and brain storm about an ad-hoc metric that might help:
> 
> I wonder if what we need is something like: sort the sampled values by value
> and count up the average number of distinct blocks per value. That might let
> us predict how many pages a fetch of a specific value would retrieve. Or
> perhaps we need a second histogram where the quantities are of distinct pages
> rather than total records.

Either of these sound like they might be an improvement over correlation
itself to estimate the number of pages it'd need to read.  Would it be
relatively easy or hard for a programmer not too familiar with the code
to experiment with these ideas?  Where would be a good place to look.

> We might also need a separate "average number of n-block spans per value"
> metric to predict how sequential the i/o will be in addition to how many pages
> will be fetched.

I'm wildly guessing that, the # of pages itself seems to be
a bigger factor than the sequential/random nature.  For example,
I do a query for data from a particular small city I'd only
need dozens of pages, not many thousands.

OTOH, it'd be neat to know if this were true.  Is there any
good way to make something like explain analyze show both
the expected and actual # of pages and # of seeks?


pgsql-hackers by date:

Previous
From: Tom Dunstan
Date:
Subject: Re: Getting a move on for 8.2 beta
Next
From: Tom Lane
Date:
Subject: Re: Optimizer improvements: to do or not to do?