Re: Multi-pass planner - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Multi-pass planner
Date
Msg-id CAMkU=1y13w5oYaJ_mwJ-dPpZAugOUTWfffgAKRa=5Po5gbznCA@mail.gmail.com
Whole thread Raw
In response to Re: Multi-pass planner  (Greg Stark <stark@mit.edu>)
Responses Re: Multi-pass planner
List pgsql-hackers
On Wed, Apr 3, 2013 at 6:40 PM, Greg Stark <stark@mit.edu> wrote:

On Fri, Aug 21, 2009 at 6:54 PM, decibel <decibel@decibel.org> wrote:
Would it? Risk seems like it would just be something along the lines of the high-end of our estimate. I don't think confidence should be that hard either. IE: hard-coded guesses have a low confidence. Something pulled right out of most_common_vals has a high confidence.

I wouldn't be so sure of that.  I've run into cases where all of the frequencies pulled out of most_common_vals are off by orders of magnitude.  The problem is that if ANALYZE only samples 1/1000th of the table, and it sees a value twice, it assumes the value is present 2000 times in the table, even when it was only in the table twice.  Now, for any given value that occurs twice in the table, it is very unlikely for both of those to end up in the sample. But when you have millions of distinct values which each occur twice (or some low number of time), it is a near certainty that several of them are going to end with both instances in the sample.  Those few ones that get "lucky" are of course going to end up in the most_common_vals list.   

Since the hashjoin estimates cost depending on the frequency of the most common value, having this be systematically off by a factor of 1000 is rather unfortunate.

The problem here is that the sample size which is adequate for getting a good estimate of the histograms (which is what controls the sample size currently) is not adequate for getting a good estimate of most_common_vals.  Cranking up the statistics_target would give a better estimates of most_common_vals, but at the expense of having a needlessly large histogram, which slows down planning.  There is currently no knob to crank up the sample size for the sake of most common values, but then prune the histogram back down for storage.

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Daniel Farina
Date:
Subject: Re: confusing message about archive failures
Next
From: Claudio Freire
Date:
Subject: Re: Multi-pass planner