Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers

From pgsql@mohawksoft.com
Subject Re: Query optimizer 8.0.1 (and 8.0)
Date
Msg-id 16552.24.91.171.78.1107870482.squirrel@mail.mohawksoft.com
Whole thread Raw
In response to Query optimizer 8.0.1 (and 8.0)  (pgsql@mohawksoft.com)
List pgsql-hackers
> pgsql@mohawksoft.com wrote:
>>
>> In this case, the behavior observed could be changed by altering the
>> sample size for a table. I submit that an arbitrary fixed sample size is
>> not a good base for the analyzer, but that the sample size should be
>> based
>> on the size of the table or some calculation of its deviation.
>>
>
>    Mark,
>
> Do you have any evidence that the Sample Size had anything to do
> with the performance problem you're seeing?

I have evidence, if you look through some of the messages in this thread,
you'll see how a sample size of 10000 provides enough data points to
create stats the planner can use.

>
> I also do a lot with the complete Census/TIGER database.

Cool, have any code for Mapserver?

>
> Every problem I have with the optimizer comes down to the
> fact that the data is loaded (and ordered on disk) by
> State/County FIPS codes, and then queried by zip-code
> or by city name.  Like this:
>
>      Alabama    36101 [hundreds of pages with zip's in 36***]
>      Alaska     99686 [hundreds of pages with zip's in 9****]
>      Arizona    85701 [hundreds of pages with zip's in 855**]
>
> Note that the zip codes are *NOT* sequential.
>
> The "correlation" statistic sees that the Zip codes are not
> sequential; so it makes the *HORRIBLE* assumption that they
> are scattered randomly across the disk.

It is my theory that this is because there are too few data points with
which to properly characterize the nature of the data.

>
> In reality, even though there's no total ordering of the
> zip codes; any given zip code only exists on a couple
> disk pages; so index scans would be the right choice.
I totally agree.
>
>
> But the single correlation parameter is not sufficient
> to let the optimizer known this.
>
> No matter how large a sample size you choose, ANALYZE
> will correctly see that Zip codes and State FIPS codes
> are non-correlated, and the optimizer will overestimate
> the # of pages an index scan will need.
>

I tried to create an analogy in another post, and TIGER is a perfect
example of the analogy.

Think of the difference between an oscilloscope and a spectrum analizer.
The current sampling code works more like an oscilloscope. It assumes a
fairly normalized distribution of data. Given this, it works perfectly
fine.

When a scope is presented with an audio signal, it looks more like
gibberish showing almost no correlation. When you view it in frequency
domain, as with a spectrum analyzer, you can see clear patterns in the
signal.

Now, fortunately, we don't need any sort of absolute visualization of the
data in TIGER, we only need to see that the data has many subtle trends
rather than one fairly evenly distributed one. That's why more samples
works.

If we could do anything, I would add more statistics to the database. A
standard deviation and maybe a sliding window deviation. A standard
deviation might be pretty high, were as a sliding window whould show less
localized deviation. Less localized deviation whould favor index scans in.

Anyway, like I said. I think the expectation that the data is fairly
normalized or evenly distributed works very well for data acquired over
time. It is data like TIGER that is in a multiple field order, i.e. state,
zipr, zipl that has complex paterns for the secondary sorts that can't be
detected with too small a sample.




>
> PS: I pointed out workarounds in my earlier posting
> in this thread.  Yes, I'm using the same TIGER data
> you are.
>

Cool.
>



pgsql-hackers by date:

Previous
From: Michael Glaesemann
Date:
Subject: Re: AT TIME ZONE
Next
From: Stephen Frost
Date:
Subject: Re: Query optimizer 8.0.1 (and 8.0)