Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers

From Tzahi Fadida
Subject Re: Query optimizer 8.0.1 (and 8.0)
Date
Msg-id 016b01c51243$3599ae20$0b00a8c0@llord
Whole thread Raw
In response to Re: Query optimizer 8.0.1 (and 8.0)  (pgsql@mohawksoft.com)
List pgsql-hackers
Just my 2 cents.
I am not a super statistics guy but besides increasing the sample size
and
assumming things on the distribution, I understand you want to get more
info on what distribution the data represents.
usualy the problem with these things is that the data needs to be sorted
on the
index key and also it could take a while, at least for the one time you
want
to find out what is the distribution. Example:
for comulative distributions you need to first sort the data (I am
talking scalars but
probably other keys can work) and run it sequentially thru a
KS(Kolmogorov smirnov) test.
(there are other tests but this is good for general cases)
The test can be against all kind of comulative distributions like
normals,etc...
You then get a feel of how close is the data to the selected
distribution with
a parameter that can be rejected at 0.01, 0.05, 0.1, etc...

Anyway, it can be done, however I am not sure how much better is it over
just plain histograms with random() and uniform dist. Or what happens if
you
just increase the sample size and be done with it.
Again, I am talking about the general/common cases.

Regards,tzahi.

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org 
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of 
> pgsql@mohawksoft.com
> Sent: Wednesday, February 09, 2005 3:46 PM
> To: Ron Mayer
> Cc: Mark Kirkwood; Tom Lane; Ron Mayer; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)
> 
> 
> I wrote a message caled "One Big trend vs multiple smaller 
> trends in table statistics" that, I think, explains what 
> we've been seeing.
> 
> 
> > pgsql@mohawksoft.com wrote:
> >>
> >> In this case, the behavior observed could be changed by 
> altering the 
> >> sample size for a table. I submit that an arbitrary fixed 
> sample size 
> >> is not a good base for the analyzer, but that the sample 
> size should 
> >> be based on the size of the table or some calculation of its 
> >> deviation.
> >>
> >
> >    Mark,
> >
> > Do you have any evidence that the Sample Size had anything 
> to do with 
> > the performance problem you're seeing?
> 
> Sample size is only a bandaid for the issue, however, more 
> samples always provide more information.
> 
> 
> >
> > I also do a lot with the complete Census/TIGER database.
> >
> > Every problem I have with the optimizer comes down to the fact that 
> > the data is loaded (and ordered on disk) by State/County 
> FIPS codes, 
> > and then queried by zip-code or by city name.  Like this:
> >
> >      Alabama    36101 [hundreds of pages with zip's in 36***]
> >      Alaska     99686 [hundreds of pages with zip's in 9****]
> >      Arizona    85701 [hundreds of pages with zip's in 855**]
> >
> > Note that the zip codes are *NOT* sequential.
> 
> Again, read "One Big Trend..." and let me know what you 
> think. I think it describes exactly the problem that we see.
> 
> For now, the solution that works for me is to seriously up 
> the value of "targrows" in analyze.c. It makes it take 
> longer, and while the stats are not "correct" because they 
> are not designed to detect these sorts of patterns, a larger 
> sample allows them to be "less wrong" enough to give a better 
> hint to the planner.
> 
> 
> 
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
> 
> 




pgsql-hackers by date:

Previous
From: Oliver Jowett
Date:
Subject: Dealing with network-dead clients
Next
From: Greg Stark
Date:
Subject: Re: Design notes for BufMgrLock rewrite