Re: [HACKERS] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-performance
From | Josh Berkus |
---|---|
Subject | Re: [HACKERS] Bad n_distinct estimation; hacks suggested? |
Date | |
Msg-id | 200504270825.17550.josh@agliodbs.com Whole thread Raw |
In response to | Bad n_distinct estimation; hacks suggested? (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
(Mischa Sandberg <mischa.sandberg@telus.net>)
Re: [HACKERS] Bad n_distinct estimation; hacks suggested? (Markus Schaber <schabi@logix-tt.com>) |
List | pgsql-performance |
Mischa, > >Perhaps I can save you some time (yes, I have a degree in Math). If I > >understand correctly, you're trying extrapolate from the correlation > >between a tiny sample and a larger sample. Introducing the tiny sample > >into any decision can only produce a less accurate result than just > >taking the larger sample on its own; GIGO. Whether they are consistent > >with one another has no relationship to whether the larger sample > >correlates with the whole population. You can think of the tiny sample > >like "anecdotal" evidence for wonderdrugs. Actually, it's more to characterize how large of a sample we need. For example, if we sample 0.005 of disk pages, and get an estimate, and then sample another 0.005 of disk pages and get an estimate which is not even close to the first estimate, then we have an idea that this is a table which defies analysis based on small samples. Wheras if the two estimates are < 1.0 stdev apart, we can have good confidence that the table is easily estimated. Note that this doesn't require progressively larger samples; any two samples would work. > I'm with Tom though in being very wary of solutions that require even > one-off whole table scans. Maybe we need an additional per-table > statistics setting which could specify the sample size, either as an > absolute number or as a percentage of the table. It certainly seems that > where D/N ~ 0.3, the estimates on very large tables at least are way way > out. Oh, I think there are several other cases where estimates are way out. Basically the estimation method we have doesn't work for samples smaller than 0.10. > Or maybe we need to support more than one estimation method. Yes, actually. We need 3 different estimation methods: 1 for tables where we can sample a large % of pages (say, >= 0.1) 1 for tables where we sample a small % of pages but are "easily estimated" 1 for tables which are not easily estimated by we can't afford to sample a large % of pages. If we're doing sampling-based estimation, I really don't want people to lose sight of the fact that page-based random sampling is much less expensive than row-based random sampling. We should really be focusing on methods which are page-based. -- Josh Berkus Aglio Database Solutions San Francisco
pgsql-performance by date: