Re: [HACKERS] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-performance

From Mischa Sandberg
Subject Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date
Msg-id 1114701696.4270ff80d577c@webmail.telus.net
Whole thread Raw
In response to Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Josh Berkus <josh@agliodbs.com>)
List pgsql-performance
Quoting Josh Berkus <josh@agliodbs.com>:

> > >Perhaps I can save you some time (yes, I have a degree in Math). If I
> > >understand correctly, you're trying extrapolate from the correlation
> > >between a tiny sample and a larger sample. Introducing the tiny sample
> > >into any decision can only produce a less accurate result than just
> > >taking the larger sample on its own; GIGO. Whether they are consistent
> > >with one another has no relationship to whether the larger sample
> > >correlates with the whole population. You can think of the tiny sample
> > >like "anecdotal" evidence for wonderdrugs.
>
> Actually, it's more to characterize how large of a sample we need.  For
> example, if we sample 0.005 of disk pages, and get an estimate, and then
> sample another 0.005 of disk pages and get an estimate which is not even
> close to the first estimate, then we have an idea that this is a table
which
> defies analysis based on small samples.   Wheras if the two estimates
are <
> 1.0 stdev apart, we can have good confidence that the table is easily
> estimated.  Note that this doesn't require progressively larger
samples; any
> two samples would work.

We're sort of wandering away from the area where words are a good way
to describe the problem. Lacking a common scratchpad to work with,
could I suggest you talk to someone you consider has a background in
stats, and have them draw for you why this doesn't work?

About all you can get out of it is, if the two samples are
disjunct by a stddev, yes, you've demonstrated that the union
of the two populations has a larger stddev than either of them;
but your two stddevs are less info than the stddev of the whole.
Breaking your sample into two (or three, or four, ...) arbitrary pieces
and looking at their stddevs just doesn't tell you any more than what
you start with.

--
"Dreams come true, not free." -- S.Sondheim, ITW


pgsql-performance by date:

Previous
From: Mischa Sandberg
Date:
Subject: Re: Suggestions for a data-warehouse migration routine
Next
From: Marko Ristola
Date:
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?