Home > mailing lists

Re: [PERFORM] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-hackers

From	Josh Berkus
Subject	Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Date	April 24, 2005 15:31:49
Msg-id	200504241130.50218.josh@agliodbs.com Whole thread Raw
In response to	Re: [PERFORM] Bad n_distinct estimation; hacks suggested? (Andrew Dunstan <andrew@dunslane.net>)
List	pgsql-hackers

Tree view

Andrew,

> The math in the paper does not seem to look at very low levels of q (=
> sample to pop ratio).

Yes, I think that's the failing.   Mind you, I did more testing and found out
that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy
(which I would consider acceptable) with a sample size of 25% or more (which
is infeasable in any large table).    The formula does work for populations
where D/N is much lower, say 0.01.  So overall it seems to only work for 1/4
of cases; those where n/N is large and D/N is low.   And, annoyingly, that's
probably the population where accurate estimation is least crucial, as it
consists mostly of small tables.

I've just developed (not original, probably, but original to *me*) a formula
that works on populations where n/N is very small and D/N is moderate (i.e.
0.1 to 0.4):

N * (d/n)^(sqrt(N/n))

However, I've tested it only on (n/N < 0.005 and D/N > 0.1 and D/N < 0.4)
populations, and only 3 of them to boot.   I'd appreciate other people trying
it on their own data populations, particularly very different ones, like D/N
> 0.7 or D/N < 0.01.

Further, as Andrew points out we presumably do page sampling rather than
purely random sampling so I should probably read the paper he referenced.
Working on it now ....

--
Josh Berkus
Aglio Database Solutions
San Francisco

pgsql-hackers by date:

From: "Joshua D. Drake"
Date: 24 April 2005, 15:01:42
Subject: Re: Constant WAL replay

From: Tom Lane
Date: 24 April 2005, 15:42:22
Subject: Old-style OR indexscan slated for destruction

Re: [PERFORM] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-hackers

Previous

Next