Home > mailing lists

Re: query slows down with more accurate stats - Mailing list pgsql-performance

From	Tom Lane
Subject	Re: query slows down with more accurate stats
Date	April 19, 2004 13:00:25
Msg-id	25762.1082390410@sss.pgh.pa.us Whole thread Raw
In response to	Re: query slows down with more accurate stats (Manfred Koizar <mkoi-pg@aon.at>)
Responses	Number of pages in a random sample (was: query slows down with more accurate stats)
List	pgsql-performance

Tree view

Manfred Koizar <mkoi-pg@aon.at> writes:
> Random sampling is more like "every possible sample is equally likely to
> be collected", and two-stage sampling doesn't satisfy this condition.

Okay, I finally see the point here: in the limit as the number of pages
B goes to infinity, you'd expect the probability that each tuple in your
sample came from a different page to go to 1.  But this doesn't happen
in the two-stage sampling method: the probability doesn't increase
beyond the value it would have for B=n.  On the average each sample page
would supply one tuple, but the odds that this holds *exactly* would be
pretty low.

However the existing sampling method has glaring flaws of its own,
in particular having to do with the fact that a tuple whose slot is
preceded by N empty slots is N times more likely to be picked than one
that has no empty-slot predecessors.  The fact that the two-stage
method artificially constrains the sample to come from only n pages
seems like a minor problem by comparison; I'd happily accept it to get
rid of the empty-slot bias.

A possible compromise is to limit the number of pages sampled to
something a bit larger than n, perhaps 2n or 3n.  I don't have a feeling
for the shape of the different-pages probability function; would this
make a significant difference, or would it just waste cycles?

            regards, tom lane

pgsql-performance by date:

From: "Anjan Dave"
Date: 19 April 2004, 10:52:42
Subject: Re: Wierd context-switching issue on Xeon

From: Josh Berkus
Date: 19 April 2004, 14:52:52
Subject: Re: Wierd context-switching issue on Xeon

Re: query slows down with more accurate stats - Mailing list pgsql-performance

Previous

Next