Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id 4FACEEFD0200002500047BB3@gw.wicourts.gov
Whole thread Raw
In response to Re: Gsoc2012 idea, tablesample  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> [ uniformly sample the TID space defined as (1..P, 1..M) ]
> 
>> Shouldn't that get us the randomly chosen sample we're looking
>> for?  Is there a problem you think this ignores?
> 
> Not sure.  The issue that I'm wondering about is that the line
> number part of the space is not uniformly populated, ie, small
> line numbers are much more likely to exist than large ones.  (In
> the limit that density goes to zero, when you pick M much too
> large.)  It's not clear to me whether this gives an unbiased
> probability of picking real tuples, as opposed to hypothetical
> TIDs.
I'm convinced it generates a random sample, since every tuple in the
*possible* space has an equal chance of selection, and we ignore the
ones which don't exist or aren't visible, each of the remaining
(existing, visible) tuples is as likely to be chosen as any other. 
I'm that's not a formal proof, but I'm sure I could work one up.
> Another issue is efficiency.  In practical cases you'll have to
> greatly overestimate M compared to the typical actual-number-of-
> tuples-per-page, which will lead to a number of target TIDs N
> that's much larger than necessary, which will make the scan slow
> --- I think in practice you'll end up doing a seqscan or something
> that might as well be one, because unless S is *really* tiny it'll
> hit just about every page.  We can have that today without months
> worth of development effort, using the "WHERE random() < S"
> technique.
I think you've probably got me there.  The technique you describe
should be equally random, and  thinking through the work involved in
each, the random() < S approach seems almost certain to be cheaper. 
Is it worth supporting the TABLESAMPLE BERNOULLI option as syntactic
sugar for that?
-Kevin


pgsql-hackers by date:

Previous
From: Florian Pflug
Date:
Subject: Re: Gsoc2012 idea, tablesample
Next
From: Fujii Masao
Date:
Subject: Re: Can pg_trgm handle non-alphanumeric characters?