Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id 4FACE43E0200002500047B95@gw.wicourts.gov
Whole thread Raw
In response to Re: Gsoc2012 idea, tablesample  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Gsoc2012 idea, tablesample
Re: Gsoc2012 idea, tablesample
List pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If you're willing to accept that the quality of the results
> depends on having up-to-date stats, then I'd suggest (1) use the
> planner's existing technology to estimate the number of rows in
> the table; (2) multiply by sampling factor you want to get a
> desired number of sample rows; (3) use ANALYZE's existing
> technology to acquire that many sample rows.  While the ANALYZE
> code isn't perfect with respect to the problem of nonuniform TID
> density, it certainly will be a lot better than pretending that
> that problem doesn't exist.
That is basically what we've been talking about for SYSTEM samples,
which I think we've agreed should be finished and committed before
programming on the BERNOULLI option.  Nobody is pretending that the
problem doesn't exist.  Let me restate the option I'm thinking
solves the problem well, since it would be easy to have missed what
I was responding to with all the various suggestions on the thread.
The user has asked for a BERNOULLI sample S of some percent of the
table.  We determine the maximum tuple index M which could be used
for any ctid in the table.  We get an estimate of the number of
pages P in the table using whatever is the best available technique.
If you assume that the number of tuples (visible or not) is T, the
approximate number of tuples we want to randomly select is S*T.  We
will be probing random page numbers 1..P for random tuple indexes
1..M.  So how many random probes by ctid does that require? The
chance of a hit on each probe is ((T/P)/M) -- the average number of
tuples per page divided by the number of tuple index values allowed.
So we need (S*T)/((T/P)/M) probes.  Simplifying that winds up being
S*M*P the product of the sample size as a percentage, the maximum
tuple index on a page, and the number of pages.  (A calculation some
may have jumped to as intuitively obvious.)
So let's call the number of probes N.  We randomly generate N
distinct ctid values, where each is a random page number 1..P and a
random index 1..M.  We attempt to read each of these in block number
order, not following HOT chains.  For each, if the tuple exists and
is visible, it is part of our result set.
Since T cancels out of that equation, we don't need to worry about
estimating it.  Results will be correct for any value of M which is
*at least* as large as the maximum tuple index in the table,
although values of M larger than that will degrade performance.  The
same holds true for the number of pages.
Shouldn't that get us the randomly chosen sample we're looking for? 
Is there a problem you think this ignores?
Credit where credit is due: I *think* this is merely my restatement
of something Florian was suggesting, building on a suggestion from
Robert.
-Kevin


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: incorrect handling of the timeout in pg_receivexlog
Next
From: Robert Haas
Date:
Subject: Re: Gsoc2012 idea, tablesample