Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: Gsoc2012 idea, tablesample |
Date | |
Msg-id | 4FACE43E0200002500047B95@gw.wicourts.gov Whole thread Raw |
In response to | Re: Gsoc2012 idea, tablesample (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Gsoc2012 idea, tablesample
Re: Gsoc2012 idea, tablesample |
List | pgsql-hackers |
Tom Lane <tgl@sss.pgh.pa.us> wrote: > If you're willing to accept that the quality of the results > depends on having up-to-date stats, then I'd suggest (1) use the > planner's existing technology to estimate the number of rows in > the table; (2) multiply by sampling factor you want to get a > desired number of sample rows; (3) use ANALYZE's existing > technology to acquire that many sample rows. While the ANALYZE > code isn't perfect with respect to the problem of nonuniform TID > density, it certainly will be a lot better than pretending that > that problem doesn't exist. That is basically what we've been talking about for SYSTEM samples, which I think we've agreed should be finished and committed before programming on the BERNOULLI option. Nobody is pretending that the problem doesn't exist. Let me restate the option I'm thinking solves the problem well, since it would be easy to have missed what I was responding to with all the various suggestions on the thread. The user has asked for a BERNOULLI sample S of some percent of the table. We determine the maximum tuple index M which could be used for any ctid in the table. We get an estimate of the number of pages P in the table using whatever is the best available technique. If you assume that the number of tuples (visible or not) is T, the approximate number of tuples we want to randomly select is S*T. We will be probing random page numbers 1..P for random tuple indexes 1..M. So how many random probes by ctid does that require? The chance of a hit on each probe is ((T/P)/M) -- the average number of tuples per page divided by the number of tuple index values allowed. So we need (S*T)/((T/P)/M) probes. Simplifying that winds up being S*M*P the product of the sample size as a percentage, the maximum tuple index on a page, and the number of pages. (A calculation some may have jumped to as intuitively obvious.) So let's call the number of probes N. We randomly generate N distinct ctid values, where each is a random page number 1..P and a random index 1..M. We attempt to read each of these in block number order, not following HOT chains. For each, if the tuple exists and is visible, it is part of our result set. Since T cancels out of that equation, we don't need to worry about estimating it. Results will be correct for any value of M which is *at least* as large as the maximum tuple index in the table, although values of M larger than that will degrade performance. The same holds true for the number of pages. Shouldn't that get us the randomly chosen sample we're looking for? Is there a problem you think this ignores? Credit where credit is due: I *think* this is merely my restatement of something Florian was suggesting, building on a suggestion from Robert. -Kevin
pgsql-hackers by date: