Thread: Re: Gsoc2012 idea, tablesample
On Mon, 16 Apr 2012 23:17:25 -0700, Heikki Linnakangas wrote: > 1. We probably don't want the SQL syntax to be added to the > grammar. This should be written as an extension, using custom > functions as the API, instead of extra SQL syntax. I can't find the discussion about this, have any pointer ? I've found a patch of 2007 by Gavin Sherry implementing the SQL 2003 TABLESAMPLE syntax. May be a good starting point ? http://www.neilconway.org/talks/hacking/ --strk; ,------o-. | __/ | Delivering high quality PostGIS 2.0 ! | / 2.0 | http://strk.keybit.net - http://vizzuality.com`-o------'
[point splintered in quoting re-joined] Florian Pflug wrote: > On May10, 2012, at 18:36 , Kevin Grittner wrote: >> Robert Haas wrote: >> >>> I wonder if you could do this with something akin to the Bitmap >>> Heap Scan machinery. Populate a TID bitmap with a bunch of >>> randomly chosen TIDs, fetch them all in physical order >>> and if you don't get as many rows as you need, rinse and repeat >>> until you do. >> >> If you get too many, it is important that you read all the way to >> the end and then randomly omit some of them. While a bit of a >> bother, that's pretty straightforward and should be pretty fast, >> assuming you're not, like, an order of magnitude high. > > Why is that? From a statistical point of view it shouldn't matter > whether you pick N random samples, or pick M >= N random samples an > then randomly pick N from M. (random implying uniformly distributed > here). That sounds to me like exactly what what Robert and I both said. While passing the heap with the bitmap, if you get to the number you want you don't stop there -- you read all of them ("M" in your parlance) and randomly drop M minus N of them. Or, if you prefer, I guess you could *pick* N of them. I don't see a logical difference. >> But falling short is tougher; making up the difference could be an >> iterative process, which could always wind up with having you read >> all tuples in the table without filling your sample. > > But the likelihood of that happening is extremely low, no? That depends. What if someone just did a mass delete and your statistics aren't yet up to date when they ask to pick a relatively large percentage of the rows. > Unless the sampling percentage is very high Or the statistics are not current. I agree, this shouldn't happen often, but we can never know, going in, whether it *is* the case. You *could* always wind up needing to read the entire table, and still not hit the initially-targeted number of rows. Now, arguably you could use data gleaned from each pass to adjust the target or inform the size of the next pass. My point is that "we selected too few" is a lot more complicated that the "we selected too many" case. > but that case isn't of much practical importance anyway. It's important that it's handled in some sane way when it happens. And it will happen. > But something else comes to mind. Does the standard permit samples > taken with the BERNOULLI method to contain the same tuple multiple > times? I'm pretty sure not. That would be nonsensical. > If not, any kind of TID-based approach will have to all previously > fetched TIDs, which seems doable but unfortunate... Right. You would always need to ignore a duplicate random choice in any one cycle of generating ctid values; and if you are iterating because you fell short, you would also need to ignore values from previous iterations. And OR your new bitmap against the previous ones to save for the next iteration, if needed. I never said it couldn't be done; it's just very fussy, and you want to avoid a very large number of iterations in the case that someone deleted 99.99% of your rows right before you ran your sample and before autovacuum caught up. -Kevin
> Florian Pflug wrote: >> On May10, 2012, at 18:36 , Kevin Grittner wrote: >>> Robert Haas wrote: >>> >>>> I wonder if you could do this with something akin to the Bitmap >>>> Heap Scan machinery. Populate a TID bitmap with a bunch of >>>> randomly chosen TIDs, fetch them all in physical order >>>> and if you don't get as many rows as you need, rinse and repeat >>>> until you do. >>> >>> If you get too many, it is important that you read all the way to >>> the end and then randomly omit some of them. While a bit of a >>> bother, that's pretty straightforward and should be pretty fast, >>> assuming you're not, like, an order of magnitude high. >> >> Why is that? From a statistical point of view it shouldn't matter >> whether you pick N random samples, or pick M >= N random samples an >> then randomly pick N from M. (random implying uniformly distributed >> here). > > That sounds to me like exactly what what Robert and I both said. > While passing the heap with the bitmap, if you get to the number you > want you don't stop there -- you read all of them ("M" in your > parlance) and randomly drop M minus N of them. Or, if you prefer, I > guess you could *pick* N of them. I don't see a logical difference. What I meant to say was "and drop the last M minus N", not "and randomly drop M minus N". Which, of course, makes my statement utterly wrong since the tuples are sorted by TID so you'd penalize tuples with a larger TID. Duh! Sorry for the noise, guys... >>> But falling short is tougher; making up the difference could be an >>> iterative process, which could always wind up with having you read >>> all tuples in the table without filling your sample. >> >> But the likelihood of that happening is extremely low, no? > > That depends. What if someone just did a mass delete and your > statistics aren't yet up to date when they ask to pick a relatively > large percentage of the rows. > >> Unless the sampling percentage is very high > > Or the statistics are not current. I agree, this shouldn't happen > often, but we can never know, going in, whether it *is* the case. > You *could* always wind up needing to read the entire table, and > still not hit the initially-targeted number of rows. Now, arguably > you could use data gleaned from each pass to adjust the target or > inform the size of the next pass. My point is that "we selected too > few" is a lot more complicated that the "we selected too many" case. > >> but that case isn't of much practical importance anyway. > > It's important that it's handled in some sane way when it happens. > And it will happen. Hm. Maybe one can get rid of these sorts of problems by factoring in the expected density of the table beforehand and simply accepting that the results will be inaccurate if the statistics are outdated? One could, for example, simply pick N := SamplingPercentage * MaxTuplesPerPage / AvgLiveTuplesPerPage where AvgLiveTuplesPerPage := #Tuples / #Pages random TIDs, fetch the live ones, and return them. I'm not totally sure whether this approach is sensible to non-uniformity in the tuple to line-pointer assignment, though. >> But something else comes to mind. Does the standard permit samples >> taken with the BERNOULLI method to contain the same tuple multiple >> times? > > I'm pretty sure not. That would be nonsensical. > >> If not, any kind of TID-based approach will have to all previously >> fetched TIDs, which seems doable but unfortunate... > > Right. You would always need to ignore a duplicate random choice in > any one cycle of generating ctid values; and if you are iterating > because you fell short, you would also need to ignore values from > previous iterations. And OR your new bitmap against the previous > ones to save for the next iteration, if needed. I never said it > couldn't be done; it's just very fussy, and you want to avoid a very > large number of iterations in the case that someone deleted 99.99% of > your rows right before you ran your sample and before autovacuum > caught up. Actually, thinking more about this, if the approach sketched above turns out to work, then one might be able to get away without remembering previously computed TIDs, thus removing a lot of complexity. Say, for example, you simply start out with a single random TID tid[0]. The you produce the sequence of "random" TIDs by setting tid[n+1] = tid[n] + random(1 <= x <= 2*D-1) where D is the expected distance between one TID from the sample set and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once tid[n] >) #TIDs). This will give you approximately uniformly distributed TIDs, I think, and will even return them in physical order. The 100$ question is, by how much does this violate the independence requirement, i.e. how far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart? Some search through the literature should be able to answer that. Should the dependency turn out to be too high, one might be able to lower it by scaling up N by a factor q, and then discarding each generated TID with probability 1/q. This amounts to a gradual switch to a simple seqscan + bernoulli-experiment algorithm, so for large q the dependency between P(tuple X picked) and P(tuple Y picked) should go to zero - or at least so I think. best regards, Florian Pflug
[ Sorry for the self-reply ] On May11, 2012, at 13:17 , Florian Pflug wrote: > Actually, thinking more about this, if the approach sketched above > turns out to work, then one might be able to get away without remembering > previously computed TIDs, thus removing a lot of complexity. > > Say, for example, you simply start out with a single random TID tid[0]. > The you produce the sequence of "random" TIDs by setting > > tid[n+1] = tid[n] + random(1 <= x <= 2*D-1) > > where D is the expected distance between one TID from the sample set > and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once > tid[n] >) #TIDs). This will give you approximately uniformly distributed > TIDs, I think, and will even return them in physical order. The 100$ question > is, by how much does this violate the independence requirement, i.e. how > far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart? > Some search through the literature should be able to answer that. Actually, one can easily do better then that. Say you've determined that to pick each tuple with probability p_tup, you need to pick each TID with probability p_tid (where p_tup/p_tid is the TID density, i.e. the probability of a single TID being live). Now, looping over all TIDs and picking each with probability p_tid is, of course, just a another (more error-prone) way of iterating over all tuples and picking each with probability p_tup. But instead of looping, you can jump from from picked TID to the next. Observe that, after having picked tid X, the probability of picking X+i as the next tid is (1 - p_tid)^(i-1) * p_tid, because to pick X+i next you have to skip (i-1) TIDs and then pick the i-th one. Thus, the following algorithm returns a sorted list of TIDs which should be uniformly distributed and independent (or so I think, at least) 1) Starting with a random tid tid[0], chosen such that P(tid[0] = i) = (1 - p_tid)^i * p_tid 2) Setting tid[n+1] = tid[n] + d, which d > 0 chosen such that P(d = i) = (1 - p_tid)^(i-1) * p_tid 3) Abort if max(tid) is >= #TIDs. This all hinges on the ability to produce a sufficient accurate estimate of the TID density p_tup/p_tid, of course. best regards, Florian Pflug
Florian Pflug <fgp@phlo.org> wrote: > Maybe one can get rid of these sorts of problems by factoring in > the expected density of the table beforehand and simply accepting > that the results will be inaccurate if the statistics are > outdated? > > One could, for example, simply pick > > N := SamplingPercentage * MaxTuplesPerPage / > AvgLiveTuplesPerPage > > where > > AvgLiveTuplesPerPage := #Tuples / #Pages > > random TIDs, fetch the live ones, and return them. To clarify, I read this as using reltuples and relpages for the table, and returning only tuples which are visible according to the query's snapshot. (i.e., I think you used "live" to mean two different things there.) Unless I'm missing something, I think that works for percentage selection, which is what the standard talks about, without any need to iterate through addition samples. Good idea! We don't need to do any second pass to pare down initial results, either. This greatly simplifies coding while providing exactly what the standard requires. > I'm not totally sure whether this approach is sensible to > non-uniformity in the tuple to line-pointer assignment, though. I think we can solve that by going high enough with tuple numbers to reach the highest tuple ID that might be in use in the table, and *not* following HOT chains. (If we follow HOT chains, we could have several distinct ctid values which returned the same tuple.) Or am I thinking about this incorrectly? > [more complex alternatives] I really think your first suggestion covers it perfectly; these more complex techniques don't seem necessary to me. -Kevin
Florian Pflug <fgp@phlo.org> writes: > This all hinges on the ability to produce a sufficient accurate estimate of the > TID density p_tup/p_tid, of course. I think that's the least of its problems. AFAICS this analysis ignores (1) the problem that the TID space is nonuniform, ie we don't know how many tuples in each page until we look; (2) the problem that we don't know the overall number of tuples beforehand. I'm not sure that there is any way to deal with (1) fully without examining every single page, but algorithms that assume that the TIDs are numbered linearly are broken before they start. regards, tom lane
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Florian Pflug <fgp@phlo.org> wrote: >> Maybe one can get rid of these sorts of problems by factoring in >> the expected density of the table beforehand and simply accepting >> that the results will be inaccurate if the statistics are >> outdated? > Unless I'm missing something, I think that works for percentage > selection, which is what the standard talks about, without any need > to iterate through addition samples. Good idea! We don't need to > do any second pass to pare down initial results, either. This > greatly simplifies coding while providing exactly what the standard > requires. >> I'm not totally sure whether this approach is sensible to >> non-uniformity in the tuple to line-pointer assignment, though. If you're willing to accept that the quality of the results depends on having up-to-date stats, then I'd suggest (1) use the planner's existing technology to estimate the number of rows in the table; (2) multiply by sampling factor you want to get a desired number of sample rows; (3) use ANALYZE's existing technology to acquire that many sample rows. While the ANALYZE code isn't perfect with respect to the problem of nonuniform TID density, it certainly will be a lot better than pretending that that problem doesn't exist. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > If you're willing to accept that the quality of the results > depends on having up-to-date stats, then I'd suggest (1) use the > planner's existing technology to estimate the number of rows in > the table; (2) multiply by sampling factor you want to get a > desired number of sample rows; (3) use ANALYZE's existing > technology to acquire that many sample rows. While the ANALYZE > code isn't perfect with respect to the problem of nonuniform TID > density, it certainly will be a lot better than pretending that > that problem doesn't exist. That is basically what we've been talking about for SYSTEM samples, which I think we've agreed should be finished and committed before programming on the BERNOULLI option. Nobody is pretending that the problem doesn't exist. Let me restate the option I'm thinking solves the problem well, since it would be easy to have missed what I was responding to with all the various suggestions on the thread. The user has asked for a BERNOULLI sample S of some percent of the table. We determine the maximum tuple index M which could be used for any ctid in the table. We get an estimate of the number of pages P in the table using whatever is the best available technique. If you assume that the number of tuples (visible or not) is T, the approximate number of tuples we want to randomly select is S*T. We will be probing random page numbers 1..P for random tuple indexes 1..M. So how many random probes by ctid does that require? The chance of a hit on each probe is ((T/P)/M) -- the average number of tuples per page divided by the number of tuple index values allowed. So we need (S*T)/((T/P)/M) probes. Simplifying that winds up being S*M*P the product of the sample size as a percentage, the maximum tuple index on a page, and the number of pages. (A calculation some may have jumped to as intuitively obvious.) So let's call the number of probes N. We randomly generate N distinct ctid values, where each is a random page number 1..P and a random index 1..M. We attempt to read each of these in block number order, not following HOT chains. For each, if the tuple exists and is visible, it is part of our result set. Since T cancels out of that equation, we don't need to worry about estimating it. Results will be correct for any value of M which is *at least* as large as the maximum tuple index in the table, although values of M larger than that will degrade performance. The same holds true for the number of pages. Shouldn't that get us the randomly chosen sample we're looking for? Is there a problem you think this ignores? Credit where credit is due: I *think* this is merely my restatement of something Florian was suggesting, building on a suggestion from Robert. -Kevin
On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > We > will be probing random page numbers 1..P for random tuple indexes > 1..M. So how many random probes by ctid does that require? The > chance of a hit on each probe is ((T/P)/M) -- the average number of > tuples per page divided by the number of tuple index values allowed. > So we need (S*T)/((T/P)/M) probes. Simplifying that winds up being > S*M*P the product of the sample size as a percentage, the maximum > tuple index on a page, and the number of pages. (A calculation some > may have jumped to as intuitively obvious.) > > So let's call the number of probes N. We randomly generate N > distinct ctid values, where each is a random page number 1..P and a > random index 1..M. We attempt to read each of these in block number > order, not following HOT chains. For each, if the tuple exists and > is visible, it is part of our result set. > > Since T cancels out of that equation, we don't need to worry about > estimating it. Results will be correct for any value of M which is > *at least* as large as the maximum tuple index in the table, > although values of M larger than that will degrade performance. The > same holds true for the number of pages. The trouble is, AFAICS, that you can't bound M very well without scanning the whole table. I mean, it's bounded by theoretical limit, but that's it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > [ uniformly sample the TID space defined as (1..P, 1..M) ] > Shouldn't that get us the randomly chosen sample we're looking for? > Is there a problem you think this ignores? Not sure. The issue that I'm wondering about is that the line number part of the space is not uniformly populated, ie, small line numbers are much more likely to exist than large ones. (In the limit that density goes to zero, when you pick M much too large.) It's not clear to me whether this gives an unbiased probability of picking real tuples, as opposed to hypothetical TIDs. Another issue is efficiency. In practical cases you'll have to greatly overestimate M compared to the typical actual-number-of-tuples-per-page, which will lead to a number of target TIDs N that's much larger than necessary, which will make the scan slow --- I think in practice you'll end up doing a seqscan or something that might as well be one, because unless S is *really* tiny it'll hit just about every page. We can have that today without months worth of development effort, using the "WHERE random() < S" technique. regards, tom lane
Robert Haas <robertmhaas@gmail.com> wrote: > The trouble is, AFAICS, that you can't bound M very well without > scanning the whole table. I mean, it's bounded by theoretical > limit, but that's it. What would the theoretical limit be? (black size - page header size - minimum size of one tuple) / item pointer size? So, on an 8KB page, somewhere in the neighborhood of 1350? Hmm. If that's right, that would mean a 1% random sample would need 13.5 probes per page, meaning there wouldn't tend to be a lot of pages missed. Still, the technique for getting a random sample seems sound, unless someone suggests something better. Maybe we just want to go straight to a seqscan to get to the pages we want to probe rather than reading just the ones on the "probe list" in physical order? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Robert Haas <robertmhaas@gmail.com> wrote: >> The trouble is, AFAICS, that you can't bound M very well without >> scanning the whole table. I mean, it's bounded by theoretical >> limit, but that's it. > What would the theoretical limit be? (black size - page header size > - minimum size of one tuple) / item pointer size? So, on an 8KB > page, somewhere in the neighborhood of 1350? Your math is off --- I get something less than 300, even if the tuples are assumed to be empty of data. (Header size 24 bytes, plus 4-byte line pointer, so at least 28 bytes per tuple, so at most 292 on an 8K page.) But you still end up probing just about every page for a 1% sample. regards, tom lane
On May11, 2012, at 17:20 , Robert Haas wrote: > On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner >> Since T cancels out of that equation, we don't need to worry about >> estimating it. Results will be correct for any value of M which is >> *at least* as large as the maximum tuple index in the table, >> although values of M larger than that will degrade performance. The >> same holds true for the number of pages. > > The trouble is, AFAICS, that you can't bound M very well without > scanning the whole table. I mean, it's bounded by theoretical limit, > but that's it. Hm, but maybe Kevin's observation that the actual value of M shouldn't matter as long as it's large enough helps here. What if you start out with M=1 and generate your first TID. After reading the page, but before returning a tuple, you check if M is indeed an upper bound on the tuple indices. If it isn't, you increase M, recompute N (the number of probes), determine a new random tuple index, and return the tuple (if it is live). Would that introduce bias? I'd think not, because scaling up N shouldn't, per Kevin's argument, change the probability of a TID being picked. So increasing N in the middle of a scan should neither penalize nor favour tuples which were already returned compared to those which will follow, no? (And yes, even if there don't turn out to be any obvious holes in this argument, it requires more formal proof that I was able to give here before being turned into code. Or at the very least excessive testing which all kinds of data) best regards, Florian Pflug
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> [ uniformly sample the TID space defined as (1..P, 1..M) ] > >> Shouldn't that get us the randomly chosen sample we're looking >> for? Is there a problem you think this ignores? > > Not sure. The issue that I'm wondering about is that the line > number part of the space is not uniformly populated, ie, small > line numbers are much more likely to exist than large ones. (In > the limit that density goes to zero, when you pick M much too > large.) It's not clear to me whether this gives an unbiased > probability of picking real tuples, as opposed to hypothetical > TIDs. I'm convinced it generates a random sample, since every tuple in the *possible* space has an equal chance of selection, and we ignore the ones which don't exist or aren't visible, each of the remaining (existing, visible) tuples is as likely to be chosen as any other. I'm that's not a formal proof, but I'm sure I could work one up. > Another issue is efficiency. In practical cases you'll have to > greatly overestimate M compared to the typical actual-number-of- > tuples-per-page, which will lead to a number of target TIDs N > that's much larger than necessary, which will make the scan slow > --- I think in practice you'll end up doing a seqscan or something > that might as well be one, because unless S is *really* tiny it'll > hit just about every page. We can have that today without months > worth of development effort, using the "WHERE random() < S" > technique. I think you've probably got me there. The technique you describe should be equally random, and thinking through the work involved in each, the random() < S approach seems almost certain to be cheaper. Is it worth supporting the TABLESAMPLE BERNOULLI option as syntactic sugar for that? -Kevin
On May11, 2012, at 16:03 , Kevin Grittner wrote: >> [more complex alternatives] > > I really think your first suggestion covers it perfectly; these more > complex techniques don't seem necessary to me. The point of the more complex techniques (especially the algorithm in my second mail, the "reply to self") was simply to optimize the generation of a random, uniformly distributed, unique and sorted list of TIDs. The basic idea is to make sure we generate the TIDs in physical order, and thus automatically ensure that they are unique. The reduces the memory (or disk) requirement to O(1) instead of O(n), and (more importantly, actually) makes the actual implementation much simpler. best regards, Florian Pflug
On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> The trouble is, AFAICS, that you can't bound M very well without >> scanning the whole table. I mean, it's bounded by theoretical >> limit, but that's it. > > What would the theoretical limit be? MaxHeapTuplesPerPage? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Robert Haas <robertmhaas@gmail.com> wrote: >>> The trouble is, AFAICS, that you can't bound M very well without >>> scanning the whole table. I mean, it's bounded by theoretical >>> limit, but that's it. >> >> What would the theoretical limit be? > > MaxHeapTuplesPerPage? What about dead line pointers without corresponding tuples? -Kevin
On Fri, May 11, 2012 at 6:16 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> MaxHeapTuplesPerPage? > > What about dead line pointers without corresponding tuples? Actually we don't allow there to be more than MaxHeapTuplesPerPage line pointers even if some of them are dead line pointers. I think the argument then was that there could be bugs in code that isn't expecting more so perhaps that doesn't justify baking that into more places when we could instead be removing that assumption. Also, if I absorbed enough of this conversation skimming backwards it seems the algorithm would be very inefficient with such a conservative estimate. On a table with normal width rows it would mean usually picking tuples that don't exist and having to try again. As Tom pointed out that would mean even a small sample would have to read nearly the whole table to find the sample. -- greg