Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers
From | Florian Pflug |
---|---|
Subject | Re: Gsoc2012 idea, tablesample |
Date | |
Msg-id | 7DE33781-4A07-4D3C-8E29-B33D37ADF51B@phlo.org Whole thread Raw |
In response to | Re: Gsoc2012 idea, tablesample ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) |
Responses |
Re: Gsoc2012 idea, tablesample
Re: Gsoc2012 idea, tablesample |
List | pgsql-hackers |
> Florian Pflug wrote: >> On May10, 2012, at 18:36 , Kevin Grittner wrote: >>> Robert Haas wrote: >>> >>>> I wonder if you could do this with something akin to the Bitmap >>>> Heap Scan machinery. Populate a TID bitmap with a bunch of >>>> randomly chosen TIDs, fetch them all in physical order >>>> and if you don't get as many rows as you need, rinse and repeat >>>> until you do. >>> >>> If you get too many, it is important that you read all the way to >>> the end and then randomly omit some of them. While a bit of a >>> bother, that's pretty straightforward and should be pretty fast, >>> assuming you're not, like, an order of magnitude high. >> >> Why is that? From a statistical point of view it shouldn't matter >> whether you pick N random samples, or pick M >= N random samples an >> then randomly pick N from M. (random implying uniformly distributed >> here). > > That sounds to me like exactly what what Robert and I both said. > While passing the heap with the bitmap, if you get to the number you > want you don't stop there -- you read all of them ("M" in your > parlance) and randomly drop M minus N of them. Or, if you prefer, I > guess you could *pick* N of them. I don't see a logical difference. What I meant to say was "and drop the last M minus N", not "and randomly drop M minus N". Which, of course, makes my statement utterly wrong since the tuples are sorted by TID so you'd penalize tuples with a larger TID. Duh! Sorry for the noise, guys... >>> But falling short is tougher; making up the difference could be an >>> iterative process, which could always wind up with having you read >>> all tuples in the table without filling your sample. >> >> But the likelihood of that happening is extremely low, no? > > That depends. What if someone just did a mass delete and your > statistics aren't yet up to date when they ask to pick a relatively > large percentage of the rows. > >> Unless the sampling percentage is very high > > Or the statistics are not current. I agree, this shouldn't happen > often, but we can never know, going in, whether it *is* the case. > You *could* always wind up needing to read the entire table, and > still not hit the initially-targeted number of rows. Now, arguably > you could use data gleaned from each pass to adjust the target or > inform the size of the next pass. My point is that "we selected too > few" is a lot more complicated that the "we selected too many" case. > >> but that case isn't of much practical importance anyway. > > It's important that it's handled in some sane way when it happens. > And it will happen. Hm. Maybe one can get rid of these sorts of problems by factoring in the expected density of the table beforehand and simply accepting that the results will be inaccurate if the statistics are outdated? One could, for example, simply pick N := SamplingPercentage * MaxTuplesPerPage / AvgLiveTuplesPerPage where AvgLiveTuplesPerPage := #Tuples / #Pages random TIDs, fetch the live ones, and return them. I'm not totally sure whether this approach is sensible to non-uniformity in the tuple to line-pointer assignment, though. >> But something else comes to mind. Does the standard permit samples >> taken with the BERNOULLI method to contain the same tuple multiple >> times? > > I'm pretty sure not. That would be nonsensical. > >> If not, any kind of TID-based approach will have to all previously >> fetched TIDs, which seems doable but unfortunate... > > Right. You would always need to ignore a duplicate random choice in > any one cycle of generating ctid values; and if you are iterating > because you fell short, you would also need to ignore values from > previous iterations. And OR your new bitmap against the previous > ones to save for the next iteration, if needed. I never said it > couldn't be done; it's just very fussy, and you want to avoid a very > large number of iterations in the case that someone deleted 99.99% of > your rows right before you ran your sample and before autovacuum > caught up. Actually, thinking more about this, if the approach sketched above turns out to work, then one might be able to get away without remembering previously computed TIDs, thus removing a lot of complexity. Say, for example, you simply start out with a single random TID tid[0]. The you produce the sequence of "random" TIDs by setting tid[n+1] = tid[n] + random(1 <= x <= 2*D-1) where D is the expected distance between one TID from the sample set and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once tid[n] >) #TIDs). This will give you approximately uniformly distributed TIDs, I think, and will even return them in physical order. The 100$ question is, by how much does this violate the independence requirement, i.e. how far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart? Some search through the literature should be able to answer that. Should the dependency turn out to be too high, one might be able to lower it by scaling up N by a factor q, and then discarding each generated TID with probability 1/q. This amounts to a gradual switch to a simple seqscan + bernoulli-experiment algorithm, so for large q the dependency between P(tuple X picked) and P(tuple Y picked) should go to zero - or at least so I think. best regards, Florian Pflug
pgsql-hackers by date: