On Tue, Apr 17, 2012 at 11:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Qi,
>
> * Qi Huang (huangqiyx@hotmail.com) wrote:
>> > Doing it 'right' certainly isn't going to be simply taking what Neil did
>> > and updating it, and I understand Tom's concerns about having this be
>> > more than a hack on seqscan, so I'm a bit nervous that this would turn
>> > into something bigger than a GSoC project.
>>
>> As Christopher Browne mentioned, for this sampling method, it is not possible without scanning the whole data set.
Itimproves the sampling quality but increases the sampling cost. I think it should also be using only for some special
samplingtypes, not for general. The general sampling methods, as in the SQL standard, should have only SYSTEM and
BERNOULLImethods.
>
> I'm not sure what sampling method you're referring to here. I agree
> that we need to be looking at implementing the specific sampling methods
> listed in the SQL standard. How much information is provided in the
> standard about the requirements placed on these sampling methods? Does
> the SQL standard only define SYSTEM and BERNOULLI? What do the other
> databases support? What does SQL say the requirements are for 'SYSTEM'?
Well, there may be cases where the quality of the sample isn't
terribly important, it just needs to be "reasonable."
I browsed an article on the SYSTEM/BERNOULLI representations; they
both amount to simple picks of tuples.
- BERNOULLI implies picking tuples with a specified probability.
- SYSTEM implies picking pages with a specified probability. (I think
we mess with this in ways that'll be fairly biased in view that tuples
mayn't be of uniform size, particularly if Slightly Smaller strings
stay in the main pages, whilst Slightly Larger strings get TOASTed...)
I get the feeling that this is a somewhat-magical feature (in that
users haven't much hope of understanding in what ways the results are
deterministic) that is sufficiently "magical" that anyone serious
about their result sets is likely to be unhappy to use either SYSTEM
or BERNOULLI.
Possibly the forms of sampling that people *actually* need, most of
the time, are more like Dollar Unit Sampling, which are pretty
deterministic, in ways that mandate that they be rather expensive
(e.g. - guaranteeing Seq Scan).
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"