On 2015-08-06 22:25, Josh Berkus wrote:
> On 08/06/2015 01:19 PM, Simon Riggs wrote:
>> For me, the docs seem exactly correct. The mathematical implications of
>> that just aren't recorded explicitly.
>
> Well, for the SELECT page, all we need is the following (one changed
> sentence):
>
> The SYSTEM method is significantly faster than the BERNOULLI method when
> small sampling percentages are specified, but it may return a
> less-random sample of the table as a result of clustering effects, and
> may return a highly variable number of results for very small sample sizes.
>
BTW this was one of the motivations for making tsm_system_rows contrib
module, that one will give you exact number of tuples while still doing
page level sampling. But since it does linear probing it's only useful
if you want those really small amounts of tuples because it will always
do random I/O even if you are scanning large part of the table.
>>
>> I will try to reword or add something to make it clear that this can
>> return a variable number of blocks and thus produces a result with
>> greater variability in the number of rows returned.
>>
>> It's documented on the SELECT page only; plus there is a whole new
>> section on writing tablesample functions.
>
> Seems like it would be nice to have more detailed user docs somewhere
> which explain the sampling algos we have, especially if we get more in
> the future. Not sure where would be appropriate for that, though.
>
> If there is no appropriate place, I'll just write a blog.
>
There is a blog post on 2ndQ blog page which tries to describe the
sampling methods visually, not sure if it's more obvious from that or
not. It's somewhat broken on planet though (only title there).
-- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services