Re: Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows
Date
Msg-id 55C3C2CC.2090409@agliodbs.com
Whole thread Raw
In response to Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows  (Petr Jelinek <petr@2ndquadrant.com>)
List pgsql-hackers
On 08/06/2015 01:19 PM, Simon Riggs wrote:
> On 6 August 2015 at 21:14, Josh Berkus <josh@agliodbs.com
> <mailto:josh@agliodbs.com>> wrote:
> 
>     On 08/06/2015 01:10 PM, Simon Riggs wrote:
>     > Given, user-stated probability of accessing a block of P and N total
>     > blocks, there are a few ways to implement block sampling.
>     >
>     > 1. Test P for each block individually. This gives a range of possible
>     > results, with 0 blocks being possible outcome, though decreasing in
>     > probability as P increases for fixed N. This is the same way BERNOULLI
>     > works, we just do it for blocks rather than rows.
>     >
>     > 2. We calculate P/N at start of scan and deliver this number blocks by
>     > random selection from N available blocks.
> 
> 
> (My mistake, that would be P*N) 
>  
> 
>     > At present we do (1), exactly as documented. (2) is slightly harder
>     > since we'd need to track which blocks have been selected already so we
>     > can use a random selection with no replacement algorithm. On a table
>     > with uneven distribution of rows this would still return a variable
>     > sample size, so it didn't seem worth changing.
> 
>     Aha, thanks!
> 
>     So, seems like this is just a doc issue? That is, we just need to
>     document that using SYSTEM on very small sample sizes may return
>     unexpected numbers of results ... and maybe also how the algorithm
>     actually works.
> 
> 
> For me, the docs seem exactly correct. The mathematical implications of
> that just aren't recorded explicitly.

Well, for the SELECT page, all we need is the following (one changed
sentence):

The SYSTEM method is significantly faster than the BERNOULLI method when
small sampling percentages are specified, but it may return a
less-random sample of the table as a result of clustering effects, and
may return a highly variable number of results for very small sample sizes.

> 
> I will try to reword or add something to make it clear that this can
> return a variable number of blocks and thus produces a result with
> greater variability in the number of rows returned.
> 
> It's documented on the SELECT page only; plus there is a whole new
> section on writing tablesample functions.

Seems like it would be nice to have more detailed user docs somewhere
which explain the sampling algos we have, especially if we get more in
the future.  Not sure where would be appropriate for that, though.

If there is no appropriate place, I'll just write a blog.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: Bug? Small samples in TABLESAMPLE SYSTEM returns zero rows
Next
From: Peter Geoghegan
Date:
Subject: Re: 9.5 release notes