Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Florian Pflug
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id 87B3326E-03B2-4724-8FF8-2C07AA0FC854@phlo.org
Whole thread Raw
In response to Re: Gsoc2012 idea, tablesample  (Florian Pflug <fgp@phlo.org>)
Responses Re: Gsoc2012 idea, tablesample
List pgsql-hackers
[ Sorry for the self-reply ]

On May11, 2012, at 13:17 , Florian Pflug wrote:
> Actually, thinking more about this, if the approach sketched above
> turns out to work, then one might be able to get away without remembering
> previously computed TIDs, thus removing a lot of complexity.
>
> Say, for example, you simply start out with a single random TID tid[0].
> The you produce the sequence of "random" TIDs by setting
>
>  tid[n+1] = tid[n] + random(1 <= x <= 2*D-1)
>
> where D is the expected distance between one TID from the sample set
> and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once
> tid[n] >) #TIDs). This will give you approximately uniformly distributed
> TIDs, I think, and will even return them in physical order. The 100$ question
> is, by how much does this violate the independence requirement, i.e. how
> far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart?
> Some search through the literature should be able to answer that.

Actually, one can easily do better then that. Say you've determined that
to pick each tuple with probability p_tup, you need to pick each TID with
probability p_tid (where p_tup/p_tid is the TID density, i.e. the probability
of a single TID being live). Now, looping over all TIDs and picking each with
probability p_tid is, of course, just a another (more error-prone) way of
iterating over all tuples and picking each with probability p_tup. But instead
of looping, you can jump from from picked TID to the next. Observe that, after
having picked tid X, the probability of picking X+i as the next tid is
(1 - p_tid)^(i-1) * p_tid, because to pick X+i next you have to skip (i-1) TIDs
and then pick the i-th one. Thus, the following algorithm returns a sorted list
of TIDs which should be uniformly distributed and independent (or so I think,
at least)
 1) Starting with a random tid tid[0], chosen such that      P(tid[0] = i) = (1 - p_tid)^i * p_tid
 2) Setting tid[n+1] = tid[n] + d, which d > 0 chosen such that      P(d = i) = (1 - p_tid)^(i-1) * p_tid
 3) Abort if max(tid) is >= #TIDs.

This all hinges on the ability to produce a sufficient accurate estimate of the
TID density p_tup/p_tid, of course.

best regards,
Florian Pflug



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Draft release notes complete
Next
From: Sandro Santilli
Date:
Subject: Re: Gsoc2012 idea, tablesample