Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Qi Huang
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id BAY159-W399366318A30988BCF81FDA33C0@phx.gbl
Whole thread Raw
In response to Re: Gsoc2012 idea, tablesample  (Ants Aasma <ants@cybertec.at>)
List pgsql-hackers
> Date: Wed, 18 Apr 2012 02:45:09 +0300
> Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
> From: ants@cybertec.at
> To: cbbrowne@gmail.com
> CC: sfrost@snowman.net; pgsql-hackers@postgresql.org

> On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> > Well, there may be cases where the quality of the sample isn't
> > terribly important, it just needs to be "reasonable."
> >
> > I browsed an article on the SYSTEM/BERNOULLI representations; they
> > both amount to simple picks of tuples.
> >
> > - BERNOULLI implies picking tuples with a specified probability.
> >
> > - SYSTEM implies picking pages with a specified probability.  (I think
> > we mess with this in ways that'll be fairly biased in view that tuples
> > mayn't be of uniform size, particularly if Slightly Smaller strings
> > stay in the main pages, whilst Slightly Larger strings get TOASTed...) 

Looking at the definition of BERNOULLI method and it means to scan all the tuples, I always have a question. What is the difference of using BERNOULLI method with using "select * .... where rand() < 0.1"? They will both go through all the tuples and cost a seq-scan. 
If the answer to the above question is "no difference", I have one proposal for another method of BERNOULLI. For a relation, we can have all their tuples assigned an unique and continuous ID( we may use ctid or others). Then for each number in the set of IDs, we assign a random number and check whether that is smaller than the sampling percentage. If it is smaller, we retrieve the tuple corresponding to that ID. This method will not seq scan all the tuples, but it can sample by picking tuples.
Thanks 

Best Regards and Thanks
Huang Qi Victor
Computer Science of National University of Singapore




pgsql-hackers by date:

Previous
From: "Greg Sabino Mullane"
Date:
Subject: Re: Bug tracker tool we need
Next
From: Robert Haas
Date:
Subject: Re: Bug tracker tool we need