Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Qi Huang
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id BAY159-W2300077743BFF39B90434BA3160@phx.gbl
Whole thread Raw
In response to Re: Gsoc2012 idea, tablesample  (Ants Aasma <ants@cybertec.at>)
Responses Re: Gsoc2012 idea, tablesample
List pgsql-hackers
Hi, All
    Thanks for your ideas on the implementation of TABLESAMPLE. I have a summary below of the high level requirements from the -hacker thread till now. Please give further comment and if I missed any point, please fell free to add. 

1. Build a new type of node, as I should not use SEQSCAN node, which will be scanning the whole table. One of the aims of TABLESAMPLE is to avoid scanning every row and thus save time. 

2. use TIDSCAN to directly access tuples. The below way of using ctid proposed by Kevin looks good.

-One technique which might be suitably random without reading the
-whole table would be to figure out a maximum block number and tuple
-ID for the table, and generate a series of random ctid values to
-read. If the tuple doesn't exist or is not visible to the snapshot,
-you ignore it and continue, until you have read the requisite number
-of rows. You could try to generate them in advance and sort them by
-block number, but then you need to solve the problems of what to do
-if that set of ctids yields too many rows or too few rows, both of
-which have sticky issues.
--Kevin
   I think this technique could be considered as an implementation algo for BERNOULLI method. It looks that it could still reduce a lot of cost compared to just assign random number to every tuple and then retrieve. 

3. Adding specific sampling, like dollar unit sampling, or geographic index support, which meets specific requirement, but costs more. 
-I have a gut feeling that Dollar Unit Sampling and other weighted
-samples can be done with a similar approach of uniformly sampling
-blocks and then running weighted reservoir sampling [1] over the
-result.
-Cheers,
--Ants Aasma

  Aasma proposed the above method for doing these problems, of using the reservoir weighted random sampling. 

4. Implement the way ANALYZE is doing sampling. 
-I'm looking for a way to fetch random samples these days so I confirm
-the need for a quick way to fetch the same sample that "analyze"
-command fetches but at SQL level.
---strk; 
    Proposed by Sandro Santilli, we can look at the way ANALYZE is doing the sampling and try to implement it in TABLESAMPLE. 



The above are the four points I summarized from the -hackers discussion. I put a brief introduction the this gSoc project on Postgres Wiki. You are welcomed to check and modify to enhance it. 


I will add the ideas formalized in this mailing thread to the wiki page once our discussion has been completed. 


Thanks.


Best Regards
Huang Qi Victor
Computer Science of National University of Singapore

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Draft release notes complete
Next
From: Magnus Hagander
Date:
Subject: Re: Draft release notes complete