Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Qi Huang
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id BAY159-W354843E3FF73CC06DD86F0A3230@phx.gbl
Whole thread Raw
In response to Gsoc2012 idea, tablesample  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Gsoc2012 idea, tablesample  (Sandro Santilli <strk@keybit.net>)
List pgsql-hackers
Hi, Heikki

> 1. We probably don't want the SQL syntax to be added to the grammar. 
> This should be written as an extension, using custom functions as the 
> API, instead of extra SQL syntax.

> 2. It's not very useful if it's just a dummy replacement for "WHERE 
> random() < ?". It has to be more advanced than that. Quality of the 
> sample is important, as is performance. There was also an interesting 
> idea of on implementing monetary unit sampling.


> Another idea that Robert Haas suggested was to add support doing a TID 
> scan for a query like "WHERE ctid< '(501,1)'". That's not enough work 
> for GSoC project on its own, but could certainly be a part of it.


   Based on the discussion these days and my understanding, I don't see much change to be made in my proposal. For the 3 points you raised, the first one and the last one are still not clear. Especially the first point, I see that some people think making the SQL syntax into grammar is first choice. For the second point, the SQL standard 2003 defines two methods for sampling, SYSTEM and BERNOULLI. I think there might be possible quality refinement for them. For the optimization statistics, I have an idea of using it to assign different sampling percentages to different pages, but I'm not sure about the detail yet, I need to see into and learn the optimization statistics (this part is mentioned by Neil in his idea, so I think there should be way of using it). Also there might be enhance on specific sampling, like monetary unit sampling or the geographic indexes sampling. I can do this part(sampling quality improvement) as research based project. We can still discuss deeper to see whether these can be done and how we can do them. 

I post my current amended proposal below. The changes are in red color. 

-----------------------------------------------------------------------------------------------------------------------

Project Details: 

Neil Conway has come up with an implementation at 2007 while he gave a talk of introducing to hacking in PostgreSQL. The code was just for demo purpose and was incomplete. It was not integrated into PostgreSQL patch. The PostgreSQL now is quite different from 2007. To implement this query, I need to understand Neil’s implementation and use the general idea to implement in the most up-to-date PostgreSQL release. In the end, I need to test the implementation till it can be released in the next patch. 

I will also explore possible ways of further enhancing the sampling quality, like using optimization statistics to produce more accurate sample. There is also suggestions that I can integrate different sampling types into this query, like for accounting data, I can use "monetary unit sampling" to get the result, or implement the geographic indexes sampling. These specific types of sampling might require at least a sequence scan on data, which means slow down the sampling speed, but sampling quality will enhance greatly and be very useful in some specific fields. 


List of features:      

1. TABLESAMPLE using select, delete, and update

2. SYSTEM method

3. REPEATABLE support

4. BERNOULLI method

5. Use optimizer statistics to produce a more accurate sample

6. non-integer sample percentage and repeat seed

7. sampling quality enhancement

4, 5 and 6 are not included in Neil’s implementation.

For 5, we can use optimizer statistics to refine the algorithm for the random number selection of pages or rows. The sample produced shall be more accurate.


Inch-stones: 

1.  Conduct the basic features' implementation, able to query TABLESAMPLE clause using select, SYSTEM, with different combination of SQL queries. 

2.  Implementation of other basic features, REPEATABLE and BERNOULLI.

3.  Improvement implementation. Support for using optimizer statistics to produce more accurate sample, non-integer sample percentage and repeat seed, and sampling quality improvement

 

Project Schedule: 

1.     From April 23rd-May 10th: learning and understanding.

2.     From Mid May- Mid June: implement simple TABLESAMPLE clause, with SYSTEM method, and no REPEATABLE support. And do testing.

3.     Mid June-Mid July: implement other supports, like REPEATABLE clause, and BERNOULLI method, and do testing. Improvement 5 and 6 are also implemented now. 

4.     Mid July- Mid Aug: Explore ways of improving sampling quality should be done at period 2 and 3. This period will be used to implement those ideas. 

----------------------------------------------------------------------------------------------------------------------- 







Best Regards and Thanks
Huang Qi Victor
Computer Science of National University of Singapore

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Plan stability versus near-exact ties in cost estimates
Next
From: "Albe Laurenz"
Date:
Subject: Re: Plan stability versus near-exact ties in cost estimates