Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers
From | Qi Huang |
---|---|
Subject | Re: Gsoc2012 idea, tablesample |
Date | |
Msg-id | BAY159-W354843E3FF73CC06DD86F0A3230@phx.gbl Whole thread Raw |
In response to | Gsoc2012 idea, tablesample (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>) |
Responses |
Re: Gsoc2012 idea, tablesample
|
List | pgsql-hackers |
> This should be written as an extension, using custom functions as the
> API, instead of extra SQL syntax.
>
> 2. It's not very useful if it's just a dummy replacement for "WHERE
> random() < ?". It has to be more advanced than that. Quality of the
> sample is important, as is performance. There was also an interesting
> idea of on implementing monetary unit sampling.
>
>
> Another idea that Robert Haas suggested was to add support doing a TID
> scan for a query like "WHERE ctid< '(501,1)'". That's not enough work
> for GSoC project on its own, but could certainly be a part of it.
>
Project Details:
Neil Conway has come up with an implementation at 2007 while he gave a talk of introducing to hacking in PostgreSQL. The code was just for demo purpose and was incomplete. It was not integrated into PostgreSQL patch. The PostgreSQL now is quite different from 2007. To implement this query, I need to understand Neil’s implementation and use the general idea to implement in the most up-to-date PostgreSQL release. In the end, I need to test the implementation till it can be released in the next patch.
I will also explore possible ways of further enhancing the sampling quality, like using optimization statistics to produce more accurate sample. There is also suggestions that I can integrate different sampling types into this query, like for accounting data, I can use "monetary unit sampling" to get the result, or implement the geographic indexes sampling. These specific types of sampling might require at least a sequence scan on data, which means slow down the sampling speed, but sampling quality will enhance greatly and be very useful in some specific fields.
List of features:
1. TABLESAMPLE using select, delete, and update
2. SYSTEM method
3. REPEATABLE support
4. BERNOULLI method
5. Use optimizer statistics to produce a more accurate sample
6. non-integer sample percentage and repeat seed
7. sampling quality enhancement
4, 5 and 6 are not included in Neil’s implementation.
For 5, we can use optimizer statistics to refine the algorithm for the random number selection of pages or rows. The sample produced shall be more accurate.
Inch-stones:
1. Conduct the basic features' implementation, able to query TABLESAMPLE clause using select, SYSTEM, with different combination of SQL queries.
2. Implementation of other basic features, REPEATABLE and BERNOULLI.
3. Improvement implementation. Support for using optimizer statistics to produce more accurate sample, non-integer sample percentage and repeat seed, and sampling quality improvement.
Project Schedule:
1. From April 23rd-May 10th: learning and understanding.
2. From Mid May- Mid June: implement simple TABLESAMPLE clause, with SYSTEM method, and no REPEATABLE support. And do testing.
3. Mid June-Mid July: implement other supports, like REPEATABLE clause, and BERNOULLI method, and do testing. Improvement 5 and 6 are also implemented now.
4. Mid July- Mid Aug: Explore ways of improving sampling quality should be done at period 2 and 3. This period will be used to implement those ideas.
-----------------------------------------------------------------------------------------------------------------------
Best Regards and Thanks
pgsql-hackers by date: