Re: TABLESAMPLE patch - Mailing list pgsql-hackers
From | Petr Jelinek |
---|---|
Subject | Re: TABLESAMPLE patch |
Date | |
Msg-id | 55272050.8080607@2ndquadrant.com Whole thread Raw |
In response to | Re: TABLESAMPLE patch (Peter Eisentraut <peter_e@gmx.net>) |
Responses |
Re: TABLESAMPLE patch
Re: TABLESAMPLE patch |
List | pgsql-hackers |
On 09/04/15 21:30, Peter Eisentraut wrote: > > In the SQL standard, the TABLESAMPLE clause is attached to a table > expression (<table primary>), which includes table functions, > subqueries, CTEs, etc. In the proposed patch, it is attached to a table > name, allowing only an ONLY clause. So this is a significant deviation. > I wouldn't call something that implements subset of standard a deviation. Especially if other major dbs have chosen same approach (afaik the only db that supports sampling over something besides physical relations is Oracle but their sampling works slightly differently than what standard has). > Obviously, doing block sampling on a physical table is a significant use > case, but we should be clear about which restrictions and tradeoffs were > are making now and in the future, especially if we are going to present > extension interfaces. The fact that physical tables are interchangeable > with other relation types, at least in data-reading contexts, is a > feature worth preserving. Yes, but I don't think there is anything that prevents us from adding this in the future. The sampling scan could made to be able to read both directly from heap and from executor subnode which is doable even if it won't be extremely pretty (but it should be easy to encapsulate into 2 internal interfaces as the heap reading is encapsulated to 1 internal interface already). Another approach would be having two different executor nodes - SampingScan and SamplingFilter and letting planner pick one depending on what is the source for TABLESAMPLE clause. The extension api is currently mainly: nextblock - gets next blockid to read from heap nextuple - gets next tupleid to read current block examinetuple - lets the extension to decide if tuple should be indeed returned (this one is optional) For the executor node reading we'd probably just call the examinetuple as there are no block ids or tuple ids there. This makes the API look slightly schizophrenic but on the other hand it gives the plugins control over how physical relation is read if that's indeed the source. And I guess we could let the plugin specify if it supports the heap access (nextblock/nexttuple) and if it doesn't then planner would always choose SamplingFilter over SequentialScan for physical relation instead of SamplingScan. All of this is possible to add without breaking compatibility with what is proposed for commit currently. The reasons why we need the nextblock and nexttuple interfaces and the ability to read the heap directly are a) block sampling can't be done by reading from another executor node, b) performance. > > It may be worth thinking about some examples of other sampling methods, > in order to get a better feeling for whether the interfaces are appropriate. > There is one additional method which is just purely for testing the interface and that uses column value to determine if the tuple should be returned or not (which is useless in practice obviously as you can do that using WHERE, it just shows how to use the interface). I would like to eventually have something that's time limited rather than size limited for example. I didn't think much about other sampling algorithms but Simon proposed some and they should work with the current API. > Earlier in the thread, someone asked about supporting specifying a > number of rows instead of percents. While not essential, that seems > pretty useful, but I wonder how that could be implemented later on if we > take the approach that the argument to the sampling method can be an > arbitrary quantity that is interpreted only by the method. > Well, you can have two approaches to this, either allow some specific set of keywords that can be used to specify limit, or you let sampling methods interpret parameters, I believe the latter is more flexible. There is nothing stopping somebody writing sampling method which takes limit as number of rows, or anything else. Also for example for BERNOULLI to work correctly you'd need to convert the number of rows to fraction of table anyway (and that's exactly what the one database which has this feature does internally) and then it's no different than passing (SELECT 100/reltuples*number_of_rows FROM tablename) as a parameter. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: