On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
> On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli <strk@keybit.net> wrote:
> > I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
> > using a tidscan. Something like: WHERE ctid =~ '(501,*)' or a ctidrange.
>
> Among other things, this would enable user-space implementation of
> tablesample. Given the operator =~(tid, int) that matches the page
> number and planner/executor integration so that it results in a TID
> scan, you would need the following functions:
>
> random_pages(tbl regclass, samples int) returns int[]
> aggregate function:
> reservoir_sample(item anyelement, samples int) returns anyarray
>
> Implementations for both of the functions could be adapted from analyze.c.
>
> Then tablesample could be implemented with the following query:
> SELECT (SELECT reservoir_sample(some_table, 50) AS samples
> FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
> FROM random_pages('some_table', 50) AS rnd_pgtids;
>
> Actually, now that I think about it, it could actually be implemented
> without any modifications to core at some cost to efficiency.
> random_pages would have to return tid[] that contains for each
> generated pagenumber all possible tids on that page.
This is exactly what I'm after.
I've actually started crafting such a TableSample function and I'm in the
process to refine the signature so your suggested interface above is
very useful, thanks !
But I don't understand the reservoir_sample call, what is it supposed to do ?
And how flexibly "anyarray" return would be ? Could you return arbitrary
typed rowtypes from it ?
> By making the building blocks available users get more flexibility.
> The downside would be that we can't automatically make better sampling
> methods available.
One approach doesn't preclude the other. TABLESAMPLE will still be useful,
also for SQL compliance.
--strk;
,------o-. | __/ | Delivering high quality PostGIS 2.0 ! | / 2.0 | http://strk.keybit.net -
http://vizzuality.com`-o------'