Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Sandro Santilli
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id 20120424064926.GE7891@gnash
Whole thread Raw
In response to Re: Gsoc2012 idea, tablesample  (Ants Aasma <ants@cybertec.at>)
Responses Re: Gsoc2012 idea, tablesample
List pgsql-hackers
On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
> On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli <strk@keybit.net> wrote:
> > I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
> > using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange.
> 
> Among other things, this would enable user-space implementation of
> tablesample. Given the operator =~(tid, int) that matches the page
> number and planner/executor integration so that it results in a TID
> scan, you would need the following functions:
> 
> random_pages(tbl regclass, samples int) returns int[]
> aggregate function:
> reservoir_sample(item anyelement, samples int) returns anyarray
> 
> Implementations for both of the functions could be adapted from analyze.c.
> 
> Then tablesample could be implemented with the following query:
> SELECT (SELECT reservoir_sample(some_table, 50) AS samples
>    FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
> FROM random_pages('some_table', 50) AS rnd_pgtids;
> 
> Actually, now that I think about it, it could actually be implemented
> without any modifications to core at some cost to efficiency.
> random_pages would have to return tid[] that contains for each
> generated pagenumber all possible tids on that page.

This is exactly what I'm after.
I've actually started crafting such a TableSample function and I'm in the
process to refine the signature so your suggested interface above is 
very useful, thanks !

But I don't understand the reservoir_sample call, what is it supposed to do ?
And how flexibly "anyarray" return would be ? Could you return arbitrary 
typed rowtypes from it ?

> By making the building blocks available users get more flexibility.
> The downside would be that we can't automatically make better sampling
> methods available.

One approach doesn't preclude the other. TABLESAMPLE will still be useful,
also for SQL compliance.

--strk; 
 ,------o-.  |   __/  |    Delivering high quality PostGIS 2.0 ! |  / 2.0 |    http://strk.keybit.net -
http://vizzuality.com`-o------'
 



pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments
Next
From: Heikki Linnakangas
Date:
Subject: Re: [BUG] Checkpointer on hot standby runs without looking checkpoint_segments