Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id CA+CSw_twJkJ0P45oGNXRJ4RFfknEuuR4nVd=GqXUGw4gGm-HUA@mail.gmail.com
Whole thread Raw
In response to Re: Gsoc2012 idea, tablesample  (Sandro Santilli <strk@keybit.net>)
Responses Re: Gsoc2012 idea, tablesample  (Sandro Santilli <strk@keybit.net>)
List pgsql-hackers
On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli <strk@keybit.net> wrote:
> I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
> using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange.

Among other things, this would enable user-space implementation of
tablesample. Given the operator =~(tid, int) that matches the page
number and planner/executor integration so that it results in a TID
scan, you would need the following functions:

random_pages(tbl regclass, samples int) returns int[]
aggregate function:
reservoir_sample(item anyelement, samples int) returns anyarray

Implementations for both of the functions could be adapted from analyze.c.

Then tablesample could be implemented with the following query:
SELECT (SELECT reservoir_sample(some_table, 50) AS samples  FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
FROM random_pages('some_table', 50) AS rnd_pgtids;

Actually, now that I think about it, it could actually be implemented
without any modifications to core at some cost to efficiency.
random_pages would have to return tid[] that contains for each
generated pagenumber all possible tids on that page.

By making the building blocks available users get more flexibility.
The downside would be that we can't automatically make better sampling
methods available.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: psql omits row count under "\x auto"
Next
From: Robert Haas
Date:
Subject: Re: Patch: add timing of buffer I/O requests