Re: Gsoc2012 idea, tablesample - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Gsoc2012 idea, tablesample
Date
Msg-id CA+CSw_up_aWm1LOp4A4RWqPPb56hUuWS8+OMSu3xU78KwXChhQ@mail.gmail.com
Whole thread Raw
In response to Re: Gsoc2012 idea, tablesample  (Sandro Santilli <strk@keybit.net>)
Responses Re: Gsoc2012 idea, tablesample  (Qi Huang <huangqiyx@hotmail.com>)
List pgsql-hackers
On Tue, Apr 24, 2012 at 10:31 AM, Sandro Santilli <strk@keybit.net> wrote:
> On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
>> On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
>
>> > SELECT (SELECT reservoir_sample(some_table, 50) AS samples
>> >    FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
>> > FROM random_pages('some_table', 50) AS rnd_pgtids;
>>
>> But I don't understand the reservoir_sample call, what is it supposed to do ?
>
> Ok got it, that was probably to avoid:
>
>  ERROR:  more than one row returned by a subquery used as an expression

No, it's to avoid bias towards tuples on more sparsely populated
pages. See http://en.wikipedia.org/wiki/Reservoir_sampling or

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/analyze.c;h=ff271644e0f93ee99bfe9c1f536f3dd48455d8d2;hb=HEAD#l1027

> The advanced TID operator would be for random_tids to only return pages rather
> than full tids...

Exactly. But when mainly IO bound (ie. sampling from a large table on
spinning rust) the overhead of probing with TID scan as opposed to
sequentially scanning the pages should be small enough. When CPU bound
I suspect that the function call machinery overhead for
reservoir_sample is going to become a large issue, so a built in
tablesample also has an edge there.


Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


pgsql-hackers by date:

Previous
From: Michael Meskes
Date:
Subject: Re: ECPG FETCH readahead
Next
From: "Albe Laurenz"
Date:
Subject: Re: Foreign table scan estimates