On 03/21/2012 11:49 AM, Robert Haas wrote:
> On Wed, Mar 21, 2012 at 11:34 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> Robert Haas<robertmhaas@gmail.com> writes:
>>> Well, the standard syntax apparently aims to reduce the number of
>>> returned rows, which ORDER BY does not. Maybe you could do it with
>>> ORDER BY .. LIMIT, but the idea here I think is that we'd like to
>>> sample the table without reading all of it first, so that seems to
>>> miss the point.
>> I think actually the traditional locution is more like
>> WHERE random()< constant
>> where the constant is the fraction of the table you want. And yeah,
>> the presumption is that you'd like it to not actually read every row.
>> (Though unless the sampling density is quite a bit less than 1 row
>> per page, it's not clear how much you're really going to win.)
> Well, there's something mighty tempting about having a way to say
> "just give me a random sample of the blocks and I'll worry about
> whether that represents a random sample of the rows".
>
> It's occurred to me a few times that it's pretty unfortunate you can't
> do that with a TID condition.
>
> rhaas=# explain select * from randomtext where ctid>= '(500,1)' and
> ctid< '(501,1)';
> QUERY PLAN
> --------------------------------------------------------------------
> Seq Scan on randomtext (cost=0.00..111764.90 rows=25000 width=31)
> Filter: ((ctid>= '(500,1)'::tid) AND (ctid< '(501,1)'::tid))
> (2 rows)
>
> The last time this came up for me was when I was trying to find which
> row in a large table as making the SELECT blow up; but it seems like
> it could be used to implement a poor man's sampling method, too... it
> would be nicer, in either case, to be able to specify the block
> numbers you'd like to be able to read, rather than bounding the CTID
> from both ends as in the above example.
That would rapidly get unmanageable when you wanted lots of pages.
Maybe we could do something like a pagenum pseudovar, or a wildcard
match for ctid against '(123,*)'.
cheers
andrew