Re: Gsoc2012 Idea --- Social Network database schema - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: Gsoc2012 Idea --- Social Network database schema
Date
Msg-id 4F69FAF5.8000809@dunslane.net
Whole thread Raw
In response to Re: Gsoc2012 Idea --- Social Network database schema  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers

On 03/21/2012 11:49 AM, Robert Haas wrote:
> On Wed, Mar 21, 2012 at 11:34 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Robert Haas<robertmhaas@gmail.com>  writes:
>>> Well, the standard syntax apparently aims to reduce the number of
>>> returned rows, which ORDER BY does not.  Maybe you could do it with
>>> ORDER BY .. LIMIT, but the idea here I think is that we'd like to
>>> sample the table without reading all of it first, so that seems to
>>> miss the point.
>> I think actually the traditional locution is more like
>>         WHERE random()<  constant
>> where the constant is the fraction of the table you want.  And yeah,
>> the presumption is that you'd like it to not actually read every row.
>> (Though unless the sampling density is quite a bit less than 1 row
>> per page, it's not clear how much you're really going to win.)
> Well, there's something mighty tempting about having a way to say
> "just give me a random sample of the blocks and I'll worry about
> whether that represents a random sample of the rows".
>
> It's occurred to me a few times that it's pretty unfortunate you can't
> do that with a TID condition.
>
> rhaas=# explain select * from randomtext where ctid>= '(500,1)' and
> ctid<  '(501,1)';
>                               QUERY PLAN
> --------------------------------------------------------------------
>   Seq Scan on randomtext  (cost=0.00..111764.90 rows=25000 width=31)
>     Filter: ((ctid>= '(500,1)'::tid) AND (ctid<  '(501,1)'::tid))
> (2 rows)
>
> The last time this came up for me was when I was trying to find which
> row in a large table as making the SELECT blow up; but it seems like
> it could be used to implement a poor man's sampling method, too... it
> would be nicer, in either case, to be able to specify the block
> numbers you'd like to be able to read, rather than bounding the CTID
> from both ends as in the above example.


That would rapidly get unmanageable when you wanted lots of pages.


Maybe we could do something like a pagenum pseudovar, or a wildcard 
match for ctid against '(123,*)'.

cheers

andrew



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Command Triggers
Next
From: Tom Lane
Date:
Subject: Re: Gsoc2012 Idea --- Social Network database schema