Re: Selecting K random rows - efficiently! - Mailing list pgsql-general

From vincent
Subject Re: Selecting K random rows - efficiently!
Date
Msg-id 24973.84.107.170.210.1193735283.squirrel@webmail.xs4all.nl
Whole thread Raw
In response to Selecting K random rows - efficiently!  (cluster <skrald@amossen.dk>)
Responses Re: Selecting K random rows - efficiently!
List pgsql-general
> 2007/10/26, Patrick TJ McPhee <ptjm@interlog.com>:
>> In article <ffnid8$1q2t$1@news.hub.org>, cluster  <skrald@amossen.dk>
>> wrote:
>> % > How important is true randomness?
>> %
>> % The goal is an even distribution but currently I have not seen any way
>> % to produce any kind of random sampling efficiently. Notice the word
>>
>> How about generating the ctid randomly? You can get the number of pages
>> from pg_class and estimate the number of rows either using the number
>> of tuples in pg_class or just based on what you know about the data.
>> Then just generate two series of random numbers, one from 0 to the
>> number
>> of pages and the other from 1 to the number of rows per page, and keep
>> picking rows until you have enough numbers. Assuming there aren't too
>> many dead tuples and your estimates are good, this should retrieve n
>> rows
>> with roughly n look-ups.
>>
>> If your estimates are low, there will be tuples which can never be
>> selected,
>> and so far as I know, there's no way to construct a random ctid in a
>> stock
>> postgres database, but apart from that it seems like a good plan. If
>> efficiency is important, you could create a C function which returns a
>> series of random tids and join on that.
>> --
>>
>
> SELECT id, ...
>    FROM data
>   WHERE id = ANY(ARRAY(
>                        SELECT (random()*max_id)::int
>                           FROM generate_series(1,20)))
>   LIMIT 1;
>
> -- max_id is external constant
>
> Pavel Stehule

That selects records where the id is one of twenty random numbers between
zero and the maximum id. That's not truely random, nor is it completely
safe if there are lots of gaps in the values of id. For example, if the
lowest id  is 50000 and the highest is 50040, this will be very likely to
generate lots of numbers below 50000 and find no records at all.


pgsql-general by date:

Previous
From: "Pavel Stehule"
Date:
Subject: Re: Selecting K random rows - efficiently!
Next
From: Alexis Beuraud
Date:
Subject: Checking empty array