Re: Selecting K random rows - efficiently! - Mailing list pgsql-general

From Martijn van Oosterhout
Subject Re: Selecting K random rows - efficiently!
Date
Msg-id 20071024130811.GC29030@svana.org
Whole thread Raw
In response to Re: Selecting K random rows - efficiently!  (cluster <skrald@amossen.dk>)
Responses Re: Selecting K random rows - efficiently!  (cluster <skrald@amossen.dk>)
List pgsql-general
On Wed, Oct 24, 2007 at 10:59:46AM +0200, cluster wrote:
> Another way to look at the problem is: How do I sample a subset of size
> K efficiently? A query like
>
>    SAMPLE 1000 OF
>    (SELECT * FROM mydata WHERE <some condition>)

How important is true randomness? To get the best possible distribution
most algorithms require you to either know how many rows there are, or
require you to scan the whole table (or index).

With some simplifying assumptions, you can try extracting them from an
index, with the caveat that if your index is unbalanced in any way, the
selection won't be "random".

> should return 1000 random rows from the select statement so that two
> consecutive evaluations of the query would only with very little
> probability return the same 1000 rows.
> (Yes, I know that "SAMPLE 1000 OF" is not valid SQL)

Presumably your table is very much bigger than that, in which I suppose
the not-entirely-random is unlikely to play much of a role.

Search the archives, there have been solutions proposed before, though
they probably arn't very quick...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

pgsql-general by date:

Previous
From: rihad
Date:
Subject: initdb: file "/usr/local/share/postgresql/snowball_create.sql" does not exist
Next
From: Stefan Schwarzer
Date:
Subject: "Concatenate" two queries - how?