Home > mailing lists

Re: Selecting K random rows - efficiently! - Mailing list pgsql-general

From	Martijn van Oosterhout
Subject	Re: Selecting K random rows - efficiently!
Date	October 24, 2007 13:08:44
Msg-id	20071024130811.GC29030@svana.org Whole thread Raw
In response to	Re: Selecting K random rows - efficiently! (cluster <skrald@amossen.dk>)
Responses	Re: Selecting K random rows - efficiently! (cluster <skrald@amossen.dk>)
List	pgsql-general

Tree view

On Wed, Oct 24, 2007 at 10:59:46AM +0200, cluster wrote:
> Another way to look at the problem is: How do I sample a subset of size
> K efficiently? A query like
>
>    SAMPLE 1000 OF
>    (SELECT * FROM mydata WHERE <some condition>)

How important is true randomness? To get the best possible distribution
most algorithms require you to either know how many rows there are, or
require you to scan the whole table (or index).

With some simplifying assumptions, you can try extracting them from an
index, with the caveat that if your index is unbalanced in any way, the
selection won't be "random".

> should return 1000 random rows from the select statement so that two
> consecutive evaluations of the query would only with very little
> probability return the same 1000 rows.
> (Yes, I know that "SAMPLE 1000 OF" is not valid SQL)

Presumably your table is very much bigger than that, in which I suppose
the not-entirely-random is unlikely to play much of a role.

Search the archives, there have been solutions proposed before, though
they probably arn't very quick...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc

pgsql-general by date:

From: rihad
Date: 24 October 2007, 13:07:24
Subject: initdb: file "/usr/local/share/postgresql/snowball_create.sql" does not exist

From: Stefan Schwarzer
Date: 24 October 2007, 13:09:45
Subject: "Concatenate" two queries - how?

Re: Selecting K random rows - efficiently! - Mailing list pgsql-general

Attachment

Previous

Next