Selecting K random rows - efficiently! - Mailing list pgsql-general

From cluster
Subject Selecting K random rows - efficiently!
Date
Msg-id ffn044$29uq$1@news.hub.org
Whole thread Raw
Responses Re: Selecting K random rows - efficiently!  (cluster <skrald@amossen.dk>)
Re: Selecting K random rows - efficiently!  (Paul Tillotson <spam1011@adelphia.net>)
Re: Selecting K random rows - efficiently!  ("Scott Marlowe" <scott.marlowe@gmail.com>)
List pgsql-general
It has been suggested [1] that a good way to select K random rows
efficiently from a table is to
   1) add a new column, random_number, to the table
      and initialize it with random()
   2) perform the following query:
        SELECT *
        FROM mydata
        WHERE random_number >= (SELECT RANDOM() OFFSET 0)
        ORDER BY random_number ASC LIMIT <K>
      Here <K> is the number of random rows to pick. E.g. 100.

The benefit in this solution is that the "random_number" column can be
indexed, allowing the query to be performed using a simple and fast
index scan.

However, there is a couple of major drawbacks in this method:

1) The smaller "random_number" is, the less likely is it that it will be
picked when using this method. Example: A random_number close to zero
will only have a very small probability to be selected. The solution is
to reassign "random_number" every now and then in order to "even out"
the selection probabilities over time.
PROBLEM: If the number of rows are large (e.g. 200.000 or even a million
or more), the update query:
    UPDATE mydata SET random_number = random();
might be very resource demanding, take a lot of time and pretty much
slow down all other transactions because it eats up all resources.

2) The query to select K random rows orders the rows by "random_number"
and selects the first K rows that have "random_number" larger than some
other random number R. But what happens if there is less than K rows
with random_value >= R? How can the rest of the random rows be picked
efficiently?

Any ideas on how to deal with these drawbacks?

References:
[1] http://tinyurl.com/tyg4m

pgsql-general by date:

Previous
From: Gábor Farkas
Date:
Subject: deadlock detected, only selects (not select-for-update)
Next
From: Laurent ROCHE
Date:
Subject: Re : pg_dump auto login