Thread: Selecting K random rows - efficiently!

Selecting K random rows - efficiently!

From

cluster

Date:

24 October 2007, 05:36:52

It has been suggested [1] that a good way to select K random rows
efficiently from a table is to
   1) add a new column, random_number, to the table
      and initialize it with random()
   2) perform the following query:
        SELECT *
        FROM mydata
        WHERE random_number >= (SELECT RANDOM() OFFSET 0)
        ORDER BY random_number ASC LIMIT <K>
      Here <K> is the number of random rows to pick. E.g. 100.

The benefit in this solution is that the "random_number" column can be
indexed, allowing the query to be performed using a simple and fast
index scan.

However, there is a couple of major drawbacks in this method:

1) The smaller "random_number" is, the less likely is it that it will be
picked when using this method. Example: A random_number close to zero
will only have a very small probability to be selected. The solution is
to reassign "random_number" every now and then in order to "even out"
the selection probabilities over time.
PROBLEM: If the number of rows are large (e.g. 200.000 or even a million
or more), the update query:
    UPDATE mydata SET random_number = random();
might be very resource demanding, take a lot of time and pretty much
slow down all other transactions because it eats up all resources.

2) The query to select K random rows orders the rows by "random_number"
and selects the first K rows that have "random_number" larger than some
other random number R. But what happens if there is less than K rows
with random_value >= R? How can the rest of the random rows be picked
efficiently?

Any ideas on how to deal with these drawbacks?

References:
[1] http://tinyurl.com/tyg4m

Re: Selecting K random rows - efficiently!

From

cluster

Date:

24 October 2007, 06:06:54

Another way to look at the problem is: How do I sample a subset of size
K efficiently? A query like

    SAMPLE 1000 OF
    (SELECT * FROM mydata WHERE <some condition>)

should return 1000 random rows from the select statement so that two
consecutive evaluations of the query would only with very little
probability return the same 1000 rows.
(Yes, I know that "SAMPLE 1000 OF" is not valid SQL)

Re: Selecting K random rows - efficiently!

From

Martijn van Oosterhout

Date:

24 October 2007, 10:08:44

On Wed, Oct 24, 2007 at 10:59:46AM +0200, cluster wrote:
> Another way to look at the problem is: How do I sample a subset of size
> K efficiently? A query like
>
>    SAMPLE 1000 OF
>    (SELECT * FROM mydata WHERE <some condition>)

How important is true randomness? To get the best possible distribution
most algorithms require you to either know how many rows there are, or
require you to scan the whole table (or index).

With some simplifying assumptions, you can try extracting them from an
index, with the caveat that if your index is unbalanced in any way, the
selection won't be "random".

> should return 1000 random rows from the select statement so that two
> consecutive evaluations of the query would only with very little
> probability return the same 1000 rows.
> (Yes, I know that "SAMPLE 1000 OF" is not valid SQL)

Presumably your table is very much bigger than that, in which I suppose
the not-entirely-random is unlikely to play much of a role.

Search the archives, there have been solutions proposed before, though
they probably arn't very quick...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc

Re: Selecting K random rows - efficiently!

From

cluster

Date:

24 October 2007, 11:06:56

> How important is true randomness?

The goal is an even distribution but currently I have not seen any way
to produce any kind of random sampling efficiently. Notice the word
"efficiently". The naive way of taking a random sample of size K:
    (SELECT * FROM mydata ORDER BY random() LIMIT <K>)
is clearly not an option for performance reasons. It shouldn't be
necessary to explain why. :-)

> Search the archives, there have been solutions proposed before, though
> they probably arn't very quick...

As the subject suggests, performance really matters and searching the
archives only results in poor solutions (my first post explains why).

Re: Selecting K random rows - efficiently!

From

Paul Tillotson

Date:

24 October 2007, 23:22:56

cluster wrote:
> It has been suggested [1] that a good way to select K random rows
> efficiently from a table is to
>   1) add a new column, random_number, to the table
>      and initialize it with random()
>   2) perform the following query:
>        SELECT *
>        FROM mydata
>        WHERE random_number >= (SELECT RANDOM() OFFSET 0)
>        ORDER BY random_number ASC LIMIT <K>
>      Here <K> is the number of random rows to pick. E.g. 100.
>
> The benefit in this solution is that the "random_number" column can be
> indexed, allowing the query to be performed using a simple and fast
> index scan.
>
> However, there is a couple of major drawbacks in this method:
>
> 1) The smaller "random_number" is, the less likely is it that it will
> be picked when using this method. Example: A random_number close to
> zero will only have a very small probability to be selected.
When the above query returns L rows (where L < K) then, you need to
append the first K - L  rows from the table to simulate a "ring" without
start or end.  (Conveniently, this also solves the problem of not
finding K rows because the random start value was too large.)

The second set of rows can certainly be fetched using a second SELECT
statement.   Whether this can be computed efficiently as a single SELECT
statement I am not sure but you might try something like this:

  (SELECT 1 AS seq, *
  FROM mydata
  WHERE random_number >= (SELECT RANDOM() OFFSET 0)
  ORDER BY random_number ASC LIMIT <K>)
UNION ALL
  (SELECT 2 AS seq, *
  FROM mydata
  ORDER BY random_number ASC LIMIT <K>)
ORDER BY seq ASC, random_number ASC LIMIT K;

This should provide each row with an equal chance of being selected
while requiring the database to fetch at most 2 * K rows.

Regards,

Paul Tillotson

Re: Selecting K random rows - efficiently!

From

"Scott Marlowe"

Date:

25 October 2007, 00:01:44

Here's how I would do it.  This assumes a static table that doesn't
change a lot.

1: find the row count n of the table.
2: randomly assign 1 through n to each row randomly.  How to do this
is a whole not post.
3: create a sequence.  If you always need 10 or 100 random rows,  set
the increment to that number.  set it to cycle at the size of the
table.
4: select nextval('sequence') =>nv and use it in a select:

select * from myrandomtable where id between nv and nv+100; --  or
whatever your increment is.

There are refinements to this.  The advantages, with a static data
set, are that you can cluster on the randomized id and get chunks of
the random dataset VERY quickly, and you won't repeat the results
until you start over.  you can re-randomize the table every x hours or
days or weeks to meet your needs.  If you don't want to re-randomize
it during the day, just put the random data set into it however many
times you need to so that it won't roll over until the next day/week
etc...

Does that make sense?

If your data changes all the time, you've got a more difficult problem
to deal with.

Re: Selecting K random rows - efficiently!

From

"John D. Burger"

Date:

26 October 2007, 01:01:32

As far as I can tell, all of the proposed solutions lack sample
independence.  Take the OP's suggested approach of doing something
like this:

   SELECT * FROM mydata
   WHERE mydata.random_number >= (SELECT RANDOM() OFFSET 0)
   ORDER BY mydata.random_number ASC LIMIT 100

All you're doing is picking random =subsequences= from the same
permutation of the original data.  This is not the same as a random
sample.  That is, if rows A and B are adjacent in the permutation,
then if A is in the sample, B will also be in it with very high
probability, depending on the size of the sample.  Another way of
saying this is that the first element of the sample is selected
randomly, the rest are completely deterministic.  In a true random
sample, different elements are selected independently.

On the other hand, ORDER BY RANDOM() does indeed construct true
random samples, because it makes a new permutation every time.  If
you want to use the random_number column approach, then you need to
do the same.  You can accomplish this by sampling from the original
permutation repeatedly, doing the above with LIMIT 1 as many times as
you need.  Yes this is more costly, but TANSTAAFL.

As is often observed, it's easy to create the appearance of
randomness, harder to accomplish in reality.

- John Burger
   MITRE

Re: Selecting K random rows - efficiently!

From

cluster

Date:

26 October 2007, 16:37:30

> All you're doing is picking random =subsequences= from the same
> permutation of the original data.

You have some good points in your reply. I am very much aware of this
non-random behavior you point out for the "static random-value column"
approach but at least it is fast, which is a requirement. :-(
However, if the life time of the individual rows are short, the
behaviour is, luckily, sufficiently random for my specific purpose.

I furthermore realize that the only way to get truly random samples is
to ORDER BY random(), but this is an unacceptable slow method for large
data sets.
Even though it is not trivial at all, there ARE indeed algorithms out
there [1,2] for picking random sub sets from a result set but these are
(sadly) not implemented in postgresql.


References:
[1] http://portal.acm.org/citation.cfm?id=304206

[2] http://compstat.chonbuk.ac.kr/Sisyphus/CurrentStudy/Sampling/vldb86.pdf

Re: Selecting K random rows - efficiently!

From

ptjm@interlog.com (Patrick TJ McPhee)

Date:

29 October 2007, 23:53:48

In article <ffnid8$1q2t$1@news.hub.org>, cluster  <skrald@amossen.dk> wrote:
% > How important is true randomness?
%
% The goal is an even distribution but currently I have not seen any way
% to produce any kind of random sampling efficiently. Notice the word

How about generating the ctid randomly? You can get the number of pages
from pg_class and estimate the number of rows either using the number
of tuples in pg_class or just based on what you know about the data.
Then just generate two series of random numbers, one from 0 to the number
of pages and the other from 1 to the number of rows per page, and keep
picking rows until you have enough numbers. Assuming there aren't too
many dead tuples and your estimates are good, this should retrieve n rows
with roughly n look-ups.

If your estimates are low, there will be tuples which can never be selected,
and so far as I know, there's no way to construct a random ctid in a stock
postgres database, but apart from that it seems like a good plan. If
efficiency is important, you could create a C function which returns a
series of random tids and join on that.
--

Patrick TJ McPhee
North York  Canada
ptjm@interlog.com

Re: Selecting K random rows - efficiently!

From

"Pavel Stehule"

Date:

30 October 2007, 05:52:41

2007/10/26, Patrick TJ McPhee <ptjm@interlog.com>:
> In article <ffnid8$1q2t$1@news.hub.org>, cluster  <skrald@amossen.dk> wrote:
> % > How important is true randomness?
> %
> % The goal is an even distribution but currently I have not seen any way
> % to produce any kind of random sampling efficiently. Notice the word
>
> How about generating the ctid randomly? You can get the number of pages
> from pg_class and estimate the number of rows either using the number
> of tuples in pg_class or just based on what you know about the data.
> Then just generate two series of random numbers, one from 0 to the number
> of pages and the other from 1 to the number of rows per page, and keep
> picking rows until you have enough numbers. Assuming there aren't too
> many dead tuples and your estimates are good, this should retrieve n rows
> with roughly n look-ups.
>
> If your estimates are low, there will be tuples which can never be selected,
> and so far as I know, there's no way to construct a random ctid in a stock
> postgres database, but apart from that it seems like a good plan. If
> efficiency is important, you could create a C function which returns a
> series of random tids and join on that.
> --
>

SELECT id, ...
   FROM data
  WHERE id = ANY(ARRAY(
                       SELECT (random()*max_id)::int
                          FROM generate_series(1,20)))
  LIMIT 1;

-- max_id is external constant

Pavel Stehule

Re: Selecting K random rows - efficiently!

From

"vincent"

Date:

30 October 2007, 06:08:27

> 2007/10/26, Patrick TJ McPhee <ptjm@interlog.com>:
>> In article <ffnid8$1q2t$1@news.hub.org>, cluster  <skrald@amossen.dk>
>> wrote:
>> % > How important is true randomness?
>> %
>> % The goal is an even distribution but currently I have not seen any way
>> % to produce any kind of random sampling efficiently. Notice the word
>>
>> How about generating the ctid randomly? You can get the number of pages
>> from pg_class and estimate the number of rows either using the number
>> of tuples in pg_class or just based on what you know about the data.
>> Then just generate two series of random numbers, one from 0 to the
>> number
>> of pages and the other from 1 to the number of rows per page, and keep
>> picking rows until you have enough numbers. Assuming there aren't too
>> many dead tuples and your estimates are good, this should retrieve n
>> rows
>> with roughly n look-ups.
>>
>> If your estimates are low, there will be tuples which can never be
>> selected,
>> and so far as I know, there's no way to construct a random ctid in a
>> stock
>> postgres database, but apart from that it seems like a good plan. If
>> efficiency is important, you could create a C function which returns a
>> series of random tids and join on that.
>> --
>>
>
> SELECT id, ...
>    FROM data
>   WHERE id = ANY(ARRAY(
>                        SELECT (random()*max_id)::int
>                           FROM generate_series(1,20)))
>   LIMIT 1;
>
> -- max_id is external constant
>
> Pavel Stehule

That selects records where the id is one of twenty random numbers between
zero and the maximum id. That's not truely random, nor is it completely
safe if there are lots of gaps in the values of id. For example, if the
lowest id  is 50000 and the highest is 50040, this will be very likely to
generate lots of numbers below 50000 and find no records at all.

Re: Selecting K random rows - efficiently!

From

"Pavel Stehule"

Date:

30 October 2007, 07:34:27

2007/10/30, vincent <vinny@xs4all.nl>:
> > 2007/10/26, Patrick TJ McPhee <ptjm@interlog.com>:
> >> In article <ffnid8$1q2t$1@news.hub.org>, cluster  <skrald@amossen.dk>
> >> wrote:
> >> % > How important is true randomness?
> >> %
> >> % The goal is an even distribution but currently I have not seen any way
> >> % to produce any kind of random sampling efficiently. Notice the word
> >>
> >> How about generating the ctid randomly? You can get the number of pages
> >> from pg_class and estimate the number of rows either using the number
> >> of tuples in pg_class or just based on what you know about the data.
> >> Then just generate two series of random numbers, one from 0 to the
> >> number
> >> of pages and the other from 1 to the number of rows per page, and keep
> >> picking rows until you have enough numbers. Assuming there aren't too
> >> many dead tuples and your estimates are good, this should retrieve n
> >> rows
> >> with roughly n look-ups.
> >>
> >> If your estimates are low, there will be tuples which can never be
> >> selected,
> >> and so far as I know, there's no way to construct a random ctid in a
> >> stock
> >> postgres database, but apart from that it seems like a good plan. If
> >> efficiency is important, you could create a C function which returns a
> >> series of random tids and join on that.
> >> --
> >>
> >
> > SELECT id, ...
> >    FROM data
> >   WHERE id = ANY(ARRAY(
> >                        SELECT (random()*max_id)::int
> >                           FROM generate_series(1,20)))
> >   LIMIT 1;
> >
> > -- max_id is external constant
> >
> > Pavel Stehule
>
> That selects records where the id is one of twenty random numbers between
> zero and the maximum id. That's not truely random, nor is it completely
> safe if there are lots of gaps in the values of id. For example, if the
> lowest id  is 50000 and the highest is 50040, this will be very likely to
> generate lots of numbers below 50000 and find no records at all.
>
>

there is only one safe way ORDER BY random() LIMIT 1;

if you know so your id has not uniform distribution, you have to mo:

SELECT random()*(max_id - min_id) + min_id

This solution is far to ideal, but it is fast and for some purposes
enough (web shops, etc)

Pavel