Home > mailing lists

Re: Performance of ORDER BY RANDOM to select random rows? - Mailing list pgsql-general

From	Sergey Konoplev
Subject	Re: Performance of ORDER BY RANDOM to select random rows?
Date	August 9, 2013 00:43:56
Msg-id	CAL_0b1v-wu0NCuo96B_6BSBrRgWQS+Yw3Ry5mUEAVtVVtuKx-w@mail.gmail.com Whole thread
In response to	Performance of ORDER BY RANDOM to select random rows? (Victor Hooi <victorhooi@yahoo.com>)
Responses	Re: Performance of ORDER BY RANDOM to select random rows?
List	pgsql-general

Tree view

On Wed, Aug 7, 2013 at 7:01 PM, Victor Hooi <victorhooi@yahoo.com> wrote:
> also seems to suggest that using ORDER BY RANDOM() will perform poorly on
> Postgres.
>
> I'm just wondering if this is still the case?
>
> I just ran those benchmarks on my system (Postgres 9.2.4), and using ORDERY
> BY RANDOM did not seem substantially to generating random integers in Python
> and picking those out (and handling non-existent rows).
>
> Has Postgres's behaviour for ORDER BY RANDOM change sometime recently?

Unfortunately, It has not. However, there always is a workaround. You
can get a random results fast by WITH RECURSIVE query.

WITH RECURSIVE r AS (
    WITH b AS (SELECT min(id), max(id) FROM table1)
    (
        SELECT id, min, max, array[]::integer[] AS a, 0 AS n
        FROM table1, b
        WHERE id > min + (max - min) * random()
        LIMIT 1
    ) UNION ALL (
        SELECT t.id, min, max, a || t.id, r.n + 1 AS n
        FROM table1 AS t, r
        WHERE
            t.id > min + (max - min) * random() AND
            t.id <> all(a) AND
            r.n + 1 < 10
        LIMIT 1
    )
)
SELECT t.id FROM table1 AS t, r WHERE r.id = t.id;

The general idea is that we get a random value between min(id) and
max(id) and then get the first row with id bigger than this value.
Then we repeat until we get 10 of such rows, checking that this id has
not been retrieved earlier.

Surely, the probability of appearing one or another value in the
result depends on the distribution of id values in the table, but in
the most cases I faced it works good.

I had an idea to play with pg_stats.histogram_bounds to work around
the described issue, but it was never so critical for tasks I solved.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

pgsql-general by date:

From: Tom Lane
Date: 09 August 2013, 00:42:11
Subject: Re: How To Install Extension Via Script File?

From: Tom Lane
Date: 09 August 2013, 00:49:32
Subject: Re: Postgres won't start

Re: Performance of ORDER BY RANDOM to select random rows? - Mailing list pgsql-general

Previous

Next