Thread: mal advice in FAQ 4.1.

mal advice in FAQ 4.1.

From
"Pavel Stehule"
Date:
Hello

I found lot of slow queries in some databases which I checked based on
advice 4.1. from FAQ,
To SELECT a random row, use:   SELECT col   FROM tab   ORDER BY random()   LIMIT 1;

It's robust and slow on bigger tables. Can we add some better solutions?

Regards
Pavel Stehule


Re: mal advice in FAQ 4.1.

From
"Nikolay Samokhvalov"
Date:
Hubert recently posted his thoughts on this topic: http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

I've encountered with this problem several times in web development and every time found out that the best (in terms of performance) solution is to use some pseudo random approach (such as ">= random() limit 1" or "limit 1 offset random()*N" or even pre-caching rows on app side).

On 10/9/07, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hello

I found lot of slow queries in some databases which I checked based on
advice 4.1. from FAQ,

To SELECT a random row, use:
    SELECT col
    FROM tab
    ORDER BY random()
    LIMIT 1;

It's robust and slow on bigger tables. Can we add some better solutions?


--
Best regards,
Nikolay

Re: mal advice in FAQ 4.1.

From
"Pavel Stehule"
Date:
2007/10/9, Nikolay Samokhvalov <samokhvalov@gmail.com>:
> Hubert recently posted his thoughts on this topic:
> http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
>
> I've encountered with this problem several times in web development and
> every time found out that the best (in terms of performance) solution is to
> use some pseudo random approach (such as ">= random() limit 1" or "limit 1
> offset random()*N" or even pre-caching rows on app side).
>

I know this article, but you cannot link from faq to private
(unstable) blog. it would article on techdoc.postgresql.org

Pavel

On 10/9/07, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> > Hello
> >
> > I found lot of slow queries in some databases which I checked based on
> > advice 4.1. from FAQ,
> >
> > To SELECT a random row, use:
> >     SELECT col
> >     FROM tab
> >     ORDER BY random()
> >     LIMIT 1;
> >
> > It's robust and slow on bigger tables. Can we add some better solutions?
> >
>
>
> --
> Best regards,
> Nikolay


Re: mal advice in FAQ 4.1.

From
Gregory Stark
Date:
"Nikolay Samokhvalov" <samokhvalov@gmail.com> writes:

> Hubert recently posted his thoughts on this topic:
> http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
>
> I've encountered with this problem several times in web development and
> every time found out that the best (in terms of performance) solution is to
> use some pseudo random approach (such as ">= random() limit 1" or "limit 1
> offset random()*N" or even pre-caching rows on app side).

"ORDER BY random() LIMIT 1" should be faster in 8.3 due to the bounded-sort
optimization. It should be basically the same as the two options above as far
as how many comparisons are done and how much memory is used. It does have to
call random() for every record whereas the solutions above only call random()
once.

But I think all of these are basically the same to a first degree
approximation. They all have to do a scan of all the records being considered.
If you want something faster you need a solution which can use an index to
scan only the target record. There are ways of doing that but they require
some application knowledge.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: mal advice in FAQ 4.1.

From
"Pavel Stehule"
Date:
2007/10/9, Gregory Stark <stark@enterprisedb.com>:
> "Nikolay Samokhvalov" <samokhvalov@gmail.com> writes:
>
> > Hubert recently posted his thoughts on this topic:
> > http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
> >
> > I've encountered with this problem several times in web development and
> > every time found out that the best (in terms of performance) solution is to
> > use some pseudo random approach (such as ">= random() limit 1" or "limit 1
> > offset random()*N" or even pre-caching rows on app side).
>
> "ORDER BY random() LIMIT 1" should be faster in 8.3 due to the bounded-sort
> optimization. It should be basically the same as the two options above as far
> as how many comparisons are done and how much memory is used. It does have to
> call random() for every record whereas the solutions above only call random()
> once.
>
> But I think all of these are basically the same to a first degree
> approximation. They all have to do a scan of all the records being considered.
> If you want something faster you need a solution which can use an index to
> scan only the target record. There are ways of doing that but they require
> some application knowledge.
>

It needs always seq scan :(, and take space on buffer cache. Solution
based on random generated PK are much faster. I collaborate with one
my customer. He shows random products from 10K products on every page
of one eShop. And he cannot understand, so ORDER random() LIMIT is bad
trick, because this trick is on PostgreSQL FAQ.

Pavel


Re: mal advice in FAQ 4.1.

From
Martijn van Oosterhout
Date:
On Tue, Oct 09, 2007 at 06:40:23PM +0200, Pavel Stehule wrote:
> It needs always seq scan :(, and take space on buffer cache. Solution
> based on random generated PK are much faster. I collaborate with one
> my customer. He shows random products from 10K products on every page
> of one eShop. And he cannot understand, so ORDER random() LIMIT is bad
> trick, because this trick is on PostgreSQL FAQ.

It's the only trick that works in all situations though. There are
ofcourse faster methods, but they require information about the
distribution of the values, the type, PKs, indexes etc...

The standard does have stuff relating to extracting samples from
tables, but they're not implemented.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: mal advice in FAQ 4.1.

From
"Pavel Stehule"
Date:
2007/10/9, Martijn van Oosterhout <kleptog@svana.org>:
> On Tue, Oct 09, 2007 at 06:40:23PM +0200, Pavel Stehule wrote:
> > It needs always seq scan :(, and take space on buffer cache. Solution
> > based on random generated PK are much faster. I collaborate with one
> > my customer. He shows random products from 10K products on every page
> > of one eShop. And he cannot understand, so ORDER random() LIMIT is bad
> > trick, because this trick is on PostgreSQL FAQ.
>
> It's the only trick that works in all situations though. There are
> ofcourse faster methods, but they require information about the
> distribution of the values, the type, PKs, indexes etc...
>
> The standard does have stuff relating to extracting samples from
> tables, but they're not implemented.
>

I agree. I don't wont to remove it from FAQ. I would to add note, so
sometimes is necessary to find other trick.

Regards
Pavel