Re: Random resultset retrieving -> performance bottleneck - Mailing list pgsql-sql

From Cédric Dufour
Subject Re: Random resultset retrieving -> performance bottleneck
Date
Msg-id NDBBIFNBODNADCAOFDOAAEJOCDAA.cedric.dufour@freesurf.ch
Whole thread Raw
In response to Re: Random resultset retrieving -> performance bottleneck  ("Christopher Kings-Lynne" <chriskl@familyhealth.com.au>)
List pgsql-sql
> -----Original Message-----
> From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
> Sent: Saturday, August 03, 2002 12:10
> To: Cédric Dufour; pgsql-sql@postgresql.org
> Subject: Re: [SQL] Random resultset retrieving -> performance bottleneck
>
>
> > I'm running into a performance problem when considering the following
> > scenario: I have a fairly large table (1mio rows) related to
> other smaller
> > tables (between 100 and 10000 rows) and would like to retrieve
> the joined
> > data (through a view) in random order. In order to do so, the main table
> > contains a 'Random' field (which is updated on a regular basis, in order
> to
> > re-randomize the data set), on which an index is created:
>
> Have you tried adding ORDER BY RANDOM() onto your select query?
>
> Chris
>
Yes, but the problem remains the same: even tough a LIMIT clause is given,
the full set seems to be evaluated before the sort occurs (and it is even
slower than storing a random value in the table itself, since random() has
to be computed for each row instead or retrieving the pre-computed one).
Thus...

SELECT my_view ORDER BY random() LIMIT 100 takes many minutes (actually,
after 10 minutes running, I gave up and canceled the query), while...
SELECT my_view LIMIT 100 takes only 200 milliseconds

So far, I managed to re-order the rows in random, turning off all triggers
and forced the planner to use only 'hash' joins (thus forcing it to make
'seq scan' and keeping it from re-ordering the rows based on the indexes).

Now, turning off 'nestloop' and 'mergejoin' (in order to have only 'hash'
joins) had a rather disorienting side-effect -> See thread '[GENERAL] b1 OR
b2 <-> ( CASE WHEN b1 THE true ELSE b2 END ): performance bottleneck on
logical OR' from this month (Tom, here is the reason why I copy you this
mail, so you understand where my messing around with the planner comes from
;-) )

So I turned all joins sorts ('hash' AND 'nestloop' and 'mergejoin') on again
and suceeded having the planner do what I want on this particular query by
surrounding the WHERE expression by a ( CASE ... END ):

SELECT ... WHERE exp -> SELECT ... WHERE ( CASE WHEN exp THEN true ELSE
false END )

Thus having the planner add a 'seq scan' of the randomized table just before
returning the resultset:

Limit  (cost=17623.97..3321903.97 rows=25 width=7855) (actual
time=2659.00..2732.00 rows=100 loops=1) ->  Hash Join  (cost=17623.97..3321923.80 rows=25 width=7855) (actual
time=2659.00..2730.00 rows=101 loops=1)       ->  Hash Join  (cost=17429.64..3321729.33 rows=25 width=7819)
(actual time=2651.00..2698.00 rows=101 loops=1)             ->  Hash Join  (cost=17235.92..3321535.49 rows=25
width=7783)
(actual time=2646.00..2681.00 rows=101 loops=1)                   ->  Hash Join  (cost=17041.36..3321315.62 rows=5000
width=7226) (actual time=2635.00..2663.00 rows=101 loops=1)                         ->  Hash Join
(cost=16847.45..3316059.18
rows=1000006 width=6669) (actual time=2629.00..2652.00 rows=101 loops=1)                               ->  Hash Join
(cost=16795.64..3311007.27
rows=1000006 width=4839) (actual time=2627.00..2648.00 rows=101 loops=1)                                     ->  Hash
Join
(cost=16743.82..3305955.36 rows=1000006 width=4831) (actual
time=2626.00..2640.00 rows=101 loops=1)
**********                                  ->  Seq Scan on tb_item item
(cost=0.00..34708.06 rows=1000006 width=2723) (actual time=847.00..853.00
rows=487 loops=1)                                           ->  Hash
(cost=14110.82..14110.82 rows=10001 width=2108) (actual
time=1775.00..1775.00 rows=0 loops=1)                                                 ->  Hash Join
(cost=13858.17..14110.82 rows=10001 width=2108) (actual
time=1041.00..1742.00 rows=10001 loops=1)                                                       ->  Hash Join
(cost=13806.36..14008.94 rows=10001 width=1310) (actual
time=1040.00..1488.00 rows=10001 loops=1)
[...]

This looks real dirty to me and I don't understand why I obtain what I
want... but 'la fin justifie les moyens', that is my query runs in
approximately 2-3 seconds and I obtain a randomized resultset ;-)

Would there be a 'by the book' way to tell the planner to preserve the order
of a given table ?

Thx for your help,

Cedric Dufour




pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: problem fetching currval of sequence
Next
From: "Sander Steffann"
Date:
Subject: Re: VACUUM not doing its job?