Thread: Random resultset retrieving -> performance bottleneck

Random resultset retrieving -> performance bottleneck

From
Cédric Dufour
Date:
Hello to all of you,

I'm running into a performance problem when considering the following
scenario: I have a fairly large table (1mio rows) related to other smaller
tables (between 100 and 10000 rows) and would like to retrieve the joined
data (through a view) in random order. In order to do so, the main table
contains a 'Random' field (which is updated on a regular basis, in order to
re-randomize the data set), on which an index is created:

*****
* SCHEMA
*****

CREATE TABLE tb_Small AS (   PK integer UNIQUE NOT NULL   PRIMARY KEY (PK)
);

CREATE TABLE tb_Large AS (   Random integer DEFAULT( CAST( 1000000*random() AS integer ) ) NOT NULL,   FK_Small integer
NOTNULL,   PK integer UNIQUE NOT NULL,   FOREIGN KEY (FK_Small) REFERENCES tb_Small (PK),   PRIMARY KEY (PK)
 
);

CREATE INDEX ix_Large__Random ON tb_Large (Random);

CREATE TABLE tb_Foo AS (   FK_Large integer NOT NULL,   PK integer UNIQUE NOT NULL,   FOREIGN KEY (FK_Large) REFERENCES
tb_Large(PK) DEFERRABLE,   PRIMARY KEY (PK)
 
);

CREATE VIEW vw_Large AS
SELECT   tb_Small.*,   tb_Large.*
FROM   tb_Small
INNER JOIN   tb_Large   ON ( tb_Small.PK = tb_Large.FK_Small );

NOTA BENE: My production view actually involves much more inner- or
left-joined tables that this simple example

Here are the various querying scenario and the related performance problem
(my production view actually involves much more inner- or left-joined tables
that the scenarios below, simplified for the sake of clarity)

*****
* 1.
*****
CREATE VIEW vw_Large AS
SELECT   *
FROM   tb_Small AS Small
INNER JOIN   tb_Large AS Large   ON ( Small.PK = Large.FK_Small );

SELECT * FROM vw_Large ORDER BY Random LIMIT 50;
-> The slowest way (~60 time units), since the entire view is evaluated
before it is sorted properly (the index on the 'Random' being ignored)

SELECT * FROM vw_Large WHERE Small.PK = <any> ORDER BY Random LIMIT 50;
-> Quicker (~5 time units), since the evaluated view is smaller (cf. the
WHERE clause) before it is sorted properly

*****
* 2.
*****
CREATE VIEW vw_Large AS
SELECT   *
FROM   tb_Small AS Small
INNER JOIN   ( SELECT * FROM tb_Large ORDER BY Random ) AS Large   ON ( Small.PK = Large.FK_Small );

SELECT * FROM vw_Large LIMIT 50;
-> Much quicker (~15 time units), since the ordering is achieved on the
table itself, using the index, before the view is evaluated

SELECT * FROM vw_Large WHERE Small.PK = <any> LIMIT 50;
-> Slow (~15 time units), since the ordering is achieved on the entire
table, despite the WHERE clause


*****
* POSSIBLE SOLUTIONS AND PROBLEMS
*****
Since the second approach seems to give better results, the idea was to
reorder (cluster) the 'Large' table regurlarly - like once a day -, so as to
have a randomized data set and avoid the ORDER BY clause (this is the way I
achieved VERY GOOD performance on MS SQL Server [~1 time unit], for exactly
the same scenario). In order to do so, one might:

*****
* 1.
*****
CLUSTER ix_Large__Random TABLE tb_Large;

-> Would be ideal... but OIDS, CONTRAINTS AND INDEXES ARE LOST AND ALL
RELATED VIEWS/FUNCTIONS WON'T WORK ANYLONGER...

*****
* 2.
*****
BEGIN;
SET CONSTRAINTS ALL DEFERRED;
CREATE TEMP TABLE tmp_Large AS SELECT * FROM tb_Table;
DELETE FROM tb_Large; -- won't work; RI violation on foreign key
'tb_Foo(FK_Large)'
INSERT INTO tb_Large SELECT * FROM tb_Table ORDER BY Random;
DROP TABLE tmp_Large;
COMMIT;

-> Would preserve oids, constraints and indexes... BUT DELETE IS IMPOSSIBLE,
BECAUSE REFERENTIAL INTEGRITY IS VIOLATED ON FOREIGN KEY 'FK_Large' IN TABLE
'tb_Foo', despite the SET CONSTRAINTS ALL DEFERRED clause

*****
* HELP !!!
*****
Would anyone have a solution to this (general) randomization problem ?
Is there a way to turn RI off during the transaction ?
Is there another way to reorder (cluster) the table without having
oids/constraints/indexes or RI problems ?

Any clues would be very much appreciated ;-)

Cédric Dufour




Re: Random resultset retrieving -> performance bottleneck

From
Stephan Szabo
Date:
On Thu, 1 Aug 2002, [iso-8859-1] C�dric Dufour wrote:

> *****
> * 2.
> *****
> BEGIN;
> SET CONSTRAINTS ALL DEFERRED;
> CREATE TEMP TABLE tmp_Large AS SELECT * FROM tb_Table;
> DELETE FROM tb_Large; -- won't work; RI violation on foreign key
> 'tb_Foo(FK_Large)'
> INSERT INTO tb_Large SELECT * FROM tb_Table ORDER BY Random;
> DROP TABLE tmp_Large;
> COMMIT;
>
> -> Would preserve oids, constraints and indexes... BUT DELETE IS IMPOSSIBLE,
> BECAUSE REFERENTIAL INTEGRITY IS VIOLATED ON FOREIGN KEY 'FK_Large' IN TABLE
> 'tb_Foo', despite the SET CONSTRAINTS ALL DEFERRED clause

Yeah, there's been a bug that should now be patched for upcoming 7.3
that caused this to fail.  I believe you should be able to find the
patch if you search -patches since it was pretty recent.  It might take a
little work to patch to a previous version, but it shouldn't be too hard.

Failing that, you can turn off all triggers (look at the output of
a data only pg_dump for queries to turn off/on trigger).



Re: Random resultset retrieving -> performance bottleneck

From
"Christopher Kings-Lynne"
Date:
> I'm running into a performance problem when considering the following
> scenario: I have a fairly large table (1mio rows) related to other smaller
> tables (between 100 and 10000 rows) and would like to retrieve the joined
> data (through a view) in random order. In order to do so, the main table
> contains a 'Random' field (which is updated on a regular basis, in order
to
> re-randomize the data set), on which an index is created:

Have you tried adding ORDER BY RANDOM() onto your select query?

Chris




Re: Random resultset retrieving -> performance bottleneck

From
Cédric Dufour
Date:
> -----Original Message-----
> From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
> Sent: Saturday, August 03, 2002 12:10
> To: Cédric Dufour; pgsql-sql@postgresql.org
> Subject: Re: [SQL] Random resultset retrieving -> performance bottleneck
>
>
> > I'm running into a performance problem when considering the following
> > scenario: I have a fairly large table (1mio rows) related to
> other smaller
> > tables (between 100 and 10000 rows) and would like to retrieve
> the joined
> > data (through a view) in random order. In order to do so, the main table
> > contains a 'Random' field (which is updated on a regular basis, in order
> to
> > re-randomize the data set), on which an index is created:
>
> Have you tried adding ORDER BY RANDOM() onto your select query?
>
> Chris
>
Yes, but the problem remains the same: even tough a LIMIT clause is given,
the full set seems to be evaluated before the sort occurs (and it is even
slower than storing a random value in the table itself, since random() has
to be computed for each row instead or retrieving the pre-computed one).
Thus...

SELECT my_view ORDER BY random() LIMIT 100 takes many minutes (actually,
after 10 minutes running, I gave up and canceled the query), while...
SELECT my_view LIMIT 100 takes only 200 milliseconds

So far, I managed to re-order the rows in random, turning off all triggers
and forced the planner to use only 'hash' joins (thus forcing it to make
'seq scan' and keeping it from re-ordering the rows based on the indexes).

Now, turning off 'nestloop' and 'mergejoin' (in order to have only 'hash'
joins) had a rather disorienting side-effect -> See thread '[GENERAL] b1 OR
b2 <-> ( CASE WHEN b1 THE true ELSE b2 END ): performance bottleneck on
logical OR' from this month (Tom, here is the reason why I copy you this
mail, so you understand where my messing around with the planner comes from
;-) )

So I turned all joins sorts ('hash' AND 'nestloop' and 'mergejoin') on again
and suceeded having the planner do what I want on this particular query by
surrounding the WHERE expression by a ( CASE ... END ):

SELECT ... WHERE exp -> SELECT ... WHERE ( CASE WHEN exp THEN true ELSE
false END )

Thus having the planner add a 'seq scan' of the randomized table just before
returning the resultset:

Limit  (cost=17623.97..3321903.97 rows=25 width=7855) (actual
time=2659.00..2732.00 rows=100 loops=1) ->  Hash Join  (cost=17623.97..3321923.80 rows=25 width=7855) (actual
time=2659.00..2730.00 rows=101 loops=1)       ->  Hash Join  (cost=17429.64..3321729.33 rows=25 width=7819)
(actual time=2651.00..2698.00 rows=101 loops=1)             ->  Hash Join  (cost=17235.92..3321535.49 rows=25
width=7783)
(actual time=2646.00..2681.00 rows=101 loops=1)                   ->  Hash Join  (cost=17041.36..3321315.62 rows=5000
width=7226) (actual time=2635.00..2663.00 rows=101 loops=1)                         ->  Hash Join
(cost=16847.45..3316059.18
rows=1000006 width=6669) (actual time=2629.00..2652.00 rows=101 loops=1)                               ->  Hash Join
(cost=16795.64..3311007.27
rows=1000006 width=4839) (actual time=2627.00..2648.00 rows=101 loops=1)                                     ->  Hash
Join
(cost=16743.82..3305955.36 rows=1000006 width=4831) (actual
time=2626.00..2640.00 rows=101 loops=1)
**********                                  ->  Seq Scan on tb_item item
(cost=0.00..34708.06 rows=1000006 width=2723) (actual time=847.00..853.00
rows=487 loops=1)                                           ->  Hash
(cost=14110.82..14110.82 rows=10001 width=2108) (actual
time=1775.00..1775.00 rows=0 loops=1)                                                 ->  Hash Join
(cost=13858.17..14110.82 rows=10001 width=2108) (actual
time=1041.00..1742.00 rows=10001 loops=1)                                                       ->  Hash Join
(cost=13806.36..14008.94 rows=10001 width=1310) (actual
time=1040.00..1488.00 rows=10001 loops=1)
[...]

This looks real dirty to me and I don't understand why I obtain what I
want... but 'la fin justifie les moyens', that is my query runs in
approximately 2-3 seconds and I obtain a randomized resultset ;-)

Would there be a 'by the book' way to tell the planner to preserve the order
of a given table ?

Thx for your help,

Cedric Dufour