Random resultset retrieving -> performance bottleneck - Mailing list pgsql-sql

From Cédric Dufour
Subject Random resultset retrieving -> performance bottleneck
Date
Msg-id NDBBIFNBODNADCAOFDOAKEIOCDAA.cedric.dufour@freesurf.ch
Whole thread Raw
Responses Re: Random resultset retrieving -> performance bottleneck  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
List pgsql-sql
Hello to all of you,

I'm running into a performance problem when considering the following
scenario: I have a fairly large table (1mio rows) related to other smaller
tables (between 100 and 10000 rows) and would like to retrieve the joined
data (through a view) in random order. In order to do so, the main table
contains a 'Random' field (which is updated on a regular basis, in order to
re-randomize the data set), on which an index is created:

*****
* SCHEMA
*****

CREATE TABLE tb_Small AS (   PK integer UNIQUE NOT NULL   PRIMARY KEY (PK)
);

CREATE TABLE tb_Large AS (   Random integer DEFAULT( CAST( 1000000*random() AS integer ) ) NOT NULL,   FK_Small integer
NOTNULL,   PK integer UNIQUE NOT NULL,   FOREIGN KEY (FK_Small) REFERENCES tb_Small (PK),   PRIMARY KEY (PK)
 
);

CREATE INDEX ix_Large__Random ON tb_Large (Random);

CREATE TABLE tb_Foo AS (   FK_Large integer NOT NULL,   PK integer UNIQUE NOT NULL,   FOREIGN KEY (FK_Large) REFERENCES
tb_Large(PK) DEFERRABLE,   PRIMARY KEY (PK)
 
);

CREATE VIEW vw_Large AS
SELECT   tb_Small.*,   tb_Large.*
FROM   tb_Small
INNER JOIN   tb_Large   ON ( tb_Small.PK = tb_Large.FK_Small );

NOTA BENE: My production view actually involves much more inner- or
left-joined tables that this simple example

Here are the various querying scenario and the related performance problem
(my production view actually involves much more inner- or left-joined tables
that the scenarios below, simplified for the sake of clarity)

*****
* 1.
*****
CREATE VIEW vw_Large AS
SELECT   *
FROM   tb_Small AS Small
INNER JOIN   tb_Large AS Large   ON ( Small.PK = Large.FK_Small );

SELECT * FROM vw_Large ORDER BY Random LIMIT 50;
-> The slowest way (~60 time units), since the entire view is evaluated
before it is sorted properly (the index on the 'Random' being ignored)

SELECT * FROM vw_Large WHERE Small.PK = <any> ORDER BY Random LIMIT 50;
-> Quicker (~5 time units), since the evaluated view is smaller (cf. the
WHERE clause) before it is sorted properly

*****
* 2.
*****
CREATE VIEW vw_Large AS
SELECT   *
FROM   tb_Small AS Small
INNER JOIN   ( SELECT * FROM tb_Large ORDER BY Random ) AS Large   ON ( Small.PK = Large.FK_Small );

SELECT * FROM vw_Large LIMIT 50;
-> Much quicker (~15 time units), since the ordering is achieved on the
table itself, using the index, before the view is evaluated

SELECT * FROM vw_Large WHERE Small.PK = <any> LIMIT 50;
-> Slow (~15 time units), since the ordering is achieved on the entire
table, despite the WHERE clause


*****
* POSSIBLE SOLUTIONS AND PROBLEMS
*****
Since the second approach seems to give better results, the idea was to
reorder (cluster) the 'Large' table regurlarly - like once a day -, so as to
have a randomized data set and avoid the ORDER BY clause (this is the way I
achieved VERY GOOD performance on MS SQL Server [~1 time unit], for exactly
the same scenario). In order to do so, one might:

*****
* 1.
*****
CLUSTER ix_Large__Random TABLE tb_Large;

-> Would be ideal... but OIDS, CONTRAINTS AND INDEXES ARE LOST AND ALL
RELATED VIEWS/FUNCTIONS WON'T WORK ANYLONGER...

*****
* 2.
*****
BEGIN;
SET CONSTRAINTS ALL DEFERRED;
CREATE TEMP TABLE tmp_Large AS SELECT * FROM tb_Table;
DELETE FROM tb_Large; -- won't work; RI violation on foreign key
'tb_Foo(FK_Large)'
INSERT INTO tb_Large SELECT * FROM tb_Table ORDER BY Random;
DROP TABLE tmp_Large;
COMMIT;

-> Would preserve oids, constraints and indexes... BUT DELETE IS IMPOSSIBLE,
BECAUSE REFERENTIAL INTEGRITY IS VIOLATED ON FOREIGN KEY 'FK_Large' IN TABLE
'tb_Foo', despite the SET CONSTRAINTS ALL DEFERRED clause

*****
* HELP !!!
*****
Would anyone have a solution to this (general) randomization problem ?
Is there a way to turn RI off during the transaction ?
Is there another way to reorder (cluster) the table without having
oids/constraints/indexes or RI problems ?

Any clues would be very much appreciated ;-)

Cédric Dufour




pgsql-sql by date:

Previous
From: "Chad R. Larson"
Date:
Subject: Re: [ADMIN] 3-tier
Next
From: Jochem van Dieten
Date:
Subject: Re: Please Help me