Thread: Random resultset retrieving -> performance bottleneck
Hello to all of you, I'm running into a performance problem when considering the following scenario: I have a fairly large table (1mio rows) related to other smaller tables (between 100 and 10000 rows) and would like to retrieve the joined data (through a view) in random order. In order to do so, the main table contains a 'Random' field (which is updated on a regular basis, in order to re-randomize the data set), on which an index is created: ***** * SCHEMA ***** CREATE TABLE tb_Small AS ( PK integer UNIQUE NOT NULL PRIMARY KEY (PK) ); CREATE TABLE tb_Large AS ( Random integer DEFAULT( CAST( 1000000*random() AS integer ) ) NOT NULL, FK_Small integer NOTNULL, PK integer UNIQUE NOT NULL, FOREIGN KEY (FK_Small) REFERENCES tb_Small (PK), PRIMARY KEY (PK) ); CREATE INDEX ix_Large__Random ON tb_Large (Random); CREATE TABLE tb_Foo AS ( FK_Large integer NOT NULL, PK integer UNIQUE NOT NULL, FOREIGN KEY (FK_Large) REFERENCES tb_Large(PK) DEFERRABLE, PRIMARY KEY (PK) ); CREATE VIEW vw_Large AS SELECT tb_Small.*, tb_Large.* FROM tb_Small INNER JOIN tb_Large ON ( tb_Small.PK = tb_Large.FK_Small ); NOTA BENE: My production view actually involves much more inner- or left-joined tables that this simple example Here are the various querying scenario and the related performance problem (my production view actually involves much more inner- or left-joined tables that the scenarios below, simplified for the sake of clarity) ***** * 1. ***** CREATE VIEW vw_Large AS SELECT * FROM tb_Small AS Small INNER JOIN tb_Large AS Large ON ( Small.PK = Large.FK_Small ); SELECT * FROM vw_Large ORDER BY Random LIMIT 50; -> The slowest way (~60 time units), since the entire view is evaluated before it is sorted properly (the index on the 'Random' being ignored) SELECT * FROM vw_Large WHERE Small.PK = <any> ORDER BY Random LIMIT 50; -> Quicker (~5 time units), since the evaluated view is smaller (cf. the WHERE clause) before it is sorted properly ***** * 2. ***** CREATE VIEW vw_Large AS SELECT * FROM tb_Small AS Small INNER JOIN ( SELECT * FROM tb_Large ORDER BY Random ) AS Large ON ( Small.PK = Large.FK_Small ); SELECT * FROM vw_Large LIMIT 50; -> Much quicker (~15 time units), since the ordering is achieved on the table itself, using the index, before the view is evaluated SELECT * FROM vw_Large WHERE Small.PK = <any> LIMIT 50; -> Slow (~15 time units), since the ordering is achieved on the entire table, despite the WHERE clause ***** * POSSIBLE SOLUTIONS AND PROBLEMS ***** Since the second approach seems to give better results, the idea was to reorder (cluster) the 'Large' table regurlarly - like once a day -, so as to have a randomized data set and avoid the ORDER BY clause (this is the way I achieved VERY GOOD performance on MS SQL Server [~1 time unit], for exactly the same scenario). In order to do so, one might: ***** * 1. ***** CLUSTER ix_Large__Random TABLE tb_Large; -> Would be ideal... but OIDS, CONTRAINTS AND INDEXES ARE LOST AND ALL RELATED VIEWS/FUNCTIONS WON'T WORK ANYLONGER... ***** * 2. ***** BEGIN; SET CONSTRAINTS ALL DEFERRED; CREATE TEMP TABLE tmp_Large AS SELECT * FROM tb_Table; DELETE FROM tb_Large; -- won't work; RI violation on foreign key 'tb_Foo(FK_Large)' INSERT INTO tb_Large SELECT * FROM tb_Table ORDER BY Random; DROP TABLE tmp_Large; COMMIT; -> Would preserve oids, constraints and indexes... BUT DELETE IS IMPOSSIBLE, BECAUSE REFERENTIAL INTEGRITY IS VIOLATED ON FOREIGN KEY 'FK_Large' IN TABLE 'tb_Foo', despite the SET CONSTRAINTS ALL DEFERRED clause ***** * HELP !!! ***** Would anyone have a solution to this (general) randomization problem ? Is there a way to turn RI off during the transaction ? Is there another way to reorder (cluster) the table without having oids/constraints/indexes or RI problems ? Any clues would be very much appreciated ;-) Cédric Dufour
On Thu, 1 Aug 2002, [iso-8859-1] C�dric Dufour wrote: > ***** > * 2. > ***** > BEGIN; > SET CONSTRAINTS ALL DEFERRED; > CREATE TEMP TABLE tmp_Large AS SELECT * FROM tb_Table; > DELETE FROM tb_Large; -- won't work; RI violation on foreign key > 'tb_Foo(FK_Large)' > INSERT INTO tb_Large SELECT * FROM tb_Table ORDER BY Random; > DROP TABLE tmp_Large; > COMMIT; > > -> Would preserve oids, constraints and indexes... BUT DELETE IS IMPOSSIBLE, > BECAUSE REFERENTIAL INTEGRITY IS VIOLATED ON FOREIGN KEY 'FK_Large' IN TABLE > 'tb_Foo', despite the SET CONSTRAINTS ALL DEFERRED clause Yeah, there's been a bug that should now be patched for upcoming 7.3 that caused this to fail. I believe you should be able to find the patch if you search -patches since it was pretty recent. It might take a little work to patch to a previous version, but it shouldn't be too hard. Failing that, you can turn off all triggers (look at the output of a data only pg_dump for queries to turn off/on trigger).
> I'm running into a performance problem when considering the following > scenario: I have a fairly large table (1mio rows) related to other smaller > tables (between 100 and 10000 rows) and would like to retrieve the joined > data (through a view) in random order. In order to do so, the main table > contains a 'Random' field (which is updated on a regular basis, in order to > re-randomize the data set), on which an index is created: Have you tried adding ORDER BY RANDOM() onto your select query? Chris
> -----Original Message----- > From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au] > Sent: Saturday, August 03, 2002 12:10 > To: Cédric Dufour; pgsql-sql@postgresql.org > Subject: Re: [SQL] Random resultset retrieving -> performance bottleneck > > > > I'm running into a performance problem when considering the following > > scenario: I have a fairly large table (1mio rows) related to > other smaller > > tables (between 100 and 10000 rows) and would like to retrieve > the joined > > data (through a view) in random order. In order to do so, the main table > > contains a 'Random' field (which is updated on a regular basis, in order > to > > re-randomize the data set), on which an index is created: > > Have you tried adding ORDER BY RANDOM() onto your select query? > > Chris > Yes, but the problem remains the same: even tough a LIMIT clause is given, the full set seems to be evaluated before the sort occurs (and it is even slower than storing a random value in the table itself, since random() has to be computed for each row instead or retrieving the pre-computed one). Thus... SELECT my_view ORDER BY random() LIMIT 100 takes many minutes (actually, after 10 minutes running, I gave up and canceled the query), while... SELECT my_view LIMIT 100 takes only 200 milliseconds So far, I managed to re-order the rows in random, turning off all triggers and forced the planner to use only 'hash' joins (thus forcing it to make 'seq scan' and keeping it from re-ordering the rows based on the indexes). Now, turning off 'nestloop' and 'mergejoin' (in order to have only 'hash' joins) had a rather disorienting side-effect -> See thread '[GENERAL] b1 OR b2 <-> ( CASE WHEN b1 THE true ELSE b2 END ): performance bottleneck on logical OR' from this month (Tom, here is the reason why I copy you this mail, so you understand where my messing around with the planner comes from ;-) ) So I turned all joins sorts ('hash' AND 'nestloop' and 'mergejoin') on again and suceeded having the planner do what I want on this particular query by surrounding the WHERE expression by a ( CASE ... END ): SELECT ... WHERE exp -> SELECT ... WHERE ( CASE WHEN exp THEN true ELSE false END ) Thus having the planner add a 'seq scan' of the randomized table just before returning the resultset: Limit (cost=17623.97..3321903.97 rows=25 width=7855) (actual time=2659.00..2732.00 rows=100 loops=1) -> Hash Join (cost=17623.97..3321923.80 rows=25 width=7855) (actual time=2659.00..2730.00 rows=101 loops=1) -> Hash Join (cost=17429.64..3321729.33 rows=25 width=7819) (actual time=2651.00..2698.00 rows=101 loops=1) -> Hash Join (cost=17235.92..3321535.49 rows=25 width=7783) (actual time=2646.00..2681.00 rows=101 loops=1) -> Hash Join (cost=17041.36..3321315.62 rows=5000 width=7226) (actual time=2635.00..2663.00 rows=101 loops=1) -> Hash Join (cost=16847.45..3316059.18 rows=1000006 width=6669) (actual time=2629.00..2652.00 rows=101 loops=1) -> Hash Join (cost=16795.64..3311007.27 rows=1000006 width=4839) (actual time=2627.00..2648.00 rows=101 loops=1) -> Hash Join (cost=16743.82..3305955.36 rows=1000006 width=4831) (actual time=2626.00..2640.00 rows=101 loops=1) ********** -> Seq Scan on tb_item item (cost=0.00..34708.06 rows=1000006 width=2723) (actual time=847.00..853.00 rows=487 loops=1) -> Hash (cost=14110.82..14110.82 rows=10001 width=2108) (actual time=1775.00..1775.00 rows=0 loops=1) -> Hash Join (cost=13858.17..14110.82 rows=10001 width=2108) (actual time=1041.00..1742.00 rows=10001 loops=1) -> Hash Join (cost=13806.36..14008.94 rows=10001 width=1310) (actual time=1040.00..1488.00 rows=10001 loops=1) [...] This looks real dirty to me and I don't understand why I obtain what I want... but 'la fin justifie les moyens', that is my query runs in approximately 2-3 seconds and I obtain a randomized resultset ;-) Would there be a 'by the book' way to tell the planner to preserve the order of a given table ? Thx for your help, Cedric Dufour