Thread: Selecting a random row
Hi! I have to select a random row from a table where primary key isn't continuous (some rows have been deleted). Postgres just seems to do something strange with my method. -- -- Use the order by desc limit 1 -trick to get maximum value -- CREATE OR REPLACE FUNCTION max_uid() RETURNS int4 AS 'SELECT uid FROM users WHERE status = ''a'' ORDER BY uid DESC LIMIT 1' LANGUAGE 'sql'; -- -- Choose a random point between 0 and max_uid and select the first -- value from the bigger part -- CREATE OR REPLACE FUNCTION random_uid() RETURNS int4 AS 'SELECT uid FROM users u WHERE u.status = ''a'' AND uid >= cast(cast(max_uid() - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid ASC LIMIT 1' LANGUAGE 'sql'; -- -- testing and looks good -- galleria=> SELECT max_uid(); max_uid --------- 126263 -- -- testing... -- galleria=> SELECT random_uid(), random_uid(), random_uid(), random_uid(), random_uid(); random_uid | random_uid | random_uid | random_uid | random_uid ------------+------------+------------+------------+------------ 322 | 601 | 266 | 427 | 369 ... but what is this? Values seem to vary from 0 to ~1000. Not from 0 to 126263!! How about doing some manual work... -- -- Testing split point selection -- galleria=> SELECT cast(cast(max_uid() - 1 AS FLOAT) * random() AS INTEGER); int4 ------- 43279 -- -- And inserting split point manually -- galleria=> SELECT uid FROM users u WHERE u.status = 'a' AND uid >= 43279 ORDER BY uid ASC LIMIT 1; uid ------- 43284 Works just fine! Is there any explanation for this strange behavior or are there better ways to select a random row? I'm using PG 8.0 b2. Plan for the query is: Limit (cost=0.00..5.19 rows=1 width=4) -> Index Scan using users_pkey on users u (cost=0.00..69145.26 rows=13329 width=4) Filter: ((status = 'a'::bpchar) AND (uid >= ((((max_uid() - 1))::double precision * random()))::integer)) |\__/| ( oo ) Kari Lavikka - tuner@bdb.fi __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _ ""
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thursday 04 November 2004 12:36, Kari Lavikka wrote: > Is there any explanation for this strange behavior or are there better > ways to select a random row? How about SELECT ...whatever... ORDER BY random() LIMIT 1; Mit freundlichem Gruß / With kind regards Holger Klawitter - -- lists <at> klawitter <dot> de -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFBiibF1Xdt0HKSwgYRAlJXAJ4nUpDfKBKCigPVMt8WpKG4gZmt4wCcD/ZC KHBlBl1+5FZ4pgqkZlyzWQA= =MrrE -----END PGP SIGNATURE-----
Works but is too slooow. Shuffling whole table and selecting the first row is not the way to go in this case. Limit (cost=5340.74..5340.74 rows=1 width=4) -> Sort (cost=5340.74..5440.70 rows=39986 width=4) Sort Key: random() -> Seq Scan on users (cost=0.00..2284.37 rows=39986 width=4) Filter: (status = 'a'::bpchar) |\__/| ( oo ) Kari Lavikka - tuner@bdb.fi __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _ "" On Thu, 4 Nov 2004, Holger Klawitter wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Thursday 04 November 2004 12:36, Kari Lavikka wrote: > > Is there any explanation for this strange behavior or are there better > > ways to select a random row? > > How about > > SELECT ...whatever... ORDER BY random() LIMIT 1; > > Mit freundlichem Gruß / With kind regards > Holger Klawitter > - -- > lists <at> klawitter <dot> de > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.2.2 (GNU/Linux) > > iD8DBQFBiibF1Xdt0HKSwgYRAlJXAJ4nUpDfKBKCigPVMt8WpKG4gZmt4wCcD/ZC > KHBlBl1+5FZ4pgqkZlyzWQA= > =MrrE > -----END PGP SIGNATURE----- > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
IIRC, this was discussed a few times on this list, searching the archives might get you some results. AFAIR, the only way to do it efficiently is to have a column specially assigned for this purpose, and populate it with random numbers in a big range. The column should be indexed to assure fast access based on it. Then you can select the first row with that column's value bigger (or smaller if you like) than a random number in the same range. HTH, Csaba. On Thu, 2004-11-04 at 14:34, Kari Lavikka wrote: > Works but is too slooow. Shuffling whole table and selecting the first > row is not the way to go in this case. > > Limit (cost=5340.74..5340.74 rows=1 width=4) > -> Sort (cost=5340.74..5440.70 rows=39986 width=4) > Sort Key: random() > -> Seq Scan on users (cost=0.00..2284.37 rows=39986 width=4) > Filter: (status = 'a'::bpchar) > > |\__/| > ( oo ) Kari Lavikka - tuner@bdb.fi > __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _ > "" > > On Thu, 4 Nov 2004, Holger Klawitter wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > On Thursday 04 November 2004 12:36, Kari Lavikka wrote: > > > Is there any explanation for this strange behavior or are there better > > > ways to select a random row? > > > > How about > > > > SELECT ...whatever... ORDER BY random() LIMIT 1; > > > > Mit freundlichem Gruß / With kind regards > > Holger Klawitter > > - -- > > lists <at> klawitter <dot> de > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.2.2 (GNU/Linux) > > > > iD8DBQFBiibF1Xdt0HKSwgYRAlJXAJ4nUpDfKBKCigPVMt8WpKG4gZmt4wCcD/ZC > > KHBlBl1+5FZ4pgqkZlyzWQA= > > =MrrE > > -----END PGP SIGNATURE----- > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Replying to myself.. Actually I found an answer. If a I wrap the split point selection to subquery then the range of results is from 0 to maximum value (~120k in this case) galleria=> SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >= (select cast(cast((SELECT uid FROM users WHERE status = 'a' ORDER BY uid DESC LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER)) ORDER BY uid ASC LIMIT 1; uid ------- 91937 (1 row) Limit (cost=1.73..3.53 rows=1 width=4) InitPlan -> Result (cost=1.71..1.73 rows=1 width=0) InitPlan -> Limit (cost=0.00..1.71 rows=1 width=4) -> Index Scan Backward using users_pkey on users (cost=0.00..68423.70 rows=39986 width=4) Filter: (status = 'a'::bpchar) -> Index Scan using users_pkey on users u (cost=0.00..23983.04 rows=13329 width=4) Index Cond: (uid >= $1) Filter: (status = 'a'::bpchar) However, without the additional nothing doing subquery the range of results is something like 0 to ~1000 which is of course wrong. galleria=> SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >= cast(cast((SELECT uid FROM users WHERE status = 'a' ORDER BY uid DESC LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid ASC LIMIT 1; uid ----- 587 (1 row) Examining the query plan reveals that without subquery the random comparison is made for each row. That causes a kind of "premature selection". galleria=> explain SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >= cast(cast((SELECT uid FROM users WHERE status = 'a' ORDER BY uid DESC LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid ASC LIMIT 1; ------------------------------------------------------------------------------------------------------------ Limit (cost=1.71..6.89 rows=1 width=4) InitPlan -> Limit (cost=0.00..1.71 rows=1 width=4) -> Index Scan Backward using users_pkey on users (cost=0.00..68423.70 rows=39986 width=4) Filter: (status = 'a'::bpchar) -> Index Scan using users_pkey on users u (cost=0.00..69042.18 rows=13329 width=4) Filter: ((status = 'a'::bpchar) AND (uid >= (((($0 - 1))::double precision * random()))::integer)) (7 rows) Well, it works now. Thanks anyway ;) |\__/| ( oo ) Kari Lavikka - tuner@bdb.fi - (050) 380 3808 __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _ ""
Kari Lavikka <tuner@bdb.fi> writes: > -- > -- Choose a random point between 0 and max_uid and select the first > -- value from the bigger part > -- > CREATE OR REPLACE FUNCTION random_uid() RETURNS int4 AS > 'SELECT uid FROM users u WHERE u.status = ''a'' AND uid >= > cast(cast(max_uid() - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid > ASC LIMIT 1' > LANGUAGE 'sql'; This isn't going to do what you think because the random() function is re-evaluated at every row of the table. (For that matter, so is max_uid(), which means performance would suck even if it worked ...) I'd suggest rewriting in plpgsql so you can assign the (max_uid-1)*random() expression to a variable and then just use the variable in the SELECT. regards, tom lane
Kari, Why not select count(*) from the table and multiply it by a true 0.0 - 1.0 pseudo random number generator? Then adjust the outcome for the range of uids. If the uids (or some other column) are contiguous starting at 0, this would be a snap. Rick Tom Lane <tgl@sss.pgh.pa.us> To: Kari Lavikka <tuner@bdb.fi> Sent by: cc: pgsql-general@postgresql.org pgsql-general-owner@pos Subject: Re: [GENERAL] Selecting a random row tgresql.org 11/04/2004 10:25 AM Kari Lavikka <tuner@bdb.fi> writes: > -- > -- Choose a random point between 0 and max_uid and select the first > -- value from the bigger part > -- > CREATE OR REPLACE FUNCTION random_uid() RETURNS int4 AS > 'SELECT uid FROM users u WHERE u.status = ''a'' AND uid >= > cast(cast(max_uid() - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid > ASC LIMIT 1' > LANGUAGE 'sql'; This isn't going to do what you think because the random() function is re-evaluated at every row of the table. (For that matter, so is max_uid(), which means performance would suck even if it worked ...) I'd suggest rewriting in plpgsql so you can assign the (max_uid-1)*random() expression to a variable and then just use the variable in the SELECT. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
From: "Kari Lavikka" <tuner@bdb.fi> > > Actually I found an answer. If a I wrap the split point selection to > subquery then the range of results is from 0 to maximum value (~120k in > this case) > > galleria=> SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >= > (select cast(cast((SELECT uid FROM users WHERE status = 'a' ORDER BY uid > DESC LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER)) ORDER BY uid ASC LIMIT 1; > uid > ------- > 91937 > (1 row) Tthe problem with this is that this is not very random. If the uids 30000 to 39999 have been missing, but the uids are more or less contiguous apart from that, the uid 40000 would be 10000 times more likely to be selected than average. Maybe using an OFFSET of (count(*) * random()) and a LIMIT 1 could be practical. gnari
"gnari" <gnari@simnet.is> writes: > Tthe problem with this is that this is not very random. > If the uids 30000 to 39999 have been missing, but > the uids are more or less contiguous apart from that, > the uid 40000 would be 10000 times more likely to be selected > than average. There is some discussion of selecting random rows in the list archives. My recollection is that we decided the only way to be fast *and* truly random was to dedicate an extra column to be a random key. regards, tom lane
> Tthe problem with this is that this is not very random. > If the uids 30000 to 39999 have been missing, but > the uids are more or less contiguous apart from that, > the uid 40000 would be 10000 times more likely to be selected > than average. There are some gaps but distribution of them is quite uniform. And results seem to be random enuff for this particular purpose. > Maybe using an OFFSET of (count(*) * random()) and a LIMIT 1 > could be practical. Something like OFFSET (random() * 10) could be used for additional randomness of course. |\__/| ( oo ) Kari Lavikka - tuner@bdb.fi __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _ ""