Thread: ORDER BY random() LIMIT 1 slowness
I have a query where i just want to randomly pick out one row of the table. The query as I am running it looks like: SELECT * FROM poetry ORDER BY random() LIMIT 1; There are only roughly 35,000 rows of data and there is no way that I have found to specify what is randomly being ordered, I assume it's picking the primary key. The explain output looks like: QUERY PLAN: Limit (cost=49279.75..49279.75 rows=1 width=1062) (actual time=8503.83..8503.84 rows=1 loops=1) -> Sort (cost=49279.75..49365.68 rows=34375 width=1062) (actual time=8503.82..8503.83 rows=2 loops=1) Sort Key: random() -> Seq Scan on poetry (cost=0.00..5029.75 rows=34375 width=1062) (actual time=0.12..2503.35 rows=34376 loops=1) Total runtime: 8526.31 msec I have different variations on the runtime, all between 5000 and 15000 msec. Anyone have any ideas on how to speed this up? I've thought about doing a count query to get the total count and then use limit 1 offset #, where offset # is a number supplied by my program, but it would be preferred to do this in query. Thanks, Gavin
"Gavin M. Roy" <gmr@justsportsusa.com> writes: > SELECT * FROM poetry ORDER BY random() LIMIT 1; > [ is slow for 35000 rows ] Yeah. Basically this query is implemented as (a) select all 35000 rows of "poetry"; (b) compute a random() value for each row; (c) sort by the random() values; (d) take the first row, discard the rest. The good news: this gives you a pretty-durn-random selection. The bad news: you didn't really care about choosing a random ordering of the other 34999 rows, but it computed one anyway. This problem's been discussed before, but I've not seen any really ideal answer. SQL is not designed to offer unpredictable results ;-) If you have a reasonably-uniformly-distributed index key on the rows, you can try something like select * from poetry where key > (scale * random() + min) order by key limit 1; (where scale and min are chosen to make the result cover the range of index keys). Given an index on "key" this should pick a result quickly via an indexscan+limit plan. However, if the distribution of keys isn't real uniform then you won't get real random results --- keys just above larger gaps will be selected with unfairly high probability. You also have to know the min and max keys accurately. I recall some folks have suggested generating an actual random column (ie, something initialized with "default random()" and suitably indexed) but this seems like a lot of overhead ... you'd have to be mighty concerned with the exactness of your random selection to put up with that, I think. There are some other suggestions in the archives, IIRC. regards, tom lane
Gavin, Assuming that you have a serial column rand on poetry and you did not delete any row, here is my suggestion: CREATE TABLE poetry ( rand SERIAL, ... ); SELECT * FROM poetry WHERE rand = ( SELECT int8( curval( 'poetry_rand_seq') * random())); JLL Tom Lane wrote: > > "Gavin M. Roy" <gmr@justsportsusa.com> writes: > > SELECT * FROM poetry ORDER BY random() LIMIT 1; > > [ is slow for 35000 rows ] > > Yeah. Basically this query is implemented as > (a) select all 35000 rows of "poetry"; > (b) compute a random() value for each row; > (c) sort by the random() values; > (d) take the first row, discard the rest. > > The good news: this gives you a pretty-durn-random selection. The bad > news: you didn't really care about choosing a random ordering of the > other 34999 rows, but it computed one anyway. > > This problem's been discussed before, but I've not seen any really > ideal answer. SQL is not designed to offer unpredictable results ;-) > > If you have a reasonably-uniformly-distributed index key on the rows, > you can try something like > > select * from poetry where key > (scale * random() + min) > order by key limit 1; > > (where scale and min are chosen to make the result cover the range of > index keys). Given an index on "key" this should pick a result quickly > via an indexscan+limit plan. However, if the distribution of keys isn't > real uniform then you won't get real random results --- keys just above > larger gaps will be selected with unfairly high probability. You also > have to know the min and max keys accurately. > > I recall some folks have suggested generating an actual random column > (ie, something initialized with "default random()" and suitably indexed) > but this seems like a lot of overhead ... you'd have to be mighty > concerned with the exactness of your random selection to put up with > that, I think. > > There are some other suggestions in the archives, IIRC. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Tom Lane <tgl@sss.pgh.pa.us> writes: > "Gavin M. Roy" <gmr@justsportsusa.com> writes: > > SELECT * FROM poetry ORDER BY random() LIMIT 1; > > [ is slow for 35000 rows ] > > Yeah. Basically this query is implemented as > (a) select all 35000 rows of "poetry"; > (b) compute a random() value for each row; > (c) sort by the random() values; > (d) take the first row, discard the rest. If you can generate a random value from your application layer you could do select * from poetry LIMIT 1 OFFSET <random value> Can offset values be placeholders in prepared queries? If not then this has that disadvantage. -- greg
----- Original Message ----- From: "Jean-Luc Lachance" <jllachan@nsd.ca> Sent: Tuesday, December 17, 2002 5:04 PM > Gavin, > > Assuming that you have a serial column rand on poetry and you did not > delete any row, > here is my suggestion: > > CREATE TABLE poetry ( rand SERIAL, ... ); > > SELECT * FROM poetry WHERE rand = ( > SELECT int8( curval( 'poetry_rand_seq') * random())); Mmmm... It usually doesn't work for me. Isn't currval (NOTE: with two r's) bound to session and has no meaning before the first call to nextval()? 7.2.1 says the following; has it changed in 7.3(.*)? ---------------------------- cut here ------------------------------ tir=> create sequence test_seq; CREATE tir=> select currval('test_seq'); ERROR: test_seq.currval is not yet defined in this session tir=> select nextval('test_seq'); nextval --------- 1 (1 row) tir=> select currval('test_seq'); currval --------- 1 (1 row) ---------------------------- cut here ------------------------------ G. -- while (!asleep()) sheep++; ---------------------------- cut here ------------------------------
=?iso-8859-1?Q?SZUCS_G=E1bor?= <surrano@mailbox.hu> writes: >> CREATE TABLE poetry ( rand SERIAL, ... ); >> >> SELECT * FROM poetry WHERE rand = ( >> SELECT int8( curval( 'poetry_rand_seq') * random())); > Mmmm... It usually doesn't work for me. Yeah ... better would be >> SELECT * FROM poetry WHERE rand = ( >> SELECT int8( (select last_value from poetry_rand_seq) * random())); Personally though, I'd skip the sequence entirely and do create table poetry (..., rand float8 default random()); create index on poetry.rand select * from poetry where rand > random() order by rand limit 1; A difficulty with either of these approaches is that the system won't optimize comparisons involving random() into indexscans. To get around that, you'd have to hide the random() call inside a user-defined function that is (bogusly) marked cachable (or in 7.3, "stable" would be the best choice). At the moment I think it'd also work to stick the random() call inside a subselect, but the UDF approach is less likely to get broken by future changes. regards, tom lane
Gabor, You are right about the missing 'r', but I think you missed my point. You should modify your table so that it has a serial field and reload it. JLL P.S. I run 7.2 so ALTER TABLE ADD rand SERIAL; does not work, but it may work under 7.3 SZUCS Gábor wrote: > > ----- Original Message ----- > From: "Jean-Luc Lachance" <jllachan@nsd.ca> > Sent: Tuesday, December 17, 2002 5:04 PM > > > Gavin, > > > > Assuming that you have a serial column rand on poetry and you did not > > delete any row, > > here is my suggestion: > > > > CREATE TABLE poetry ( rand SERIAL, ... ); > > > > SELECT * FROM poetry WHERE rand = ( > > SELECT int8( curval( 'poetry_rand_seq') * random())); > > Mmmm... It usually doesn't work for me. Isn't currval (NOTE: with two r's) > bound to session and has no meaning before the first call to nextval()? > 7.2.1 says the following; has it changed in 7.3(.*)? > > ---------------------------- cut here ------------------------------ > tir=> create sequence test_seq; > CREATE > tir=> select currval('test_seq'); > ERROR: test_seq.currval is not yet defined in this session > tir=> select nextval('test_seq'); > nextval > --------- > 1 > (1 row) > > tir=> select currval('test_seq'); > currval > --------- > 1 > (1 row) > ---------------------------- cut here ------------------------------ > > G. > -- > while (!asleep()) sheep++; > > ---------------------------- cut here ------------------------------ > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Dear Jean-Luc, I don't think my simplified example missed any of your solution's features. The essence, in my eyes, is that it has nothing to do with tables. It's only related to sequences. In short, you _cannot_ use currval() in any single _session_ until you use nextval() in the same session, even if you created the sequence in the very same session. Using a serial field in a table or using the sequence directly is indifferent. Or I'm missing something here. As for Tom's solution: ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> Sent: Wednesday, December 18, 2002 4:56 PM > Personally though, I'd skip the sequence entirely and do > > create table poetry (..., > rand float8 default random()); > create index on poetry.rand > > select * from poetry where rand > random() order by rand limit 1; I'm not sure it's as flat as a random number should be. I have some relation to mathematics but can't see it clearly right now. I fear it's more likely a normal distribution, not linear (or whatsits called). But if I needed something like this, I'd be happy with this solution anyway. G. -- while (!asleep()) sheep++; ---------------------------- cut here ------------------------------ ----- Original Message ----- From: "Jean-Luc Lachance" <jllachan@nsd.ca> Sent: Wednesday, December 18, 2002 5:55 PM Gabor, You are right about the missing 'r', but I think you missed my point. You should modify your table so that it has a serial field and reload it. JLL P.S. I run 7.2 so ALTER TABLE ADD rand SERIAL; does not work, but it may work under 7.3
OK Gabor, I'm the one who misunderstood. To me, it seem to be a bug (or at least a mis-feature) that one cannot call currval() before calling nextval(). Does anyone know why it should be like this? JLL SZUCS Gábor wrote: > > Dear Jean-Luc, > > I don't think my simplified example missed any of your solution's features. > The essence, in my eyes, is that it has nothing to do with tables. It's only > related to sequences. > > In short, you _cannot_ use currval() in any single _session_ until you use > nextval() in the same session, even if you created the sequence in the very > same session. Using a serial field in a table or using the sequence directly > is indifferent. > > Or I'm missing something here. > > As for Tom's solution: > > ----- Original Message ----- > From: "Tom Lane" <tgl@sss.pgh.pa.us> > Sent: Wednesday, December 18, 2002 4:56 PM > > > Personally though, I'd skip the sequence entirely and do > > > > create table poetry (..., > > rand float8 default random()); > > create index on poetry.rand > > > > select * from poetry where rand > random() order by rand limit 1; > > I'm not sure it's as flat as a random number should be. I have some relation > to mathematics but can't see it clearly right now. I fear it's more likely a > normal distribution, not linear (or whatsits called). But if I needed > something like this, I'd be happy with this solution anyway. > > G. > -- > while (!asleep()) sheep++; > > ---------------------------- cut here ------------------------------ > ----- Original Message ----- > From: "Jean-Luc Lachance" <jllachan@nsd.ca> > Sent: Wednesday, December 18, 2002 5:55 PM > > Gabor, > > You are right about the missing 'r', but I think you missed my point. > You should modify your table so that it has a serial field and reload > it. > > JLL > > P.S. I run 7.2 so ALTER TABLE ADD rand SERIAL; does not work, but it may > work under 7.3 > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Wed, Dec 18, 2002 at 02:09:42PM -0500, Jean-Luc Lachance wrote: > OK Gabor, > > I'm the one who misunderstood. > > To me, it seem to be a bug (or at least a mis-feature) that one cannot > call currval() before calling nextval(). > > Does anyone know why it should be like this? It doesn't make sense to call currval() if you haven't called nextval() before. The meaning of currval() is "the value that was last assigned to you". If you haven't called nextval(), there isn't a value assigned to you. If you want to know what was the last value the sequence gave to anyway, SELECT last_value FROM sequence. But be aware that this is non-transaction safe, non-isolatable, non-anything. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Entristecido, Wutra echa a Freyr a rodar y a nosotros al mar" (cancion de Las Barreras)
Alvara, But instead of returning an error, currval() should return last_value if nextval() was not called (with all the caveat of couse). I think it would be more usefull that way. JLL Alvaro Herrera wrote: > > On Wed, Dec 18, 2002 at 02:09:42PM -0500, Jean-Luc Lachance wrote: > > OK Gabor, > > > > I'm the one who misunderstood. > > > > To me, it seem to be a bug (or at least a mis-feature) that one cannot > > call currval() before calling nextval(). > > > > Does anyone know why it should be like this? > > It doesn't make sense to call currval() if you haven't called nextval() > before. The meaning of currval() is "the value that was last assigned > to you". If you haven't called nextval(), there isn't a value assigned > to you. > > If you want to know what was the last value the sequence gave to anyway, > SELECT last_value FROM sequence. But be aware that this is > non-transaction safe, non-isolatable, non-anything. > > -- > Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) > "Entristecido, Wutra > echa a Freyr a rodar > y a nosotros al mar" (cancion de Las Barreras)
On Wed, 18 Dec 2002, Jean-Luc Lachance wrote: > OK Gabor, > > I'm the one who misunderstood. > > To me, it seem to be a bug (or at least a mis-feature) that one cannot > call currval() before calling nextval(). > > Does anyone know why it should be like this? First, read this page: http://developer.postgresql.org/docs/postgres/functions-sequence.html which answers a bit of that question. the real issue with sequences is that in order to be transactionally safe, they have to live outside of all transactions. The purpose of the sequence manipulation functions is to interact with sequence's in ways that ensure that the same sequence number is never used by two different transactions. Let me illustrate with a pair of concurrent transactions, A and B: A: begin; B: begin; A: select currval('seq'); <- client stores this value B: select currval('seq'); <- ditto A: insert into table (name, id) values ('john',idnum); B: insert into table (name, id) values ('sue',idnum); A: commit; B: commit; See the problem with the above? It's why you can't use currval to get the sequence number if you haven't called nextval, setval, or some other fuction that has changed the sequence, and why using it will cause an error. Let's fix the above queries: (seq=20) A: begin; B: begin; A: select nextval('seq'); <- client doesn't store this (but could) B: select nextval('seq'); <- client stores 22 A: insert into table (name, id) values ('john',currval('seq')); B: insert into table (name, id) values ('sue',idnum); A: commit; B: commit; All is well. Note that if A: were to roll back, B would still complete, but we would have a hole in our sequence for number 21. this is normal. The price we pay for having sequences be safe in transactions is that they live outside of transactions, and the functions that provide the interface are what are transactionally aware.
On Wed, 18 Dec 2002, Jean-Luc Lachance wrote: > Alvara, > > But instead of returning an error, currval() should return last_value if > nextval() was not called (with all the caveat of couse). I think it > would be more usefull that way. no, that would be like walking around with a gun pointed at your foot, to quote Tom Lane. See my post on transactions and such. Remember that everything in Postgresql is designed to make transactions safe. currval working without a nextval or setval before it is dangerous in the exterme to transactions.
Scott, I understand it all. If a programmer understand that currval() return the last_(used)_value and did not himself call nextval() he should be aware of the caveat. I did not want to make a big fuss of it. I will just use select last_value myself since I am already aware of the caveat. :) JLL "scott.marlowe" wrote: > > On Wed, 18 Dec 2002, Jean-Luc Lachance wrote: > > > Alvara, > > > > But instead of returning an error, currval() should return last_value if > > nextval() was not called (with all the caveat of couse). I think it > > would be more usefull that way. > > no, that would be like walking around with a gun pointed at your foot, to > quote Tom Lane. > > See my post on transactions and such. Remember that everything in > Postgresql is designed to make transactions safe. currval working without > a nextval or setval before it is dangerous in the exterme to transactions.
But the problem is that we'd immediately be innundated on the pgsql-general list by people who were using currval() in an unsafe way but didn't know it until they went into production and watched their application develop serious issues. I.e. they have a right to expect the sequence functions to operate in a transaction safe manner. It's not something that is likely to ever change, which is, imnsho, a good thing. On Wed, 18 Dec 2002, Jean-Luc Lachance wrote: > Scott, > > I understand it all. > > If a programmer understand that currval() return the last_(used)_value > and did not himself call nextval() he should be aware of the caveat. > > I did not want to make a big fuss of it. I will just use select > last_value myself since I am already aware of the caveat. :) > > JLL > > > "scott.marlowe" wrote: > > > > On Wed, 18 Dec 2002, Jean-Luc Lachance wrote: > > > > > Alvara, > > > > > > But instead of returning an error, currval() should return last_value if > > > nextval() was not called (with all the caveat of couse). I think it > > > would be more usefull that way. > > > > no, that would be like walking around with a gun pointed at your foot, to > > quote Tom Lane. > > > > See my post on transactions and such. Remember that everything in > > Postgresql is designed to make transactions safe. currval working without > > a nextval or setval before it is dangerous in the exterme to transactions. > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
Hi, If you need the last gobal value of the sequence just query the sequence directly. select last_value, increment_by from xyz_seq; last_value | increment_by ------------+-------------- 355 | 1 (1 row) or just select last_value from xyz_seq; last_value ------------ 355 (1 row) I think that sequences are great in postgresql after going from oracle to mysql to postgresql. Regards, Simon scott.marlowe wrote: > But the problem is that we'd immediately be innundated on the > pgsql-general list by people who were using currval() in an unsafe way > but didn't know it until they went into production and watched their > application develop serious issues. I.e. they have a right to expect > the sequence functions to operate in a transaction safe manner. It's > not something that is likely to ever change, which is, imnsho, a good > thing. > > On Wed, 18 Dec 2002, Jean-Luc Lachance wrote: > > > >> Scott, >> >> I understand it all. >> >> If a programmer understand that currval() return the last_(used)_value >> and did not himself call nextval() he should be aware of the caveat. >> >> I did not want to make a big fuss of it. I will just use select >> last_value myself since I am already aware of the caveat. :) >> >> JLL >> >> >> "scott.marlowe" wrote: >> >> >>> On Wed, 18 Dec 2002, Jean-Luc Lachance wrote: >>> >>> >>> >>>> Alvara, >>>> >>>> But instead of returning an error, currval() should return >>>> last_value if >>>> nextval() was not called (with all the caveat of couse). I think it >>>> would be more usefull that way. >>>> >>> >>> no, that would be like walking around with a gun pointed at your >>> foot, to >>> quote Tom Lane. >>> >>> See my post on transactions and such. Remember that everything in >>> Postgresql is designed to make transactions safe. currval working >>> without >>> a nextval or setval before it is dangerous in the exterme to >>> transactions. >>> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >> >> > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > > >
Have another suggestion then -- add lastval(). We would have: nextval() transaction safe currval() transaction safe (nextval() must be called first) lastval() transaction unsafe JLL "scott.marlowe" wrote: > > But the problem is that we'd immediately be innundated on the > pgsql-general list by people who were using currval() in an unsafe way but > didn't know it until they went into production and watched their > application develop serious issues. I.e. they have a right to expect the > sequence functions to operate in a transaction safe manner. It's not > something that is likely to ever change, which is, imnsho, a good thing. > > On Wed, 18 Dec 2002, Jean-Luc Lachance wrote: > > > Scott, > > > > I understand it all. > > > > If a programmer understand that currval() return the last_(used)_value > > and did not himself call nextval() he should be aware of the caveat. > > > > I did not want to make a big fuss of it. I will just use select > > last_value myself since I am already aware of the caveat. :) > > > > JLL > > > > > > "scott.marlowe" wrote: > > > > > > On Wed, 18 Dec 2002, Jean-Luc Lachance wrote: > > > > > > > Alvara, > > > > > > > > But instead of returning an error, currval() should return last_value if > > > > nextval() was not called (with all the caveat of couse). I think it > > > > would be more usefull that way. > > > > > > no, that would be like walking around with a gun pointed at your foot, to > > > quote Tom Lane. > > > > > > See my post on transactions and such. Remember that everything in > > > Postgresql is designed to make transactions safe. currval working without > > > a nextval or setval before it is dangerous in the exterme to transactions. > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > >