Thread: selecting random row values in postgres
Hi all,<br /><br />I'm trying to write a query to select random values from a set of 'GROUP BY'<br />....see the scenariobelow to understand the problem here (the actual problem cannot be discussed here so i'm taking an example scenario)<br /><br />Assume there is a table<br /><br />id | name | year_of_birth<br /><br />query: I want to select foreach year_of_birth a random name.<br /><br />--> so i do a group by year_of_birth, now i have a set of names, is there any function to select just one name from these set of names. <br />The current approach i'm using to solve this problemis <br /><br />1) getting these names in a single string using a custom function 'group_concat'<br />2) Convert thesingle string into an array <br />3) use postgresql random function to generate a random number <br />4) us the randomnumber to select a element from the array previously created.<br /><br clear="all" />The solution is there but it'skinda hack, is there any other better way of solving this problem.<br /><br /><br />Thanks, <br />Sumeet
<br /><br />On 2/24/07, Sumeet <<a href="mailto:asumeet@gmail.com">asumeet@gmail.com</a>> wrote:<br />> Hi all,<br/>> <br />> I'm trying to write a query to select random values from a set of 'GROUP<br />> BY' <br />>....see the scenario below to understand the problem here (the actual<br />> problem cannot be discussed here soi'm taking an example scenario) <br />> <br />> Assume there is a table<br />> <br />> id | name | year_of_birth<br />> <br />> query: I want to select for each year_of_birth a random name.<br /><br />Dear Sumeet<br/><br />postgresql DISTINCT ON may be of help , but its not standard sql.<br /><br />regds<br />mallah.<br /><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">tradein_clients=>SELECT * from temp.test;</span><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;">+----+------+-----+</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">| id | name | yob |</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;"> +----+------+-----+</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">| 1 | A | 2 |</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;"> | 2 | B | 2 |</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">| 3| C | 2 |</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;"> | 4 | D | 1 |</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">| 5 | E | 1 |</span><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;"> | 6 | F | 1 |</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">+----+------+-----+</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;"> (6 rows)</span><br style="font-family: couriernew,monospace;" /><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">tradein_clients=>SELECT distinct on (yob) id,name,yob from temp.test order by yob,random();</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">+----+------+-----+</span><brstyle="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;"> | id | name | yob |</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">+----+------+-----+</span><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;"> | 5 | E | 1 |</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;">| 1 | A | 2 |</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;"> +----+------+-----+</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;">(2 rows)</span><br style="font-family:courier new,monospace;" /><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;">tradein_clients=> SELECT distinct on (yob) id,name,yob from temp.test orderby yob,random();</span><br style="font-family: courier new,monospace;" /><span style="font-family: courier new,monospace;">+----+------+-----+</span><br style="font-family: courier new,monospace;" /><span style="font-family: couriernew,monospace;">| id | name | yob |</span><br style="font-family: courier new,monospace;" /><span style="font-family:courier new,monospace;"> +----+------+-----+</span><br style="font-family: courier new,monospace;" /><spanstyle="font-family: courier new,monospace;">| 4 | D | 1 |</span><br style="font-family: courier new,monospace;"/><span style="font-family: courier new,monospace;"> | 1 | A | 2 |</span><br style="font-family: couriernew,monospace;" /><span style="font-family: courier new,monospace;">+----+------+-----+</span><br style="font-family:courier new,monospace;" /><span style="font-family: courier new,monospace;"> (2 rows)</span><br style="font-family:courier new,monospace;" /><br style="font-family: courier new,monospace;" /><br style="font-family: couriernew,monospace;" /><br style="font-family: courier new,monospace;" /><br />> <br />> --> so i do a group byyear_of_birth, now i have a set of names, is there <br />> any function to select just one name from these set of names.<br />> The current approach i'm using to solve this problem is <br />> <br />> 1) getting these names ina single string using a custom function <br />> 'group_concat'<br />> 2) Convert the single string into an array<br />> 3) use postgresql random function to generate a random number <br />> 4) us the random number to selecta element from the array previously <br />> created.<br />> <br />> The solution is there but it's kinda hack,is there any other better way of<br />> solving this problem.<br />> <br />> <br />> Thanks, <br />>Sumeet <br />
Thanks Buddy, really appreciate ur help on this
....problem solved...
Is there any way this query can be optimized...i'm running it on a huge table with joins
- Sumeet
--
Thanks,
Sumeet Ambre
Master of Information Science Candidate,
Indiana University.
....problem solved...
Is there any way this query can be optimized...i'm running it on a huge table with joins
- Sumeet
On 2/23/07, Rajesh Kumar Mallah <mallah.rajesh@gmail.com> wrote:
On 2/24/07, Sumeet <asumeet@gmail.com> wrote:
> Hi all,
>
> I'm trying to write a query to select random values from a set of 'GROUP
> BY'
> ....see the scenario below to understand the problem here (the actual
> problem cannot be discussed here so i'm taking an example scenario)
>
> Assume there is a table
>
> id | name | year_of_birth
>
> query: I want to select for each year_of_birth a random name.
Dear Sumeet
postgresql DISTINCT ON may be of help , but its not standard sql.
regds
mallah.
tradein_clients=> SELECT * from temp.test;
+----+------+-----+
| id | name | yob |
+----+------+-----+
| 1 | A | 2 |
| 2 | B | 2 |
| 3 | C | 2 |
| 4 | D | 1 |
| 5 | E | 1 |
| 6 | F | 1 |
+----+------+-----+
(6 rows)
tradein_clients=> SELECT distinct on (yob) id,name,yob from temp.test order by yob,random();
+----+------+-----+
| id | name | yob |
+----+------+-----+
| 5 | E | 1 |
| 1 | A | 2 |
+----+------+-----+
(2 rows)
tradein_clients=> SELECT distinct on (yob) id,name,yob from temp.test order by yob,random();
+----+------+-----+
| id | name | yob |
+----+------+-----+
| 4 | D | 1 |
| 1 | A | 2 |
+----+------+-----+
(2 rows)
>
> --> so i do a group by year_of_birth, now i have a set of names, is there
> any function to select just one name from these set of names.
> The current approach i'm using to solve this problem is
>
> 1) getting these names in a single string using a custom function
> 'group_concat'
> 2) Convert the single string into an array
> 3) use postgresql random function to generate a random number
> 4) us the random number to select a element from the array previously
> created.
>
> The solution is there but it's kinda hack, is there any other better way of
> solving this problem.
>
>
> Thanks,
> Sumeet
--
Thanks,
Sumeet Ambre
Master of Information Science Candidate,
Indiana University.
On 2/24/07, Sumeet <asumeet@gmail.com> wrote:
Dear Sumeet,
if order by is not done there is no certainty about which row gets
selected. usually same row keeps getting selected. but if you want
a really random rows to come order by is required.
if you need certainty about the particular row being selected
order by a non random() column is required.
regds
mallah.
got it....I just figured out that i dont need the ORDER BY clause even the first row selected by the 'DISTINCT ON' would solve the problem.
Dear Sumeet,
if order by is not done there is no certainty about which row gets
selected. usually same row keeps getting selected. but if you want
a really random rows to come order by is required.
if you need certainty about the particular row being selected
order by a non random() column is required.
regds
mallah.
Thanks for all you help
-Sumeet.On 2/23/07, Sumeet <asumeet@gmail.com> wrote:Thanks Buddy, really appreciate ur help on this
....problem solved...
Is there any way this query can be optimized...i'm running it on a huge table with joins
- Sumeet--On 2/23/07, Rajesh Kumar Mallah <mallah.rajesh@gmail.com> wrote:
On 2/24/07, Sumeet <asumeet@gmail.com> wrote:
> Hi all,
>
> I'm trying to write a query to select random values from a set of 'GROUP
> BY'
> ....see the scenario below to understand the problem here (the actual
> problem cannot be discussed here so i'm taking an example scenario)
>
> Assume there is a table
>
> id | name | year_of_birth
>
> query: I want to select for each year_of_birth a random name.
Dear Sumeet
postgresql DISTINCT ON may be of help , but its not standard sql.
regds
mallah.
tradein_clients=> SELECT * from temp.test;
+----+------+-----+
| id | name | yob |
+----+------+-----+
| 1 | A | 2 |
| 2 | B | 2 |
| 3 | C | 2 |
| 4 | D | 1 |
| 5 | E | 1 |
| 6 | F | 1 |
+----+------+-----+
(6 rows)
tradein_clients=> SELECT distinct on (yob) id,name,yob from temp.test order by yob,random();
+----+------+-----+
| id | name | yob |
+----+------+-----+
| 5 | E | 1 |
| 1 | A | 2 |
+----+------+-----+
(2 rows)
tradein_clients=> SELECT distinct on (yob) id,name,yob from temp.test order by yob,random();
+----+------+-----+
| id | name | yob |
+----+------+-----+
| 4 | D | 1 |
| 1 | A | 2 |
+----+------+-----+
(2 rows)
>
> --> so i do a group by year_of_birth, now i have a set of names, is there
> any function to select just one name from these set of names.
> The current approach i'm using to solve this problem is
>
> 1) getting these names in a single string using a custom function
> 'group_concat'
> 2) Convert the single string into an array
> 3) use postgresql random function to generate a random number
> 4) us the random number to select a element from the array previously
> created.
>
> The solution is there but it's kinda hack, is there any other better way of
> solving this problem.
>
>
> Thanks,
> Sumeet
Thanks,
Sumeet Ambre
Master of Information Science Candidate,
Indiana University.
--
Thanks,
Sumeet Ambre
Master of Information Science Candidate,
Indiana University.
Sumeet wrote: > Thanks Buddy, really appreciate ur help on this > > ....problem solved... > > Is there any way this query can be optimized...i'm running it on a > huge table with joins ORDER BY rand() is rather slow on large datasets, since the db has to actually generate a random value for each row in the table, before being able use it to sort by. Preferable ways to do this include f.ex: SELECT max(id) FROM table; SELECT * FROM table WHERE id > [rand value between 0 and max(id)] ORDER BY id LIMIT 1; This means you need to execute 2 queries, and it can also be a good idea to somehow cache the number of rows/largest ID of the table, for quicker performence. You can find an interesting discussion on this topic at http://thedailywtf.com/Comments/Finding_Random_Rows.aspx (yeah, I know. thedayilywtf.com isn't normally what I'd use as a reference for anything :-) ) -- Tommy
Tommy Gildseth wrote: > Sumeet wrote: >> Thanks Buddy, really appreciate ur help on this >> >> ....problem solved... >> >> Is there any way this query can be optimized...i'm running it on a >> huge table with joins > > ORDER BY rand() is rather slow on large datasets, since the db has to > actually generate a random value for each row in the table, before being > able use it to sort by. Preferable ways to do this include f.ex: > SELECT max(id) FROM table; > SELECT * FROM table WHERE id > [rand value between 0 and max(id)] ORDER > BY id LIMIT 1; I'd have thought that in most cases the slowness would be due more to the sort than the cost of generating the random numbers. Your pair of queries neatly avoid doing the sort, but it does raise the question, does it matter if not all rows have the same chance of being picked? As an extreme example in the above, if there are id's 1-100 and id = 100000 in the table, the last one will almost always be picked. I wonder if the random aggregate that Josh Berkus wrote could be adapted to return a record type rather than a random value of a single column? Its big advantage is that it requires no sorting and all entries are equiprobable. Unfortunately I'm a bit short of time at this end of a Friday to do much but give you the URL: http://www.powerpostgresql.com/Random_Aggregate HTH, Geoff