Thread: Getting random rows from a table
This email has been sitting in my drafts folder since Sept 20th, 2008. Almost 4 years! Getting it off my chest now. I am attaching 2 files for this solution I developed, because I don't know which is the correct one (probably both are, with something different in implementation), but I don't have time or energy to verify that now. I am pretty sure the randomization works, although it is a tad bit expensive to get random rows.
Although the procedure's parameter names are pretty self descriptive, I'll explain them in brief here:
p_schemaname : name of the schema where the table resides
p_tablename : name of the table you want to get random rows from
p_columns : column list (AFAIR, these can expressions too)
p_where : the WHERE clause you wish to appy, if any.
p_numrows : how many rows you want in the result.
p_maxretries : how many times to retry when we can't find a row, before giving up; null implies 'retry forever'
Hi All,
For one of my small experiments (which is obviously backed by PG), I needed to get a set of random rows from a table. iGoogling around gave me some pointers, but they all were either not convenient (needed adding a column), or were not performant enough (sort on huge resultsets!); some had both the problems. In my test table of about 90 MB containing 1 million rows, these solutions clearly did not perform well!
One of the solutions I thought of, and which worked too for me, was using the 'Synchronized Sequential Scans' feature of 8.3. You make one of your connections do sequential scans on the target table in a loop (jut do a count(*) on that table in a loop). And when you want to select rows, say 5, from that table, you just fire 'SELECT * FROM mytab LIMIT 5'. Depending on where the other constantly-looping sequential scan is, you will get 5 rows from a random location in your table.
The problem with this approach is that, that you will always get the first rows from whichever database block you hit. So, in effect, you will almost never be able to see al the rows which lie at the end of the blocks (unless your LIMIT is high enough, or all the rows before that row are dead).
So I developed another solution, which might work for many cases; and in cases it doesn't work, the code can be easily be extended/modified to suit any query type.
Attached is the file containing the definition of plpgsql function get_random_rows(), using which we can get a specified number of random rows. This function returns truly random rows from the mentioned table. Here are two invocations of this function on a test table:
postgres=> explain analyze select * from get_random_rows( null, 'url', '{url}', 100, null ) as ( a varchar );
NOTICE: Number of misses: 17
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on get_random_rows (cost=0.00..260.00 rows=1000 width=32) (actual time=193.790..194.269 rows=100 loops=1)
Total runtime: 195.017 ms
(2 rows)
postgres=> explain analyze select * from get_random_rows( null, 'url', '{url}', 100, null ) as ( a varchar );
NOTICE: Number of misses: 30
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on get_random_rows (cost=0.00..260.00 rows=1000 width=32) (actual time=246.101..246.714 rows=100 loops=1)
Total runtime: 247.452 ms
(2 rows)
postgres=>
--
Although the procedure's parameter names are pretty self descriptive, I'll explain them in brief here:
p_schemaname : name of the schema where the table resides
p_tablename : name of the table you want to get random rows from
p_columns : column list (AFAIR, these can expressions too)
p_where : the WHERE clause you wish to appy, if any.
p_numrows : how many rows you want in the result.
p_maxretries : how many times to retry when we can't find a row, before giving up; null implies 'retry forever'
Hi All,
For one of my small experiments (which is obviously backed by PG), I needed to get a set of random rows from a table. iGoogling around gave me some pointers, but they all were either not convenient (needed adding a column), or were not performant enough (sort on huge resultsets!); some had both the problems. In my test table of about 90 MB containing 1 million rows, these solutions clearly did not perform well!
One of the solutions I thought of, and which worked too for me, was using the 'Synchronized Sequential Scans' feature of 8.3. You make one of your connections do sequential scans on the target table in a loop (jut do a count(*) on that table in a loop). And when you want to select rows, say 5, from that table, you just fire 'SELECT * FROM mytab LIMIT 5'. Depending on where the other constantly-looping sequential scan is, you will get 5 rows from a random location in your table.
The problem with this approach is that, that you will always get the first rows from whichever database block you hit. So, in effect, you will almost never be able to see al the rows which lie at the end of the blocks (unless your LIMIT is high enough, or all the rows before that row are dead).
So I developed another solution, which might work for many cases; and in cases it doesn't work, the code can be easily be extended/modified to suit any query type.
Attached is the file containing the definition of plpgsql function get_random_rows(), using which we can get a specified number of random rows. This function returns truly random rows from the mentioned table. Here are two invocations of this function on a test table:
postgres=> explain analyze select * from get_random_rows( null, 'url', '{url}', 100, null ) as ( a varchar );
NOTICE: Number of misses: 17
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on get_random_rows (cost=0.00..260.00 rows=1000 width=32) (actual time=193.790..194.269 rows=100 loops=1)
Total runtime: 195.017 ms
(2 rows)
postgres=> explain analyze select * from get_random_rows( null, 'url', '{url}', 100, null ) as ( a varchar );
NOTICE: Number of misses: 30
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on get_random_rows (cost=0.00..260.00 rows=1000 width=32) (actual time=246.101..246.714 rows=100 loops=1)
Total runtime: 247.452 ms
(2 rows)
postgres=>
--
Gurjeet Singh
Attachment
On 01/09/2012 05:23, Gurjeet Singh wrote: > This email has been sitting in my drafts folder since Sept 20th, 2008. > Almost 4 years! Getting it off my chest now. I am attaching 2 files for > this solution I developed, because I don't know which is the correct one Better later than never! :-) One for the wiki maybe? Ray. -- Raymond O'Donnell :: Galway :: Ireland rod@iol.ie