Thread: Selecting K random rows - efficiently!
It has been suggested [1] that a good way to select K random rows efficiently from a table is to 1) add a new column, random_number, to the table and initialize it with random() 2) perform the following query: SELECT * FROM mydata WHERE random_number >= (SELECT RANDOM() OFFSET 0) ORDER BY random_number ASC LIMIT <K> Here <K> is the number of random rows to pick. E.g. 100. The benefit in this solution is that the "random_number" column can be indexed, allowing the query to be performed using a simple and fast index scan. However, there is a couple of major drawbacks in this method: 1) The smaller "random_number" is, the less likely is it that it will be picked when using this method. Example: A random_number close to zero will only have a very small probability to be selected. The solution is to reassign "random_number" every now and then in order to "even out" the selection probabilities over time. PROBLEM: If the number of rows are large (e.g. 200.000 or even a million or more), the update query: UPDATE mydata SET random_number = random(); might be very resource demanding, take a lot of time and pretty much slow down all other transactions because it eats up all resources. 2) The query to select K random rows orders the rows by "random_number" and selects the first K rows that have "random_number" larger than some other random number R. But what happens if there is less than K rows with random_value >= R? How can the rest of the random rows be picked efficiently? Any ideas on how to deal with these drawbacks? References: [1] http://tinyurl.com/tyg4m
Another way to look at the problem is: How do I sample a subset of size K efficiently? A query like SAMPLE 1000 OF (SELECT * FROM mydata WHERE <some condition>) should return 1000 random rows from the select statement so that two consecutive evaluations of the query would only with very little probability return the same 1000 rows. (Yes, I know that "SAMPLE 1000 OF" is not valid SQL)
On Wed, Oct 24, 2007 at 10:59:46AM +0200, cluster wrote: > Another way to look at the problem is: How do I sample a subset of size > K efficiently? A query like > > SAMPLE 1000 OF > (SELECT * FROM mydata WHERE <some condition>) How important is true randomness? To get the best possible distribution most algorithms require you to either know how many rows there are, or require you to scan the whole table (or index). With some simplifying assumptions, you can try extracting them from an index, with the caveat that if your index is unbalanced in any way, the selection won't be "random". > should return 1000 random rows from the select statement so that two > consecutive evaluations of the query would only with very little > probability return the same 1000 rows. > (Yes, I know that "SAMPLE 1000 OF" is not valid SQL) Presumably your table is very much bigger than that, in which I suppose the not-entirely-random is unlikely to play much of a role. Search the archives, there have been solutions proposed before, though they probably arn't very quick... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
> How important is true randomness? The goal is an even distribution but currently I have not seen any way to produce any kind of random sampling efficiently. Notice the word "efficiently". The naive way of taking a random sample of size K: (SELECT * FROM mydata ORDER BY random() LIMIT <K>) is clearly not an option for performance reasons. It shouldn't be necessary to explain why. :-) > Search the archives, there have been solutions proposed before, though > they probably arn't very quick... As the subject suggests, performance really matters and searching the archives only results in poor solutions (my first post explains why).
cluster wrote: > It has been suggested [1] that a good way to select K random rows > efficiently from a table is to > 1) add a new column, random_number, to the table > and initialize it with random() > 2) perform the following query: > SELECT * > FROM mydata > WHERE random_number >= (SELECT RANDOM() OFFSET 0) > ORDER BY random_number ASC LIMIT <K> > Here <K> is the number of random rows to pick. E.g. 100. > > The benefit in this solution is that the "random_number" column can be > indexed, allowing the query to be performed using a simple and fast > index scan. > > However, there is a couple of major drawbacks in this method: > > 1) The smaller "random_number" is, the less likely is it that it will > be picked when using this method. Example: A random_number close to > zero will only have a very small probability to be selected. When the above query returns L rows (where L < K) then, you need to append the first K - L rows from the table to simulate a "ring" without start or end. (Conveniently, this also solves the problem of not finding K rows because the random start value was too large.) The second set of rows can certainly be fetched using a second SELECT statement. Whether this can be computed efficiently as a single SELECT statement I am not sure but you might try something like this: (SELECT 1 AS seq, * FROM mydata WHERE random_number >= (SELECT RANDOM() OFFSET 0) ORDER BY random_number ASC LIMIT <K>) UNION ALL (SELECT 2 AS seq, * FROM mydata ORDER BY random_number ASC LIMIT <K>) ORDER BY seq ASC, random_number ASC LIMIT K; This should provide each row with an equal chance of being selected while requiring the database to fetch at most 2 * K rows. Regards, Paul Tillotson
Here's how I would do it. This assumes a static table that doesn't change a lot. 1: find the row count n of the table. 2: randomly assign 1 through n to each row randomly. How to do this is a whole not post. 3: create a sequence. If you always need 10 or 100 random rows, set the increment to that number. set it to cycle at the size of the table. 4: select nextval('sequence') =>nv and use it in a select: select * from myrandomtable where id between nv and nv+100; -- or whatever your increment is. There are refinements to this. The advantages, with a static data set, are that you can cluster on the randomized id and get chunks of the random dataset VERY quickly, and you won't repeat the results until you start over. you can re-randomize the table every x hours or days or weeks to meet your needs. If you don't want to re-randomize it during the day, just put the random data set into it however many times you need to so that it won't roll over until the next day/week etc... Does that make sense? If your data changes all the time, you've got a more difficult problem to deal with.
As far as I can tell, all of the proposed solutions lack sample independence. Take the OP's suggested approach of doing something like this: SELECT * FROM mydata WHERE mydata.random_number >= (SELECT RANDOM() OFFSET 0) ORDER BY mydata.random_number ASC LIMIT 100 All you're doing is picking random =subsequences= from the same permutation of the original data. This is not the same as a random sample. That is, if rows A and B are adjacent in the permutation, then if A is in the sample, B will also be in it with very high probability, depending on the size of the sample. Another way of saying this is that the first element of the sample is selected randomly, the rest are completely deterministic. In a true random sample, different elements are selected independently. On the other hand, ORDER BY RANDOM() does indeed construct true random samples, because it makes a new permutation every time. If you want to use the random_number column approach, then you need to do the same. You can accomplish this by sampling from the original permutation repeatedly, doing the above with LIMIT 1 as many times as you need. Yes this is more costly, but TANSTAAFL. As is often observed, it's easy to create the appearance of randomness, harder to accomplish in reality. - John Burger MITRE
> All you're doing is picking random =subsequences= from the same > permutation of the original data. You have some good points in your reply. I am very much aware of this non-random behavior you point out for the "static random-value column" approach but at least it is fast, which is a requirement. :-( However, if the life time of the individual rows are short, the behaviour is, luckily, sufficiently random for my specific purpose. I furthermore realize that the only way to get truly random samples is to ORDER BY random(), but this is an unacceptable slow method for large data sets. Even though it is not trivial at all, there ARE indeed algorithms out there [1,2] for picking random sub sets from a result set but these are (sadly) not implemented in postgresql. References: [1] http://portal.acm.org/citation.cfm?id=304206 [2] http://compstat.chonbuk.ac.kr/Sisyphus/CurrentStudy/Sampling/vldb86.pdf
In article <ffnid8$1q2t$1@news.hub.org>, cluster <skrald@amossen.dk> wrote: % > How important is true randomness? % % The goal is an even distribution but currently I have not seen any way % to produce any kind of random sampling efficiently. Notice the word How about generating the ctid randomly? You can get the number of pages from pg_class and estimate the number of rows either using the number of tuples in pg_class or just based on what you know about the data. Then just generate two series of random numbers, one from 0 to the number of pages and the other from 1 to the number of rows per page, and keep picking rows until you have enough numbers. Assuming there aren't too many dead tuples and your estimates are good, this should retrieve n rows with roughly n look-ups. If your estimates are low, there will be tuples which can never be selected, and so far as I know, there's no way to construct a random ctid in a stock postgres database, but apart from that it seems like a good plan. If efficiency is important, you could create a C function which returns a series of random tids and join on that. -- Patrick TJ McPhee North York Canada ptjm@interlog.com
2007/10/26, Patrick TJ McPhee <ptjm@interlog.com>: > In article <ffnid8$1q2t$1@news.hub.org>, cluster <skrald@amossen.dk> wrote: > % > How important is true randomness? > % > % The goal is an even distribution but currently I have not seen any way > % to produce any kind of random sampling efficiently. Notice the word > > How about generating the ctid randomly? You can get the number of pages > from pg_class and estimate the number of rows either using the number > of tuples in pg_class or just based on what you know about the data. > Then just generate two series of random numbers, one from 0 to the number > of pages and the other from 1 to the number of rows per page, and keep > picking rows until you have enough numbers. Assuming there aren't too > many dead tuples and your estimates are good, this should retrieve n rows > with roughly n look-ups. > > If your estimates are low, there will be tuples which can never be selected, > and so far as I know, there's no way to construct a random ctid in a stock > postgres database, but apart from that it seems like a good plan. If > efficiency is important, you could create a C function which returns a > series of random tids and join on that. > -- > SELECT id, ... FROM data WHERE id = ANY(ARRAY( SELECT (random()*max_id)::int FROM generate_series(1,20))) LIMIT 1; -- max_id is external constant Pavel Stehule
> 2007/10/26, Patrick TJ McPhee <ptjm@interlog.com>: >> In article <ffnid8$1q2t$1@news.hub.org>, cluster <skrald@amossen.dk> >> wrote: >> % > How important is true randomness? >> % >> % The goal is an even distribution but currently I have not seen any way >> % to produce any kind of random sampling efficiently. Notice the word >> >> How about generating the ctid randomly? You can get the number of pages >> from pg_class and estimate the number of rows either using the number >> of tuples in pg_class or just based on what you know about the data. >> Then just generate two series of random numbers, one from 0 to the >> number >> of pages and the other from 1 to the number of rows per page, and keep >> picking rows until you have enough numbers. Assuming there aren't too >> many dead tuples and your estimates are good, this should retrieve n >> rows >> with roughly n look-ups. >> >> If your estimates are low, there will be tuples which can never be >> selected, >> and so far as I know, there's no way to construct a random ctid in a >> stock >> postgres database, but apart from that it seems like a good plan. If >> efficiency is important, you could create a C function which returns a >> series of random tids and join on that. >> -- >> > > SELECT id, ... > FROM data > WHERE id = ANY(ARRAY( > SELECT (random()*max_id)::int > FROM generate_series(1,20))) > LIMIT 1; > > -- max_id is external constant > > Pavel Stehule That selects records where the id is one of twenty random numbers between zero and the maximum id. That's not truely random, nor is it completely safe if there are lots of gaps in the values of id. For example, if the lowest id is 50000 and the highest is 50040, this will be very likely to generate lots of numbers below 50000 and find no records at all.
2007/10/30, vincent <vinny@xs4all.nl>: > > 2007/10/26, Patrick TJ McPhee <ptjm@interlog.com>: > >> In article <ffnid8$1q2t$1@news.hub.org>, cluster <skrald@amossen.dk> > >> wrote: > >> % > How important is true randomness? > >> % > >> % The goal is an even distribution but currently I have not seen any way > >> % to produce any kind of random sampling efficiently. Notice the word > >> > >> How about generating the ctid randomly? You can get the number of pages > >> from pg_class and estimate the number of rows either using the number > >> of tuples in pg_class or just based on what you know about the data. > >> Then just generate two series of random numbers, one from 0 to the > >> number > >> of pages and the other from 1 to the number of rows per page, and keep > >> picking rows until you have enough numbers. Assuming there aren't too > >> many dead tuples and your estimates are good, this should retrieve n > >> rows > >> with roughly n look-ups. > >> > >> If your estimates are low, there will be tuples which can never be > >> selected, > >> and so far as I know, there's no way to construct a random ctid in a > >> stock > >> postgres database, but apart from that it seems like a good plan. If > >> efficiency is important, you could create a C function which returns a > >> series of random tids and join on that. > >> -- > >> > > > > SELECT id, ... > > FROM data > > WHERE id = ANY(ARRAY( > > SELECT (random()*max_id)::int > > FROM generate_series(1,20))) > > LIMIT 1; > > > > -- max_id is external constant > > > > Pavel Stehule > > That selects records where the id is one of twenty random numbers between > zero and the maximum id. That's not truely random, nor is it completely > safe if there are lots of gaps in the values of id. For example, if the > lowest id is 50000 and the highest is 50040, this will be very likely to > generate lots of numbers below 50000 and find no records at all. > > there is only one safe way ORDER BY random() LIMIT 1; if you know so your id has not uniform distribution, you have to mo: SELECT random()*(max_id - min_id) + min_id This solution is far to ideal, but it is fast and for some purposes enough (web shops, etc) Pavel