Re: Obtaining random rows from a result set - Mailing list pgsql-general
From | Alban Hertroys |
---|---|
Subject | Re: Obtaining random rows from a result set |
Date | |
Msg-id | 5F2804AD-A18A-4791-8BAE-4203615E5ED4@magproductions.nl Whole thread Raw |
In response to | Re: Obtaining random rows from a result set (Martijn van Oosterhout <kleptog@svana.org>) |
Responses |
Re: Obtaining random rows from a result set
Re: Obtaining random rows from a result set |
List | pgsql-general |
On Aug 31, 2007, at 15:54, Martijn van Oosterhout wrote: > On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote: >> Examples: >> * random(maxrows) would return random rows from the resultset. >> * median() would return the rows in the middle of the result set >> (this >> would require ordering to be meaningful). > > It would be possible to write an aggregate that returns a single > random > value from a set. The algorithm is something like: > > n = 1 > v = null > for each row > if random() < 1/n: > v = value of row > n = n + 1 > > return v Doesn't this always return the first record, since random() is always less than 1/1? I don't think this method has a linear distribution, but then again I don't understand what 'value of row' refers to... > It does require a seqscan though. I doubt that a seqscan can be entirely avoided to fetch random rows from a set, at least not until the last random result has been returned, _unless_ the number of matching records would be known before starting taking random samples. > If you're asking for 5 random rows > you probably mean 5 random but distinct rows, which is different to > just running the above set 5 times in parallel. Indeed, that is one of the distinctions that need some thought for my original preposition. I left it out, as it's an implementation detail (an important one, admittedly). > I don't know if there's a similar method for median... I'm not entirely sure, but I think your method is the only one suggested that doesn't involve calculating random() a million times (for a million records) to return 5 (random) records. My suggestion involved a way to calculate random() only when retrieving records from the result set (only 5 times for a million records in this case). For a linearly distributed random set it does require knowing the number of records in the set though, an estimate would make it non- linear (although only a little bit if accurate enough). OTOH, I'm starting to think that the last sort step of an order by can be postponed to the result set fetching cycle under the conditions that: - the ordering expression is unrelated to the records involved, and - only a fraction of the total number of records will be returned. (Which is somewhat similar to the condition for an index being more efficient than a seqscan, btw) Comparing records with each other for something not related seems a waste of effort, while the result set has already been determined (just not ordered in any particular way), am I right? With that change (postponing sorting) my original ORDER BY random() LIMIT 5 would perform quite adequately, I think - it'd only involve calculating random() at least 5 times, not as often as the number of records in the result set. Or is order by random() acting as some kind of shuffling method? Is that a requirement to get a linearly distributed set to randomly draw from? I can see how it wouldn't be linear if you'd start randomly comparing records from the beginning of the result set... (Which would be the logical method if you don't know the size of the set before hand) I thought of another solution (with only a few calculations of random ()) that can be deployed in existing versions of PG, using a set- returning function with a scrolling cursor that accepts the query string as input like this (in pseudoish-code): ---- create function random(text _query, integer _limit) returns set volatile as $$ DECLARE _cur cursor; _cnt bigint; _idx integer; _rowpos bigint; _rec record; BEGIN open _cur for execute query; fetch forward all into _rec; -- select total nr of records into _cnt for _idx in 1.._limit loop _rowpos := random() * _cnt; fetch absolute _rowpos into _rec; return next _rec; end loop; return; END; $$ language 'plpgsql'; ---- This method could return the same record twice though, I'll need to build in some accounting for used up rowpos'es. Would it be more efficient than the usual methods? Sorry for the brain dump, I tried to get everything into this single message. I hope it is at least comprehensible and useful, or interesting or at least mildly amusing if not. Regards, Alban Hertroys magproductions b.v. !DSPAM:737,46d93d9b289906550616460!
pgsql-general by date: