Thread: random record from small set
I am trying to retrieve a random record (according to a chance attribute) from a small set of records, each with a "chance" attribute. This may eventually be somwhat of a performance concern, so I'd like to make sure I'm doing this right. Here's what I have so far: create table r1 ( i int, chance numeric ) create or replace function randrec() returns int as $$ $res = spi_exec_query('select i,chance from r1'); $r = rand; $accum = 0; $i = 0; while($accum < $r) { $accum += $res->{rows}[$i++]->{chance} } return $res->{rows}[$i-1]->{i}; $$ language plperl; test=# select * from r1; i | chance ---+-------- 1 | 0.25 2 | 0.20 3 | 0.15 4 | 0.10 5 | 0.30 That seems to work, in that out of 10k times, I got the following numbers of each: 1 2479 2 1959 3 1522 4 950 5 3090 But I have a few questions: * Am I right to use NUMERIC for the chance attribute? * Does perl's arithmetic leave me with the chance that those numeric values don't add up to 1.00 (and in this case that could mean an infinite loop)? * In my design I'll need a constraint trigger making sure that the numbers add up to 1.00. Will that be a performance problem for operations on the table that don't modify the chance attribute? * Is there a better way? * Does spi_exec_query pull the entire result set into memory at once? Is there a point at which performance could be a serious problem if there are a large number of items to select among? Regards, Jeff Davis
On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote: > > * Am I right to use NUMERIC for the chance attribute? I ran tests with numeric, real, and double precision; double precision was consistently about 10% faster than the others. I used the sample data you posted and the PL/pgSQL function shown later in this message. > * Does perl's arithmetic leave me with the chance that those numeric > values don't add up to 1.00 (and in this case that could mean an > infinite loop)? I'd suggest looping through the records so you can't possibly end up in an infinite loop. > * In my design I'll need a constraint trigger making sure that the > numbers add up to 1.00. If the sum must be exactly 1.00 then be careful if you use double precision -- if you test with the equality operator then the check might fail because the sum is 0.9999999987. > Will that be a performance problem for operations on the table that > don't modify the chance attribute? Any trigger that you didn't otherwise need will cause a performance hit. I'd expect a statement-level AFTER trigger to have the lowest impact since it would run only once per statement, whereas a row-level trigger might run multiple times per statement. On the other hand, if you make a lot of updates that don't modify the chance attribute, then you might want to try a row-level trigger that skips the check when NEW.chance = OLD.chance. I'd suggesting testing different methods under expected conditions and see which has the lowest impact. > * Is there a better way? > * Does spi_exec_query pull the entire result set into memory at once? I think it does. I ran some tests with the following PL/pgSQL function and got significantly faster times than with PL/Perl, especially as the data set grew: CREATE FUNCTION randrec() RETURNS integer AS $$ DECLARE r double precision := random(); accum double precision := 0.0; row record; BEGIN FOR row IN SELECT * FROM r1 LOOP accum := accum + row.chance; IF accum >= r THEN EXIT; END IF; END LOOP; RETURN row.i; END; $$ LANGUAGE plpgsql VOLATILE; SELECT * FROM r1; i | chance ---+-------- 1 | 0.25 2 | 0.20 3 | 0.15 4 | 0.10 5 | 0.30 SELECT i, count(*) FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s GROUP BY i ORDER by i; i | count ---+------- 1 | 2467 2 | 1939 3 | 1536 4 | 1016 5 | 3042 (5 rows) Time: 3300.710 ms Here are the results using the PL/Perl function you posted: SELECT i, count(*) FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s GROUP BY i ORDER by i; i | count ---+------- 1 | 2501 2 | 2040 3 | 1463 4 | 994 5 | 3002 (5 rows) Time: 8765.584 ms I ran each query several times and those times were typical of both. With a data set of 100 records, the PL/pgSQL function ran in about 14 seconds, while the PL/Perl function took around 65 seconds. -- Michael Fuhr http://www.fuhr.org/~mfuhr/
Thanks very much for the information. I had very similar results on my machine. I will take your advice and use the double-precision values, since it doesn't affect the probability significantly anyway. As far as the constraint trigger, I will see if it becomes a problem before I worry about its performance. As far as whether those values add up to 1.0, I'll just check to make sure it's fairly close to 1.0 :) The only real difference that I saw was that I didn't notice much difference if the underlying table's chance attribute was double precision vs. numeric. I used your generate_series()-based query. Perhaps that was because I was using ALTER TABLE to modify r1's "chance" attribute. One thing that I noticed there was that I had to CREATE OR REPLACE the plpgsql function for that to work. Perhaps it was a bug? Here's a test case: test=# create table t1(i int); CREATE TABLE test=# insert into t1 values(1); INSERT 17775 1 test=# create or replace function err() returns int as $$ DECLARE accum double precision := 0.0; row record; BEGIN FOR row IN SELECT i FROM t1 LOOP accum := accum + row.i; END LOOP; RETURN row.i; END; $$ language plpgsql; CREATE FUNCTION test=# insert into t1 values(2); INSERT 17778 1 test=# select err(); err ----- 2 (1 row) test=# alter table t1 alter column i type numeric; ALTER TABLE test=# select err(); err ----- 10 (1 row) And if you keep playing around with the type and values you can get other errors like: ERROR: type "double precision" value out of range: underflow CONTEXT: PL/pgSQL function "err" line 1 at assignment Or: ERROR: invalid memory alloc request size 4294967290 CONTEXT: PL/pgSQL function "randrec" line 7 at assignment If any more information would be helpful someone let me know. It looks a little like a bug; perhaps we should throw an error when dependent functions are called after the underlying types have changed? Or I suppose if we can recognize that, it might as well recompile the function and proceed without error. Regards, Jeff Davis On Mon, 2005-02-14 at 22:18 -0700, Michael Fuhr wrote: > On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote: > > > > * Am I right to use NUMERIC for the chance attribute? > > I ran tests with numeric, real, and double precision; double precision > was consistently about 10% faster than the others. I used the > sample data you posted and the PL/pgSQL function shown later in > this message. > > > * Does perl's arithmetic leave me with the chance that those numeric > > values don't add up to 1.00 (and in this case that could mean an > > infinite loop)? > > I'd suggest looping through the records so you can't possibly end > up in an infinite loop. > > > * In my design I'll need a constraint trigger making sure that the > > numbers add up to 1.00. > > If the sum must be exactly 1.00 then be careful if you use double > precision -- if you test with the equality operator then the check > might fail because the sum is 0.9999999987. > > > Will that be a performance problem for operations on the table that > > don't modify the chance attribute? > > Any trigger that you didn't otherwise need will cause a performance > hit. I'd expect a statement-level AFTER trigger to have the lowest > impact since it would run only once per statement, whereas a row-level > trigger might run multiple times per statement. On the other hand, > if you make a lot of updates that don't modify the chance attribute, > then you might want to try a row-level trigger that skips the check > when NEW.chance = OLD.chance. I'd suggesting testing different > methods under expected conditions and see which has the lowest impact. > > > * Is there a better way? > > * Does spi_exec_query pull the entire result set into memory at once? > > I think it does. I ran some tests with the following PL/pgSQL > function and got significantly faster times than with PL/Perl, > especially as the data set grew: > > CREATE FUNCTION randrec() RETURNS integer AS $$ > DECLARE > r double precision := random(); > accum double precision := 0.0; > row record; > BEGIN > FOR row IN SELECT * FROM r1 LOOP > accum := accum + row.chance; > IF accum >= r THEN > EXIT; > END IF; > END LOOP; > > RETURN row.i; > END; > $$ LANGUAGE plpgsql VOLATILE; > > SELECT * FROM r1; > i | chance > ---+-------- > 1 | 0.25 > 2 | 0.20 > 3 | 0.15 > 4 | 0.10 > 5 | 0.30 > > SELECT i, count(*) > FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s > GROUP BY i > ORDER by i; > i | count > ---+------- > 1 | 2467 > 2 | 1939 > 3 | 1536 > 4 | 1016 > 5 | 3042 > (5 rows) > Time: 3300.710 ms > > Here are the results using the PL/Perl function you posted: > > SELECT i, count(*) > FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s > GROUP BY i > ORDER by i; > i | count > ---+------- > 1 | 2501 > 2 | 2040 > 3 | 1463 > 4 | 994 > 5 | 3002 > (5 rows) > Time: 8765.584 ms > > I ran each query several times and those times were typical of both. > With a data set of 100 records, the PL/pgSQL function ran in about > 14 seconds, while the PL/Perl function took around 65 seconds. >
And what about another data representation like create table r1 ( i int, chance_from numeric, chance_to numeric ) , you can select one random row in one select, for instance select * from r1 where chance_from <= $rnd and chance_to > $rnd; I see these advantages - Only one select. - Indices can improve performance if r1 has many rows. and disadvantage - Tricky update Jeff Davis wrote: >I am trying to retrieve a random record (according to a chance >attribute) from a small set of records, each with a "chance" attribute. >This may eventually be somwhat of a performance concern, so I'd like to >make sure I'm doing this right. > >Here's what I have so far: > >create table r1 ( > i int, > chance numeric >) >create or replace function randrec() returns int as $$ > $res = spi_exec_query('select i,chance from r1'); > $r = rand; > $accum = 0; > $i = 0; > while($accum < $r) { > $accum += $res->{rows}[$i++]->{chance} > } > return $res->{rows}[$i-1]->{i}; >$$ language plperl; > >test=# select * from r1; > i | chance >---+-------- > 1 | 0.25 > 2 | 0.20 > 3 | 0.15 > 4 | 0.10 > 5 | 0.30 > > >That seems to work, in that out of 10k times, I got the following >numbers of each: >1 2479 >2 1959 >3 1522 >4 950 >5 3090 > >But I have a few questions: >* Am I right to use NUMERIC for the chance attribute? >* Does perl's arithmetic leave me with the chance that those numeric >values don't add up to 1.00 (and in this case that could mean an >infinite loop)? >* In my design I'll need a constraint trigger making sure that the >numbers add up to 1.00. Will that be a performance problem for >operations on the table that don't modify the chance attribute? >* Is there a better way? >* Does spi_exec_query pull the entire result set into memory at once? Is >there a point at which performance could be a serious problem if there >are a large number of items to select among? > >Regards, > Jeff Davis > > > >---------------------------(end of broadcast)--------------------------- >TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > >
Or create table r1 ( i int, chance_from numeric ) and select * from r1 where chance_from <= $rnd order by chance_from desc limit 1; which can be easier updated... Just ideas, I has never tested it... Jan Poslusny wrote: > And what about another data representation like > > create table r1 ( > i int, > chance_from numeric, > chance_to numeric > ) > , you can select one random row in one select, for instance > select * from r1 where chance_from <= $rnd and chance_to > $rnd; > > I see these advantages > - Only one select. > - Indices can improve performance if r1 has many rows. > and disadvantage > - Tricky update > > > Jeff Davis wrote: > >> I am trying to retrieve a random record (according to a chance >> attribute) from a small set of records, each with a "chance" attribute. >> This may eventually be somwhat of a performance concern, so I'd like to >> make sure I'm doing this right. >> >> Here's what I have so far: >> >> create table r1 ( >> i int, >> chance numeric >> ) >> create or replace function randrec() returns int as $$ >> $res = spi_exec_query('select i,chance from r1'); >> $r = rand; >> $accum = 0; >> $i = 0; >> while($accum < $r) { >> $accum += $res->{rows}[$i++]->{chance} >> } >> return $res->{rows}[$i-1]->{i}; >> $$ language plperl; >> >> test=# select * from r1; >> i | chance >> ---+-------- >> 1 | 0.25 >> 2 | 0.20 >> 3 | 0.15 >> 4 | 0.10 >> 5 | 0.30 >> >> >> That seems to work, in that out of 10k times, I got the following >> numbers of each: >> 1 2479 >> 2 1959 >> 3 1522 >> 4 950 >> 5 3090 >> >> But I have a few questions: >> * Am I right to use NUMERIC for the chance attribute? >> * Does perl's arithmetic leave me with the chance that those numeric >> values don't add up to 1.00 (and in this case that could mean an >> infinite loop)? >> * In my design I'll need a constraint trigger making sure that the >> numbers add up to 1.00. Will that be a performance problem for >> operations on the table that don't modify the chance attribute? >> * Is there a better way? >> * Does spi_exec_query pull the entire result set into memory at once? Is >> there a point at which performance could be a serious problem if there >> are a large number of items to select among? >> >> Regards, >> Jeff Davis >> >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 2: you can get off all lists at once with the unregister command >> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >> >> >> >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Here's what I have so far: If you go that route, make sure you check for edge cases, such as reaching the end of the rows without hitting your number: while($accum < $r) { die qq{Ran out of rows!\n} if ! defined $res->{rows}[$i]; Also, your query should be "select i,chance from r1 ORDER BY random()" else you are getting back the same order each time (until a row is changed) which certainly reduces the randomness. Anyway, here's another solution, which shifts as much work as possible off of the actual random row call, and uses a trigger to keep things in sync. I switched the 'chance' from 0.25 to 25 (numeric to int) to make things easier to read. UPDATE r1 SET chance = chance*100; ALTER TABLE r1 ALTER COLUMN chance TYPE INTEGER; CREATE TABLE r2(integer); CREATE OR REPLACE FUNCTION r1_cleanup() RETURNS trigger LANGUAGE plpgsql AS $$ DECLARE mychance integer; BEGIN IF TG_OP = 'DELETE' THEN DELETE FROM r2 WHERE id = OLD.i; ELSE IF TG_OP = 'UPDATE' THEN DELETE FROM r2 WHERE id = OLD.i or id = NEW.i; END IF; SELECT chance FROM r1 WHERE i=NEW.i INTO mychance; LOOP mychance := mychance - 1; EXIT WHEN mychance < 0; INSERT INTO r2 VALUES (NEW.i); END LOOP; END IF; RETURN NULL; END; $$; CREATE TRIGGER r1_trigger AFTER INSERT or UPDATE or DELETE ON r1 FOR EACH ROW EXECUTE PROCEDURE r1_cleanup(); UPDATE r1 SET i=i; -- To initially populate r2 SELECT id FROM r2 ORDER BY random() LIMIT 1; -- repeat as needed - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200502152252 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFCEsOvvJuQZxSWSsgRAjysAJ9X3JpMfuXV2ST049bhCWuJOp6Y1ACg/sNx PXqxVlfvlsKMTBDDhsh3BmU= =7/IE -----END PGP SIGNATURE-----