Re: random record from small set - Mailing list pgsql-general
From | Michael Fuhr |
---|---|
Subject | Re: random record from small set |
Date | |
Msg-id | 20050215051843.GA40527@winnie.fuhr.org Whole thread Raw |
In response to | random record from small set (Jeff Davis <jdavis-pgsql@empires.org>) |
Responses |
Re: random record from small set
|
List | pgsql-general |
On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote: > > * Am I right to use NUMERIC for the chance attribute? I ran tests with numeric, real, and double precision; double precision was consistently about 10% faster than the others. I used the sample data you posted and the PL/pgSQL function shown later in this message. > * Does perl's arithmetic leave me with the chance that those numeric > values don't add up to 1.00 (and in this case that could mean an > infinite loop)? I'd suggest looping through the records so you can't possibly end up in an infinite loop. > * In my design I'll need a constraint trigger making sure that the > numbers add up to 1.00. If the sum must be exactly 1.00 then be careful if you use double precision -- if you test with the equality operator then the check might fail because the sum is 0.9999999987. > Will that be a performance problem for operations on the table that > don't modify the chance attribute? Any trigger that you didn't otherwise need will cause a performance hit. I'd expect a statement-level AFTER trigger to have the lowest impact since it would run only once per statement, whereas a row-level trigger might run multiple times per statement. On the other hand, if you make a lot of updates that don't modify the chance attribute, then you might want to try a row-level trigger that skips the check when NEW.chance = OLD.chance. I'd suggesting testing different methods under expected conditions and see which has the lowest impact. > * Is there a better way? > * Does spi_exec_query pull the entire result set into memory at once? I think it does. I ran some tests with the following PL/pgSQL function and got significantly faster times than with PL/Perl, especially as the data set grew: CREATE FUNCTION randrec() RETURNS integer AS $$ DECLARE r double precision := random(); accum double precision := 0.0; row record; BEGIN FOR row IN SELECT * FROM r1 LOOP accum := accum + row.chance; IF accum >= r THEN EXIT; END IF; END LOOP; RETURN row.i; END; $$ LANGUAGE plpgsql VOLATILE; SELECT * FROM r1; i | chance ---+-------- 1 | 0.25 2 | 0.20 3 | 0.15 4 | 0.10 5 | 0.30 SELECT i, count(*) FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s GROUP BY i ORDER by i; i | count ---+------- 1 | 2467 2 | 1939 3 | 1536 4 | 1016 5 | 3042 (5 rows) Time: 3300.710 ms Here are the results using the PL/Perl function you posted: SELECT i, count(*) FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s GROUP BY i ORDER by i; i | count ---+------- 1 | 2501 2 | 2040 3 | 1463 4 | 994 5 | 3002 (5 rows) Time: 8765.584 ms I ran each query several times and those times were typical of both. With a data set of 100 records, the PL/pgSQL function ran in about 14 seconds, while the PL/Perl function took around 65 seconds. -- Michael Fuhr http://www.fuhr.org/~mfuhr/
pgsql-general by date: