Re: random record from small set - Mailing list pgsql-general
From | Jeff Davis |
---|---|
Subject | Re: random record from small set |
Date | |
Msg-id | 1108448114.354.229.camel@jeff Whole thread Raw |
In response to | Re: random record from small set (Michael Fuhr <mike@fuhr.org>) |
List | pgsql-general |
Thanks very much for the information. I had very similar results on my machine. I will take your advice and use the double-precision values, since it doesn't affect the probability significantly anyway. As far as the constraint trigger, I will see if it becomes a problem before I worry about its performance. As far as whether those values add up to 1.0, I'll just check to make sure it's fairly close to 1.0 :) The only real difference that I saw was that I didn't notice much difference if the underlying table's chance attribute was double precision vs. numeric. I used your generate_series()-based query. Perhaps that was because I was using ALTER TABLE to modify r1's "chance" attribute. One thing that I noticed there was that I had to CREATE OR REPLACE the plpgsql function for that to work. Perhaps it was a bug? Here's a test case: test=# create table t1(i int); CREATE TABLE test=# insert into t1 values(1); INSERT 17775 1 test=# create or replace function err() returns int as $$ DECLARE accum double precision := 0.0; row record; BEGIN FOR row IN SELECT i FROM t1 LOOP accum := accum + row.i; END LOOP; RETURN row.i; END; $$ language plpgsql; CREATE FUNCTION test=# insert into t1 values(2); INSERT 17778 1 test=# select err(); err ----- 2 (1 row) test=# alter table t1 alter column i type numeric; ALTER TABLE test=# select err(); err ----- 10 (1 row) And if you keep playing around with the type and values you can get other errors like: ERROR: type "double precision" value out of range: underflow CONTEXT: PL/pgSQL function "err" line 1 at assignment Or: ERROR: invalid memory alloc request size 4294967290 CONTEXT: PL/pgSQL function "randrec" line 7 at assignment If any more information would be helpful someone let me know. It looks a little like a bug; perhaps we should throw an error when dependent functions are called after the underlying types have changed? Or I suppose if we can recognize that, it might as well recompile the function and proceed without error. Regards, Jeff Davis On Mon, 2005-02-14 at 22:18 -0700, Michael Fuhr wrote: > On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote: > > > > * Am I right to use NUMERIC for the chance attribute? > > I ran tests with numeric, real, and double precision; double precision > was consistently about 10% faster than the others. I used the > sample data you posted and the PL/pgSQL function shown later in > this message. > > > * Does perl's arithmetic leave me with the chance that those numeric > > values don't add up to 1.00 (and in this case that could mean an > > infinite loop)? > > I'd suggest looping through the records so you can't possibly end > up in an infinite loop. > > > * In my design I'll need a constraint trigger making sure that the > > numbers add up to 1.00. > > If the sum must be exactly 1.00 then be careful if you use double > precision -- if you test with the equality operator then the check > might fail because the sum is 0.9999999987. > > > Will that be a performance problem for operations on the table that > > don't modify the chance attribute? > > Any trigger that you didn't otherwise need will cause a performance > hit. I'd expect a statement-level AFTER trigger to have the lowest > impact since it would run only once per statement, whereas a row-level > trigger might run multiple times per statement. On the other hand, > if you make a lot of updates that don't modify the chance attribute, > then you might want to try a row-level trigger that skips the check > when NEW.chance = OLD.chance. I'd suggesting testing different > methods under expected conditions and see which has the lowest impact. > > > * Is there a better way? > > * Does spi_exec_query pull the entire result set into memory at once? > > I think it does. I ran some tests with the following PL/pgSQL > function and got significantly faster times than with PL/Perl, > especially as the data set grew: > > CREATE FUNCTION randrec() RETURNS integer AS $$ > DECLARE > r double precision := random(); > accum double precision := 0.0; > row record; > BEGIN > FOR row IN SELECT * FROM r1 LOOP > accum := accum + row.chance; > IF accum >= r THEN > EXIT; > END IF; > END LOOP; > > RETURN row.i; > END; > $$ LANGUAGE plpgsql VOLATILE; > > SELECT * FROM r1; > i | chance > ---+-------- > 1 | 0.25 > 2 | 0.20 > 3 | 0.15 > 4 | 0.10 > 5 | 0.30 > > SELECT i, count(*) > FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s > GROUP BY i > ORDER by i; > i | count > ---+------- > 1 | 2467 > 2 | 1939 > 3 | 1536 > 4 | 1016 > 5 | 3042 > (5 rows) > Time: 3300.710 ms > > Here are the results using the PL/Perl function you posted: > > SELECT i, count(*) > FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s > GROUP BY i > ORDER by i; > i | count > ---+------- > 1 | 2501 > 2 | 2040 > 3 | 1463 > 4 | 994 > 5 | 3002 > (5 rows) > Time: 8765.584 ms > > I ran each query several times and those times were typical of both. > With a data set of 100 records, the PL/pgSQL function ran in about > 14 seconds, while the PL/Perl function took around 65 seconds. >
pgsql-general by date: