Home > mailing lists

Re: random record from small set - Mailing list pgsql-general

From	Jeff Davis
Subject	Re: random record from small set
Date	February 15, 2005 06:15:47
Msg-id	1108448114.354.229.camel@jeff Whole thread Raw
In response to	Re: random record from small set (Michael Fuhr <mike@fuhr.org>)
List	pgsql-general

Tree view

Thanks very much for the information. I had very similar results on my
machine. I will take your advice and use the double-precision values,
since it doesn't affect the probability significantly anyway. As far as
the constraint trigger, I will see if it becomes a problem before I
worry about its performance. As far as whether those values add up to
1.0, I'll just check to make sure it's fairly close to 1.0 :)

The only real difference that I saw was that I didn't notice much
difference if the underlying table's chance attribute was double
precision vs. numeric. I used your generate_series()-based query.

Perhaps that was because I was using ALTER TABLE to modify r1's "chance"
attribute. One thing that I noticed there was that I had to CREATE OR
REPLACE the plpgsql function for that to work. Perhaps it was a bug?
Here's a test case:

test=# create table t1(i int);
CREATE TABLE
test=# insert into t1 values(1);
INSERT 17775 1
test=# create or replace function err() returns int as $$ DECLARE accum
double precision := 0.0; row record; BEGIN FOR row IN SELECT i FROM t1
LOOP accum := accum + row.i; END LOOP; RETURN row.i; END; $$ language
plpgsql;
CREATE FUNCTION
test=# insert into t1 values(2); INSERT 17778 1
test=# select err();
 err
-----
   2
(1 row)
test=# alter table t1 alter column i type numeric;
ALTER TABLE
test=# select err();
 err
-----
  10
(1 row)

And if you keep playing around with the type and values you can get
other errors like:
ERROR:  type "double precision" value out of range: underflow
CONTEXT:  PL/pgSQL function "err" line 1 at assignment
Or:
ERROR:  invalid memory alloc request size 4294967290
CONTEXT:  PL/pgSQL function "randrec" line 7 at assignment

If any more information would be helpful someone let me know. It looks a
little like a bug; perhaps we should throw an error when dependent
functions are called after the underlying types have changed? Or I
suppose if we can recognize that, it might as well recompile the
function and proceed without error.

Regards,
    Jeff Davis


On Mon, 2005-02-14 at 22:18 -0700, Michael Fuhr wrote:
> On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:
> >
> > * Am I right to use NUMERIC for the chance attribute?
>
> I ran tests with numeric, real, and double precision; double precision
> was consistently about 10% faster than the others.  I used the
> sample data you posted and the PL/pgSQL function shown later in
> this message.
>
> > * Does perl's arithmetic leave me with the chance that those numeric
> > values don't add up to 1.00 (and in this case that could mean an
> > infinite loop)?
>
> I'd suggest looping through the records so you can't possibly end
> up in an infinite loop.
>
> > * In my design I'll need a constraint trigger making sure that the
> > numbers add up to 1.00.
>
> If the sum must be exactly 1.00 then be careful if you use double
> precision -- if you test with the equality operator then the check
> might fail because the sum is 0.9999999987.
>
> > Will that be a performance problem for operations on the table that
> > don't modify the chance attribute?
>
> Any trigger that you didn't otherwise need will cause a performance
> hit.  I'd expect a statement-level AFTER trigger to have the lowest
> impact since it would run only once per statement, whereas a row-level
> trigger might run multiple times per statement.  On the other hand,
> if you make a lot of updates that don't modify the chance attribute,
> then you might want to try a row-level trigger that skips the check
> when NEW.chance = OLD.chance.  I'd suggesting testing different
> methods under expected conditions and see which has the lowest impact.
>
> > * Is there a better way?
> > * Does spi_exec_query pull the entire result set into memory at once?
>
> I think it does.  I ran some tests with the following PL/pgSQL
> function and got significantly faster times than with PL/Perl,
> especially as the data set grew:
>
> CREATE FUNCTION randrec() RETURNS integer AS $$
> DECLARE
>     r      double precision := random();
>     accum  double precision := 0.0;
>     row    record;
> BEGIN
>     FOR row IN SELECT * FROM r1 LOOP
>         accum := accum + row.chance;
>         IF accum >= r THEN
>             EXIT;
>         END IF;
>     END LOOP;
>
>     RETURN row.i;
> END;
> $$ LANGUAGE plpgsql VOLATILE;
>
> SELECT * FROM r1;
>  i | chance
> ---+--------
>  1 |   0.25
>  2 |   0.20
>  3 |   0.15
>  4 |   0.10
>  5 |   0.30
>
> SELECT i, count(*)
> FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
> GROUP BY i
> ORDER by i;
>  i | count
> ---+-------
>  1 |  2467
>  2 |  1939
>  3 |  1536
>  4 |  1016
>  5 |  3042
> (5 rows)
> Time: 3300.710 ms
>
> Here are the results using the PL/Perl function you posted:
>
> SELECT i, count(*)
> FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
> GROUP BY i
> ORDER by i;
>  i | count
> ---+-------
>  1 |  2501
>  2 |  2040
>  3 |  1463
>  4 |   994
>  5 |  3002
> (5 rows)
> Time: 8765.584 ms
>
> I ran each query several times and those times were typical of both.
> With a data set of 100 records, the PL/pgSQL function ran in about
> 14 seconds, while the PL/Perl function took around 65 seconds.
>

pgsql-general by date:

From: Tom Lane
Date: 15 February 2005, 05:59:09
Subject: Re: Lost rows/data corruption?

From: "Rey Arqueza"
Date: 15 February 2005, 07:22:29
Subject: Re: /usr/sbin/useradd is needed by postgresql-server-8.0.1-PGDG

Re: random record from small set - Mailing list pgsql-general

Previous

Next