Re: random record from small set - Mailing list pgsql-general

From Michael Fuhr
Subject Re: random record from small set
Date
Msg-id 20050215051843.GA40527@winnie.fuhr.org
Whole thread Raw
In response to random record from small set  (Jeff Davis <jdavis-pgsql@empires.org>)
Responses Re: random record from small set
List pgsql-general
On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:
>
> * Am I right to use NUMERIC for the chance attribute?

I ran tests with numeric, real, and double precision; double precision
was consistently about 10% faster than the others.  I used the
sample data you posted and the PL/pgSQL function shown later in
this message.

> * Does perl's arithmetic leave me with the chance that those numeric
> values don't add up to 1.00 (and in this case that could mean an
> infinite loop)?

I'd suggest looping through the records so you can't possibly end
up in an infinite loop.

> * In my design I'll need a constraint trigger making sure that the
> numbers add up to 1.00.

If the sum must be exactly 1.00 then be careful if you use double
precision -- if you test with the equality operator then the check
might fail because the sum is 0.9999999987.

> Will that be a performance problem for operations on the table that
> don't modify the chance attribute?

Any trigger that you didn't otherwise need will cause a performance
hit.  I'd expect a statement-level AFTER trigger to have the lowest
impact since it would run only once per statement, whereas a row-level
trigger might run multiple times per statement.  On the other hand,
if you make a lot of updates that don't modify the chance attribute,
then you might want to try a row-level trigger that skips the check
when NEW.chance = OLD.chance.  I'd suggesting testing different
methods under expected conditions and see which has the lowest impact.

> * Is there a better way?
> * Does spi_exec_query pull the entire result set into memory at once?

I think it does.  I ran some tests with the following PL/pgSQL
function and got significantly faster times than with PL/Perl,
especially as the data set grew:

CREATE FUNCTION randrec() RETURNS integer AS $$
DECLARE
    r      double precision := random();
    accum  double precision := 0.0;
    row    record;
BEGIN
    FOR row IN SELECT * FROM r1 LOOP
        accum := accum + row.chance;
        IF accum >= r THEN
            EXIT;
        END IF;
    END LOOP;

    RETURN row.i;
END;
$$ LANGUAGE plpgsql VOLATILE;

SELECT * FROM r1;
 i | chance
---+--------
 1 |   0.25
 2 |   0.20
 3 |   0.15
 4 |   0.10
 5 |   0.30

SELECT i, count(*)
FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
 i | count
---+-------
 1 |  2467
 2 |  1939
 3 |  1536
 4 |  1016
 5 |  3042
(5 rows)
Time: 3300.710 ms

Here are the results using the PL/Perl function you posted:

SELECT i, count(*)
FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
 i | count
---+-------
 1 |  2501
 2 |  2040
 3 |  1463
 4 |   994
 5 |  3002
(5 rows)
Time: 8765.584 ms

I ran each query several times and those times were typical of both.
With a data set of 100 records, the PL/pgSQL function ran in about
14 seconds, while the PL/Perl function took around 65 seconds.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Lost rows/data corruption?
Next
From: "Andrew Hall"
Date:
Subject: Re: Lost rows/data corruption?