Thread: random record from small set

random record from small set

From

Jeff Davis

Date:

15 February 2005, 02:19:04

I am trying to retrieve a random record (according to a chance
attribute) from a small set of records, each with a "chance" attribute.
This may eventually be somwhat of a performance concern, so I'd like to
make sure I'm doing this right.

Here's what I have so far:

create table r1 (
   i int,
   chance numeric
)
create or replace function randrec() returns int as $$
    $res = spi_exec_query('select i,chance from r1');
    $r = rand;
    $accum = 0;
    $i = 0;
    while($accum < $r) {
        $accum += $res->{rows}[$i++]->{chance}
    }
    return $res->{rows}[$i-1]->{i};
$$ language plperl;

test=# select * from r1;
 i | chance
---+--------
 1 |   0.25
 2 |   0.20
 3 |   0.15
 4 |   0.10
 5 |   0.30


That seems to work, in that out of 10k times, I got the following
numbers of each:
1 2479
2 1959
3 1522
4 950
5 3090

But I have a few questions:
* Am I right to use NUMERIC for the chance attribute?
* Does perl's arithmetic leave me with the chance that those numeric
values don't add up to 1.00 (and in this case that could mean an
infinite loop)?
* In my design I'll need a constraint trigger making sure that the
numbers add up to 1.00. Will that be a performance problem for
operations on the table that don't modify the chance attribute?
* Is there a better way?
* Does spi_exec_query pull the entire result set into memory at once? Is
there a point at which performance could be a serious problem if there
are a large number of items to select among?

Regards,
    Jeff Davis

Re: random record from small set

From

Michael Fuhr

Date:

15 February 2005, 05:18:57

On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:
>
> * Am I right to use NUMERIC for the chance attribute?

I ran tests with numeric, real, and double precision; double precision
was consistently about 10% faster than the others.  I used the
sample data you posted and the PL/pgSQL function shown later in
this message.

> * Does perl's arithmetic leave me with the chance that those numeric
> values don't add up to 1.00 (and in this case that could mean an
> infinite loop)?

I'd suggest looping through the records so you can't possibly end
up in an infinite loop.

> * In my design I'll need a constraint trigger making sure that the
> numbers add up to 1.00.

If the sum must be exactly 1.00 then be careful if you use double
precision -- if you test with the equality operator then the check
might fail because the sum is 0.9999999987.

> Will that be a performance problem for operations on the table that
> don't modify the chance attribute?

Any trigger that you didn't otherwise need will cause a performance
hit.  I'd expect a statement-level AFTER trigger to have the lowest
impact since it would run only once per statement, whereas a row-level
trigger might run multiple times per statement.  On the other hand,
if you make a lot of updates that don't modify the chance attribute,
then you might want to try a row-level trigger that skips the check
when NEW.chance = OLD.chance.  I'd suggesting testing different
methods under expected conditions and see which has the lowest impact.

> * Is there a better way?
> * Does spi_exec_query pull the entire result set into memory at once?

I think it does.  I ran some tests with the following PL/pgSQL
function and got significantly faster times than with PL/Perl,
especially as the data set grew:

CREATE FUNCTION randrec() RETURNS integer AS $$
DECLARE
    r      double precision := random();
    accum  double precision := 0.0;
    row    record;
BEGIN
    FOR row IN SELECT * FROM r1 LOOP
        accum := accum + row.chance;
        IF accum >= r THEN
            EXIT;
        END IF;
    END LOOP;

    RETURN row.i;
END;
$$ LANGUAGE plpgsql VOLATILE;

SELECT * FROM r1;
 i | chance
---+--------
 1 |   0.25
 2 |   0.20
 3 |   0.15
 4 |   0.10
 5 |   0.30

SELECT i, count(*)
FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
 i | count
---+-------
 1 |  2467
 2 |  1939
 3 |  1536
 4 |  1016
 5 |  3042
(5 rows)
Time: 3300.710 ms

Here are the results using the PL/Perl function you posted:

SELECT i, count(*)
FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
 i | count
---+-------
 1 |  2501
 2 |  2040
 3 |  1463
 4 |   994
 5 |  3002
(5 rows)
Time: 8765.584 ms

I ran each query several times and those times were typical of both.
With a data set of 100 records, the PL/pgSQL function ran in about
14 seconds, while the PL/Perl function took around 65 seconds.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: random record from small set

From

Jeff Davis

Date:

15 February 2005, 06:15:47

Thanks very much for the information. I had very similar results on my
machine. I will take your advice and use the double-precision values,
since it doesn't affect the probability significantly anyway. As far as
the constraint trigger, I will see if it becomes a problem before I
worry about its performance. As far as whether those values add up to
1.0, I'll just check to make sure it's fairly close to 1.0 :)

The only real difference that I saw was that I didn't notice much
difference if the underlying table's chance attribute was double
precision vs. numeric. I used your generate_series()-based query.

Perhaps that was because I was using ALTER TABLE to modify r1's "chance"
attribute. One thing that I noticed there was that I had to CREATE OR
REPLACE the plpgsql function for that to work. Perhaps it was a bug?
Here's a test case:

test=# create table t1(i int);
CREATE TABLE
test=# insert into t1 values(1);
INSERT 17775 1
test=# create or replace function err() returns int as $$ DECLARE accum
double precision := 0.0; row record; BEGIN FOR row IN SELECT i FROM t1
LOOP accum := accum + row.i; END LOOP; RETURN row.i; END; $$ language
plpgsql;
CREATE FUNCTION
test=# insert into t1 values(2); INSERT 17778 1
test=# select err();
 err
-----
   2
(1 row)
test=# alter table t1 alter column i type numeric;
ALTER TABLE
test=# select err();
 err
-----
  10
(1 row)

And if you keep playing around with the type and values you can get
other errors like:
ERROR:  type "double precision" value out of range: underflow
CONTEXT:  PL/pgSQL function "err" line 1 at assignment
Or:
ERROR:  invalid memory alloc request size 4294967290
CONTEXT:  PL/pgSQL function "randrec" line 7 at assignment

If any more information would be helpful someone let me know. It looks a
little like a bug; perhaps we should throw an error when dependent
functions are called after the underlying types have changed? Or I
suppose if we can recognize that, it might as well recompile the
function and proceed without error.

Regards,
    Jeff Davis


On Mon, 2005-02-14 at 22:18 -0700, Michael Fuhr wrote:
> On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:
> >
> > * Am I right to use NUMERIC for the chance attribute?
>
> I ran tests with numeric, real, and double precision; double precision
> was consistently about 10% faster than the others.  I used the
> sample data you posted and the PL/pgSQL function shown later in
> this message.
>
> > * Does perl's arithmetic leave me with the chance that those numeric
> > values don't add up to 1.00 (and in this case that could mean an
> > infinite loop)?
>
> I'd suggest looping through the records so you can't possibly end
> up in an infinite loop.
>
> > * In my design I'll need a constraint trigger making sure that the
> > numbers add up to 1.00.
>
> If the sum must be exactly 1.00 then be careful if you use double
> precision -- if you test with the equality operator then the check
> might fail because the sum is 0.9999999987.
>
> > Will that be a performance problem for operations on the table that
> > don't modify the chance attribute?
>
> Any trigger that you didn't otherwise need will cause a performance
> hit.  I'd expect a statement-level AFTER trigger to have the lowest
> impact since it would run only once per statement, whereas a row-level
> trigger might run multiple times per statement.  On the other hand,
> if you make a lot of updates that don't modify the chance attribute,
> then you might want to try a row-level trigger that skips the check
> when NEW.chance = OLD.chance.  I'd suggesting testing different
> methods under expected conditions and see which has the lowest impact.
>
> > * Is there a better way?
> > * Does spi_exec_query pull the entire result set into memory at once?
>
> I think it does.  I ran some tests with the following PL/pgSQL
> function and got significantly faster times than with PL/Perl,
> especially as the data set grew:
>
> CREATE FUNCTION randrec() RETURNS integer AS $$
> DECLARE
>     r      double precision := random();
>     accum  double precision := 0.0;
>     row    record;
> BEGIN
>     FOR row IN SELECT * FROM r1 LOOP
>         accum := accum + row.chance;
>         IF accum >= r THEN
>             EXIT;
>         END IF;
>     END LOOP;
>
>     RETURN row.i;
> END;
> $$ LANGUAGE plpgsql VOLATILE;
>
> SELECT * FROM r1;
>  i | chance
> ---+--------
>  1 |   0.25
>  2 |   0.20
>  3 |   0.15
>  4 |   0.10
>  5 |   0.30
>
> SELECT i, count(*)
> FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
> GROUP BY i
> ORDER by i;
>  i | count
> ---+-------
>  1 |  2467
>  2 |  1939
>  3 |  1536
>  4 |  1016
>  5 |  3042
> (5 rows)
> Time: 3300.710 ms
>
> Here are the results using the PL/Perl function you posted:
>
> SELECT i, count(*)
> FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
> GROUP BY i
> ORDER by i;
>  i | count
> ---+-------
>  1 |  2501
>  2 |  2040
>  3 |  1463
>  4 |   994
>  5 |  3002
> (5 rows)
> Time: 8765.584 ms
>
> I ran each query several times and those times were typical of both.
> With a data set of 100 records, the PL/pgSQL function ran in about
> 14 seconds, while the PL/Perl function took around 65 seconds.
>

Re: random record from small set

From

Jan Poslusny

Date:

15 February 2005, 16:20:29

And what about another data representation like

create table r1 (
   i int,
   chance_from numeric,
   chance_to numeric
)
, you can select one random row in one select, for instance
select * from r1 where chance_from <= $rnd and chance_to > $rnd;

I see these advantages
- Only one select.
- Indices can improve performance if r1 has many rows.

and disadvantage
- Tricky update


Jeff Davis wrote:

>I am trying to retrieve a random record (according to a chance
>attribute) from a small set of records, each with a "chance" attribute.
>This may eventually be somwhat of a performance concern, so I'd like to
>make sure I'm doing this right.
>
>Here's what I have so far:
>
>create table r1 (
>   i int,
>   chance numeric
>)
>create or replace function randrec() returns int as $$
>    $res = spi_exec_query('select i,chance from r1');
>    $r = rand;
>    $accum = 0;
>    $i = 0;
>    while($accum < $r) {
>        $accum += $res->{rows}[$i++]->{chance}
>    }
>    return $res->{rows}[$i-1]->{i};
>$$ language plperl;
>
>test=# select * from r1;
> i | chance
>---+--------
> 1 |   0.25
> 2 |   0.20
> 3 |   0.15
> 4 |   0.10
> 5 |   0.30
>
>
>That seems to work, in that out of 10k times, I got the following
>numbers of each:
>1 2479
>2 1959
>3 1522
>4 950
>5 3090
>
>But I have a few questions:
>* Am I right to use NUMERIC for the chance attribute?
>* Does perl's arithmetic leave me with the chance that those numeric
>values don't add up to 1.00 (and in this case that could mean an
>infinite loop)?
>* In my design I'll need a constraint trigger making sure that the
>numbers add up to 1.00. Will that be a performance problem for
>operations on the table that don't modify the chance attribute?
>* Is there a better way?
>* Does spi_exec_query pull the entire result set into memory at once? Is
>there a point at which performance could be a serious problem if there
>are a large number of items to select among?
>
>Regards,
>    Jeff Davis
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>
>

Re: random record from small set

From

Jan Poslusny

Date:

15 February 2005, 16:25:22

Or
create table r1 (
  i int,
  chance_from numeric
)
and
select * from r1 where chance_from <= $rnd order by chance_from desc
limit 1;
which can be easier updated...
Just ideas, I has never tested it...

Jan Poslusny wrote:

> And what about another data representation like
>
> create table r1 (
>   i int,
>   chance_from numeric,
>   chance_to numeric
> )
> , you can select one random row in one select, for instance
> select * from r1 where chance_from <= $rnd and chance_to > $rnd;
>
> I see these advantages
> - Only one select.
> - Indices can improve performance if r1 has many rows.
> and disadvantage
> - Tricky update
>
>
> Jeff Davis wrote:
>
>> I am trying to retrieve a random record (according to a chance
>> attribute) from a small set of records, each with a "chance" attribute.
>> This may eventually be somwhat of a performance concern, so I'd like to
>> make sure I'm doing this right.
>>
>> Here's what I have so far:
>>
>> create table r1 (
>>   i int,
>>   chance numeric
>> )
>> create or replace function randrec() returns int as $$
>>     $res = spi_exec_query('select i,chance from r1');
>>     $r = rand;
>>     $accum = 0;
>>     $i = 0;
>>     while($accum < $r) {
>>         $accum += $res->{rows}[$i++]->{chance}
>>     }
>>     return $res->{rows}[$i-1]->{i};
>> $$ language plperl;
>>
>> test=# select * from r1;
>> i | chance
>> ---+--------
>> 1 |   0.25
>> 2 |   0.20
>> 3 |   0.15
>> 4 |   0.10
>> 5 |   0.30
>>
>>
>> That seems to work, in that out of 10k times, I got the following
>> numbers of each:
>> 1 2479
>> 2 1959
>> 3 1522
>> 4 950
>> 5 3090
>>
>> But I have a few questions:
>> * Am I right to use NUMERIC for the chance attribute?
>> * Does perl's arithmetic leave me with the chance that those numeric
>> values don't add up to 1.00 (and in this case that could mean an
>> infinite loop)?
>> * In my design I'll need a constraint trigger making sure that the
>> numbers add up to 1.00. Will that be a performance problem for
>> operations on the table that don't modify the chance attribute?
>> * Is there a better way?
>> * Does spi_exec_query pull the entire result set into memory at once? Is
>> there a point at which performance could be a serious problem if there
>> are a large number of items to select among?
>>
>> Regards,
>>     Jeff Davis
>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: you can get off all lists at once with the unregister command
>>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>>
>>
>>
>

Re: random record from small set

From

"Greg Sabino Mullane"

Date:

16 February 2005, 03:52:22

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> Here's what I have so far:

If you go that route, make sure you check for edge cases, such
as reaching the end of the rows without hitting your number:

        while($accum < $r) {
                die qq{Ran out of rows!\n} if ! defined $res->{rows}[$i];

Also, your query should be "select i,chance from r1 ORDER BY random()"
else you are getting back the same order each time (until a row is
changed) which certainly reduces the randomness.

Anyway, here's another solution, which shifts as much work as possible
off of the actual random row call, and uses a trigger to keep things
in sync. I switched the 'chance' from 0.25 to 25 (numeric to int) to make
things easier to read.

UPDATE r1 SET chance = chance*100;
ALTER TABLE r1 ALTER COLUMN chance TYPE INTEGER;

CREATE TABLE r2(integer);

CREATE OR REPLACE FUNCTION r1_cleanup() RETURNS trigger LANGUAGE plpgsql AS
$$
DECLARE
  mychance integer;
BEGIN
IF TG_OP = 'DELETE' THEN
  DELETE FROM r2 WHERE id = OLD.i;
ELSE
  IF TG_OP = 'UPDATE' THEN
    DELETE FROM r2 WHERE id = OLD.i or id = NEW.i;
  END IF;
  SELECT chance FROM r1 WHERE i=NEW.i INTO mychance;
        LOOP
          mychance := mychance - 1;
                EXIT WHEN mychance < 0;
                INSERT INTO r2 VALUES (NEW.i);
        END LOOP;
END IF;
RETURN NULL;
END;
$$;

CREATE TRIGGER r1_trigger AFTER INSERT or UPDATE or DELETE ON r1
FOR EACH ROW EXECUTE PROCEDURE r1_cleanup();

UPDATE r1 SET i=i; -- To initially populate r2

SELECT id FROM r2 ORDER BY random() LIMIT 1; -- repeat as needed


- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200502152252
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iD8DBQFCEsOvvJuQZxSWSsgRAjysAJ9X3JpMfuXV2ST049bhCWuJOp6Y1ACg/sNx
PXqxVlfvlsKMTBDDhsh3BmU=
=7/IE
-----END PGP SIGNATURE-----