Thread: Recursive CTEs and randomness - is there something I'm missing?

Recursive CTEs and randomness - is there something I'm missing?

From
Pól Ua Laoínecháin
Date:
Hi all,

I'm trying to generate a series of random strings (capital A-Z)
between 2 and 5 characters long (say, 10^6).

I'm using a recursive CTE to achieve this.

A fiddle is available here:

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=206a0c522e853f043c7e633d19852de2

The SQL:


WITH RECURSIVE rand (num, md, a_2_s) AS
(
  SELECT
    1,
    MD5(RANDOM()::TEXT),
    ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
                          FROM GENERATE_SERIES(1, 5)), '')
  UNION
    SELECT num + 1,
    MD5(RANDOM()::TEXT),
    ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
                          FROM GENERATE_SERIES(1, 5)), '')
  FROM rand
  WHERE num < 5
)
SELECT * FROM rand;

A typical result is shown below:

1  974ee059a1902e5ca1ec73c91275984b     GYXYS
2  6cf5a974d5859eae23cdb9c310e3a3bf       YFDPT
3  fa6be95eb720fe6f80c7c8fb6ba11171         YFDPT
4  fa54913b0bb43de0025b153fd71a5334      YFDPT
5  523fab9bdc6c4c51a89e0d901273fb69       YFDPT

Now, the interesting thing is the ARRAY_TO_STRING.

The fact that the last 4 are identical is not a coincidence. If I put
100 in the GENERATE_SERIES, I still get the same result, the first and
second records are different, but ALL subsequent instances of the
ARRAY_TO_STRING are identical! You can test this on the fiddle!

Now, I'm puzzled by this, since the MD5 records are ALWAYS different.

I would be grateful if anybody could explain this - I need the
ARRAY_TO_STRING, because an MD5 cannot be guaranteed to have at least
5 letters.

Should you require any further information, please don't hesitate to contact me.

TIA and rgs,


Pól...



Re: Recursive CTEs and randomness - is there something I'm missing?

From
Tom Lane
Date:
=?UTF-8?B?UMOzbCBVYSBMYW/DrW5lY2jDoWlu?= <linehanp@tcd.ie> writes:
> The SQL:

> WITH RECURSIVE rand (num, md, a_2_s) AS
> (
>   SELECT
>     1,
>     MD5(RANDOM()::TEXT),
>     ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>                           FROM GENERATE_SERIES(1, 5)), '')
>   UNION
>     SELECT num + 1,
>     MD5(RANDOM()::TEXT),
>     ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>                           FROM GENERATE_SERIES(1, 5)), '')
>   FROM rand
>   WHERE num < 5
> )
> SELECT * FROM rand;

> A typical result is shown below:

> 1  974ee059a1902e5ca1ec73c91275984b     GYXYS
> 2  6cf5a974d5859eae23cdb9c310e3a3bf       YFDPT
> 3  fa6be95eb720fe6f80c7c8fb6ba11171         YFDPT
> 4  fa54913b0bb43de0025b153fd71a5334      YFDPT
> 5  523fab9bdc6c4c51a89e0d901273fb69       YFDPT

> The fact that the last 4 are identical is not a coincidence. If I put
> 100 in the GENERATE_SERIES, I still get the same result, the first and
> second records are different, but ALL subsequent instances of the
> ARRAY_TO_STRING are identical!

Yeah, it's weird.  A look at EXPLAIN VERBOSE offers some insight:

                                                       QUERY PLAN


------------------------------------------------------------------------------------------------------------------------
 CTE Scan on rand  (cost=4.75..5.37 rows=31 width=68)
   Output: rand.num, rand.md, rand.a_2_s
   CTE rand
     ->  Recursive Union  (cost=0.13..4.75 rows=31 width=68)
           ->  Result  (cost=0.13..0.15 rows=1 width=68)
                 Output: 1, md5((random())::text), array_to_string($1, ''::text)
                 InitPlan 1 (returns $1)
                   ->  Function Scan on pg_catalog.generate_series  (cost=0.00..0.13 rows=5 width=32)
                         Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
                         Function Call: generate_series(1, 5)
           ->  WorkTable Scan on rand rand_1  (cost=0.13..0.40 rows=3 width=68)
                 Output: (rand_1.num + 1), md5((random())::text), array_to_string($2, ''::text)
                 Filter: (rand_1.num < 5)
                 InitPlan 2 (returns $2)
                   ->  Function Scan on pg_catalog.generate_series generate_series_1  (cost=0.00..0.13 rows=5 width=32)
                         Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
                         Function Call: generate_series(1, 5)
(17 rows)

The ARRAY sub-selects are being done as initplans, not subplans,
which means they're only done once not once per row.  This is correct
so far as the planner is concerned because those sub-selects are
"uncorrelated", ie they use no values from the outer query.

There is room to argue that because the sub-selects contain volatile
functions, they ought not be optimized into initplans.  We have
traditionally not considered that, however, and I'm afraid that a
lot of people have written queries that depend on it.  For example,
there's lore out there that replacing
    WHERE mycol < random()
with
    WHERE mycol < (SELECT random())
will freeze the random() result as a single value rather than
computing a new value for each row, which sometimes is what you
need.  These days, better practice would be to put the random()
call in a CTE, but there's still a lot of code out there that
does it as above.

For your immediate problem, since you don't care that much
(I suppose) about exactly how the strings are generated, you
could fix the issue by making the sub-selects depend on
"num" somehow.  Or possibly there's a way to do it without
a sub-select.  On the whole this looks like a mighty expensive
way to generate a random string, so I'd be inclined to look
for other implementations.

            regards, tom lane



Re: Recursive CTEs and randomness - is there something I'm missing?

From
Adrian Klaver
Date:
On 2/28/20 9:45 AM, Tom Lane wrote:
> =?UTF-8?B?UMOzbCBVYSBMYW/DrW5lY2jDoWlu?= <linehanp@tcd.ie> writes:
>> The SQL:
> 
>> WITH RECURSIVE rand (num, md, a_2_s) AS
>> (
>>    SELECT
>>      1,
>>      MD5(RANDOM()::TEXT),
>>      ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>>                            FROM GENERATE_SERIES(1, 5)), '')
>>    UNION
>>      SELECT num + 1,
>>      MD5(RANDOM()::TEXT),
>>      ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>>                            FROM GENERATE_SERIES(1, 5)), '')
>>    FROM rand
>>    WHERE num < 5
>> )
>> SELECT * FROM rand;
> 
>> A typical result is shown below:
> 
>> 1  974ee059a1902e5ca1ec73c91275984b     GYXYS
>> 2  6cf5a974d5859eae23cdb9c310e3a3bf       YFDPT
>> 3  fa6be95eb720fe6f80c7c8fb6ba11171         YFDPT
>> 4  fa54913b0bb43de0025b153fd71a5334      YFDPT
>> 5  523fab9bdc6c4c51a89e0d901273fb69       YFDPT
> 
>> The fact that the last 4 are identical is not a coincidence. If I put
>> 100 in the GENERATE_SERIES, I still get the same result, the first and
>> second records are different, but ALL subsequent instances of the
>> ARRAY_TO_STRING are identical!
> 
> Yeah, it's weird.  A look at EXPLAIN VERBOSE offers some insight:
> 
>                                                         QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------
>   CTE Scan on rand  (cost=4.75..5.37 rows=31 width=68)
>     Output: rand.num, rand.md, rand.a_2_s
>     CTE rand
>       ->  Recursive Union  (cost=0.13..4.75 rows=31 width=68)
>             ->  Result  (cost=0.13..0.15 rows=1 width=68)
>                   Output: 1, md5((random())::text), array_to_string($1, ''::text)
>                   InitPlan 1 (returns $1)
>                     ->  Function Scan on pg_catalog.generate_series  (cost=0.00..0.13 rows=5 width=32)
>                           Output: chr((('65'::double precision + round((random() * '25'::double
precision))))::integer)
>                           Function Call: generate_series(1, 5)
>             ->  WorkTable Scan on rand rand_1  (cost=0.13..0.40 rows=3 width=68)
>                   Output: (rand_1.num + 1), md5((random())::text), array_to_string($2, ''::text)
>                   Filter: (rand_1.num < 5)
>                   InitPlan 2 (returns $2)
>                     ->  Function Scan on pg_catalog.generate_series generate_series_1  (cost=0.00..0.13 rows=5
width=32)
>                           Output: chr((('65'::double precision + round((random() * '25'::double
precision))))::integer)
>                           Function Call: generate_series(1, 5)
> (17 rows)
> 
> The ARRAY sub-selects are being done as initplans, not subplans,
> which means they're only done once not once per row.  This is correct
> so far as the planner is concerned because those sub-selects are
> "uncorrelated", ie they use no values from the outer query.
> 
> There is room to argue that because the sub-selects contain volatile
> functions, they ought not be optimized into initplans.  We have
> traditionally not considered that, however, and I'm afraid that a
> lot of people have written queries that depend on it.  For example,
> there's lore out there that replacing
>     WHERE mycol < random()
> with
>     WHERE mycol < (SELECT random())
> will freeze the random() result as a single value rather than
> computing a new value for each row, which sometimes is what you
> need.  These days, better practice would be to put the random()
> call in a CTE, but there's still a lot of code out there that
> does it as above.
> 
> For your immediate problem, since you don't care that much
> (I suppose) about exactly how the strings are generated, you
> could fix the issue by making the sub-selects depend on
> "num" somehow.  Or possibly there's a way to do it without
> a sub-select.  On the whole this looks like a mighty expensive
> way to generate a random string, so I'd be inclined to look
> for other implementations.

An off the cuff Python solution:

CREATE OR REPLACE FUNCTION public.upper_random()
  RETURNS character varying
  LANGUAGE plpythonu
AS $function$
from string import ascii_uppercase as au
import random
return  ''.join(random.sample(au,5))
$function$

SELECT
     num,
     MD5(RANDOM()::TEXT),
     upper_random()
FROM
    generate_series(1, 5) AS t(num);

  num |               md5                | upper_random
-----+----------------------------------+--------------
    1 | 5896a9e3efa53027873d7999e58904ae | TRGEB
    2 | 9f9677c32a64b6eae73759a69e1acfff | TFQHG
    3 | 5aefda5b498215065e01ba697d79caee | KYBZS
    4 | 1605b7fa54fef9bdc5c49f9b79810e07 | EUDML
    5 | 4ba59880c1c67bca1d1f184bda5350b6 | RFPGO



> 
>             regards, tom lane
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com