Re: Recursive CTEs and randomness - is there something I'm missing? - Mailing list pgsql-general

From Adrian Klaver
Subject Re: Recursive CTEs and randomness - is there something I'm missing?
Date
Msg-id f3d0e5dc-33fc-0ec1-caa7-d30f042a02bc@aklaver.com
Whole thread Raw
In response to Re: Recursive CTEs and randomness - is there something I'm missing?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On 2/28/20 9:45 AM, Tom Lane wrote:
> =?UTF-8?B?UMOzbCBVYSBMYW/DrW5lY2jDoWlu?= <linehanp@tcd.ie> writes:
>> The SQL:
> 
>> WITH RECURSIVE rand (num, md, a_2_s) AS
>> (
>>    SELECT
>>      1,
>>      MD5(RANDOM()::TEXT),
>>      ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>>                            FROM GENERATE_SERIES(1, 5)), '')
>>    UNION
>>      SELECT num + 1,
>>      MD5(RANDOM()::TEXT),
>>      ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>>                            FROM GENERATE_SERIES(1, 5)), '')
>>    FROM rand
>>    WHERE num < 5
>> )
>> SELECT * FROM rand;
> 
>> A typical result is shown below:
> 
>> 1  974ee059a1902e5ca1ec73c91275984b     GYXYS
>> 2  6cf5a974d5859eae23cdb9c310e3a3bf       YFDPT
>> 3  fa6be95eb720fe6f80c7c8fb6ba11171         YFDPT
>> 4  fa54913b0bb43de0025b153fd71a5334      YFDPT
>> 5  523fab9bdc6c4c51a89e0d901273fb69       YFDPT
> 
>> The fact that the last 4 are identical is not a coincidence. If I put
>> 100 in the GENERATE_SERIES, I still get the same result, the first and
>> second records are different, but ALL subsequent instances of the
>> ARRAY_TO_STRING are identical!
> 
> Yeah, it's weird.  A look at EXPLAIN VERBOSE offers some insight:
> 
>                                                         QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------
>   CTE Scan on rand  (cost=4.75..5.37 rows=31 width=68)
>     Output: rand.num, rand.md, rand.a_2_s
>     CTE rand
>       ->  Recursive Union  (cost=0.13..4.75 rows=31 width=68)
>             ->  Result  (cost=0.13..0.15 rows=1 width=68)
>                   Output: 1, md5((random())::text), array_to_string($1, ''::text)
>                   InitPlan 1 (returns $1)
>                     ->  Function Scan on pg_catalog.generate_series  (cost=0.00..0.13 rows=5 width=32)
>                           Output: chr((('65'::double precision + round((random() * '25'::double
precision))))::integer)
>                           Function Call: generate_series(1, 5)
>             ->  WorkTable Scan on rand rand_1  (cost=0.13..0.40 rows=3 width=68)
>                   Output: (rand_1.num + 1), md5((random())::text), array_to_string($2, ''::text)
>                   Filter: (rand_1.num < 5)
>                   InitPlan 2 (returns $2)
>                     ->  Function Scan on pg_catalog.generate_series generate_series_1  (cost=0.00..0.13 rows=5
width=32)
>                           Output: chr((('65'::double precision + round((random() * '25'::double
precision))))::integer)
>                           Function Call: generate_series(1, 5)
> (17 rows)
> 
> The ARRAY sub-selects are being done as initplans, not subplans,
> which means they're only done once not once per row.  This is correct
> so far as the planner is concerned because those sub-selects are
> "uncorrelated", ie they use no values from the outer query.
> 
> There is room to argue that because the sub-selects contain volatile
> functions, they ought not be optimized into initplans.  We have
> traditionally not considered that, however, and I'm afraid that a
> lot of people have written queries that depend on it.  For example,
> there's lore out there that replacing
>     WHERE mycol < random()
> with
>     WHERE mycol < (SELECT random())
> will freeze the random() result as a single value rather than
> computing a new value for each row, which sometimes is what you
> need.  These days, better practice would be to put the random()
> call in a CTE, but there's still a lot of code out there that
> does it as above.
> 
> For your immediate problem, since you don't care that much
> (I suppose) about exactly how the strings are generated, you
> could fix the issue by making the sub-selects depend on
> "num" somehow.  Or possibly there's a way to do it without
> a sub-select.  On the whole this looks like a mighty expensive
> way to generate a random string, so I'd be inclined to look
> for other implementations.

An off the cuff Python solution:

CREATE OR REPLACE FUNCTION public.upper_random()
  RETURNS character varying
  LANGUAGE plpythonu
AS $function$
from string import ascii_uppercase as au
import random
return  ''.join(random.sample(au,5))
$function$

SELECT
     num,
     MD5(RANDOM()::TEXT),
     upper_random()
FROM
    generate_series(1, 5) AS t(num);

  num |               md5                | upper_random
-----+----------------------------------+--------------
    1 | 5896a9e3efa53027873d7999e58904ae | TRGEB
    2 | 9f9677c32a64b6eae73759a69e1acfff | TFQHG
    3 | 5aefda5b498215065e01ba697d79caee | KYBZS
    4 | 1605b7fa54fef9bdc5c49f9b79810e07 | EUDML
    5 | 4ba59880c1c67bca1d1f184bda5350b6 | RFPGO



> 
>             regards, tom lane
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Recursive CTEs and randomness - is there something I'm missing?
Next
From: "Peter J. Holzer"
Date:
Subject: Re: trouble making PG use my Perl