Re: random() (was Re: New GUC to sample log queries) - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: random() (was Re: New GUC to sample log queries)
Date
Msg-id alpine.DEB.2.21.1812270833190.32444@lancre
Whole thread Raw
In response to random() (was Re: New GUC to sample log queries)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: random() (was Re: New GUC to sample log queries)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hello all,

> I am not sure I buy the argument that this is a security hazard, but
> there are other reasons to question the use of random() here, some of
> which you stated yourself above.  Another one is that using random()
> for internal purposes interferes with applications' possible use of
> drandom() and setseed(), ie an application depending on getting a
> particular random series would see different behavior depending on
> whether this GUC is active or not.
>
> Another idea, which would be a lot less prone to breakage by
> add-on code, is to change drandom() and setseed() to themselves
> use pg_erand48() with a private seed.

My random thoughts about random, erand48, etc. which may be slightly out 
of topic, sorry if this is the case.

The word "random" is a misnommer for these pseudo-random generators, so 
that "strong" has to be used for higher quality generators:-(

On Linux, random() is advertised with a period of around 2**36, its 
internal state is 8 to 256 bytes (default unclear, probably 8 bytes), 
however seeding with srandom() provides only 32 bits, which is a drawback.

The pg_erand48 code looks like crumbs from the 70's optimized for 16 bits 
architectures (which it is probably not, but why not going to 64 bits or 
128 bits directly looks like a missed opportunity), its internal state is 
48 bits as its name implies, and its period probably around 2**48, which 
is 2**12 better than the previous case, not an extraordinary achievement.

Initial seeding of any pseudo-random generator should NEVER only use pid & 
time, which are too predictable, as already noted on the thread. They 
should use a strong random source if available, and maybe some backup, eg 
hashing logs. I think that this should be directly implemented, maybe with 
some provision to set the seed manually for debugging purposes, although 
with time-dependent features that may use random I'm not sure how far this 
would go.

Also, I would suggest to centralize and abstract the implementation of a 
default pseudo-random generator so that its actual internal size and 
quality can be changed. That would mean renaming pg_erand48 and hidding 
its state size, maybe along the lines of:

   // extractors
   void pg_random_bytes(int nbytes, char *where_to_put_them);

   uint32 pg_random_32();
   uint64 pg_random_48();
   uint64 pg_random_64();
   ...

   // dynamic?
   int pg_random_state_size(void); // in bytes
   // or static?
   #define PG_RANDOM_STATE_SIZE 6 // bytes

   // get/set state
   bool pg_random_get_state(uchar *state(, int size|[PG_RANDOM_STATE_SIZE]));
   bool pg_random_set_state(const uchar *state...);

Given the typical hardware a postgres instance runs on, I would shop 
around for a pseudo-random generator which takes advantage of 64 bits 
operations, and not go below 64 bit seeds, or possibly 128.

If a strong random source is available but considered too costly, so that 
a (weak) linear congruencial algorithm must be used, a possible compromise 
is to reseed from the strong source every few thousands/millions draws, or 
with a time periodicity, eg every few minutes, or maybe some configurable 
option.

A not too costly security enhancer is to combine different fast generators 
so that if one becomes weak at some point, the combination does not.

-- 
Fabien.


pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Speeding up creating UPDATE/DELETE generic plan for partitionedtable into a lot
Next
From: Fabien COELHO
Date:
Subject: Re: Offline enabling/disabling of data checksums