Thread: Generating random unique alphanumeric IDs
Does anyone know a way to generate a random and unique lowercase alphanumeric ID (preferably without using 0, 1, o or i to prevent problems with users manually typing the ID) using SQL without resorting to a prerendered table or using GUIDs.
For example, if I were to ask for an ID of 5 characters, something like the following would be returned:
hn21o
8sp2j
9wwun
m7z02
Notice that I don't mean hexadecimal values either. This would preferrably not resort to trying to generate the ID, then checking for a clash, and if there is one, do it again, although that could do as I can't think of how the ideal solution of a ID hashing algorithm would be possible.
Any ideas?
For example, if I were to ask for an ID of 5 characters, something like the following would be returned:
hn21o
8sp2j
9wwun
m7z02
Notice that I don't mean hexadecimal values either. This would preferrably not resort to trying to generate the ID, then checking for a clash, and if there is one, do it again, although that could do as I can't think of how the ideal solution of a ID hashing algorithm would be possible.
Any ideas?
Thanks
Thom
On Sun, Aug 16, 2009 at 12:07:27PM +0100, Thom Brown wrote: > Does anyone know a way to generate a random and unique lowercase > alphanumeric ID If you want it to be unique then it's not going to be random. The easiest way to keep it from producing duplicates is to have some monotonically increasing component. If you're OK with code/people retrying the occasional duplicate then you're going to be relying on statistical guarantees and you should look at "birthday attacks" to see how often this is going to happen. > Notice that I don't mean hexadecimal values either. This would preferrably > not resort to trying to generate the ID, then checking for a clash, and if > there is one, do it again, although that could do as I can't think of how > the ideal solution of a ID hashing algorithm would be possible. The following is the obvious PGSQL code, you'd obviously need something else to stop duplicates. SELECT array_to_string(array(( SELECT SUBSTRING('abcdefghjklmnpqrstuvwxyz23456789' FROM mod((random()*32)::int, 32)+1 FOR 1) FROM generate_series(1,5))),''); As this only generates five characters and each character can be one of 32 values, you've got about 33554432 choices and you'd have a 50% chance of getting a duplicate after 7240 values. This assumes I wrote the above code correctly. It's also not amazing because PG's random number generator is defined to return a value between 0 and 1 inclusive, it's generally much more useful if it runs from 0 to less than 1 and would mean that I wouldn't need the "mod" above and would remove the (slight) biasing towards choosing 'a'. -- Sam http://samason.me.uk/
The following is the obvious PGSQL code, you'd obviously need somethingelse to stop duplicates.
SELECT array_to_string(array((
SELECT SUBSTRING('abcdefghjklmnpqrstuvwxyz23456789'
FROM mod((random()*32)::int, 32)+1 FOR 1)
FROM generate_series(1,5))),'');
As this only generates five characters and each character can be one of
32 values, you've got about 33554432 choices and you'd have a 50% chance
of getting a duplicate after 7240 values. This assumes I wrote the
above code correctly. It's also not amazing because PG's random number
generator is defined to return a value between 0 and 1 inclusive, it's
generally much more useful if it runs from 0 to less than 1 and would
mean that I wouldn't need the "mod" above and would remove the (slight)
biasing towards choosing 'a'.
That does actually work! I'm not sure why you're saying that there's a 50% chance of duplication after 7240 values though. With 33 million combinations, I would have thought that duplications would become equally likely at the 16,777,216 mark.
I hadn't thought of coding it the way you did, which is an interesting way of approaching it!
Thom
On Sun, 16 Aug 2009 12:48:39 +0100 Sam Mason <sam@samason.me.uk> wrote: > On Sun, Aug 16, 2009 at 12:07:27PM +0100, Thom Brown wrote: > > Does anyone know a way to generate a random and unique lowercase > > alphanumeric ID > > If you want it to be unique then it's not going to be random. The > easiest way to keep it from producing duplicates is to have some > monotonically increasing component. If you're OK with code/people > retrying the occasional duplicate then you're going to be relying > on statistical guarantees and you should look at "birthday > attacks" to see how often this is going to happen. > > > Notice that I don't mean hexadecimal values either. This would > > preferrably not resort to trying to generate the ID, then > > checking for a clash, and if there is one, do it again, although > > that could do as I can't think of how the ideal solution of a ID > > hashing algorithm would be possible. Sometimes ago Daniel Verite posted an implementation of a fiestel cipher in plpgsql. I'm happily using it to generate pseudo-random hex strings. -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Sun, Aug 16, 2009 at 12:57:34PM +0100, Thom Brown wrote: > > SELECT array_to_string(array(( > > SELECT SUBSTRING('abcdefghjklmnpqrstuvwxyz23456789' > > FROM mod((random()*32)::int, 32)+1 FOR 1) > > FROM generate_series(1,5))),''); I've just had a look and PG does actually seem to be returning values as I'd expect, i.e. 0 <= n < 1. So the following would work: floor(random()*32)::int+1 instead of the mod hack. Distribution looks reasonably flat (this is good): char %occurances 1 3.1222 2 3.1329 3 3.1269 4 3.1236 5 3.1233 6 3.1310 7 3.1226 8 3.1298 9 3.1229 10 3.1294 11 3.1192 12 3.1249 13 3.1267 14 3.1236 15 3.1190 16 3.1279 17 3.1232 18 3.1218 19 3.1314 20 3.1091 21 3.1337 22 3.1239 23 3.1184 24 3.1347 25 3.1205 26 3.1160 27 3.1219 28 3.1344 29 3.1118 30 3.1256 31 3.1408 32 3.1255 > I'm not sure why you're saying that there's a 50% > chance of duplication after 7240 values though. With 33 million > combinations, I would have thought that duplications would become equally > likely at the 16,777,216 mark. No, that's why I pointed out birthday attacks---collisions happen much more often than you'd expect. Get 23 people in a room and you have a 50% chance of two people having the same birthday--not 150 people. This is why it's called the birthday attack and it's one of the basic tests for hash functions--any bias in their output will shrink this number even further. -- Sam http://samason.me.uk/
Thom Brown wrote: >> I'm not sure why you're saying that there's a 50% >> chance of duplication after 7240 values though. With 33 million >> combinations, I would have thought that duplications would become equally >> likely at the 16,777,216 mark. Basic probability. <http://en.wikipedia.org/wiki/Birthday_attack> <http://en.wikipedia.org/wiki/Birthday_problem> Sam Mason wrote: > No, that's why I pointed out birthday attacks---collisions happen much > more often than you'd expect. Get 23 people in a room and you have a > 50% chance of two people having the same birthday--not 150 people. This > is why it's called the birthday attack and it's one of the basic tests > for hash functions--any bias in their output will shrink this number > even further. Taking the birthday example, the chance that no two people in a group of size n < 365 have the same birthday (irrespective of year) is (365-0)/365 x (365-1)/365 x (365-2)/365 x (365-3)/365 ... x (365 - (n-1))/365 As each term in the expression is less than one, their multiplication together rapidly approaches zero. That means the probability of two or more people having the same birthday rapidly approaches one. The 50% point is a bit under n=23. Substitute 33 million for 365 for the OP's problem. -- Lew
> I would have thought that duplications would become equally likely at the > 16,777,216 mark. Yes, at that point you're as likely to get a duplicate as a unique one--every time you do it. You're likely to see your first duplicate long before that point. In fact, it would be extremely unlikely to get to that point without having generated any duplicates. -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
On Aug 16, 2009, at 5:07 AM, Thom Brown wrote:
Does anyone know a way to generate a random and unique lowercase alphanumeric ID (preferably without using 0, 1, o or i to prevent problems with users manually typing the ID) using SQL without resorting to a prerendered table or using GUIDs.
For example, if I were to ask for an ID of 5 characters, something like the following would be returned:
hn21o
8sp2j
9wwun
m7z02
Notice that I don't mean hexadecimal values either. This would preferrably not resort to trying to generate the ID, then checking for a clash, and if there is one, do it again, although that could do as I can't think of how the ideal solution of a ID hashing algorithm would be possible.
One way is to use a LFSR (linear feedback shift register function). I haven't used one in a long time but I recall generating pseudo random numbers that are guaranteed not to repeat after billions of iterations. It's very fast as well. Then translate the resulting integer into the character sequence of your choosing. Here is a reference: http://en.wikipedia.org/wiki/Linear_feedback_shift_register
Bob Gobeille
Hewlett Packard
Open Source Program Office
(and http://fossology.org)
Ivan Sergio Borgonovo escribió: > Sometimes ago Daniel Verite posted an implementation of a fiestel > cipher in plpgsql. It's in the wiki, in the Snippets area. wiki.postgresql.org/wiki/Snippets (pseudo encrypt or something like that I think it's called) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Sun, Aug 16, 2009 at 04:53:01PM -0600, Bob Gobeille wrote: > One way is to use a LFSR (linear feedback shift register function). I > haven't used one in a long time but I recall generating pseudo random > numbers that are guaranteed not to repeat after billions of > iterations. It's very fast as well. Then translate the resulting > integer into the character sequence of your choosing. Here is a > reference: http://en.wikipedia.org/wiki/Linear_feedback_shift_register Not sure if this is very applicable; LFSRs can have a very long period, or interval before they repeat (i.e. their internal state is the same as it was before) but individual numbers *will* be repeated. The performance claims tend only to apply to hardware implementations, there are much faster pseudo-random number generators available for software. The fastest one I found recently is a SIMD implementation of the "Mersenne Twister" called SFMT[1]. -- Sam http://samason.me.uk/ [1] http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/
On Sun, Aug 16, 2009 at 6:12 PM, Alvaro Herrera<alvherre@commandprompt.com> wrote: > It's in the wiki, in the Snippets area. > wiki.postgresql.org/wiki/Snippets > (pseudo encrypt or something like that I think it's called) Here's a simple 255 value linear feedback shift register. It's nothing fancy, but works as an example. It's not any kind of a secure sequence, but can be handy for generating pseudo random codes for things like identifiers that need to not be sequential. create table lfsr (b bit(8)); insert into lfsr values ('10100011'); create or replace function lf() returns bit(8) language sql as $$ update lfsr set b=(select ((substring(b,1,1)#substring(b,3,1)#substring(b,4,1)#substring(b,5,1)))::bit(8)>>7|(b<<1) from lfsr) ; select b from lfsr $$; create table l (b bit(8), i int); insert into l select lf(),generate_series(1,255); select count(distinct(b)) from l; select b, count(b) from l group by b having count(b) > 1; insert into l select lf(),generate_series(1,1); select b, count(b) from l group by b having count(b) > 1;
Thom Brown wrote: > This would preferrably not resort to trying to generate the ID, then > checking for a clash, and if there is one, do it again, although that could > do as I can't think of how the ideal solution of a ID hashing algorithm > would be possible. As suggested upthread, this function: http://wiki.postgresql.org/wiki/Pseudo_encrypt can be used to generate integers without collision. But the output range is 32 bits, and expressing 2^32 values with 32 letters produces strings of 7 characters, not 5. I don't think that the strings can be shortened in a post-processing stage without producing the collisions that were to be avoided in the first place. However, changing the function to reduce its output range is possible. Going down from 32 bits to 30 has been discussed already: http://archives.postgresql.org/pgsql-general/2009-07/msg00194.php 2^30 happens to be exactly equal to 32^6, so this modified version would produce exactly all the possible strings of 6 characters out of 32 without any collision. If you really want no more than 5 characters, you'd need to go down to 24 bits, but then the number of different outputs would be only about 16 millions. Best regards, -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
On 2009-08-17, Sam Mason <sam@samason.me.uk> wrote: > On Sun, Aug 16, 2009 at 04:53:01PM -0600, Bob Gobeille wrote: >> One way is to use a LFSR (linear feedback shift register function). I >> haven't used one in a long time but I recall generating pseudo random >> numbers that are guaranteed not to repeat after billions of >> iterations. It's very fast as well. Then translate the resulting >> integer into the character sequence of your choosing. Here is a >> reference: http://en.wikipedia.org/wiki/Linear_feedback_shift_register > > Not sure if this is very applicable; LFSRs can have a very long period, > or interval before they repeat ideally (2^n)-1 (i.e. their internal state is the same as > it was before) but individual numbers *will* be repeated. numbers will not be repeated intil the state wraps if the number returned represents the entire state of the LFSR. for the OP's problem this means building a LFSR with n=5c (where c is the number of charactes in the serial code, and n is the number of bits in the LFSR state) and then taking a single LFSR result and peeling off 5 bits at a time and using each 5 to make each charcter in the result. (or anternately clocking c 5 bit numbers out of the LFSR) for each result. > The performance claims tend only to apply to hardware implementations, > there are much faster pseudo-random number generators available for > software. The fastest one I found recently is a SIMD implementation of > the "Mersenne Twister" called SFMT[1]. LFSR can be optimised using look-up tables same as CRC32 is (CRC32 is an LFSR with an input as well as an output) MT is too big (too much state), linear congrutential is probably a better approach when a software generated non-revisiting sequence is wanted. for good results use a prime m less than 32^c and an r near sqrt(32^c)
On Mon, Aug 17, 2009 at 12:17:29PM +0000, Jasen Betts wrote: > On 2009-08-17, Sam Mason <sam@samason.me.uk> wrote: > (i.e. their internal state is the same as > > it was before) but individual numbers *will* be repeated. > > numbers will not be repeated intil the state wraps if the number > returned represents the entire state of the LFSR. Huh, not thought of that! I'd only considered them for crypto purposes where this sort of thing would be bad as you'd be able to know exactly what's coming next. I'd never considered that this could be a useful property before! -- Sam http://samason.me.uk/
In article <20090816122526.GW5407@samason.me.uk>, Sam Mason <sam@samason.me.uk> writes: > I've just had a look and PG does actually seem to be returning values as > I'd expect, i.e. 0 <= n < 1. That's what everyone would expect. If it's really implemented like that the documentation is wrong, isn't it?
On Mon, Aug 17, 2009 at 04:00:54PM +0200, Harald Fuchs wrote: > In article <20090816122526.GW5407@samason.me.uk>, > Sam Mason <sam@samason.me.uk> writes: > > > I've just had a look and PG does actually seem to be returning values as > > I'd expect, i.e. 0 <= n < 1. > > That's what everyone would expect. If it's really implemented like > that the documentation is wrong, isn't it? Yup, the code is doing the right thing. I submitted a patch to the docs and Tom committed a, somewhat more readable, version of it last night. http://archives.postgresql.org/pgsql-committers/2009-08/msg00197.php -- Sam http://samason.me.uk/
On Mon, 17 Aug 2009 12:37:33 +0200 "Daniel Verite" <daniel@manitou-mail.org> wrote: > http://archives.postgresql.org/pgsql-general/2009-07/msg00194.php As an exercise I wrote the decrypt version create or replace function feistel_encrypt(value int) returns int as $$ declare l1 int; l2 int; r1 int; r2 int; i int:=0; begin l1:= (value >> 16) & 65535; r1:= value & 65535; while i<3 loop l2:=r1; r2:=l1 # ((((1366.0 * r1+150889)%714025)/714025.0)*32767)::int; l1:=l2; r1:=r2; i:=i+1; end loop; return ((l1::bigint<<16) + r1); end; $$ language plpgsql strict immutable; create or replace function feistel_decrypt(value int) returns int as $$ declare l1 int; l2 int; r1 int; r2 int; i int:=0; begin l2:= (value >> 16) & 65535; r2:= value & 65535; while i<3 loop r1=l2; l1:=r2#((((1366.0*l2+150889)%714025)/714025.0)*32767)::int; l2:=l1; r2:=r1; i:=i+1; end loop; return ((l2::bigint<<16) + r2); end; $$ language plpgsql strict immutable; so that 10 = feistel_decrypt(feistel_encrypt(10)) Since I'm then converting to_hex to shorten the string I was thinking to add some more bits of randomness since eg. to_hex(10) = 'a' In the line of select lpad( to_hex(feistel_encrypt(10)),7 , to_hex((rand()*2^31)::int) ); I was wondering if there is any better way to get alphanumeric random string quickly. Since uniqueness is assured by passing a sequence to fesitel_encrypt, I just need turning into to alphanumeric quickly. -- Ivan Sergio Borgonovo http://www.webthatworks.it
Since I'm then converting to_hex to shorten the string I was
thinking to add some more bits of randomness since eg.
to_hex(10) = 'a'
In the line of
select lpad(
to_hex(feistel_encrypt(10)),7 , to_hex((rand()*2^31)::int)
);
I was wondering if there is any better way to get alphanumeric
random string quickly. Since uniqueness is assured by passing a
sequence to fesitel_encrypt, I just need turning into to
alphanumeric quickly.
This appears a lot more tricky than I had originally anticipated! I may be misunderstanding your example, but by alphanumeric, I mean beyond hex (i.e. a-z and possibly uppcase too).
I've looked into LFSR, but I'm afraid it goes over my head. But what Jason Betts said seems to summarise what I'm after: "for the OP's problem this means building a LFSR with n=5c (where c is the number of charactes in the serial code, and n is the number of bits in the LFSR state) and then taking a single LFSR result and peeling off 5 bits at a time and using each 5 to make each charcter in the result."
If this results in an unpredictable and non-duplicating loop of generated sets of characters, that would be ideal. Would a parallel for this be a 5-character code possibly transcoded from a 6-character GUID/UUID? (a-h + j+n + p-z + A-H + J-N + P+Z + 2-9 = 56 possible characters, 56^5 = 550,731,776, 550,731,776 / 16 (hex character set) ^ 6 (characters) = just over 32.), so wouldn't actually use up all possible combinations. :/
Thom
On Thu, 20 Aug 2009 13:34:51 +0100 Thom Brown <thombrown@gmail.com> wrote: Correcting myself. a) it is a bad idea to pad an hex with an hex... so I should still find a quick way to change representation to [g-z] for the padding characters... or just pad with a constant string. select lpad( to_hex(feistel_encrypt(10)),8 , 'mjkitlh') ); b) this if from int (signed) to int (signed). begin; create or replace function feistel_encrypt(value int) returns int as $$ declare l1 int; l2 int; r1 int; r2 int; i int:=0; begin l1:= (value >> 16) & 65535; r1:= value & 65535; while i<3 loop l2:=r1; r2:=l1#((((1366.0 *r1+150889)%714025)/714025.0)*32767)::int; l1:=l2; r1:=r2; i:=i+1; end loop; return ((l1 << 16) | r1); end; $$ language plpgsql strict immutable; create or replace function feistel_decrypt(value int) returns int as $$ declare l1 int; l2 int; r1 int; r2 int; i int:=0; begin l2:= (value >> 16) & 65535; r2:= value & 65535; while i<3 loop r1=l2; l1:=r2#((((1366.0*l2+150889)%714025)/714025.0)*32767)::int; l2:=l1; r2:=r1; i:=i+1; end loop; return ((l2 << 16) | r2); end; $$ language plpgsql strict immutable; commit; select * from feistel_decrypt(feistel_encrypt((2^31-1)::int)) union select * from feistel_decrypt(feistel_encrypt((-2^31)::int)) union select * from feistel_decrypt(feistel_encrypt((0)::int)) union select * from feistel_decrypt(feistel_encrypt((-1)::int)) ; > This appears a lot more tricky than I had originally anticipated! > I may be misunderstanding your example, but by alphanumeric, I > mean beyond hex (i.e. a-z and possibly uppcase too). me too... but to_hex was there and a quick trick to shorten the string and get rid of a sign. > I've looked into LFSR, but I'm afraid it goes over my head. But There is too much dust on my copy of "Concrete Mathematics" still by popular culture (read wikipedia) it is said that LFSR are not cryptographically safe, while making 4 loops and choosing a suitable F, Feistel cypher is. Then it is just a problem of "shrinking the string" or representing it in another base... and that may result in some "waste". 5 bits are 32 char... you actually have more chars available even just considering a subset of ASCII. Picking 5 bits from LFSR algo isn't that different than converting to hex feistel cipher as I see it. The main advantage of hex over ASCII is that ints map very well to hex (no waste) and that to_hex has good chance to perform better than any plpgsql function. Since I'm generating "gift codes" It wouldn't look nice if I present the user with A as a gift code... And that's going to happen as soon as I'll have generated 232798491 gift codes. (/me wondering which is the smaller number with a corresponding one digit hex(fiestel()) transform.)[1]. So just to make gift codes look nicer I thought about padding them with some furter random noise... but the way initially described is not going to work. Variants could be to concat with something [^a-f0-9] (eg '-') and then padding with hex random noise A -> -A -> (random noise)-A I don't know if it is worth since it is another round of lpad. Even if I'm currently not overly concerned by performances I'm working with plpgsql and while I think that writing something to change base representation to an int can be done... it will be slow and ugly. If I was working with pgperl (?) I'd just google for some perl receipt. Given the premises I'll just embellish the hex with some padding. But if you really need to use letters and be compact and such... I think you're just looking for changing the base of your wathever-pseudo-random algorithm. That's a common problem you may just have to adapt to plpgsql. [1] select s.i, feistel_decrypt(s.i) from generate_series(0,16) as s(i) order by feistel_decrypt(s.i) did it in a hurry... didn't check -- Ivan Sergio Borgonovo http://www.webthatworks.it
Thom Brown wrote: > If this results in an unpredictable and non-duplicating loop of generated > sets of characters, that would be ideal. Would a parallel for this be a > 5-character code possibly transcoded from a 6-character GUID/UUID? (a-h + > j+n + p-z + A-H + J-N + P+Z + 2-9 = 56 possible characters, 56^5 = > 550,731,776, 550,731,776 / 16 (hex character set) ^ 6 (characters) = just > over 32.), so wouldn't actually use up all possible combinations. :/ 56^5 is the number of strings that you can form with 5 letters of the 56 letters alphabet. But independantly of that , what is the minimum number of different values that you truly need? Best regards, -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org