Re: Hostnames, IDNs, Punycode and Unicode Case Folding - Mailing list pgsql-general
From | David G Johnston |
---|---|
Subject | Re: Hostnames, IDNs, Punycode and Unicode Case Folding |
Date | |
Msg-id | 1419895402599-5832360.post@n5.nabble.com Whole thread Raw |
In response to | Re: Hostnames, IDNs, Punycode and Unicode Case Folding (Andy Colson <andy@squeakycode.net>) |
Responses |
Re: Hostnames, IDNs, Punycode and Unicode Case Folding
|
List | pgsql-general |
Andy Colson wrote > On 12/29/2014 4:36 PM, Mike Cardwell wrote: >> I'd like to store hostnames in a postgres database and I want to fully >> support >> IDNs (Internationalised Domain Names) >> >> I want to be able to recover the original representation of the hostname, >> so I >> can't just encode it with punycode and then store the ascii result. For >> example, >> these two are the same hostnames thanks to unicode case folding [1]: >> >> tesst.ëxämplé.com >> teßt.ëxämplé.com >> >> They both encode in punycode to the same thing: >> >> xn--tesst.xmpl.com-cib7f2a >> >> Don't believe me, then try visiting any domain with two s's in, whilst >> replacing >> the s's with ß's. E.g: >> >> ericßon.com >> nißan.com >> americanexpreß.com >> >> So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no >> idea >> which of those two hostnames was the original representation. >> >> The trouble is, if I store the unicode representation of a hostname >> instead, >> then when I run queries with conditions like: >> >> WHERE hostname='nißan.com' >> > > _IF_ Postgres had a punycode function, then you could use: > WHERE punycode(hostname) = punycode('nißan.com') If the OP wraps what he is doing up into a function that is what you end up getting: a memoized punycode function. http://en.wikipedia.org/wiki/Memoization It has to be defined as volatile but basically write the function to check for the provided input on the indexed table and if it doesn't exist the function will calculate the punycode value and store it onto the table before returning the punycode value to the caller. The question then becomes how to perform the actual punycode conversion. The API within PostgreSQL becomes that single function and you store only the original hostname in the user-data area. Whenever you need to access the punycode version you call the function. The only downside is that because the function is volatile you cannot create a functional index using it. It may be worth writing both an immutable conversion function and a memoization wrapper so you can index using the former while using the memoized version normally. I'm getting a little outside my experience level here as it seems there should be a better way to define a volatile function that is deterministic so that you get the best of both worlds. Just because a function modifies the database doesn't mean that its output value is not solely a function of its inputs - and that all side effects are of a logging or caching nature. David J. -- View this message in context: http://postgresql.nabble.com/Hostnames-IDNs-Punycode-and-Unicode-Case-Folding-tp5832350p5832360.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
pgsql-general by date: