Re: Hostnames, IDNs, Punycode and Unicode Case Folding - Mailing list pgsql-general

From David G Johnston
Subject Re: Hostnames, IDNs, Punycode and Unicode Case Folding
Date
Msg-id 1419895402599-5832360.post@n5.nabble.com
Whole thread Raw
In response to Re: Hostnames, IDNs, Punycode and Unicode Case Folding  (Andy Colson <andy@squeakycode.net>)
Responses Re: Hostnames, IDNs, Punycode and Unicode Case Folding
List pgsql-general
Andy Colson wrote
> On 12/29/2014 4:36 PM, Mike Cardwell wrote:
>> I'd like to store hostnames in a postgres database and I want to fully
>> support
>> IDNs (Internationalised Domain Names)
>>
>> I want to be able to recover the original representation of the hostname,
>> so I
>> can't just encode it with punycode and then store the ascii result. For
>> example,
>> these two are the same hostnames thanks to unicode case folding [1]:
>>
>>    tesst.ëxämplé.com
>>    teßt.ëxämplé.com
>>
>> They both encode in punycode to the same thing:
>>
>>    xn--tesst.xmpl.com-cib7f2a
>>
>> Don't believe me, then try visiting any domain with two s's in, whilst
>> replacing
>> the s's with ß's. E.g:
>>
>>    ericßon.com
>>    nißan.com
>>    americanexpreß.com
>>
>> So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no
>> idea
>> which of those two hostnames was the original representation.
>>
>> The trouble is, if I store the unicode representation of a hostname
>> instead,
>> then when I run queries with conditions like:
>>
>>    WHERE hostname='nißan.com'
>>
>
> _IF_ Postgres had a punycode function, then you could use:
> WHERE punycode(hostname) = punycode('nißan.com')

If the OP wraps what he is doing up into a function that is what you end up
getting: a memoized punycode function.

http://en.wikipedia.org/wiki/Memoization

It has to be defined as volatile but basically write the function to check
for the provided input on the indexed table and if it doesn't exist the
function will calculate the punycode value and store it onto the table
before returning the punycode value to the caller.

The question then becomes how to perform the actual punycode conversion.

The API within PostgreSQL becomes that single function and you store only
the original hostname in the user-data area.  Whenever you need to access
the punycode version you call the function.

The only downside is that because the function is volatile you cannot create
a functional index using it.  It may be worth writing both an immutable
conversion function and a memoization wrapper so you can index using the
former while using the memoized version normally.

I'm getting a little outside my experience level here as it seems there
should be a better way to define a volatile function that is deterministic
so that you get the best of both worlds.  Just because a function modifies
the database doesn't mean that its output value is not solely a function of
its inputs - and that all side effects are of a logging or caching nature.

David J.



--
View this message in context:
http://postgresql.nabble.com/Hostnames-IDNs-Punycode-and-Unicode-Case-Folding-tp5832350p5832360.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


pgsql-general by date:

Previous
From: David Johnston
Date:
Subject: Re: Rollback on include error in psql
Next
From: Adrian Klaver
Date:
Subject: Re: Rollback on include error in psql