Hostnames, IDNs, Punycode and Unicode Case Folding - Mailing list pgsql-general

From Mike Cardwell
Subject Hostnames, IDNs, Punycode and Unicode Case Folding
Date
Msg-id 20141229223642.GA19102@glue.grepular.com
Whole thread Raw
Responses Re: Hostnames, IDNs, Punycode and Unicode Case Folding  (Andy Colson <andy@squeakycode.net>)
Re: Hostnames, IDNs, Punycode and Unicode Case Folding  (Andrew Sullivan <ajs@crankycanuck.ca>)
Re: Hostnames, IDNs, Punycode and Unicode Case Folding  (Jonathan Vanasco <postgres@2xlp.com>)
List pgsql-general
I'd like to store hostnames in a postgres database and I want to fully support
IDNs (Internationalised Domain Names)

I want to be able to recover the original representation of the hostname, so I
can't just encode it with punycode and then store the ascii result. For example,
these two are the same hostnames thanks to unicode case folding [1]:

  tesst.ëxämplé.com
  teßt.ëxämplé.com

They both encode in punycode to the same thing:

  xn--tesst.xmpl.com-cib7f2a

Don't believe me, then try visiting any domain with two s's in, whilst replacing
the s's with ß's. E.g:

  ericßon.com
  nißan.com
  americanexpreß.com

So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no idea
which of those two hostnames was the original representation.

The trouble is, if I store the unicode representation of a hostname instead,
then when I run queries with conditions like:

  WHERE hostname='nißan.com'

that wont pull out rows where hostname='nissan.com'. Also, if I create a unique
index on the hostname field, or even on lower(hostname), that wont stop the
same hostname being inserted more than once, with slightly different
representations.

So the system I've settled with is storing both the originally supplied
representation, *and* the lower cased punycode encoded version in a separate
column for indexing/search. This seems really hackish to me though.

It seems to me that Postgres would benefit from a native hostname type and/or
a pair of punycode encode/decode functions. And perhaps even a simple unicode
case folding function. With the end result that these return TRUE:

  unicode_case_fold('ß') = 'ss'

  'xn--tesst.xmpl.com-cib7f2a' = punycode_encode('teßt.ëxämplé.com')

  punycode_decode('xn--tesst.xmpl.com-cib7f2a') = 'tesst.ëxämplé.com'

A native type would also be able to apply suitable constraints, e.g a maximum
length of 253 octets on a punycode-encoded trailing-dot-excluded hostname, a
limit of 1-63 octets on a punycode encoded label, no leading or trailing hyphens
on a label, etc.

I initially created a pair of functions for using punycode using plperl and the
IDNA::Punycode Perl module, but the performance wasn't good enough, so I settled
with duplicating the data to sacrifice disk space and complexity, for speed.

I'm new to Postgres, and to this list, so if there is a better way for me to
submit this suggestion or solve my problem, please point me in the right
direction.

[1] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

Regards,

--
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key    35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4

Attachment

pgsql-general by date:

Previous
From: David Johnston
Date:
Subject: Re: Rollback on include error in psql
Next
From: Adrian Klaver
Date:
Subject: Re: Rollback on include error in psql