Thread: Hostnames, IDNs, Punycode and Unicode Case Folding

Hostnames, IDNs, Punycode and Unicode Case Folding

From

Mike Cardwell

Date:

29 December 2014, 22:36:49

I'd like to store hostnames in a postgres database and I want to fully support
IDNs (Internationalised Domain Names)

I want to be able to recover the original representation of the hostname, so I
can't just encode it with punycode and then store the ascii result. For example,
these two are the same hostnames thanks to unicode case folding [1]:

  tesst.ëxämplé.com
  teßt.ëxämplé.com

They both encode in punycode to the same thing:

  xn--tesst.xmpl.com-cib7f2a

Don't believe me, then try visiting any domain with two s's in, whilst replacing
the s's with ß's. E.g:

  ericßon.com
  nißan.com
  americanexpreß.com

So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no idea
which of those two hostnames was the original representation.

The trouble is, if I store the unicode representation of a hostname instead,
then when I run queries with conditions like:

  WHERE hostname='nißan.com'

that wont pull out rows where hostname='nissan.com'. Also, if I create a unique
index on the hostname field, or even on lower(hostname), that wont stop the
same hostname being inserted more than once, with slightly different
representations.

So the system I've settled with is storing both the originally supplied
representation, *and* the lower cased punycode encoded version in a separate
column for indexing/search. This seems really hackish to me though.

It seems to me that Postgres would benefit from a native hostname type and/or
a pair of punycode encode/decode functions. And perhaps even a simple unicode
case folding function. With the end result that these return TRUE:

  unicode_case_fold('ß') = 'ss'

  'xn--tesst.xmpl.com-cib7f2a' = punycode_encode('teßt.ëxämplé.com')

  punycode_decode('xn--tesst.xmpl.com-cib7f2a') = 'tesst.ëxämplé.com'

A native type would also be able to apply suitable constraints, e.g a maximum
length of 253 octets on a punycode-encoded trailing-dot-excluded hostname, a
limit of 1-63 octets on a punycode encoded label, no leading or trailing hyphens
on a label, etc.

I initially created a pair of functions for using punycode using plperl and the
IDNA::Punycode Perl module, but the performance wasn't good enough, so I settled
with duplicating the data to sacrifice disk space and complexity, for speed.

I'm new to Postgres, and to this list, so if there is a better way for me to
submit this suggestion or solve my problem, please point me in the right
direction.

[1] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

Regards,

--
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key    35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4

Attachment

signature.asc

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Andy Colson

Date:

29 December 2014, 22:52:04

On 12/29/2014 4:36 PM, Mike Cardwell wrote:
> I'd like to store hostnames in a postgres database and I want to fully support
> IDNs (Internationalised Domain Names)
>
> I want to be able to recover the original representation of the hostname, so I
> can't just encode it with punycode and then store the ascii result. For example,
> these two are the same hostnames thanks to unicode case folding [1]:
>
>    tesst.ëxämplé.com
>    teßt.ëxämplé.com
>
> They both encode in punycode to the same thing:
>
>    xn--tesst.xmpl.com-cib7f2a
>
> Don't believe me, then try visiting any domain with two s's in, whilst replacing
> the s's with ß's. E.g:
>
>    ericßon.com
>    nißan.com
>    americanexpreß.com
>
> So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no idea
> which of those two hostnames was the original representation.
>
> The trouble is, if I store the unicode representation of a hostname instead,
> then when I run queries with conditions like:
>
>    WHERE hostname='nißan.com'
>

_IF_ Postgres had a punycode function, then you could use:
WHERE punycode(hostname) = punycode('nißan.com')


-Andy

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

David G Johnston

Date:

29 December 2014, 23:23:27

Andy Colson wrote
> On 12/29/2014 4:36 PM, Mike Cardwell wrote:
>> I'd like to store hostnames in a postgres database and I want to fully
>> support
>> IDNs (Internationalised Domain Names)
>>
>> I want to be able to recover the original representation of the hostname,
>> so I
>> can't just encode it with punycode and then store the ascii result. For
>> example,
>> these two are the same hostnames thanks to unicode case folding [1]:
>>
>>    tesst.ëxämplé.com
>>    teßt.ëxämplé.com
>>
>> They both encode in punycode to the same thing:
>>
>>    xn--tesst.xmpl.com-cib7f2a
>>
>> Don't believe me, then try visiting any domain with two s's in, whilst
>> replacing
>> the s's with ß's. E.g:
>>
>>    ericßon.com
>>    nißan.com
>>    americanexpreß.com
>>
>> So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no
>> idea
>> which of those two hostnames was the original representation.
>>
>> The trouble is, if I store the unicode representation of a hostname
>> instead,
>> then when I run queries with conditions like:
>>
>>    WHERE hostname='nißan.com'
>>
>
> _IF_ Postgres had a punycode function, then you could use:
> WHERE punycode(hostname) = punycode('nißan.com')

If the OP wraps what he is doing up into a function that is what you end up
getting: a memoized punycode function.

http://en.wikipedia.org/wiki/Memoization

It has to be defined as volatile but basically write the function to check
for the provided input on the indexed table and if it doesn't exist the
function will calculate the punycode value and store it onto the table
before returning the punycode value to the caller.

The question then becomes how to perform the actual punycode conversion.

The API within PostgreSQL becomes that single function and you store only
the original hostname in the user-data area.  Whenever you need to access
the punycode version you call the function.

The only downside is that because the function is volatile you cannot create
a functional index using it.  It may be worth writing both an immutable
conversion function and a memoization wrapper so you can index using the
former while using the memoized version normally.

I'm getting a little outside my experience level here as it seems there
should be a better way to define a volatile function that is deterministic
so that you get the best of both worlds.  Just because a function modifies
the database doesn't mean that its output value is not solely a function of
its inputs - and that all side effects are of a logging or caching nature.

David J.

--
View this message in context:
http://postgresql.nabble.com/Hostnames-IDNs-Punycode-and-Unicode-Case-Folding-tp5832350p5832360.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Mike Cardwell

Date:

29 December 2014, 23:51:01

* on the Mon, Dec 29, 2014 at 04:23:22PM -0700, David G Johnston wrote:

>>>    WHERE hostname='nißan.com'
>>>
>>
>> _IF_ Postgres had a punycode function, then you could use:
>> WHERE punycode(hostname) = punycode('nißan.com')
>
> If the OP wraps what he is doing up into a function that is what you end up
> getting: a memoized punycode function.
>
> http://en.wikipedia.org/wiki/Memoization
>
> It has to be defined as volatile but basically write the function to check
> for the provided input on the indexed table and if it doesn't exist the
> function will calculate the punycode value and store it onto the table
> before returning the punycode value to the caller.

I'm not sure all that is necessary. It could be quite a simple function,
like the lower() function. So what I would do is this:

CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname)));

That would prevent adding more than one representation for the same hostname
to the column. And I if I wanted to do a fast, indexed search where I could
supply any representation of the hostname as input, I would just do:

WHERE lower(punycode_encode(hostname)) = lower(punycode_encode('any-representation'))

There doesn't need to be any extra table storage for the punycode encoded
version.

--
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key    35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4

Attachment

signature.asc

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Andrew Sullivan

Date:

30 December 2014, 00:00:24

On Mon, Dec 29, 2014 at 11:50:54PM +0000, Mike Cardwell wrote:
>
> CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname)));

This wouldn't work to get the original back if oyu have any IDNA2003
data, because puncode-encoding the UTF-8 under IDNA2003 and then
punycode-decoding it doesn't always result in the same label.  See my
other message.

Did I mention that IDNA is a mess?

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

David G Johnston

Date:

30 December 2014, 00:01:51

On Mon, Dec 29, 2014 at 4:51 PM, Mike Cardwell [via PostgreSQL] <[hidden email]> wrote:

* on the Mon, Dec 29, 2014 at 04:23:22PM -0700, David G Johnston wrote:

>>> WHERE hostname='nißan.com'
>>>
>>
>> _IF_ Postgres had a punycode function, then you could use:
>> WHERE punycode(hostname) = punycode('nißan.com')
>
> If the OP wraps what he is doing up into a function that is what you end up
> getting: a memoized punycode function.
>
> http://en.wikipedia.org/wiki/Memoization
>
> It has to be defined as volatile but basically write the function to check
> for the provided input on the indexed table and if it doesn't exist the
> function will calculate the punycode value and store it onto the table
> before returning the punycode value to the caller.
I'm not sure all that is necessary. It could be quite a simple function,
like the lower() function. So what I would do is this:

CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname)));

That would prevent adding more than one representation for the same hostname
to the column.

Except two different hostname can resolve to the same punycode_encode(hostname) value so the unique index won't work.

It was also mentioned that using the Perl encoding function was non-performant; which is why caching the data into a memoization table has value.

WHERE lower(punycode_encode(hostname)) = lower(punycode_encode('any-representation'))

I'm not for knowing the rules of punycode but I'm not seeing what value lower() provides here...

There doesn't need to be any extra table storage for the punycode encoded
version.

David J.

View this message in context: Re: Hostnames, IDNs, Punycode and Unicode Case Folding
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Tom Lane

Date:

30 December 2014, 00:09:07

David G Johnston <david.g.johnston@gmail.com> writes:
> It was also mentioned that using the Perl encoding function was
> non-performant; which is why caching the data into a memoization table has
> value.

I find it hard to believe that the standards folk would have chosen a
hostname encoding method that was so inefficient that memoization using
a database table would outperform just recomputing the encoding.  What
I suspect the above really means is "man, this Perl code sucks".

            regards, tom lane

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Mike Cardwell

Date:

30 December 2014, 00:19:03

* on the Mon, Dec 29, 2014 at 07:00:05PM -0500, Andrew Sullivan wrote:

>> CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname)));
>
> This wouldn't work to get the original back if oyu have any IDNA200
> data, because puncode-encoding the UTF-8 under IDNA2003 and the
> punycode-decoding it doesn't always result in the same label.  See my
> other message.

The original is the thing that is stored in the database. I wouldn't need to
do any conversion to get the original back. In my example I am storing
the original and creating an index on the punycode version.

This is exactly the same method that we commonly use for performing case
insensitive text searches using lower() indexes.

--
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key    35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4

Attachment

signature.asc

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Andrew Sullivan

Date:

30 December 2014, 00:22:28

[resending, because somehow this got routed through my work address the first time.]

Hi,

I didn't have time to write a short note, so I wrote a long one
instead.  Sorry.

On Mon, Dec 29, 2014 at 10:36:42PM +0000, Mike Cardwell wrote:
> can't just encode it with punycode and then store the ascii result. For example,
> these two are the same hostnames thanks to unicode case folding [1]:
>
>   tesst.ëxämplé.com
>   teßt.ëxämplé.com

Well, in IDNA2003 they're the same.  In IDNA2008 (RFC 5890 and suite),
they're not the same.  In UTS46, they're kind of the same, because
pre-lookup processing maps one of them to the other (it depends which
mode you're in which way the mapping goes, which is just fantastic
because you can't tell at the server which mode the client is in.
IDNA is an unholy mess); but the lookup is still done using the
IDNA2008 rules, approximately.

> They both encode in punycode to the same thing:
>
>   xn--tesst.xmpl.com-cib7f2a

Under no circumstances should they encode to that.  IDNA, either 2003
or 2008, is label by label, not for entire domain names.  (The labels
are the things between the dots in a presentation-format FQDN.)

So, under IDNA2003, you should get tesst.xn--xmpl-loa7ai.com for both
of them.  Under IDNA2008, you should get that also for the first of
them, and xn--tet-6ka.xn--xmpl-loa7ai.com for the second.  Most
clients, however, are probably going to want to do something UTS46ish,
but there's no way to guarantee that the IDNA2003 and IDNA2008 labels
go together (this is called "variants", and you do not want to know
how much horror that has caused).

Worse, of course, you also don't know that what you have isn't just
raw UTF-8 in the label, but let's boil one ocean at a time.

> Don't believe me, then try visiting any domain with two s's in, whilst replacing
> the s's with ß's. E.g:
>
>   ericßon.com
>   nißan.com
>   americanexpreß.com

This depends entirely on which version of IDNA you're using.  Many
browsers right now officially do IDNA2003.  Unfortunately, they don't
_actually_ do that either because IDNA2003 is nominally nailed to
Unicode version 3.2, and there's approximately zero chance that a
running computer on the Internet these days is using such an old
Unicode version.

IDNA2008, by the way, wasn't something we did (I am one of the people
you can blame for this) for fun.  The very problem you're noting in
IDNA2003 is one of the things we were trying to fix.  Under IDNA2003,
the Unicode-Punycode-Unicode round trip could lose data.  Under
IDNA2008, this is fixed: every A-label (the xn--Punycodehere version)
corresponds to exactly one U-label (the Unicode representation) and
conversely.  (It follows from this that lots of Unicode strings aren't
U-labels, because there are a lot of rules about what can be a
U-label.  For instance, capital letters aren't allowed, because
they're not stable under caseFold.  This is all in the IDNA2008 RFCs,
but they're not an easy read.  We tried.)

> So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no idea
> which of those two hostnames was the original representation.

None.  What you need to do is split the name on label boundaries
(which is hard, because believe it or not "." is a valid character in
the DNS, but splitting on the "." character is probably as good as you
can do here.  But look for escaped ones).  Then you can check for
validity under IDNA2008 and IDNA2003, and then you can run it through
Punycode.

ICANN, for whatever it's worth, is using IDNA2008 rules for top-level
domains and as part of its IDNA guidelines, so over time the actually
_registered_ names are going to be either IDNA2008 or, at worst,
UTS46.

> The trouble is, if I store the unicode representation of a hostname instead,
> then when I run queries with conditions like:
>
>   WHERE hostname='nißan.com'
>
> that wont pull out rows where hostname='nissan.com'.

Right.  If I were doing this, I think I'd probably create two
functions, one to do IDNA2003 and one to do IDNA2008.  Then I'd put a
functional index on it for both cases.  Eventually, you'll be able to
drop the IDNA2003 lookup because everything will conform to IDNA2003.
(Note that the WHATWG, which the W3C is going to listen to but which
is not a W3C WG, appears to be trying to undo that; but IDNA2003 is
irredeemably broken.  So there is a mess here brewing.)

> So the system I've settled with is storing both the originally supplied
> representation, *and* the lower cased punycode encoded version in a separate
> column for indexing/search. This seems really hackish to me though.

Well, see above.  The other way I'd do it is to store _both_ the
IDNA2003 punycode and also the IDNA2008 A-label.  The reason I hate
this is because the lookup is insanely complicated.

> It seems to me that Postgres would benefit from a native hostname type and/or
> a pair of punycode encode/decode functions.

A pair won't work.

> And perhaps even a simple unicode case folding function.

Unicode case folding is _way_ more complicated than you seem to be
thinking here, and importantly has some nasty edge conditions.  For
instance, the natural uppercase of the lowercase sharp s, ß, that
we've been talking about now turns out to be capital sharp S, ẞ
(that's U+1E9E in case you can't see it).  That is not, however, the
uppercase, because the case folding rules in earlier versions of
Unicode (which didn't have U+1E9E) was SS, and the stability rules
require that things not break across versions.  (There are other
problems like this.  For instance, the upper case of é in French is
officially E, and in Québecois is officially É.  And then there's the
Turkic dotless-i and dotted-i rules.)

To do case folding really according to what people expect, you need to
be locale sensitive.  Since the DNS has no locale information, we
couldn't do that in IDNA, so the answer the first time was naïve case
folding (along the lines of the Unicode standard caseFold file), and
the second time to leave the case folding to the user agent, on the
principle that it has a hope of knowing the locale.

>  With the end result that these return TRUE:
>
>   unicode_case_fold('ß') = 'ss'

But that's false.  What's really going on there is that the Unicode
case fold of ß is SS, and that case folded again is ss.

> A native type would also be able to apply suitable constraints, e.g a maximum
> length of 253 octets on a punycode-encoded trailing-dot-excluded hostname, a
> limit of 1-63 octets on a punycode encoded label, no leading or trailing hyphens
> on a label, etc.

You seem to want a bunch of label constraints, not all of which are
related to IDNA. I think it would be better to break these up into a
small number of functions.  As it happens, I have a colleague at Dyn
who I think has some need of some of this too, and so it might be
worth spinning up a small project to try to get generic functions:
to_idna2003, to_idna2008, check_ldh, split_labels, and so on.  If this
seems possibly interesting for collaboration, let me know & I'll try
to put together the relevant people.

Best regards,

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Andrew Sullivan

Date:

30 December 2014, 00:26:04

On Tue, Dec 30, 2014 at 12:18:58AM +0000, Mike Cardwell wrote:
>
> This is exactly the same method that we commonly use for performing case
> insensitive text searches using lower() indexes.

Hmm.  How did you get the original, then?  If you have the original
Unicode version, why don't you switch to IDNA2008 publication rules,
which are way more reliable?  In that case, you do have a 1:1 lookup
and you shouldn't have a problem.

If you need variants, then you have a different problem, but that
actually can be specified for the much narrower range of UTF-8
permissible under IDNA2008.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Mike Cardwell

Date:

30 December 2014, 00:26:17

* on the Mon, Dec 29, 2014 at 05:01:44PM -0700, David G Johnston wrote:

>> CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames
>> (lower(punycode_encode(hostname)));
>>
>> That would prevent adding more than one representation for the same
>> hostname
>> to the column.
>
> Except two different hostname can resolve to the same
> punycode_encode(hostname) value

That's exactly what I'm taking advantage of...

> so the unique index won't work.

I think you misunderstand what I'm suggesing. I'm suggesting storing the
unicode version in the database, and then creating a unique index on the
result of the punycode function to prevent more than one representation
of the same hostname being inserted.

> It was also mentioned that using the Perl encoding function was
> non-performant; which is why caching the data into a memoization table has
> value.

It was non-performant because it involved unnecessarily starting a Perl
interpreter. Not because the algorithm its self is slow.

> WHERE lower(punycode_encode(hostname)) =
> > lower(punycode_encode('any-representation'))
> >
>
> I'm not for knowing the rules of punycode but I'm not seeing what value
> lower() provides here...

Case insensitive matching. So that "EXAMPLE.COM" = "example.com"

--
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key    35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4

Attachment

signature.asc

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Mike Cardwell

Date:

30 December 2014, 00:48:25

* on the Mon, Dec 29, 2014 at 07:22:21PM -0500, Andrew Sullivan wrote:

>> can't just encode it with punycode and then store the ascii result. For example,
>> these two are the same hostnames thanks to unicode case folding [1]:
>>
>>   tesst.ëxämplé.com
>>   teßt.ëxämplé.com
>
> Well, in IDNA2003 they're the same.  In IDNA2008 (RFC 5890 and suite),
> they're not the same.  In UTS46, they're kind of the same, because
> pre-lookup processing maps one of them to the other (it depends which
> mode you're in which way the mapping goes, which is just fantastic
> because you can't tell at the server which mode the client is in.
> IDNA is an unholy mess); but the lookup is still done using the
> IDNA2008 rules, approximately.

Heh. And I just thought I was finally starting to get to grips with this stuff.

>> They both encode in punycode to the same thing:
>>
>>   xn--tesst.xmpl.com-cib7f2a
>
> Under no circumstances should they encode to that.

Eurgh, you're right. The library I'm using does actually do it right, I just
forgot to split on the dot and encode each label separately when writing the
examples for this email. Sorry for confusing matters.

[snip lots of useful and interesting information]

> You seem to want a bunch of label constraints, not all of which are
> related to IDNA. I think it would be better to break these up into a
> small number of functions.  As it happens, I have a colleague at Dyn
> who I think has some need of some of this too, and so it might be
> worth spinning up a small project to try to get generic functions:
> to_idna2003, to_idna2008, check_ldh, split_labels, and so on.  If this
> seems possibly interesting for collaboration, let me know & I'll try
> to put together the relevant people.

Those functions would be very useful to me. I know a bit of C, but probably not
enough to produce an acceptable patch. If there are people who would also find
these functions useful, and people motivated to implement them, that would
be great...

--
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key    35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4

Attachment

signature.asc

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Mike Cardwell

Date:

30 December 2014, 00:53:47

* on the Mon, Dec 29, 2014 at 07:25:59PM -0500, Andrew Sullivan wrote:

>> This is exactly the same method that we commonly use for performing case
>> insensitive text searches using lower() indexes.
>
> Hmm.  How did you get the original, then?

The "original" in my case, is the hostname which the end user supplied.
Essentially, when I display it back to them, I want to make sure it is
displayed the same way that it was when they originally submitted it.

> If you have the original
> Unicode version, why don't you switch to IDNA2008 publication rules,
> which are way more reliable?  In that case, you do have a 1:1 lookup
> and you shouldn't have a problem.

I was unaware of the different versions of IDNA. I basically started using
the Perl module IDNA::Punycode in my project and assumed that this was the
only type. Seems like I need to do some more reading.

--
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key    35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4

Attachment

signature.asc

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Andrew Sullivan

Date:

30 December 2014, 01:53:21

On Tue, Dec 30, 2014 at 12:53:42AM +0000, Mike Cardwell wrote:
> > Hmm.  How did you get the original, then?
>
> The "original" in my case, is the hostname which the end user supplied.
> Essentially, when I display it back to them, I want to make sure it is
> displayed the same way that it was when they originally submitted it.

Ah.  This gets even better™ for you, then, because whereas in IDNA2003
you can pass it an old fashioned LDH name (letter, digit, hypen),
IDNA2008 treats those as _outside_ the spec.  So basically, you first
have to get a label and determine whether it is LDH or not (you can do
this by checking for any octet outside the LDH range) and then you can
decide which way to process it.  In IDNA2003, the punycode output from
an LDH label turns out always to be the LDH label.  The reason for
this is that you're supposed to validate that a U-label is really a
U-label before registering in IDNA2008, and lots of perfectly good LDH
labels (like EXAMPLE) are not valid under IDNA2008 because of upper
case.

(If by now you think that maybe it's time for this DNS thing to get
replaced, you have company.)

> I was unaware of the different versions of IDNA. I basically started using
> the Perl module IDNA::Punycode in my project and assumed that this was the
> only type. Seems like I need to do some more reading.

Yeah, this is all made much harder by the fact that several IDN
libraries still do 2003.  Here is one that many people are using for
IDNA2008:
<https://gitorious.org/libidn2/libidn2/source/0d6b5c0a9f1e4a9742c5ce32b6241afb4910cae1:>
It's GPLv3, though, which brings its own issues.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: Hostnames, IDNs, Punycode and Unicode Case Folding

From

Jonathan Vanasco

Date:

05 January 2015, 20:46:54

On Dec 29, 2014, at 5:36 PM, Mike Cardwell wrote:

> So the system I've settled with is storing both the originally supplied
> representation, *and* the lower cased punycode encoded version in a separate
> column for indexing/search. This seems really hackish to me though.

I actually do the same exact thing and don't think it's hackish.  I actually really like being able to see the punycode
nextto the human representation on simple db pulls.  It's barely more disk space and really useful.