Re: chr() is still too loose about UTF8 code points - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: chr() is still too loose about UTF8 code points
Date
Msg-id 537646AC.2020201@dunslane.net
Whole thread Raw
In response to Re: chr() is still too loose about UTF8 code points  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 05/16/2014 12:43 PM, Heikki Linnakangas wrote:
> On 05/16/2014 06:05 PM, Tom Lane wrote:
>> Quite some time ago, we made the chr() function accept Unicode code 
>> points
>> up to U+1FFFFF, which is the largest value that will fit in a 4-byte 
>> UTF8
>> string.  It was pointed out to me though that RFC3629 restricted the
>> original definition of UTF8 to only allow code points up to U+10FFFF 
>> (for
>> compatibility with UTF16).  While that might not be something we feel we
>> need to follow exactly, pg_utf8_islegal implements the checking 
>> algorithm
>> specified by RFC3629, and will therefore reject points above U+10FFFF.
>>
>> This means you can use chr() to create values that will be rejected on
>> dump and reload:
>>
>> u8=# create table tt (f1 text);
>> CREATE TABLE
>> u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
>> INSERT 0 1
>> u8=# select * from tt;
>>   f1
>> ----
>>
>> (1 row)
>>
>> u8=# \copy tt to 'junk'
>> COPY 1
>> u8=# \copy tt from 'junk'
>> ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 
>> 0xbf 0xbf
>> CONTEXT:  COPY tt, line 1
>> LOCATION:  report_invalid_encoding, wchar.c:2011
>>
>> I think this probably means we need to change chr() to reject code 
>> points
>> above 10ffff.  Should we back-patch that, or just do it in HEAD?
>
> +1 for back-patching. A value that cannot be restored is bad, and I 
> can't imagine any legitimate use case for producing a Unicode 
> character larger than U+10FFFF with chr(x), when the rest of the 
> system doesn't handle it. Fully supporting such values might be 
> useful, but that's a different story.
>
>

My understanding us that U+10FFFF is the highest legal Unicode code 
point anyway. So this is really just tightening our routines to make 
sure we don't produce an invalid value. We won't be disallowing anything 
that is legal Unicode.

cheers

andrew



pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: pg_basebackup: could not get transaction log end position from server: FATAL: could not open file "./pg_hba.conf~": Permission denied
Next
From: David G Johnston
Date:
Subject: Re: pg_basebackup: could not get transaction log end position from server: FATAL: could not open file "./pg_hba.conf~": Permission denied