Thread: chr() is still too loose about UTF8 code points

chr() is still too loose about UTF8 code points

From
Tom Lane
Date:
Quite some time ago, we made the chr() function accept Unicode code points
up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8
string.  It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10FFFF (for
compatibility with UTF16).  While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking algorithm
specified by RFC3629, and will therefore reject points above U+10FFFF.

This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;f1 
----
(1 row)

u8=# \copy tt to 'junk' 
COPY 1
u8=# \copy tt from 'junk'
ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf
CONTEXT:  COPY tt, line 1
LOCATION:  report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code points
above 10ffff.  Should we back-patch that, or just do it in HEAD?
        regards, tom lane



Re: chr() is still too loose about UTF8 code points

From
Heikki Linnakangas
Date:
On 05/16/2014 06:05 PM, Tom Lane wrote:
> Quite some time ago, we made the chr() function accept Unicode code points
> up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8
> string.  It was pointed out to me though that RFC3629 restricted the
> original definition of UTF8 to only allow code points up to U+10FFFF (for
> compatibility with UTF16).  While that might not be something we feel we
> need to follow exactly, pg_utf8_islegal implements the checking algorithm
> specified by RFC3629, and will therefore reject points above U+10FFFF.
>
> This means you can use chr() to create values that will be rejected on
> dump and reload:
>
> u8=# create table tt (f1 text);
> CREATE TABLE
> u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
> INSERT 0 1
> u8=# select * from tt;
>   f1
> ----
>
> (1 row)
>
> u8=# \copy tt to 'junk'
> COPY 1
> u8=# \copy tt from 'junk'
> ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf
> CONTEXT:  COPY tt, line 1
> LOCATION:  report_invalid_encoding, wchar.c:2011
>
> I think this probably means we need to change chr() to reject code points
> above 10ffff.  Should we back-patch that, or just do it in HEAD?

+1 for back-patching. A value that cannot be restored is bad, and I 
can't imagine any legitimate use case for producing a Unicode character 
larger than U+10FFFF with chr(x), when the rest of the system doesn't 
handle it. Fully supporting such values might be useful, but that's a 
different story.

- Heikki



Re: chr() is still too loose about UTF8 code points

From
Andrew Dunstan
Date:
On 05/16/2014 12:43 PM, Heikki Linnakangas wrote:
> On 05/16/2014 06:05 PM, Tom Lane wrote:
>> Quite some time ago, we made the chr() function accept Unicode code 
>> points
>> up to U+1FFFFF, which is the largest value that will fit in a 4-byte 
>> UTF8
>> string.  It was pointed out to me though that RFC3629 restricted the
>> original definition of UTF8 to only allow code points up to U+10FFFF 
>> (for
>> compatibility with UTF16).  While that might not be something we feel we
>> need to follow exactly, pg_utf8_islegal implements the checking 
>> algorithm
>> specified by RFC3629, and will therefore reject points above U+10FFFF.
>>
>> This means you can use chr() to create values that will be rejected on
>> dump and reload:
>>
>> u8=# create table tt (f1 text);
>> CREATE TABLE
>> u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
>> INSERT 0 1
>> u8=# select * from tt;
>>   f1
>> ----
>>
>> (1 row)
>>
>> u8=# \copy tt to 'junk'
>> COPY 1
>> u8=# \copy tt from 'junk'
>> ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 
>> 0xbf 0xbf
>> CONTEXT:  COPY tt, line 1
>> LOCATION:  report_invalid_encoding, wchar.c:2011
>>
>> I think this probably means we need to change chr() to reject code 
>> points
>> above 10ffff.  Should we back-patch that, or just do it in HEAD?
>
> +1 for back-patching. A value that cannot be restored is bad, and I 
> can't imagine any legitimate use case for producing a Unicode 
> character larger than U+10FFFF with chr(x), when the rest of the 
> system doesn't handle it. Fully supporting such values might be 
> useful, but that's a different story.
>
>

My understanding us that U+10FFFF is the highest legal Unicode code 
point anyway. So this is really just tightening our routines to make 
sure we don't produce an invalid value. We won't be disallowing anything 
that is legal Unicode.

cheers

andrew



Re: chr() is still too loose about UTF8 code points

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 05/16/2014 06:05 PM, Tom Lane wrote:
>> I think this probably means we need to change chr() to reject code points
>> above 10ffff.  Should we back-patch that, or just do it in HEAD?

> +1 for back-patching. A value that cannot be restored is bad, and I 
> can't imagine any legitimate use case for producing a Unicode character 
> larger than U+10FFFF with chr(x), when the rest of the system doesn't 
> handle it. Fully supporting such values might be useful, but that's a 
> different story.

Well, AFAICT "the rest of the system" does handle any code point up to
U+1FFFFF.  It's only pg_utf8_islegal that's being picky.  So another
possible answer is to weaken the check in pg_utf8_islegal.  However,
that could create interoperability concerns with other software, and
as you say the use-case for larger values seems pretty thin.

Actually, after re-reading the spec there's more to it than this:
chr() will allow creating utf8 sequences that correspond to the
surrogate-pair codes, which are expressly disallowed in UTF8 by
the RFCs.  Maybe we should apply pg_utf8_islegal to the result
string rather than duplicating its checks?

BTW, there are various places that have comments or ifdefd-out code
anticipating possible future support of 5- or 6-byte UTF8 sequences,
which were specified in RFC2279 but then rescinded by RFC3629.
I guess as a matter of cleanup we should think about removing that
stuff.
        regards, tom lane



Re: chr() is still too loose about UTF8 code points

From
Noah Misch
Date:
On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote:
> I think this probably means we need to change chr() to reject code points
> above 10ffff.  Should we back-patch that, or just do it in HEAD?

The compatibility risks resemble those associated with the fixes for bug
#9210, so I recommend HEAD only:

http://www.postgresql.org/message-id/flat/20140220043940.GA3064539@tornado.leadboat.com

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: chr() is still too loose about UTF8 code points

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote:
>> I think this probably means we need to change chr() to reject code points
>> above 10ffff.  Should we back-patch that, or just do it in HEAD?

> The compatibility risks resemble those associated with the fixes for bug
> #9210, so I recommend HEAD only:

> http://www.postgresql.org/message-id/flat/20140220043940.GA3064539@tornado.leadboat.com

While I'd be willing to ignore that risk so far as code points above
10ffff go, if we want pg_utf8_islegal to be happy then we will also
have to reject surrogate-pair code points.  It's not beyond the realm
of possibility that somebody is intentionally generating such code
points with chr(), despite the dump/reload hazard.  So now I agree
that this is sounding more like a major-version-only behavioral change.
        regards, tom lane



Re: chr() is still too loose about UTF8 code points

From
David G Johnston
Date:
Tom Lane-2 wrote
> Noah Misch <

> noah@

> > writes:
>> On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote:
>>> I think this probably means we need to change chr() to reject code
>>> points
>>> above 10ffff.  Should we back-patch that, or just do it in HEAD?
> 
>> The compatibility risks resemble those associated with the fixes for bug
>> #9210, so I recommend HEAD only:
> 
>> http://www.postgresql.org/message-id/flat/

> 20140220043940.GA3064539@.leadboat

> 
> While I'd be willing to ignore that risk so far as code points above
> 10ffff go, if we want pg_utf8_islegal to be happy then we will also
> have to reject surrogate-pair code points.  It's not beyond the realm
> of possibility that somebody is intentionally generating such code
> points with chr(), despite the dump/reload hazard.  So now I agree
> that this is sounding more like a major-version-only behavioral change.

I would tend to agree on principle - though since this does fall in a
grey-area does 9.4 qualify for this bug-fix.

David J.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/chr-is-still-too-loose-about-UTF8-code-points-tp5804232p5804270.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: chr() is still too loose about UTF8 code points

From
Tom Lane
Date:
David G Johnston <david.g.johnston@gmail.com> writes:
> Tom Lane-2 wrote
>> While I'd be willing to ignore that risk so far as code points above
>> 10ffff go, if we want pg_utf8_islegal to be happy then we will also
>> have to reject surrogate-pair code points.  It's not beyond the realm
>> of possibility that somebody is intentionally generating such code
>> points with chr(), despite the dump/reload hazard.  So now I agree
>> that this is sounding more like a major-version-only behavioral change.

> I would tend to agree on principle - though since this does fall in a
> grey-area does 9.4 qualify for this bug-fix.

I don't think it's too late to change this in 9.4.  The discussion was
about whether to back-patch.
        regards, tom lane