Thread: chr() is still too loose about UTF8 code points
Quite some time ago, we made the chr() function accept Unicode code points up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8 string. It was pointed out to me though that RFC3629 restricted the original definition of UTF8 to only allow code points up to U+10FFFF (for compatibility with UTF16). While that might not be something we feel we need to follow exactly, pg_utf8_islegal implements the checking algorithm specified by RFC3629, and will therefore reject points above U+10FFFF. This means you can use chr() to create values that will be rejected on dump and reload: u8=# create table tt (f1 text); CREATE TABLE u8=# insert into tt values(chr('x001fffff'::bit(32)::int)); INSERT 0 1 u8=# select * from tt;f1 ---- (1 row) u8=# \copy tt to 'junk' COPY 1 u8=# \copy tt from 'junk' ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf CONTEXT: COPY tt, line 1 LOCATION: report_invalid_encoding, wchar.c:2011 I think this probably means we need to change chr() to reject code points above 10ffff. Should we back-patch that, or just do it in HEAD? regards, tom lane
On 05/16/2014 06:05 PM, Tom Lane wrote: > Quite some time ago, we made the chr() function accept Unicode code points > up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8 > string. It was pointed out to me though that RFC3629 restricted the > original definition of UTF8 to only allow code points up to U+10FFFF (for > compatibility with UTF16). While that might not be something we feel we > need to follow exactly, pg_utf8_islegal implements the checking algorithm > specified by RFC3629, and will therefore reject points above U+10FFFF. > > This means you can use chr() to create values that will be rejected on > dump and reload: > > u8=# create table tt (f1 text); > CREATE TABLE > u8=# insert into tt values(chr('x001fffff'::bit(32)::int)); > INSERT 0 1 > u8=# select * from tt; > f1 > ---- > > (1 row) > > u8=# \copy tt to 'junk' > COPY 1 > u8=# \copy tt from 'junk' > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf > CONTEXT: COPY tt, line 1 > LOCATION: report_invalid_encoding, wchar.c:2011 > > I think this probably means we need to change chr() to reject code points > above 10ffff. Should we back-patch that, or just do it in HEAD? +1 for back-patching. A value that cannot be restored is bad, and I can't imagine any legitimate use case for producing a Unicode character larger than U+10FFFF with chr(x), when the rest of the system doesn't handle it. Fully supporting such values might be useful, but that's a different story. - Heikki
On 05/16/2014 12:43 PM, Heikki Linnakangas wrote: > On 05/16/2014 06:05 PM, Tom Lane wrote: >> Quite some time ago, we made the chr() function accept Unicode code >> points >> up to U+1FFFFF, which is the largest value that will fit in a 4-byte >> UTF8 >> string. It was pointed out to me though that RFC3629 restricted the >> original definition of UTF8 to only allow code points up to U+10FFFF >> (for >> compatibility with UTF16). While that might not be something we feel we >> need to follow exactly, pg_utf8_islegal implements the checking >> algorithm >> specified by RFC3629, and will therefore reject points above U+10FFFF. >> >> This means you can use chr() to create values that will be rejected on >> dump and reload: >> >> u8=# create table tt (f1 text); >> CREATE TABLE >> u8=# insert into tt values(chr('x001fffff'::bit(32)::int)); >> INSERT 0 1 >> u8=# select * from tt; >> f1 >> ---- >> >> (1 row) >> >> u8=# \copy tt to 'junk' >> COPY 1 >> u8=# \copy tt from 'junk' >> ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf >> 0xbf 0xbf >> CONTEXT: COPY tt, line 1 >> LOCATION: report_invalid_encoding, wchar.c:2011 >> >> I think this probably means we need to change chr() to reject code >> points >> above 10ffff. Should we back-patch that, or just do it in HEAD? > > +1 for back-patching. A value that cannot be restored is bad, and I > can't imagine any legitimate use case for producing a Unicode > character larger than U+10FFFF with chr(x), when the rest of the > system doesn't handle it. Fully supporting such values might be > useful, but that's a different story. > > My understanding us that U+10FFFF is the highest legal Unicode code point anyway. So this is really just tightening our routines to make sure we don't produce an invalid value. We won't be disallowing anything that is legal Unicode. cheers andrew
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 05/16/2014 06:05 PM, Tom Lane wrote: >> I think this probably means we need to change chr() to reject code points >> above 10ffff. Should we back-patch that, or just do it in HEAD? > +1 for back-patching. A value that cannot be restored is bad, and I > can't imagine any legitimate use case for producing a Unicode character > larger than U+10FFFF with chr(x), when the rest of the system doesn't > handle it. Fully supporting such values might be useful, but that's a > different story. Well, AFAICT "the rest of the system" does handle any code point up to U+1FFFFF. It's only pg_utf8_islegal that's being picky. So another possible answer is to weaken the check in pg_utf8_islegal. However, that could create interoperability concerns with other software, and as you say the use-case for larger values seems pretty thin. Actually, after re-reading the spec there's more to it than this: chr() will allow creating utf8 sequences that correspond to the surrogate-pair codes, which are expressly disallowed in UTF8 by the RFCs. Maybe we should apply pg_utf8_islegal to the result string rather than duplicating its checks? BTW, there are various places that have comments or ifdefd-out code anticipating possible future support of 5- or 6-byte UTF8 sequences, which were specified in RFC2279 but then rescinded by RFC3629. I guess as a matter of cleanup we should think about removing that stuff. regards, tom lane
On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote: > I think this probably means we need to change chr() to reject code points > above 10ffff. Should we back-patch that, or just do it in HEAD? The compatibility risks resemble those associated with the fixes for bug #9210, so I recommend HEAD only: http://www.postgresql.org/message-id/flat/20140220043940.GA3064539@tornado.leadboat.com -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Noah Misch <noah@leadboat.com> writes: > On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote: >> I think this probably means we need to change chr() to reject code points >> above 10ffff. Should we back-patch that, or just do it in HEAD? > The compatibility risks resemble those associated with the fixes for bug > #9210, so I recommend HEAD only: > http://www.postgresql.org/message-id/flat/20140220043940.GA3064539@tornado.leadboat.com While I'd be willing to ignore that risk so far as code points above 10ffff go, if we want pg_utf8_islegal to be happy then we will also have to reject surrogate-pair code points. It's not beyond the realm of possibility that somebody is intentionally generating such code points with chr(), despite the dump/reload hazard. So now I agree that this is sounding more like a major-version-only behavioral change. regards, tom lane
Tom Lane-2 wrote > Noah Misch < > noah@ > > writes: >> On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote: >>> I think this probably means we need to change chr() to reject code >>> points >>> above 10ffff. Should we back-patch that, or just do it in HEAD? > >> The compatibility risks resemble those associated with the fixes for bug >> #9210, so I recommend HEAD only: > >> http://www.postgresql.org/message-id/flat/ > 20140220043940.GA3064539@.leadboat > > While I'd be willing to ignore that risk so far as code points above > 10ffff go, if we want pg_utf8_islegal to be happy then we will also > have to reject surrogate-pair code points. It's not beyond the realm > of possibility that somebody is intentionally generating such code > points with chr(), despite the dump/reload hazard. So now I agree > that this is sounding more like a major-version-only behavioral change. I would tend to agree on principle - though since this does fall in a grey-area does 9.4 qualify for this bug-fix. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/chr-is-still-too-loose-about-UTF8-code-points-tp5804232p5804270.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
David G Johnston <david.g.johnston@gmail.com> writes: > Tom Lane-2 wrote >> While I'd be willing to ignore that risk so far as code points above >> 10ffff go, if we want pg_utf8_islegal to be happy then we will also >> have to reject surrogate-pair code points. It's not beyond the realm >> of possibility that somebody is intentionally generating such code >> points with chr(), despite the dump/reload hazard. So now I agree >> that this is sounding more like a major-version-only behavioral change. > I would tend to agree on principle - though since this does fall in a > grey-area does 9.4 qualify for this bug-fix. I don't think it's too late to change this in 9.4. The discussion was about whether to back-patch. regards, tom lane