Thread: PostgreSQL fails to convert decomposed utf-8 to other encodings

PostgreSQL fails to convert decomposed utf-8 to other encodings

From
Craig Ringer
Date:
There's a bug in encoding conversions from utf-8 to other encodings that
results in corrupt output if decomposed utf-8 is used.

PostgreSQL doesn't process utf-8 to pre-composed form first, so
decomposed UTF-8 is not handled correctly.

Take á:

regress=> -- Decomposed - 'a' then 'acute'
regress=> SELECT E'\u0061\u0301';
' ?column?
----------
 á
(1 row)

regress=> -- Precomposed - 'a-acute'
regress=> SELECT E'\u00E1';
 ?column?
----------
 á
(1 row)


regress=> SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
ERROR:  character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to(E'\u00E1', 'iso-8859-1');
 convert_to
------------
 \xe1
(1 row)


This affects input from the client too:

regress=> SELECT convert_to('á', 'iso-8859-1');
ERROR:  character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to('á', 'iso-8859-1');
 convert_to
------------
 \xe1
(1 row)


... yes, that looks like the same function producing different results
on identical input. You might not be able to reproduce with copy and
paste from this mail if your client normalizes UTF-8, but you'll be able
to by printing the decomposed character to your terminal as an escape
string, then copying and pasting from there.


We should've probably been normalizing decomposed sequences to
precomposed as part of utf-8 validation wherever 'text' input occurs,
but it's too late for that now as DBs in the wild will contain
decomposed chars. Instead, conversion functions need to normalize
decomposed chars to precomposed before converting from utf-8 to another
encoding.

Comments?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PostgreSQL fails to convert decomposed utf-8 to other encodings

From
Tom Lane
Date:
Craig Ringer <craig@2ndquadrant.com> writes:
> There's a bug in encoding conversions from utf-8 to other encodings that
> results in corrupt output if decomposed utf-8 is used.

We don't actually support "decomposed" utf8; if there is any bug here,
it's that the input you show isn't rejected.  But I think there was
some intentional choice to not check \u escapes fully.

            regards, tom lane

Re: PostgreSQL fails to convert decomposed utf-8 to other encodings

From
Craig Ringer
Date:
On 08/06/2014 09:14 AM, Tom Lane wrote:
> We don't actually support "decomposed" utf8; if there is any bug here,
> it's that the input you show isn't rejected.  But I think there was
> some intentional choice to not check \u escapes fully.

Combining characters (i.e. decomposed utf-8 form, for chars where there
is a combined equivalent) are part of utf-8. They're not an optional add-on.

So if Pg doesn't support them, it doesn't fully support utf-8. Which is
fine as far as it goes, but must be documented as a limitation at
minimum. (I'll deal with that).

It also means that you get fun anomalies like:

regress=> SELECT 'á' = 'á';
 ?column?
----------
 f
(1 row)

which is IMO insane.

Not only that, but we can't reject decomposed forms, because they will
already exist in live installs. That'd break dump and reload of such
installs and cause exciting problems with pg_upgrade.

The "we'll just reject part of utf-8" opportunity has flown. It needs to
be documented as a bug in existing versions, and I guess given that I'm
the one complaining I get to see if I can find a sane fix for 9.5...

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PostgreSQL fails to convert decomposed utf-8 to other encodings

From
Craig Ringer
Date:
On 08/06/2014 11:54 AM, Craig Ringer wrote:
> On 08/06/2014 09:14 AM, Tom Lane wrote:
>> We don't actually support "decomposed" utf8; if there is any bug here,
>> it's that the input you show isn't rejected.  But I think there was
>> some intentional choice to not check \u escapes fully.
>
> Combining characters (i.e. decomposed utf-8 form, for chars where there
> is a combined equivalent) are part of utf-8. They're not an optional add-on.

... though we can advertise partial Unicode support, saying that we
support UTF-8 for UCS (ISO 10646-1:2000 Annex D / RFC 3629)
implementation level 1 only, requiring Normalization Form C (NFC) input.

Given that Pg doesn't seem to understand \xf8 or \xfc utf-8 chars, so it
doesn't cover the full utf-8 range, it doesn't look like it meets Level
1 either. So it supports "mostly-utf8".

With level 1 we should really _reject_ combining chars, but can't do
that w/o breaking BC.


I guess I should turn this:

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

into a regression test.


Possibly also parts of this:

http://www.columbia.edu/~fdc/utf8/

though it's more oriented toward rendering.


It's worth noting that Konsole and Thunderbird had no issues with
combining chars when I was testing this.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PostgreSQL fails to convert decomposed utf-8 to other encodings

From
Tatsuo Ishii
Date:
PiBPbiAwOC8wNi8yMDE0IDA5OjE0IEFNLCBUb20gTGFuZSB3cm90ZToNCj4+IFdlIGRvbid0IGFj
dHVhbGx5IHN1cHBvcnQgImRlY29tcG9zZWQiIHV0Zjg7IGlmIHRoZXJlIGlzIGFueSBidWcgaGVy
ZSwNCj4+IGl0J3MgdGhhdCB0aGUgaW5wdXQgeW91IHNob3cgaXNuJ3QgcmVqZWN0ZWQuICBCdXQg
SSB0aGluayB0aGVyZSB3YXMNCj4+IHNvbWUgaW50ZW50aW9uYWwgY2hvaWNlIHRvIG5vdCBjaGVj
ayBcdSBlc2NhcGVzIGZ1bGx5Lg0KPiANCj4gQ29tYmluaW5nIGNoYXJhY3RlcnMgKGkuZS4gZGVj
b21wb3NlZCB1dGYtOCBmb3JtLCBmb3IgY2hhcnMgd2hlcmUgdGhlcmUNCj4gaXMgYSBjb21iaW5l
ZCBlcXVpdmFsZW50KSBhcmUgcGFydCBvZiB1dGYtOC4gVGhleSdyZSBub3QgYW4gb3B0aW9uYWwg
YWRkLW9uLg0KPiANCj4gU28gaWYgUGcgZG9lc24ndCBzdXBwb3J0IHRoZW0sIGl0IGRvZXNuJ3Qg
ZnVsbHkgc3VwcG9ydCB1dGYtOC4gV2hpY2ggaXMNCj4gZmluZSBhcyBmYXIgYXMgaXQgZ29lcywg
YnV0IG11c3QgYmUgZG9jdW1lbnRlZCBhcyBhIGxpbWl0YXRpb24gYXQNCj4gbWluaW11bS4gKEkn
bGwgZGVhbCB3aXRoIHRoYXQpLg0KPiANCj4gSXQgYWxzbyBtZWFucyB0aGF0IHlvdSBnZXQgZnVu
IGFub21hbGllcyBsaWtlOg0KPiANCj4gcmVncmVzcz0+IFNFTEVDVCAnYcyBJyA9ICfDoSc7DQo+
ICA/Y29sdW1uPw0KPiAtLS0tLS0tLS0tDQo+ICBmDQo+ICgxIHJvdykNCj4gDQo+IHdoaWNoIGlz
IElNTyBpbnNhbmUuDQo+IA0KPiBOb3Qgb25seSB0aGF0LCBidXQgd2UgY2FuJ3QgcmVqZWN0IGRl
Y29tcG9zZWQgZm9ybXMsIGJlY2F1c2UgdGhleSB3aWxsDQo+IGFscmVhZHkgZXhpc3QgaW4gbGl2
ZSBpbnN0YWxscy4gVGhhdCdkIGJyZWFrIGR1bXAgYW5kIHJlbG9hZCBvZiBzdWNoDQo+IGluc3Rh
bGxzIGFuZCBjYXVzZSBleGNpdGluZyBwcm9ibGVtcyB3aXRoIHBnX3VwZ3JhZGUuDQo+IA0KPiBU
aGUgIndlJ2xsIGp1c3QgcmVqZWN0IHBhcnQgb2YgdXRmLTgiIG9wcG9ydHVuaXR5IGhhcyBmbG93
bi4gSXQgbmVlZHMgdG8NCj4gYmUgZG9jdW1lbnRlZCBhcyBhIGJ1ZyBpbiBleGlzdGluZyB2ZXJz
aW9ucywgYW5kIEkgZ3Vlc3MgZ2l2ZW4gdGhhdCBJJ20NCj4gdGhlIG9uZSBjb21wbGFpbmluZyBJ
IGdldCB0byBzZWUgaWYgSSBjYW4gZmluZCBhIHNhbmUgZml4IGZvciA5LjUuLi4NCg0KSSdtIG5v
dCBzdXJlIHdoYXQgeW91IG1lYW4gYnkgZGVjb21wb3NlZCB1dGY4IGJlY2F1c2UgdGhlcmUncyBu
byBzdWNoDQphIHRoaW5nIGluIHRoZSBVbmljb2RlIHN0YW5kYXJkLiBNYXliZSB5b3UgbWVhbiAi
Y29tcG9zaXRlIGNoYXJhY3RlciINCm9yICJwcmVjb21wb3NlZCBjaGFyYWN0ZXIiPw0KDQpBbnl3
YSBpbiBteSB1bmRlcnN0YW5kaW5nIHRvIGhhbmRsZSBjb21wb3NpdGUgY2hhcmFjdGVycywgd2Ug
c2hvdWxkIGRvDQoiVW5pY29kZSBub3JtYWxpemF0aW9uIiBpbiB0aGUgZmlyc3QgcGxhY2UuIFRo
ZXJlJ3MgNCB0eXBlcyBvZg0Kbm9ybWFsaXphdGlvbjoNCg0KTkZEIChOb3JtYWxpemF0aW9uIEZv
cm0gQ2Fub25pY2FsIERlY29tcG9zaXRpb24pDQpORkMgKE5vcm1hbGl6YXRpb24gRm9ybSBDYW5v
bmljYWwgQ29tcG9zaXRpb24pDQpORktEIChOb3JtYWxpemF0aW9uIEZvcm0gQ29tcGF0aWJpbGl0
eSBEZWNvbXBvc2l0aW9uKQ0KTkZLQyAoTm9ybWFsaXphdGlvbiBGb3JtIENvbXBhdGliaWxpdHkg
Q29tcG9zaXRpb24pDQoNCkkgZG9uJ3Qga25vdyBob3cgd2UgY291bGQgaW1wbGVtZW50IG9uZSBv
ZiB0aGVzZSB3aXRob3V0IG1ham9yDQpwZXJmb3JtYW5jZSBkZWdyYWRhdGlvbi4NCg0KQWxzbyBz
b21lIGNvbXBvc2l0ZSBjaGFyYWN0ZXJzIGNhbiBiZSBkZWNvbXBvc2VkIGJ1dCBhZnRlciBjb21w
b3NlZA0KYWdhaW4sIHRoZXkgZG8gbm90IHJldHVybiB0byB0aGUgb3JpZ2luYWwgZm9ybSBvZiBj
b21wb3NpdGUgY2hhcmFjdGVycw0KKHJvdW5kIHRyaXAgY29udmVyc2lvbiBpcyBpbXBvc3NpYmxl
KS4gU3VjaCBjaGFyYWN0ZXJzIGFyZSBjYWxsZWQNCiJDb21wb3NpdGlvbiBFeGNsdXNpb24iIChz
ZWUNCmh0dHA6Ly93d3cudW5pY29kZS5vcmcvUHVibGljL1VOSURBVEEvQ29tcG9zaXRpb25FeGNs
dXNpb25zLnR4dCkuDQpJIGhhdmUgbm8gaWRlYSBob3cgdG8gZGVhbCB3aXRoIHRoZSBpc3N1ZS4N
Cg0KQmVzdCByZWdhcmRzLA0KLS0NClRhdHN1byBJc2hpaQ0KU1JBIE9TUywgSW5jLiBKYXBhbg0K
RW5nbGlzaDogaHR0cDovL3d3dy5zcmFvc3MuY28uanAvaW5kZXhfZW4ucGhwDQpKYXBhbmVzZTpo
dHRwOi8vd3d3LnNyYW9zcy5jby5qcA0K