PostgreSQL fails to convert decomposed utf-8 to other encodings - Mailing list pgsql-bugs

From Craig Ringer
Subject PostgreSQL fails to convert decomposed utf-8 to other encodings
Date
Msg-id 53E179E1.3060404@2ndquadrant.com
Whole thread Raw
Responses Re: PostgreSQL fails to convert decomposed utf-8 to other encodings
List pgsql-bugs
There's a bug in encoding conversions from utf-8 to other encodings that
results in corrupt output if decomposed utf-8 is used.

PostgreSQL doesn't process utf-8 to pre-composed form first, so
decomposed UTF-8 is not handled correctly.

Take á:

regress=> -- Decomposed - 'a' then 'acute'
regress=> SELECT E'\u0061\u0301';
' ?column?
----------
 á
(1 row)

regress=> -- Precomposed - 'a-acute'
regress=> SELECT E'\u00E1';
 ?column?
----------
 á
(1 row)


regress=> SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
ERROR:  character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to(E'\u00E1', 'iso-8859-1');
 convert_to
------------
 \xe1
(1 row)


This affects input from the client too:

regress=> SELECT convert_to('á', 'iso-8859-1');
ERROR:  character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to('á', 'iso-8859-1');
 convert_to
------------
 \xe1
(1 row)


... yes, that looks like the same function producing different results
on identical input. You might not be able to reproduce with copy and
paste from this mail if your client normalizes UTF-8, but you'll be able
to by printing the decomposed character to your terminal as an escape
string, then copying and pasting from there.


We should've probably been normalizing decomposed sequences to
precomposed as part of utf-8 validation wherever 'text' input occurs,
but it's too late for that now as DBs in the wild will contain
decomposed chars. Instead, conversion functions need to normalize
decomposed chars to precomposed before converting from utf-8 to another
encoding.

Comments?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-bugs by date:

Previous
From: David G Johnston
Date:
Subject: Re: BUG #11128: Error in pg_restore with materialized view
Next
From: Tom Lane
Date:
Subject: Re: PostgreSQL fails to convert decomposed utf-8 to other encodings