Thread: BUG #13785: Postgresql encoding screw-up

BUG #13785: Postgresql encoding screw-up

From
ntpt@seznam.cz
Date:
The following bug has been logged on the website:

Bug reference:      13785
Logged by:          ntpt
Email address:      ntpt@seznam.cz
PostgreSQL version: 9.4.5
Operating system:   linux
Description:

Hi, all


I use postgresql several years. But recently with hw upgrade a fall to
problem.

pg_exec(): Query failed: ERROR: character with byte sequence 0xc2 0x96 in
encoding "UTF8" has no equivalent in encoding "WIN1250"
It is a strange. First there was a database with latin2 encoding.
to this database connect an aplicaton with "set client encoding to win1250"
and manipulating data
then database was dumped with pg_dump -E UTF8
then database was restored pg_restore on another cluster in database with
UTF8 encoding
then application connect to new database with "set client encoding to
win1250"
and - query failed
How in this scenario could invaid  characters reach the database ???


Look here at whole thread

http://postgresql.nabble.com/Query-failed-ERROR-character-with-byte-sequence-0xc2-0x96-in-encoding-quot-UTF8-quot-has-no-equivale-td5875048.html

I examine this situation. I am affraid that there is a major design flaw or
bug that can screw up a lot of databases.


Please look at the https://cs.wikipedia.org/wiki/Windows-1250 (in czech
only) but translation table "Srovnání s ISO 8859-2" is selfexplanatory.

In situatin where db have latin2 encoding but client set client encoding to
win1250 and manipulate data:  Because  some characters in cp1250 (win1250)
are not present in latin2 encoding,  postgres store value of that chars  "as
is" ie without charset translation ie as example  0x96 (EN_DASH in win1250)
is stored as 0x96 (unknown in latin2) (at least version 9.0)


If You read from that database with client encoding set to win1250, this "as
is" characters are "as is" returned - and from point of client aplication -
everything work as expected.

But.. If You need to migrate and transcode database to utf8 by recomended
way (pg_dump -E utf8 ) things goes weired


Because there is no character 0x96 in latin2 , transcoder to utf8 does not
know  the recipe how treat this chracter  - and leave it "as is" producing
\u0096 character in output .
But if You look to the table "Mapování do Unikódu" from wiki above mentioned
, proper unicode code for this char in WIN1250 is \u2013. (And postgres can
not know it, because postgres does not know  from with codepage these stored
"as is"  characters came  )


So now if You restore it to  new utf8 database, Part of characters  would
have "unicodized" raw values of that chars that was  not presented in
original database encoding, but presented in original client encoding . And
result is  an error as described in my post in mailing list


So everybody in situation where need transfer to utf8 and original   db
encoding have no representation for some chars in client encodings ended
like this - with working - but screwed and non transferable database.

I thik that safe practice would be: Pg_dum with -E as used by client
applicaton  and then restore to newly created utf8 database . It should  be
mentioned as safe way in the doc, at least

But in enviroment, where client use multiple encodings that have more
characters then  database encoding, the database is screwed forever - will
work but can not be repaired and migrated to another encoding. Fix me if i
am wrong..

PS: execuse my bad english

Re: BUG #13785: Postgresql encoding screw-up

From
Feike Steenbergen
Date:
Hi,

> there is a major design flaw or bug

I feel your pain, but how is this a bug? Once the character that cannot be
mapped to latin2 is stored, there is no information about the
source-encoding
(win1250) of this character available anymore. Any client connecting
(whether your application or pg_dump) will get that character "as is".

I don't see a way around solving this in general, other than rejecting
characters that do not fit in the target character set

> where client use multiple encodings that have more characters then
database
> encoding, the database is screwed forever

The allowed conversions from LATIN2 to other encodings is quite
limited (MULE_INTERNAL, UTF8, WIN1250), , see:
see: http://www.postgresql.org/docs/9.4/static/multibyte.html#AEN35768:

If the clients using different encodings all touch the same data, the data
is already dirty. The migration is only bringing it to light then.

If the clients all touch different parts of the data, the data can be
safely migrated by exporting distinct parts of data in its correct encoding
and then importing it with that encoding in the the target database with
UTF8 encoding.

> I thik that safe practice would be: Pg_dum with -E as used by client
> applicaton  and then restore to newly created utf8 database . It should
 be
> mentioned as safe way in the doc, at least

This looks safe to me, you export unknown characters data into its original
encoding thereby making them known again. If you now import this into UTF8
it
will be encoded correctly, because both the source (WIN1250) as the target
(UTF8) can encode these character.

regards,

Feike Steenbergen

Re: BUG #13785: Postgresql encoding screw-up

From
"Peter J. Holzer"
Date:
On 2015-11-26 12:15:58 +0000, ntpt@seznam.cz wrote:
> Because there is no character 0x96 in latin2 , transcoder to utf8 does not
> know  the recipe how treat this chracter  - and leave it "as is" producing
> \u0096 character in output .=20

Actually, it does know. While the standard ISO-8859-2[1] only defines
the printable characters, those are commonly combined with the control
characters from ISO 6429, which does define a control code 0x96 (SPA).
The unicode standard also defines two blocks of control characters and
they have the same code points: 0x0D (CR) is translated to U+000D, 0x1B
(ESC) is translated to U+001B and 0x96 (SPA) is translated to U+0096.

So the "problem" here isn't that PostgreSQL doesn't know how to
translate an ISO-8859-2 0x96 into unicode (if that was the case, it
could reject it, forcing the user to fix the client configuration), but
that it does know how to convert it and therefore does it - even if it
is almost certainly wrong (when did you ever need an SPA character?).

    hp

PS: I would write a script which fixes the wrong characters in situ.=20
    That takes a bit of scripting, but:
    * You probably can't guarantee that all your clients are fixed,=20
      so the problem may crop up again
    * So you want to be able to find out when that happens and fix it
      again without taking the db down.
    * A script like that can be expanded to fix other encoding errors,
      too (e.g. UTF-8 double-encoding, ISO-8859-2 vs. ISO-8859-1, ...)

[1] I actually looked at ECMA-94, but they should be identical.

--=20
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants=20
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/