BUG #13785: Postgresql encoding screw-up - Mailing list pgsql-bugs

From ntpt@seznam.cz
Subject BUG #13785: Postgresql encoding screw-up
Date
Msg-id 20151126121558.2760.44255@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #13785: Postgresql encoding screw-up  (Feike Steenbergen <feikesteenbergen@gmail.com>)
Re: BUG #13785: Postgresql encoding screw-up  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      13785
Logged by:          ntpt
Email address:      ntpt@seznam.cz
PostgreSQL version: 9.4.5
Operating system:   linux
Description:

Hi, all


I use postgresql several years. But recently with hw upgrade a fall to
problem.

pg_exec(): Query failed: ERROR: character with byte sequence 0xc2 0x96 in
encoding "UTF8" has no equivalent in encoding "WIN1250"
It is a strange. First there was a database with latin2 encoding.
to this database connect an aplicaton with "set client encoding to win1250"
and manipulating data
then database was dumped with pg_dump -E UTF8
then database was restored pg_restore on another cluster in database with
UTF8 encoding
then application connect to new database with "set client encoding to
win1250"
and - query failed
How in this scenario could invaid  characters reach the database ???


Look here at whole thread

http://postgresql.nabble.com/Query-failed-ERROR-character-with-byte-sequence-0xc2-0x96-in-encoding-quot-UTF8-quot-has-no-equivale-td5875048.html

I examine this situation. I am affraid that there is a major design flaw or
bug that can screw up a lot of databases.


Please look at the https://cs.wikipedia.org/wiki/Windows-1250 (in czech
only) but translation table "Srovnání s ISO 8859-2" is selfexplanatory.

In situatin where db have latin2 encoding but client set client encoding to
win1250 and manipulate data:  Because  some characters in cp1250 (win1250)
are not present in latin2 encoding,  postgres store value of that chars  "as
is" ie without charset translation ie as example  0x96 (EN_DASH in win1250)
is stored as 0x96 (unknown in latin2) (at least version 9.0)


If You read from that database with client encoding set to win1250, this "as
is" characters are "as is" returned - and from point of client aplication -
everything work as expected.

But.. If You need to migrate and transcode database to utf8 by recomended
way (pg_dump -E utf8 ) things goes weired


Because there is no character 0x96 in latin2 , transcoder to utf8 does not
know  the recipe how treat this chracter  - and leave it "as is" producing
\u0096 character in output .
But if You look to the table "Mapování do Unikódu" from wiki above mentioned
, proper unicode code for this char in WIN1250 is \u2013. (And postgres can
not know it, because postgres does not know  from with codepage these stored
"as is"  characters came  )


So now if You restore it to  new utf8 database, Part of characters  would
have "unicodized" raw values of that chars that was  not presented in
original database encoding, but presented in original client encoding . And
result is  an error as described in my post in mailing list


So everybody in situation where need transfer to utf8 and original   db
encoding have no representation for some chars in client encodings ended
like this - with working - but screwed and non transferable database.

I thik that safe practice would be: Pg_dum with -E as used by client
applicaton  and then restore to newly created utf8 database . It should  be
mentioned as safe way in the doc, at least

But in enviroment, where client use multiple encodings that have more
characters then  database encoding, the database is screwed forever - will
work but can not be repaired and migrated to another encoding. Fix me if i
am wrong..

PS: execuse my bad english

pgsql-bugs by date:

Previous
From: Mark Kirkwood
Date:
Subject: Recovery conflict message lost in user session for 9.3
Next
From: Tom Lane
Date:
Subject: Re: Recovery conflict message lost in user session for 9.3