Re: Query failed: ERROR: character with byte sequence 0xc2 0x96 in encoding "UTF8" has no equivalent in encoding "WIN1250" - Mailing list pgsql-general

From Albe Laurenz
Subject Re: Query failed: ERROR: character with byte sequence 0xc2 0x96 in encoding "UTF8" has no equivalent in encoding "WIN1250"
Date
Msg-id A737B7A37273E048B164557ADEF4A58B50FEAC5D@ntex2010i.host.magwien.gv.at
Whole thread Raw
In response to Query failed: ERROR: character with byte sequence 0xc2 0x96 in encoding "UTF8" has no equivalent in encoding "WIN1250"  (NTPT <NTPT@seznam.cz>)
List pgsql-general
NTPT wrote:
> I need help.
> 
> pg_exec(): Query failed: ERROR: character with byte sequence 0xc2 0x96 in encoding "UTF8" has no
> equivalent in encoding "WIN1250"
> 
> It is a strange. First there was a database with latin2 encoding.
> 
> to this database connect an aplicaton with "set client encoding to win1250" and manipulating data
> 
> then database was dumped with pg_dump -E UTF8
> 
> then database was restored pg_restore on another cluster in database with UTF8 encoding
> 
> then application connect to new database with "set client encoding to win1250"
> 
> and - query failed
> 
> 
> How in this scenario could invaid  characters reach the database ???
> 
> And how to solve this ? Errort message is not very useful, because does not provide any hint (at least
> column and row)

I can reproduce that, and I think it is a bug.

Hex 96 is Unicode Code Point 2013 in Windows-1250, that is an "en dash".

1) You enter this byte into a Latin 2 database with client_encoding WIN1250,
   and it gets stored as hex 96 in the database.

2) You dump this database with -E UTF8 and get hex C2 96 in the dump.

3) You restore this database to a new UTF8 database, the data end up
   as hex C2 96.

4) You query with client_encoding WIN1250 and get the error you quote.

Now I think that the bug is in step 1).
Wikipedia says that hex 96 is undefined in Latin 2
(https://en.wikipedia.org/wiki/ISO/IEC_8859-2),
so instead of storing this byte, PostgreSQL should have complained that it 
cannot be converted to Latin 2, since indeed there is no "em dash" defined
in Latin 2.

The bug seems to be in
backend/utils/mb/conversion_procs/latin2_and_win1250/latin2_and_win1250.c,
function win12502mic().
I think that the entries in win1250_2_iso88592 corresponding to undefined characters
should be 0x00 to produce an error.

Yours,
Laurenz Albe

pgsql-general by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: full_page_writes on SSD?
Next
From: NTPT
Date:
Subject: Re: full_page_writes on SSD?