Thread: BUG #4890: Allow insert character has no equivalent in "LATIN2"

BUG #4890: Allow insert character has no equivalent in "LATIN2"

From
"saint"
Date:
The following bug has been logged online:

Bug reference:      4890
Logged by:          saint
Email address:      saint@akpa.pl
PostgreSQL version: 8.4 RC1
Operating system:   Windows XP
Description:        Allow insert character has no equivalent in "LATIN2"
Details:

Database encoding: "LATIN2"

set client_encoding to 'UTF8';
insert into public.test(col) values('‰');

ERROR: character 0xe280b0 of encoding "UTF8" has no equivalent in "LATIN2"

in this case is correct but if client_encoding is 'WIN1250':

set client_encoding to 'WIN1250';
insert into public.test(col) values('‰');

Query returned successfully: 1 row affected, 0 ms execution time.

Re: BUG #4890: Allow insert character has no equivalent in "LATIN2"

From
Craig Ringer
Date:
On Sat, 2009-06-27 at 19:20 +0000, saint wrote:

> set client_encoding to 'WIN1250';
> insert into public.test(col) values('‰');

You're lying to the server about the client encoding in one or both
cases. I can't know which without knowing what program you're talking to
the server with and how it's set up.

'SET client_encoding' tells the server what encoding to expect incoming
data in. It doesn't change what encoding the client sends that data in.
If your client has a different default encoding to that of the server,
you can inform the server that the client will be sending differently
encoded data.

In other words, you can't use 'SET client_encoding' to change what
encoding the client uses, only how the server interprets the bytes it
gets from the client.


--
Craig Ringer

Re: BUG #4890: Allow insert character has no equivalent in "LATIN2"

From
Craig Ringer
Date:
(Please reply to the list, not just to me)

I'm not sure about this so far. Re the specific issue you mention of
conversion between cp1250 and latin-2 (ISO-8859-2) the Unicode tables
at:

  http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT

appear to agree - there's no PER MILLE in ISO-8859-2.

With a UTF-8 database, Pg correctly doesn't accept PER MILLE as a valid
ISO-8859-2 char:

-- Connecting with unicode (utf-8) client
CREATE TABLE test (x);
INSERT INTO test(x) VALUES ('‰');

SET client_encoding='iso-8859-2';
SELECT * from test;
ERROR:  character 0xe280b0 of encoding "UTF8" has no equivalent in
"LATIN2"

If the encoding is set to WIN1250 Pg outputs the appropriate byte. So
it's doing the right thing in each individual case where a UTF-8 DB is
concerned.

Your problem, though, is that if you connect to a LATIN2 database with a
WIN1250 client and INSERT a string containing the per-mille glyph, Pg
accepts it and it should not. If it does, indeed, accept it, then I
agree that's a bug.

I haven't tested with a LATIN2 database as I'd have to re-initdb and the
machine I'm working on has semi-useful databases on it. What you're
saying makes sense, though, presuming your client really is sending
win1250 per-mille (byte 0x89).


I'd still like to know how you're setting your client encoding. You
can't just run "SET client_encoding='win1250'" - you must tell the
client program, or the terminal it runs in, to use the appropriate
encoding as well. Otherwise when you paste the per-mille character
you'll see the right glyph, but the CLIENT will interpret that as the
character in the encoding you specified.

So, if you're using a utf-8 terminal, that means that the terminal will
send 0xe2 0x80 0xb0 for per-mille, which when interpreted as win1250
becomes ‰ , so that's what the server thinks you sent it.

In that case, though, you'd find that the euro symbol, which isn't
defined in latin-2, will cause an error:

ERROR:  character 0xe282ac of encoding "UTF8" has no equivalent in
"LATIN2"




--
Craig Ringer

Re: BUG #4890: Allow insert character has no equivalent in "LATIN2"

From
Tom Lane
Date:
Craig Ringer <craig@postnewspapers.com.au> writes:
> Your problem, though, is that if you connect to a LATIN2 database with a
> WIN1250 client and INSERT a string containing the per-mille glyph, Pg
> accepts it and it should not. If it does, indeed, accept it, then I
> agree that's a bug.

The table in win12502mic() in latin2_and_win1250.c just translates
0x89 to 0x89, which is wrong according to your comments (it should
have a zero entry for characters with no LATIN2 equivalent).

The table looks to have quite a few one-to-one conversions, so I am
wondering whether this is the only bug in it.  Anyone want to go through
the rest of it?

            regards, tom lane

Re: BUG #4890: Allow insert character has no equivalent in "LATIN2"

From
Robert Świętochowski
Date:
Dnia 13-07-2009 o 19:58:50 Craig Ringer <craig@postnewspapers.com.au>
napisał(a):

> (Please reply to the list, not just to me)
>

I'm sorry.

> I'd still like to know how you're setting your client encoding. You
> can't just run "SET client_encoding='win1250'" - you must tell the
> client program, or the terminal it runs in, to use the appropriate
> encoding as well. Otherwise when you paste the per-mille character
> you'll see the right glyph, but the CLIENT will interpret that as the
> character in the encoding you specified.
>
> So, if you're using a utf-8 terminal, that means that the terminal will
> send 0xe2 0x80 0xb0 for per-mille, which when interpreted as win1250
> becomes ‰ , so that's what the server thinks you sent it.
>
> In that case, though, you'd find that the euro symbol, which isn't
> defined in latin-2, will cause an error:
>
> ERROR:  character 0xe282ac of encoding "UTF8" has no equivalent in
> "LATIN2"
>

I didn't use terminal, I used pgAdmin III for first test
and you're right, this was not good idea, but somehow
validation for invalid characters work properly.
For later tests I used python-psycopg2 script.

--
Robert Świętochowski




__
AKPA Polska Press Sp. z o.o., 02-221 Warszawa, ul. Zbąszyńska 5, NIP 521-10-00-270
Sąd Rejonowy dla m.st. Warszawy, XIII Wydział Gospodarczy KRS 0000023945;
Kapitał zakładowy 516.900,00 PLN (w całości opłacony).