Thread: BUG #4890: Allow insert character has no equivalent in "LATIN2"
The following bug has been logged online: Bug reference: 4890 Logged by: saint Email address: saint@akpa.pl PostgreSQL version: 8.4 RC1 Operating system: Windows XP Description: Allow insert character has no equivalent in "LATIN2" Details: Database encoding: "LATIN2" set client_encoding to 'UTF8'; insert into public.test(col) values('â°'); ERROR: character 0xe280b0 of encoding "UTF8" has no equivalent in "LATIN2" in this case is correct but if client_encoding is 'WIN1250': set client_encoding to 'WIN1250'; insert into public.test(col) values('â°'); Query returned successfully: 1 row affected, 0 ms execution time.
On Sat, 2009-06-27 at 19:20 +0000, saint wrote: > set client_encoding to 'WIN1250'; > insert into public.test(col) values('â°'); You're lying to the server about the client encoding in one or both cases. I can't know which without knowing what program you're talking to the server with and how it's set up. 'SET client_encoding' tells the server what encoding to expect incoming data in. It doesn't change what encoding the client sends that data in. If your client has a different default encoding to that of the server, you can inform the server that the client will be sending differently encoded data. In other words, you can't use 'SET client_encoding' to change what encoding the client uses, only how the server interprets the bytes it gets from the client. -- Craig Ringer
(Please reply to the list, not just to me) I'm not sure about this so far. Re the specific issue you mention of conversion between cp1250 and latin-2 (ISO-8859-2) the Unicode tables at: http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT appear to agree - there's no PER MILLE in ISO-8859-2. With a UTF-8 database, Pg correctly doesn't accept PER MILLE as a valid ISO-8859-2 char: -- Connecting with unicode (utf-8) client CREATE TABLE test (x); INSERT INTO test(x) VALUES ('â°'); SET client_encoding='iso-8859-2'; SELECT * from test; ERROR: character 0xe280b0 of encoding "UTF8" has no equivalent in "LATIN2" If the encoding is set to WIN1250 Pg outputs the appropriate byte. So it's doing the right thing in each individual case where a UTF-8 DB is concerned. Your problem, though, is that if you connect to a LATIN2 database with a WIN1250 client and INSERT a string containing the per-mille glyph, Pg accepts it and it should not. If it does, indeed, accept it, then I agree that's a bug. I haven't tested with a LATIN2 database as I'd have to re-initdb and the machine I'm working on has semi-useful databases on it. What you're saying makes sense, though, presuming your client really is sending win1250 per-mille (byte 0x89). I'd still like to know how you're setting your client encoding. You can't just run "SET client_encoding='win1250'" - you must tell the client program, or the terminal it runs in, to use the appropriate encoding as well. Otherwise when you paste the per-mille character you'll see the right glyph, but the CLIENT will interpret that as the character in the encoding you specified. So, if you're using a utf-8 terminal, that means that the terminal will send 0xe2 0x80 0xb0 for per-mille, which when interpreted as win1250 becomes ââ¬Â° , so that's what the server thinks you sent it. In that case, though, you'd find that the euro symbol, which isn't defined in latin-2, will cause an error: ERROR: character 0xe282ac of encoding "UTF8" has no equivalent in "LATIN2" -- Craig Ringer
Craig Ringer <craig@postnewspapers.com.au> writes: > Your problem, though, is that if you connect to a LATIN2 database with a > WIN1250 client and INSERT a string containing the per-mille glyph, Pg > accepts it and it should not. If it does, indeed, accept it, then I > agree that's a bug. The table in win12502mic() in latin2_and_win1250.c just translates 0x89 to 0x89, which is wrong according to your comments (it should have a zero entry for characters with no LATIN2 equivalent). The table looks to have quite a few one-to-one conversions, so I am wondering whether this is the only bug in it. Anyone want to go through the rest of it? regards, tom lane
Dnia 13-07-2009 o 19:58:50 Craig Ringer <craig@postnewspapers.com.au> napisał(a): > (Please reply to the list, not just to me) > I'm sorry. > I'd still like to know how you're setting your client encoding. You > can't just run "SET client_encoding='win1250'" - you must tell the > client program, or the terminal it runs in, to use the appropriate > encoding as well. Otherwise when you paste the per-mille character > you'll see the right glyph, but the CLIENT will interpret that as the > character in the encoding you specified. > > So, if you're using a utf-8 terminal, that means that the terminal will > send 0xe2 0x80 0xb0 for per-mille, which when interpreted as win1250 > becomes ‰ , so that's what the server thinks you sent it. > > In that case, though, you'd find that the euro symbol, which isn't > defined in latin-2, will cause an error: > > ERROR: character 0xe282ac of encoding "UTF8" has no equivalent in > "LATIN2" > I didn't use terminal, I used pgAdmin III for first test and you're right, this was not good idea, but somehow validation for invalid characters work properly. For later tests I used python-psycopg2 script. -- Robert Świętochowski __ AKPA Polska Press Sp. z o.o., 02-221 Warszawa, ul. Zbąszyńska 5, NIP 521-10-00-270 Sąd Rejonowy dla m.st. Warszawy, XIII Wydział Gospodarczy KRS 0000023945; Kapitał zakładowy 516.900,00 PLN (w całości opłacony).