Thread: Unicode problem again
I have the following error:
Postgres 8.3 via psycopg 1.1.21 and zope 2.10.
ProgrammingError Error Value: ERROR: character 0xe28099 of encoding "UTF8" has no equivalent in "LATIN1" select distinct teachers.teacherid,teachers.teacherid as thisteacherid,teachers.initials, reports.reporttext,reports.reportid,subjects.subjectid ,subjects.name, subjects.abbreviation,reports.academicyear,teachers.firstname as first,teachers.surname as second,classes.classid from classes, reports,teachers,subjects where reports.classid=classes.classid and reports.teacherid=teachers.teacherid and reports.studentid=4247 and classes.subjectid=subjects.subjectid and reports.classid=5626 and reports.teacherid=30
I have changed client_encoding to Latin1 to get over errors caused by having the database in UTF8 and users trying to enter special characters like £ signs.
Unfortunately, it seems there are already UTF8 encodings in the DB that have no equivalent in Latin1 from before the change.
How can I get over this problem, and still allow special characters, ie have no error reports.
Regards
Garry
Garry Saddington wrote: > I have the following error: > > Postgres 8.3 via psycopg 1.1.21 and zope 2.10. > > ProgrammingError Error Value: ERROR: character 0xe28099 of encoding "UTF8" has no equivalent in "LATIN1" select distinct [...] This is UNICODE 0x2019, a "right single quotation mark". This is a "Windows character" - the only non-UNICODE codepages I know that contain this character are the Microsoft codepages. Microsoft programs are known to automagically change ASCII characters to characters like that, so a frequent source of such characters is copy & paste from a Microsoft text processor. > I have changed client_encoding to Latin1 to get over errors > caused by having the database in UTF8 and users trying to > enter special characters like £ signs. > > Unfortunately, it seems there are already UTF8 encodings in > the DB that have no equivalent in Latin1 from before the change. > > How can I get over this problem, and still allow special > characters, ie have no error reports. If you want to allow *all* special characters, you will have to use UNICODE (and a pretty comprehensive font). You could check if all software that you use supports UNICODE. By using LATIN1 (or any other non-UNICODE codepage) you allow *some* special characters. In that case you should not allow all characters into your database. You'll have to check data at entry time. If you are confident that you will never need any non-LATIN1 characters in your database, you could create the database with LATIN1 encoding; that way there will be an error message at data entry time. If you know that all your data is from and for Windows, you could also use encoding WIN1252 throughout. Yours, Laurenz Albe
On Tue, Jun 24, 2008 at 09:16:37AM +0200, Albe Laurenz wrote: > Garry Saddington wrote: > > ProgrammingError Error Value: ERROR: character 0xe28099 of > > encoding "UTF8" has no equivalent in "LATIN1" select distinct > [...] > > This is UNICODE 0x2019, a "right single quotation mark". > > This is a "Windows character" - the only non-UNICODE codepages I > know that contain this character are the Microsoft codepages. [...] > > > I have changed client_encoding to Latin1 to get over errors > > caused by having the database in UTF8 and users trying to > > enter special characters like £ signs. > > > > Unfortunately, it seems there are already UTF8 encodings in > > the DB that have no equivalent in Latin1 from before the change. Your input data seems to have a mix of encodings: sometimes you're getting pound signs in a non-UTF-8 encoding, but if characters like <U+2019 RIGHT SINGLE QUOTATION MARK> got into the database when client_encoding was set to UTF8 then at least some data must have been in UTF-8. If you're not certain that all data will be in the same encoding then you might need to attempt to detect the encoding and set client_encoding accordingly or convert the data to a common encoding in the application before inserting it (I've had to do this, sometimes on a line-by-line basis). Setting client_encoding has implications for display as well as for input: if the displaying application expects data in one encoding but you give it data in a different encoding then non-ASCII characters might not display correctly. -- Michael Fuhr
Michael Fuhr wrote: > > > ProgrammingError Error Value: ERROR: character 0xe28099 of > > > encoding "UTF8" has no equivalent in "LATIN1" select distinct > > [...] > > > > This is UNICODE 0x2019, a "right single quotation mark". > > > > This is a "Windows character" - the only non-UNICODE codepages I > > know that contain this character are the Microsoft codepages. > [...] > > > > > I have changed client_encoding to Latin1 to get over errors > > > caused by having the database in UTF8 and users trying to > > > enter special characters like £ signs. > > > > > > Unfortunately, it seems there are already UTF8 encodings in > > > the DB that have no equivalent in Latin1 from before the change. > > Your input data seems to have a mix of encodings: sometimes you're > getting pound signs in a non-UTF-8 encoding, but if characters like > <U+2019 RIGHT SINGLE QUOTATION MARK> got into the database when > client_encoding was set to UTF8 then at least some data must have > been in UTF-8. Sorry, but that's not true. That character is 0x9s in WINDOWS-1252. So it could have been that client_encoding was (correctly) set to WIN1252 and the quotation mark was entered as a single byte character. Yours, Laurenz Albe
On Thu, Jun 26, 2008 at 03:31:01PM +0200, Albe Laurenz wrote: > Michael Fuhr wrote: > > Your input data seems to have a mix of encodings: sometimes you're > > getting pound signs in a non-UTF-8 encoding, but if characters like > > <U+2019 RIGHT SINGLE QUOTATION MARK> got into the database when > > client_encoding was set to UTF8 then at least some data must have > > been in UTF-8. > > Sorry, but that's not true. > That character is 0x9s in WINDOWS-1252. I think you mean 0x92. > So it could have been that client_encoding was (correctly) set to WIN1252 > and the quotation mark was entered as a single byte character. Yes, *if* client_encoding was set to win1252. However, in the following thread Garry said that he was getting encoding errors when entering the pound sign that were resolved by changing client_encoding (I suggested latin1, latin9, or win1252; he doesn't say which he used): http://archives.postgresql.org/pgsql-general/2008-06/msg00526.php If client_encoding had been set to win1252 then Garry wouldn't have gotten encoding errors when entering the pound sign because that character is 0xa3 in win1252 (also in latin1 and latin9). So either applications are setting client_encoding to different values, sometimes correctly and sometimes incorrectly (Garry, do you know if that could be happening?), or the data is sometimes in different encodings. If the data is being entered via a web application then the latter seems more likely, at least in my experience (I've had to deal with exactly this problem recently). -- Michael Fuhr
On Thursday 26 June 2008 15:41, Michael Fuhr wrote: > On Thu, Jun 26, 2008 at 03:31:01PM +0200, Albe Laurenz wrote: > > Michael Fuhr wrote: > > > Your input data seems to have a mix of encodings: sometimes you're > > > getting pound signs in a non-UTF-8 encoding, but if characters like > > > <U+2019 RIGHT SINGLE QUOTATION MARK> got into the database when > > > client_encoding was set to UTF8 then at least some data must have > > > been in UTF-8. > > > > Sorry, but that's not true. > > That character is 0x9s in WINDOWS-1252. > > I think you mean 0x92. > > > So it could have been that client_encoding was (correctly) set to WIN1252 > > and the quotation mark was entered as a single byte character. > > Yes, *if* client_encoding was set to win1252. However, in the > following thread Garry said that he was getting encoding errors > when entering the pound sign that were resolved by changing > client_encoding (I suggested latin1, latin9, or win1252; he doesn't > say which he used): > > http://archives.postgresql.org/pgsql-general/2008-06/msg00526.php > > If client_encoding had been set to win1252 then Garry wouldn't have > gotten encoding errors when entering the pound sign because that > character is 0xa3 in win1252 (also in latin1 and latin9). So either > applications are setting client_encoding to different values, > sometimes correctly and sometimes incorrectly (Garry, do you know > if that could be happening?), or the data is sometimes in different > encodings. If the data is being entered via a web application This is the case and so I need some way to tell the browser to send the correct encoding - still researching. regards Garry > then > the latter seems more likely, at least in my experience (I've had > to deal with exactly this problem recently). > > -- > Michael Fuhr