Thread: Unicode problem again

Unicode problem again

From
Garry Saddington
Date:

I have the following error:

Postgres 8.3 via psycopg 1.1.21 and zope 2.10.

ProgrammingError Error Value: ERROR: character 0xe28099 of encoding "UTF8" has no equivalent in "LATIN1" select distinct teachers.teacherid,teachers.teacherid as thisteacherid,teachers.initials, reports.reporttext,reports.reportid,subjects.subjectid ,subjects.name, subjects.abbreviation,reports.academicyear,teachers.firstname as first,teachers.surname as second,classes.classid from classes, reports,teachers,subjects where reports.classid=classes.classid and reports.teacherid=teachers.teacherid and reports.studentid=4247 and classes.subjectid=subjects.subjectid and reports.classid=5626 and reports.teacherid=30

I have changed client_encoding to Latin1 to get over errors caused by having the database in UTF8 and users trying to enter special characters like £ signs.

Unfortunately, it seems there are already UTF8 encodings in the DB that have no equivalent in Latin1 from before the change.

How can I get over this problem, and still allow special characters, ie have no error reports.

Regards

Garry

Re: Unicode problem again

From
"Albe Laurenz"
Date:
Garry Saddington wrote:
> I have the following error:
>
> Postgres 8.3 via psycopg 1.1.21 and zope 2.10.
>
> ProgrammingError Error Value: ERROR: character 0xe28099 of encoding "UTF8" has no equivalent in "LATIN1" select
distinct 
[...]

This is UNICODE 0x2019, a "right single quotation mark".

This is a "Windows character" - the only non-UNICODE codepages I
know that contain this character are the Microsoft codepages.

Microsoft programs are known to automagically change ASCII
characters to characters like that, so a frequent source of
such characters is copy & paste from a Microsoft text processor.

> I have changed client_encoding to Latin1 to get over errors
> caused by having the database in UTF8 and users trying to
> enter special characters like £ signs.
>
> Unfortunately, it seems there are already UTF8 encodings in
> the DB that have no equivalent in Latin1 from before the change.
>
> How can I get over this problem, and still allow special
> characters, ie have no error reports.

If you want to allow *all* special characters, you will have to
use UNICODE (and a pretty comprehensive font).
You could check if all software that you use supports UNICODE.

By using LATIN1 (or any other non-UNICODE codepage) you allow
*some* special characters. In that case you should not allow all
characters into your database.
You'll have to check data at entry time.
If you are confident that you will never need any non-LATIN1
characters in your database, you could create the database
with LATIN1 encoding; that way there will be an error message at
data entry time.

If you know that all your data is from and for Windows, you could
also use encoding WIN1252 throughout.

Yours,
Laurenz Albe

Re: Unicode problem again

From
Michael Fuhr
Date:
On Tue, Jun 24, 2008 at 09:16:37AM +0200, Albe Laurenz wrote:
> Garry Saddington wrote:
> > ProgrammingError Error Value: ERROR: character 0xe28099 of
> > encoding "UTF8" has no equivalent in "LATIN1" select distinct
> [...]
>
> This is UNICODE 0x2019, a "right single quotation mark".
>
> This is a "Windows character" - the only non-UNICODE codepages I
> know that contain this character are the Microsoft codepages.
[...]
>
> > I have changed client_encoding to Latin1 to get over errors
> > caused by having the database in UTF8 and users trying to
> > enter special characters like £ signs.
> >
> > Unfortunately, it seems there are already UTF8 encodings in
> > the DB that have no equivalent in Latin1 from before the change.

Your input data seems to have a mix of encodings: sometimes you're
getting pound signs in a non-UTF-8 encoding, but if characters like
<U+2019 RIGHT SINGLE QUOTATION MARK> got into the database when
client_encoding was set to UTF8 then at least some data must have
been in UTF-8.  If you're not certain that all data will be in the
same encoding then you might need to attempt to detect the encoding
and set client_encoding accordingly or convert the data to a common
encoding in the application before inserting it (I've had to do
this, sometimes on a line-by-line basis).

Setting client_encoding has implications for display as well as for
input: if the displaying application expects data in one encoding
but you give it data in a different encoding then non-ASCII characters
might not display correctly.

--
Michael Fuhr

Re: Unicode problem again

From
"Albe Laurenz"
Date:
Michael Fuhr wrote:
> > > ProgrammingError Error Value: ERROR: character 0xe28099 of
> > > encoding "UTF8" has no equivalent in "LATIN1" select distinct
> > [...]
> >
> > This is UNICODE 0x2019, a "right single quotation mark".
> >
> > This is a "Windows character" - the only non-UNICODE codepages I
> > know that contain this character are the Microsoft codepages.
> [...]
> >
> > > I have changed client_encoding to Latin1 to get over errors
> > > caused by having the database in UTF8 and users trying to
> > > enter special characters like £ signs.
> > >
> > > Unfortunately, it seems there are already UTF8 encodings in
> > > the DB that have no equivalent in Latin1 from before the change.
>
> Your input data seems to have a mix of encodings: sometimes you're
> getting pound signs in a non-UTF-8 encoding, but if characters like
> <U+2019 RIGHT SINGLE QUOTATION MARK> got into the database when
> client_encoding was set to UTF8 then at least some data must have
> been in UTF-8.

Sorry, but that's not true.
That character is 0x9s in WINDOWS-1252.

So it could have been that client_encoding was (correctly) set to WIN1252
and the quotation mark was entered as a single byte character.

Yours,
Laurenz Albe

Re: Unicode problem again

From
Michael Fuhr
Date:
On Thu, Jun 26, 2008 at 03:31:01PM +0200, Albe Laurenz wrote:
> Michael Fuhr wrote:
> > Your input data seems to have a mix of encodings: sometimes you're
> > getting pound signs in a non-UTF-8 encoding, but if characters like
> > <U+2019 RIGHT SINGLE QUOTATION MARK> got into the database when
> > client_encoding was set to UTF8 then at least some data must have
> > been in UTF-8.
>
> Sorry, but that's not true.
> That character is 0x9s in WINDOWS-1252.

I think you mean 0x92.

> So it could have been that client_encoding was (correctly) set to WIN1252
> and the quotation mark was entered as a single byte character.

Yes, *if* client_encoding was set to win1252.  However, in the
following thread Garry said that he was getting encoding errors
when entering the pound sign that were resolved by changing
client_encoding (I suggested latin1, latin9, or win1252; he doesn't
say which he used):

http://archives.postgresql.org/pgsql-general/2008-06/msg00526.php

If client_encoding had been set to win1252 then Garry wouldn't have
gotten encoding errors when entering the pound sign because that
character is 0xa3 in win1252 (also in latin1 and latin9). So either
applications are setting client_encoding to different values,
sometimes correctly and sometimes incorrectly (Garry, do you know
if that could be happening?), or the data is sometimes in different
encodings.  If the data is being entered via a web application then
the latter seems more likely, at least in my experience (I've had
to deal with exactly this problem recently).

--
Michael Fuhr

Re: Unicode problem again

From
Garry Saddington
Date:
On Thursday 26 June 2008 15:41, Michael Fuhr wrote:
> On Thu, Jun 26, 2008 at 03:31:01PM +0200, Albe Laurenz wrote:
> > Michael Fuhr wrote:
> > > Your input data seems to have a mix of encodings: sometimes you're
> > > getting pound signs in a non-UTF-8 encoding, but if characters like
> > > <U+2019 RIGHT SINGLE QUOTATION MARK> got into the database when
> > > client_encoding was set to UTF8 then at least some data must have
> > > been in UTF-8.
> >
> > Sorry, but that's not true.
> > That character is 0x9s in WINDOWS-1252.
>
> I think you mean 0x92.
>
> > So it could have been that client_encoding was (correctly) set to WIN1252
> > and the quotation mark was entered as a single byte character.
>
> Yes, *if* client_encoding was set to win1252.  However, in the
> following thread Garry said that he was getting encoding errors
> when entering the pound sign that were resolved by changing
> client_encoding (I suggested latin1, latin9, or win1252; he doesn't
> say which he used):
>
> http://archives.postgresql.org/pgsql-general/2008-06/msg00526.php
>
> If client_encoding had been set to win1252 then Garry wouldn't have
> gotten encoding errors when entering the pound sign because that
> character is 0xa3 in win1252 (also in latin1 and latin9). So either
> applications are setting client_encoding to different values,
> sometimes correctly and sometimes incorrectly (Garry, do you know
> if that could be happening?), or the data is sometimes in different
> encodings.  If the data is being entered via a web application
This is the case and so I need some way to tell the browser to send the
correct encoding - still researching.
regards
Garry


> then
> the latter seems more likely, at least in my experience (I've had
> to deal with exactly this problem recently).
>
> --
> Michael Fuhr