Thread: encoding question

encoding question

From
"Ben K."
Date:
Hi,

My main server is on Solaris (LANG=C), postgresql 8.1.0, with encoding
SQL_ASCII.

I'm trying to set up a backup server on linux (LANG=en_US.UTF-8), with
encoding UTF8.

I thought UTF8 was safe if all I have is ascii, since there was a relevant
discussion in this list a short while ago, but I'm getting this error, for
a few records (out of tens of thousands in that database).

===============
ERROR:  invalid UTF-8 byte sequence detected near byte 0x85
CONTEXT:  COPY tblcoursesinfo, line 30, column title: "Seminar in<85>
..."
===============

The dump was created by 8.1.0's pg_dump with no options.

That line in the dump has the following contents.

Text:

n<85>       Knowledg

Binary:

02e2100: 6e85 2020 2020 2020 2020 2020 2020 2020  n.
02e2110: 2020 2020 2020 2020 2020 2020 2020 2020
02e2120: 2020 2020 2020 2020 2020 2020 2020 2020
02e2130: 2020 2020 2020 2020 2020 2020 2020 2020
02e2140: 2020 2020 2020 2020 2020 2020 2020 2020
02e2150: 2020 2020 2020 2020 2020 2020 2020 2020
02e2160: 2020 2020 2020 2020 2020 2020 2020 2020
02e2170: 2020 2020 2020 2009 4b6e 6f77 6c65 6467         .Knowledg

Where could this be coming from? Would appreciate help.


Regards,

Ben K.
Developer
http://benix.tamu.edu

Re: encoding question

From
Ivo Rossacher
Date:
Am Montag, 20. März 2006 23.56 schrieb Ben K.:
> Hi,
>
> My main server is on Solaris (LANG=C), postgresql 8.1.0, with encoding
> SQL_ASCII.

SQL_ASCII is not an encoding. It marks that there is no encoding and therefore
no check of the stored byte codes. This means that the client is fully
responsible for the correctness of the byte codes. If the client does no
checks on the byte codes it is possible for the user to enter non ascii
characters which will create trubles when sent to the unicode server.

You can find several different solutions for this kind of issues in the
archives. Which to use depends on the details of your problem.

Best Regards
Ivo

>
> I'm trying to set up a backup server on linux (LANG=en_US.UTF-8), with
> encoding UTF8.
>
> I thought UTF8 was safe if all I have is ascii, since there was a relevant
> discussion in this list a short while ago, but I'm getting this error, for
> a few records (out of tens of thousands in that database).
>
> ===============
> ERROR:  invalid UTF-8 byte sequence detected near byte 0x85
> CONTEXT:  COPY tblcoursesinfo, line 30, column title: "Seminar in<85>
> ..."
> ===============
>
> The dump was created by 8.1.0's pg_dump with no options.
>
> That line in the dump has the following contents.
>
> Text:
>
> n<85>       Knowledg
>
> Binary:
>
> 02e2100: 6e85 2020 2020 2020 2020 2020 2020 2020  n.
> 02e2110: 2020 2020 2020 2020 2020 2020 2020 2020
> 02e2120: 2020 2020 2020 2020 2020 2020 2020 2020
> 02e2130: 2020 2020 2020 2020 2020 2020 2020 2020
> 02e2140: 2020 2020 2020 2020 2020 2020 2020 2020
> 02e2150: 2020 2020 2020 2020 2020 2020 2020 2020
> 02e2160: 2020 2020 2020 2020 2020 2020 2020 2020
> 02e2170: 2020 2020 2020 2009 4b6e 6f77 6c65 6467         .Knowledg
>
> Where could this be coming from? Would appreciate help.
>
>
> Regards,
>
> Ben K.
> Developer
> http://benix.tamu.edu
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

Re: encoding question

From
Tom Lane
Date:
"Ben K." <bkim@coe.tamu.edu> writes:
> I thought UTF8 was safe if all I have is ascii,

It is, but evidently what you have is not all ASCII.

> ERROR:  invalid UTF-8 byte sequence detected near byte 0x85

Looks to me like it might have been meant as LATIN1 or one of
the other single-byte ASCII-extension encodings.

If you think your data should be all ASCII, then you have some
data-cleaning to do.

            regards, tom lane

Re: encoding question

From
"Ben K."
Date:
>> ERROR:  invalid UTF-8 byte sequence detected near byte 0x85
> Looks to me like it might have been meant as LATIN1 or one of
> the other single-byte ASCII-extension encodings.

Thanks. Indeed it has non-ascii and wouldn't be covered by SQL_ASCII, I
see now.

I never suspected there'd be non-ascii in the data since we do cleansing
before script-loading the data, but we use other input methods too, so am
not sure where they came from.

I didn't specify encoding when doing initdb when upgrading to 8.1.0, and
think it was where I could have prevented this problem, but I'm not sure.

I'm suspecting so because of this article (At least for locale C - since I
did not specify encoding and got UTF on linux with en_US.UTF-8). Is it
valid for 8.1.0?

http://www.commandprompt.com/ppbook/x17149
"ENCODING = encoding
...
If the ENCODING keyword is unspecified, PostgreSQL will create a
database using its default encoding. This is usually SQL_ASCII, though
it may have been set to a different default during the initial
configuration of PostgreSQL (see Chapter 2 for more on default
encoding)."

And I'm getting this from pgAdmin III. I guess this is the reason why
you all say avoid SQL_ASCII?

"Database encoding The database ... is created to store data using
the SQL_ASCII encoding. This encoding is defined for 7 bit characters
only; the meaning of characters with the 8th bit set (non-ASCII
characters 127-255) is not defined. Consequently, it is not possible for
the server to convert the data to other encodings. If you're storing
non-ASCII data in the database, you're strongly encouraged to use a
proper database encoding representing your locale character set to take
benefit from the automatic conversion to different client encodings when
needed. If you store non-ASCII data in an SQL_ASCII database, you may
encounter weird characters written to or read from the database, caused
by code conversion problems. This may cause you a lot of headache when
accessing the database using different client programs and drivers. For
most installations, Unicode (UTF8) encoding will provide the most
flexible capabilities."

Could anyone comment if the method in this url is valid and reasonably
safe? (At this time the problem seems almost harmless except for a few
records not being loaded, but it'll need to be fixed.)

http://archives.postgresql.org/pgsql-general/2004-02/msg01192.php

dump database, recode the dump, drop database, restore from recoded dump

Especially, any experience with recode vs. manual inspection ?

I'm just reasoning from pieces of information. I'd appreciate any advices
or experiences.



Regards,

Ben K.
Developer
http://benix.tamu.edu

Re: encoding question

From
"Ben K."
Date:
I just wanted to add that when I created the same database with -E
SQL_ASCII on my linux box, the dump was loaded fine. I created another
database without -E and observed the same invalid encoding problem.

On the face value this seems to solve the problem at least superficially.

I'd like to check the data validity, and the easiest way seems to be to
dump the data again from the linux box and compare with the original.

Is there a way to compare between any two databases online? (like running
a script checking row counts and schema) If I run crc on the concat of all
fields in a row, and if the crc matches, would it be reasonably
sufficient? Is there a stronger validation method?


Thanks.

Ben K.
Developer
http://benix.tamu.edu

Re: encoding question

From
Ivo Rossacher
Date:
Am Dienstag, 21. März 2006 21.14 schrieb Ben K.:
> I just wanted to add that when I created the same database with -E
> SQL_ASCII on my linux box, the dump was loaded fine. I created another
> database without -E and observed the same invalid encoding problem.

This is not really surprising since SQL_ASCII does not check the coding unlike
all other encodings.

>
> On the face value this seems to solve the problem at least superficially.

The more interesting question is, what is your application doing with the non
ASCII characters within your database. The answer to this question will tell
you what the correct contents would be.

>
> I'd like to check the data validity, and the easiest way seems to be to
> dump the data again from the linux box and compare with the original.

Your application defines what is valid. Even if you know that the dump would
be the same it would not tell you anything about the validity of the data. So
the better check would be to check with the application(s) connecting to both
servers and work with some records which do contain non ASCII characters.
If both servers do give the same results with your application(s) you most
possible got the coding right.

>
> Is there a way to compare between any two databases online? (like running
> a script checking row counts and schema) If I run crc on the concat of all
> fields in a row, and if the crc matches, would it be reasonably
> sufficient? Is there a stronger validation method?

Since any general method for comparing database contents (I don't know of such
a tool) would use it's own drivers and setup, it will probably not get the
same result as the test with your client applications.

The bottom line is that only a encoding set at the server level will make
clear what the meaning of non ASCII characters is. The server can then deal
with the conversion between the server and the client encoding so that the
different clients can work even with different internal encodings.
With SQL_ASCII only the client application knows. This kind of setup needs a
lot of care during setup to get consistent data, especially when several
different applications are used.
The drawback of selecting an encoding is a little performance penalty. However
in my databases I could not measure any difference. I have to say here that
my data does not have a lot of strings in. So it is definitly not a good test
case for this. Since there are several different clients with different
languages using my databases, I do use unicode as encoding. This works
without any problem for me.

Best regards
Ivo

>
>
> Thanks.
>
> Ben K.
> Developer
> http://benix.tamu.edu
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

Re: encoding question

From
"Aftab Alam"
Date:
unsubscribe

Regards,





-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Ben K.
Sent: Tuesday, March 21, 2006 10:09 PM
To: Tom Lane
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] encoding question


>> ERROR:  invalid UTF-8 byte sequence detected near byte 0x85
> Looks to me like it might have been meant as LATIN1 or one of
> the other single-byte ASCII-extension encodings.

Thanks. Indeed it has non-ascii and wouldn't be covered by SQL_ASCII, I
see now.

I never suspected there'd be non-ascii in the data since we do cleansing
before script-loading the data, but we use other input methods too, so am
not sure where they came from.

I didn't specify encoding when doing initdb when upgrading to 8.1.0, and
think it was where I could have prevented this problem, but I'm not sure.

I'm suspecting so because of this article (At least for locale C - since I
did not specify encoding and got UTF on linux with en_US.UTF-8). Is it
valid for 8.1.0?

http://www.commandprompt.com/ppbook/x17149
"ENCODING = encoding
...
If the ENCODING keyword is unspecified, PostgreSQL will create a
database using its default encoding. This is usually SQL_ASCII, though
it may have been set to a different default during the initial
configuration of PostgreSQL (see Chapter 2 for more on default
encoding)."

And I'm getting this from pgAdmin III. I guess this is the reason why
you all say avoid SQL_ASCII?

"Database encoding The database ... is created to store data using
the SQL_ASCII encoding. This encoding is defined for 7 bit characters
only; the meaning of characters with the 8th bit set (non-ASCII
characters 127-255) is not defined. Consequently, it is not possible for
the server to convert the data to other encodings. If you're storing
non-ASCII data in the database, you're strongly encouraged to use a
proper database encoding representing your locale character set to take
benefit from the automatic conversion to different client encodings when
needed. If you store non-ASCII data in an SQL_ASCII database, you may
encounter weird characters written to or read from the database, caused
by code conversion problems. This may cause you a lot of headache when
accessing the database using different client programs and drivers. For
most installations, Unicode (UTF8) encoding will provide the most
flexible capabilities."

Could anyone comment if the method in this url is valid and reasonably
safe? (At this time the problem seems almost harmless except for a few
records not being loaded, but it'll need to be fixed.)

http://archives.postgresql.org/pgsql-general/2004-02/msg01192.php

dump database, recode the dump, drop database, restore from recoded dump

Especially, any experience with recode vs. manual inspection ?

I'm just reasoning from pieces of information. I'd appreciate any advices
or experiences.



Regards,

Ben K.
Developer
http://benix.tamu.edu

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: encoding question

From
"Aftab Alam"
Date:
unsubscribe

Regards,





-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Ivo Rossacher
Sent: Wednesday, March 22, 2006 3:08 AM
To: pgsql-admin@postgresql.org
Cc: Ben K.
Subject: Re: [ADMIN] encoding question


Am Dienstag, 21. März 2006 21.14 schrieb Ben K.:
> I just wanted to add that when I created the same database with -E
> SQL_ASCII on my linux box, the dump was loaded fine. I created another
> database without -E and observed the same invalid encoding problem.

This is not really surprising since SQL_ASCII does not check the coding
unlike
all other encodings.

>
> On the face value this seems to solve the problem at least superficially.

The more interesting question is, what is your application doing with the
non
ASCII characters within your database. The answer to this question will tell
you what the correct contents would be.

>
> I'd like to check the data validity, and the easiest way seems to be to
> dump the data again from the linux box and compare with the original.

Your application defines what is valid. Even if you know that the dump would
be the same it would not tell you anything about the validity of the data.
So
the better check would be to check with the application(s) connecting to
both
servers and work with some records which do contain non ASCII characters.
If both servers do give the same results with your application(s) you most
possible got the coding right.

>
> Is there a way to compare between any two databases online? (like running
> a script checking row counts and schema) If I run crc on the concat of all
> fields in a row, and if the crc matches, would it be reasonably
> sufficient? Is there a stronger validation method?

Since any general method for comparing database contents (I don't know of
such
a tool) would use it's own drivers and setup, it will probably not get the
same result as the test with your client applications.

The bottom line is that only a encoding set at the server level will make
clear what the meaning of non ASCII characters is. The server can then deal
with the conversion between the server and the client encoding so that the
different clients can work even with different internal encodings.
With SQL_ASCII only the client application knows. This kind of setup needs a
lot of care during setup to get consistent data, especially when several
different applications are used.
The drawback of selecting an encoding is a little performance penalty.
However
in my databases I could not measure any difference. I have to say here that
my data does not have a lot of strings in. So it is definitly not a good
test
case for this. Since there are several different clients with different
languages using my databases, I do use unicode as encoding. This works
without any problem for me.

Best regards
Ivo

>
>
> Thanks.
>
> Ben K.
> Developer
> http://benix.tamu.edu
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend