Thread: encoding question
Hi, My main server is on Solaris (LANG=C), postgresql 8.1.0, with encoding SQL_ASCII. I'm trying to set up a backup server on linux (LANG=en_US.UTF-8), with encoding UTF8. I thought UTF8 was safe if all I have is ascii, since there was a relevant discussion in this list a short while ago, but I'm getting this error, for a few records (out of tens of thousands in that database). =============== ERROR: invalid UTF-8 byte sequence detected near byte 0x85 CONTEXT: COPY tblcoursesinfo, line 30, column title: "Seminar in<85> ..." =============== The dump was created by 8.1.0's pg_dump with no options. That line in the dump has the following contents. Text: n<85> Knowledg Binary: 02e2100: 6e85 2020 2020 2020 2020 2020 2020 2020 n. 02e2110: 2020 2020 2020 2020 2020 2020 2020 2020 02e2120: 2020 2020 2020 2020 2020 2020 2020 2020 02e2130: 2020 2020 2020 2020 2020 2020 2020 2020 02e2140: 2020 2020 2020 2020 2020 2020 2020 2020 02e2150: 2020 2020 2020 2020 2020 2020 2020 2020 02e2160: 2020 2020 2020 2020 2020 2020 2020 2020 02e2170: 2020 2020 2020 2009 4b6e 6f77 6c65 6467 .Knowledg Where could this be coming from? Would appreciate help. Regards, Ben K. Developer http://benix.tamu.edu
Am Montag, 20. März 2006 23.56 schrieb Ben K.: > Hi, > > My main server is on Solaris (LANG=C), postgresql 8.1.0, with encoding > SQL_ASCII. SQL_ASCII is not an encoding. It marks that there is no encoding and therefore no check of the stored byte codes. This means that the client is fully responsible for the correctness of the byte codes. If the client does no checks on the byte codes it is possible for the user to enter non ascii characters which will create trubles when sent to the unicode server. You can find several different solutions for this kind of issues in the archives. Which to use depends on the details of your problem. Best Regards Ivo > > I'm trying to set up a backup server on linux (LANG=en_US.UTF-8), with > encoding UTF8. > > I thought UTF8 was safe if all I have is ascii, since there was a relevant > discussion in this list a short while ago, but I'm getting this error, for > a few records (out of tens of thousands in that database). > > =============== > ERROR: invalid UTF-8 byte sequence detected near byte 0x85 > CONTEXT: COPY tblcoursesinfo, line 30, column title: "Seminar in<85> > ..." > =============== > > The dump was created by 8.1.0's pg_dump with no options. > > That line in the dump has the following contents. > > Text: > > n<85> Knowledg > > Binary: > > 02e2100: 6e85 2020 2020 2020 2020 2020 2020 2020 n. > 02e2110: 2020 2020 2020 2020 2020 2020 2020 2020 > 02e2120: 2020 2020 2020 2020 2020 2020 2020 2020 > 02e2130: 2020 2020 2020 2020 2020 2020 2020 2020 > 02e2140: 2020 2020 2020 2020 2020 2020 2020 2020 > 02e2150: 2020 2020 2020 2020 2020 2020 2020 2020 > 02e2160: 2020 2020 2020 2020 2020 2020 2020 2020 > 02e2170: 2020 2020 2020 2009 4b6e 6f77 6c65 6467 .Knowledg > > Where could this be coming from? Would appreciate help. > > > Regards, > > Ben K. > Developer > http://benix.tamu.edu > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend
"Ben K." <bkim@coe.tamu.edu> writes: > I thought UTF8 was safe if all I have is ascii, It is, but evidently what you have is not all ASCII. > ERROR: invalid UTF-8 byte sequence detected near byte 0x85 Looks to me like it might have been meant as LATIN1 or one of the other single-byte ASCII-extension encodings. If you think your data should be all ASCII, then you have some data-cleaning to do. regards, tom lane
>> ERROR: invalid UTF-8 byte sequence detected near byte 0x85 > Looks to me like it might have been meant as LATIN1 or one of > the other single-byte ASCII-extension encodings. Thanks. Indeed it has non-ascii and wouldn't be covered by SQL_ASCII, I see now. I never suspected there'd be non-ascii in the data since we do cleansing before script-loading the data, but we use other input methods too, so am not sure where they came from. I didn't specify encoding when doing initdb when upgrading to 8.1.0, and think it was where I could have prevented this problem, but I'm not sure. I'm suspecting so because of this article (At least for locale C - since I did not specify encoding and got UTF on linux with en_US.UTF-8). Is it valid for 8.1.0? http://www.commandprompt.com/ppbook/x17149 "ENCODING = encoding ... If the ENCODING keyword is unspecified, PostgreSQL will create a database using its default encoding. This is usually SQL_ASCII, though it may have been set to a different default during the initial configuration of PostgreSQL (see Chapter 2 for more on default encoding)." And I'm getting this from pgAdmin III. I guess this is the reason why you all say avoid SQL_ASCII? "Database encoding The database ... is created to store data using the SQL_ASCII encoding. This encoding is defined for 7 bit characters only; the meaning of characters with the 8th bit set (non-ASCII characters 127-255) is not defined. Consequently, it is not possible for the server to convert the data to other encodings. If you're storing non-ASCII data in the database, you're strongly encouraged to use a proper database encoding representing your locale character set to take benefit from the automatic conversion to different client encodings when needed. If you store non-ASCII data in an SQL_ASCII database, you may encounter weird characters written to or read from the database, caused by code conversion problems. This may cause you a lot of headache when accessing the database using different client programs and drivers. For most installations, Unicode (UTF8) encoding will provide the most flexible capabilities." Could anyone comment if the method in this url is valid and reasonably safe? (At this time the problem seems almost harmless except for a few records not being loaded, but it'll need to be fixed.) http://archives.postgresql.org/pgsql-general/2004-02/msg01192.php dump database, recode the dump, drop database, restore from recoded dump Especially, any experience with recode vs. manual inspection ? I'm just reasoning from pieces of information. I'd appreciate any advices or experiences. Regards, Ben K. Developer http://benix.tamu.edu
I just wanted to add that when I created the same database with -E SQL_ASCII on my linux box, the dump was loaded fine. I created another database without -E and observed the same invalid encoding problem. On the face value this seems to solve the problem at least superficially. I'd like to check the data validity, and the easiest way seems to be to dump the data again from the linux box and compare with the original. Is there a way to compare between any two databases online? (like running a script checking row counts and schema) If I run crc on the concat of all fields in a row, and if the crc matches, would it be reasonably sufficient? Is there a stronger validation method? Thanks. Ben K. Developer http://benix.tamu.edu
Am Dienstag, 21. März 2006 21.14 schrieb Ben K.: > I just wanted to add that when I created the same database with -E > SQL_ASCII on my linux box, the dump was loaded fine. I created another > database without -E and observed the same invalid encoding problem. This is not really surprising since SQL_ASCII does not check the coding unlike all other encodings. > > On the face value this seems to solve the problem at least superficially. The more interesting question is, what is your application doing with the non ASCII characters within your database. The answer to this question will tell you what the correct contents would be. > > I'd like to check the data validity, and the easiest way seems to be to > dump the data again from the linux box and compare with the original. Your application defines what is valid. Even if you know that the dump would be the same it would not tell you anything about the validity of the data. So the better check would be to check with the application(s) connecting to both servers and work with some records which do contain non ASCII characters. If both servers do give the same results with your application(s) you most possible got the coding right. > > Is there a way to compare between any two databases online? (like running > a script checking row counts and schema) If I run crc on the concat of all > fields in a row, and if the crc matches, would it be reasonably > sufficient? Is there a stronger validation method? Since any general method for comparing database contents (I don't know of such a tool) would use it's own drivers and setup, it will probably not get the same result as the test with your client applications. The bottom line is that only a encoding set at the server level will make clear what the meaning of non ASCII characters is. The server can then deal with the conversion between the server and the client encoding so that the different clients can work even with different internal encodings. With SQL_ASCII only the client application knows. This kind of setup needs a lot of care during setup to get consistent data, especially when several different applications are used. The drawback of selecting an encoding is a little performance penalty. However in my databases I could not measure any difference. I have to say here that my data does not have a lot of strings in. So it is definitly not a good test case for this. Since there are several different clients with different languages using my databases, I do use unicode as encoding. This works without any problem for me. Best regards Ivo > > > Thanks. > > Ben K. > Developer > http://benix.tamu.edu > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
unsubscribe Regards, -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Ben K. Sent: Tuesday, March 21, 2006 10:09 PM To: Tom Lane Cc: pgsql-admin@postgresql.org Subject: Re: [ADMIN] encoding question >> ERROR: invalid UTF-8 byte sequence detected near byte 0x85 > Looks to me like it might have been meant as LATIN1 or one of > the other single-byte ASCII-extension encodings. Thanks. Indeed it has non-ascii and wouldn't be covered by SQL_ASCII, I see now. I never suspected there'd be non-ascii in the data since we do cleansing before script-loading the data, but we use other input methods too, so am not sure where they came from. I didn't specify encoding when doing initdb when upgrading to 8.1.0, and think it was where I could have prevented this problem, but I'm not sure. I'm suspecting so because of this article (At least for locale C - since I did not specify encoding and got UTF on linux with en_US.UTF-8). Is it valid for 8.1.0? http://www.commandprompt.com/ppbook/x17149 "ENCODING = encoding ... If the ENCODING keyword is unspecified, PostgreSQL will create a database using its default encoding. This is usually SQL_ASCII, though it may have been set to a different default during the initial configuration of PostgreSQL (see Chapter 2 for more on default encoding)." And I'm getting this from pgAdmin III. I guess this is the reason why you all say avoid SQL_ASCII? "Database encoding The database ... is created to store data using the SQL_ASCII encoding. This encoding is defined for 7 bit characters only; the meaning of characters with the 8th bit set (non-ASCII characters 127-255) is not defined. Consequently, it is not possible for the server to convert the data to other encodings. If you're storing non-ASCII data in the database, you're strongly encouraged to use a proper database encoding representing your locale character set to take benefit from the automatic conversion to different client encodings when needed. If you store non-ASCII data in an SQL_ASCII database, you may encounter weird characters written to or read from the database, caused by code conversion problems. This may cause you a lot of headache when accessing the database using different client programs and drivers. For most installations, Unicode (UTF8) encoding will provide the most flexible capabilities." Could anyone comment if the method in this url is valid and reasonably safe? (At this time the problem seems almost harmless except for a few records not being loaded, but it'll need to be fixed.) http://archives.postgresql.org/pgsql-general/2004-02/msg01192.php dump database, recode the dump, drop database, restore from recoded dump Especially, any experience with recode vs. manual inspection ? I'm just reasoning from pieces of information. I'd appreciate any advices or experiences. Regards, Ben K. Developer http://benix.tamu.edu ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
unsubscribe Regards, -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Ivo Rossacher Sent: Wednesday, March 22, 2006 3:08 AM To: pgsql-admin@postgresql.org Cc: Ben K. Subject: Re: [ADMIN] encoding question Am Dienstag, 21. März 2006 21.14 schrieb Ben K.: > I just wanted to add that when I created the same database with -E > SQL_ASCII on my linux box, the dump was loaded fine. I created another > database without -E and observed the same invalid encoding problem. This is not really surprising since SQL_ASCII does not check the coding unlike all other encodings. > > On the face value this seems to solve the problem at least superficially. The more interesting question is, what is your application doing with the non ASCII characters within your database. The answer to this question will tell you what the correct contents would be. > > I'd like to check the data validity, and the easiest way seems to be to > dump the data again from the linux box and compare with the original. Your application defines what is valid. Even if you know that the dump would be the same it would not tell you anything about the validity of the data. So the better check would be to check with the application(s) connecting to both servers and work with some records which do contain non ASCII characters. If both servers do give the same results with your application(s) you most possible got the coding right. > > Is there a way to compare between any two databases online? (like running > a script checking row counts and schema) If I run crc on the concat of all > fields in a row, and if the crc matches, would it be reasonably > sufficient? Is there a stronger validation method? Since any general method for comparing database contents (I don't know of such a tool) would use it's own drivers and setup, it will probably not get the same result as the test with your client applications. The bottom line is that only a encoding set at the server level will make clear what the meaning of non ASCII characters is. The server can then deal with the conversion between the server and the client encoding so that the different clients can work even with different internal encodings. With SQL_ASCII only the client application knows. This kind of setup needs a lot of care during setup to get consistent data, especially when several different applications are used. The drawback of selecting an encoding is a little performance penalty. However in my databases I could not measure any difference. I have to say here that my data does not have a lot of strings in. So it is definitly not a good test case for this. Since there are several different clients with different languages using my databases, I do use unicode as encoding. This works without any problem for me. Best regards Ivo > > > Thanks. > > Ben K. > Developer > http://benix.tamu.edu > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend