Re: encoding question - Mailing list pgsql-admin
From | Ben K. |
---|---|
Subject | Re: encoding question |
Date | |
Msg-id | Pine.GSO.4.64.0603211015260.16852@coe.tamu.edu Whole thread Raw |
In response to | Re: encoding question (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: encoding question
Re: encoding question |
List | pgsql-admin |
>> ERROR: invalid UTF-8 byte sequence detected near byte 0x85 > Looks to me like it might have been meant as LATIN1 or one of > the other single-byte ASCII-extension encodings. Thanks. Indeed it has non-ascii and wouldn't be covered by SQL_ASCII, I see now. I never suspected there'd be non-ascii in the data since we do cleansing before script-loading the data, but we use other input methods too, so am not sure where they came from. I didn't specify encoding when doing initdb when upgrading to 8.1.0, and think it was where I could have prevented this problem, but I'm not sure. I'm suspecting so because of this article (At least for locale C - since I did not specify encoding and got UTF on linux with en_US.UTF-8). Is it valid for 8.1.0? http://www.commandprompt.com/ppbook/x17149 "ENCODING = encoding ... If the ENCODING keyword is unspecified, PostgreSQL will create a database using its default encoding. This is usually SQL_ASCII, though it may have been set to a different default during the initial configuration of PostgreSQL (see Chapter 2 for more on default encoding)." And I'm getting this from pgAdmin III. I guess this is the reason why you all say avoid SQL_ASCII? "Database encoding The database ... is created to store data using the SQL_ASCII encoding. This encoding is defined for 7 bit characters only; the meaning of characters with the 8th bit set (non-ASCII characters 127-255) is not defined. Consequently, it is not possible for the server to convert the data to other encodings. If you're storing non-ASCII data in the database, you're strongly encouraged to use a proper database encoding representing your locale character set to take benefit from the automatic conversion to different client encodings when needed. If you store non-ASCII data in an SQL_ASCII database, you may encounter weird characters written to or read from the database, caused by code conversion problems. This may cause you a lot of headache when accessing the database using different client programs and drivers. For most installations, Unicode (UTF8) encoding will provide the most flexible capabilities." Could anyone comment if the method in this url is valid and reasonably safe? (At this time the problem seems almost harmless except for a few records not being loaded, but it'll need to be fixed.) http://archives.postgresql.org/pgsql-general/2004-02/msg01192.php dump database, recode the dump, drop database, restore from recoded dump Especially, any experience with recode vs. manual inspection ? I'm just reasoning from pieces of information. I'd appreciate any advices or experiences. Regards, Ben K. Developer http://benix.tamu.edu
pgsql-admin by date: