Thread: Character encoding problems
Sorry for the duplicate postings. I have only recieved one reply so far and that was a suggestion to post to this forum. I trying to build a database to support natural language processing from a variety of data files posted on the internet. Many of them are identified as using UTF-8 encoding. Some of these are dictionary files fro WinEdt. Some are froman Open Source multi-lingual health care package. When I try to build a table from several of the different languages I get the following error ERROR: invalid byte sequence for encoding "UTF8": 0x82 I checked the encoding and it is indeed set up for Unicode-8. I tried to create databases using a variety of other encodingtypes such as WIN1252 and others and I got the same error message from all of them except SQL_ASCII. When I created the database using SQL_ASCII I received the warning that the database could only store 7 bit data. When Iloaded the data in this database I did not have any errors and when I look at the data it seems to be the same as in theoriginal text file. Is there a "proper" encoding type that I should use to load the word lists so they can be interoperable with the WordNetdataset that happily uses the UTF8 encoding? Bruce
On 12/08/11 7:54 PM, Bruce Clay wrote: > Is there a "proper" encoding type that I should use to load the word lists so they can be interoperable with the WordNetdataset that happily uses the UTF8 encoding? some of your input data may be in other encodings, not UTF8, for instance, LATIIN1. if you can identify these, and use SET CLIENT_ENCODING=... at the appropriate times, you should be able to import from the various data sources. otherwise, you might have to run the data through some sort of filter before you feed it to postgres, I dunno. I'm pretty sure 0x82 is not a valid code in UTF8. -- john r pierce N 37, W 122 santa cruz ca mid-left coast