Thread: Character encoding problems

Character encoding problems

From
Bruce Clay
Date:
Sorry for the duplicate postings.  I have only recieved one reply so far and that was a suggestion to post to this
forum.

I trying to build a database to support natural language processing from a variety of data files posted on the
internet. Many of them are identified as using UTF-8 encoding.  Some of these are dictionary files fro WinEdt. Some are
froman Open Source multi-lingual health care package. 

When I try to build a table from several of the different languages I get the following error

ERROR: invalid byte sequence for encoding "UTF8": 0x82

I checked the encoding and it is indeed set up for Unicode-8. I tried to create databases using a variety of other
encodingtypes such as WIN1252 and others and I got the same error message from all of them except SQL_ASCII. 

When I created the database using SQL_ASCII I received the warning that the database could only store 7 bit data. When
Iloaded the data in this database I did not have any errors and when I look at the data it seems to be the same as in
theoriginal text file. 

Is there a "proper" encoding type that I should use to load the word lists so they can be interoperable with the
WordNetdataset that happily uses the UTF8 encoding? 

Bruce

Re: Character encoding problems

From
John R Pierce
Date:
On 12/08/11 7:54 PM, Bruce Clay wrote:
> Is there a "proper" encoding type that I should use to load the word lists so they can be interoperable with the
WordNetdataset that happily uses the UTF8 encoding? 

some of your input data may be in other encodings, not UTF8, for
instance, LATIIN1.  if you can identify these, and use SET
CLIENT_ENCODING=...  at the appropriate times, you should be able to
import from the various data sources.

otherwise, you might have to run the data through some sort of filter
before you feed it to postgres, I dunno.   I'm pretty sure 0x82 is not a
valid code in UTF8.


--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast