Home > mailing lists

Character encoding problems - Mailing list pgsql-general

From	Bruce Clay
Subject	Character encoding problems
Date	December 9, 2011 03:35:56
Msg-id	35b888aa-eac8-4b23-9f17-a04feb58854b@Mariah Whole thread
Responses	Re: Character encoding problems
List	pgsql-general

Tree view

Sorry for the duplicate postings.  I have only recieved one reply so far and that was a suggestion to post to this
forum.

I trying to build a database to support natural language processing from a variety of data files posted on the
internet. Many of them are identified as using UTF-8 encoding.  Some of these are dictionary files fro WinEdt. Some are
froman Open Source multi-lingual health care package. 

When I try to build a table from several of the different languages I get the following error

ERROR: invalid byte sequence for encoding "UTF8": 0x82

I checked the encoding and it is indeed set up for Unicode-8. I tried to create databases using a variety of other
encodingtypes such as WIN1252 and others and I got the same error message from all of them except SQL_ASCII. 

When I created the database using SQL_ASCII I received the warning that the database could only store 7 bit data. When
Iloaded the data in this database I did not have any errors and when I look at the data it seems to be the same as in
theoriginal text file. 

Is there a "proper" encoding type that I should use to load the word lists so they can be interoperable with the
WordNetdataset that happily uses the UTF8 encoding? 

Bruce

pgsql-general by date:

From: Chris Travers
Date: 09 December 2011, 00:18:13
Subject: Re: Hope for a new PostgreSQL era?

From: John R Pierce
Date: 09 December 2011, 04:20:47
Subject: Re: Character encoding problems

Character encoding problems - Mailing list pgsql-general

Previous

Next