Re: invalid byte sequence for encoding "UTF8" - Mailing list pgsql-general

From Gregory Stark
Subject Re: invalid byte sequence for encoding "UTF8"
Date
Msg-id 87bq9ccao5.fsf@oxford.xeocode.com
Whole thread Raw
In response to invalid byte sequence for encoding "UTF8"  (Glyn Astill <glynastill@yahoo.co.uk>)
List pgsql-general
[Generally it's not a good idea to start a new thread by responding to an
existing one, it confuses people and makes it more likely for your question to
be missed.]


"Glyn Astill" <glynastill@yahoo.co.uk> writes:

> Hi People,
>
> I've setup a postgres 8.2 server and have a database setup with UTF8
> encoding. I intend to read some of our legacy data into the table,
> this legacy data is in ASCII format, and as far as I know is 8 bit
> ASCII.

ASCII is a 7-bit encoding. If you have bytes with the high bit set then you
have something else. Can you give any examples of characters with the high bit
set and what you think they represent?

> We have a migration tool from mertechdata.com to convert these files
> that are in a DataFlex format into out postgres tables.
>
> Some files convert over okay, and some come up with the error message
> 'invalid byte sequence for encoding "UTF8"'. the files that come up
> with the error are created correctly and so are their index's, but as
> soon as we come to insert the data we get this error.

This error indicates that you are trying to import data with client_encoding
set to UTF8 but the data isn't actually UTF8 and contains invalid byte
sequences for UTF8.

If your migration toolkit lets you set the client encoding separately from the
server encoding then you can set the client encoding to match your data and
the server encoding to the encoding you want the server to use.

Otherwise you'll have to recode the data to UTF8 or whatever encoding you want
the data to be. There are tools to do this (such as GNU "recode" for example).


> Are there any more flexible formats we could use? I noticed we have
> Latin 1-10 and ISO formats. Is there any reason why we shouldn't use
> these?

Well there are pros and cons. The 1-byte ISO formats will be more space
efficient and also allow some cpu optimizations so they perform somewhat
better. But if you ever need to store a character which doesn't fit in the
encoding you'll be stuck. Postgres doesn't support using multiple encodings in
the same database (or effectively even in the same initdb cluster).

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

pgsql-general by date:

Previous
From: "Usama Dar"
Date:
Subject: Re: invalid byte sequence for encoding "UTF8"
Next
From: "Marko Kreen"
Date:
Subject: Re: pgcrypto functions fail for asymmetric encryption/decryption