Thread: invalid UTF-8 byte sequence detected

invalid UTF-8 byte sequence detected

From
"Markus Wollny"
Date:
Hello!

I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; in this process I have stumbled a couple
oftimes over certain errors in text-fields that lead to error-messages during import of the dump like these: 

<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>ERROR:  invalid UTF-8 byte sequence detected near byte 0xb4
<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>CONTEXT:  COPY board_message, line 1125662, column text: "HI

        Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden kaufe, da in beiden Addon▒s viel..."
<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>STATEMENT:  COPY board_message (board_id, thread_id,
father_id,message_id, user_id, title, signat 
ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, state_id, user_login, user_status, user_rank,
user_rank_description,user_rank_picture, user_rights, text, deleted_user_id, deleted_date, deleted_login, user_created,
poll_id,idxfti) FROM stdin; 

<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>ERROR:  invalid UTF-8 byte sequence detected near byte 0x98
<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>CONTEXT:  COPY kidszone_tournament2005_user, line 427,
columnphone: "02302▒74" 
<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>STATEMENT:  COPY kidszone_tournament2005_user (id,
first_name,last_name, adress, birthday, phone, 
 email, place, permission, ude, ude_box, invited) FROM stdin;

There are not too many occurrences of the same type - five altogether in a 1.8GB compressed dumpfile, but still it has
meworried and leaves me with some questions: 

1.) How could I have prevented insertion of these invalid byte-sequences in the first place? We're using UTF-8 encoded
databases,data is mostly inserted by users via browser applications, our websites are UTF-8 encoded, too, but still we
cannotreally make 100% sure that all clients behave as expected; on the other hand, it would be extremely inconvenient
ifwe had to check each and every text input for character set conformance in the application, so is there a way to
ascertain"sane" data via some database-setting? pg_restore does throw this error and indeed terminates after that (I
usedcustom dump format for pg_dump), psql on the other hand just continues with the import (using a pgdumpall-output
thatgenerates a standard SQL-script), although it too throws the error. 

2.) How does this really affect the value of the database-dumps? psql continues with import after the error, but the
tablewhere this error occurred remains empty, as the affected COPY-statement has failed altogether due to this error.
Soa plain no-worries import in my case would present me a result with five tables empty - one of them quite large... Is
theresome kind of magic, maybe involving some perl or whatever, that could help to clean up the dump before the import,
soI can accomplish a full restore? 

Kind regards,

   Markus

Re: invalid UTF-8 byte sequence detected

From
Bruce Momjian
Date:
I am also confused how invalid UTF8 sequences got into your database.
It shouldn't have been possible.

---------------------------------------------------------------------------

Markus Wollny wrote:
> Hello!
>
> I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; in this process I have stumbled a couple
oftimes over certain errors in text-fields that lead to error-messages during import of the dump like these: 
>
> <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>ERROR:  invalid UTF-8 byte sequence detected near byte
0xb4
> <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>CONTEXT:  COPY board_message, line 1125662, column text:
"HI
>
>         Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden kaufe, da in beiden Addon?s viel..."
> <2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>STATEMENT:  COPY board_message (board_id, thread_id,
father_id,message_id, user_id, title, signat 
> ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, state_id, user_login, user_status,
user_rank,user_rank_description, user_rank_picture, user_rights, text, deleted_user_id, deleted_date, deleted_login,
user_created,poll_id, idxfti) FROM stdin; 
>
> <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>ERROR:  invalid UTF-8 byte sequence detected near byte
0x98
> <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>CONTEXT:  COPY kidszone_tournament2005_user, line 427,
columnphone: "02302?74" 
> <2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>STATEMENT:  COPY kidszone_tournament2005_user (id,
first_name,last_name, adress, birthday, phone, 
>  email, place, permission, ude, ude_box, invited) FROM stdin;
>
> There are not too many occurrences of the same type - five altogether in a 1.8GB compressed dumpfile, but still it
hasme worried and leaves me with some questions: 
>
> 1.) How could I have prevented insertion of these invalid byte-sequences in the first place? We're using UTF-8
encodeddatabases, data is mostly inserted by users via browser applications, our websites are UTF-8 encoded, too, but
stillwe cannot really make 100% sure that all clients behave as expected; on the other hand, it would be extremely
inconvenientif we had to check each and every text input for character set conformance in the application, so is there
away to ascertain "sane" data via some database-setting? pg_restore does throw this error and indeed terminates after
that(I used custom dump format for pg_dump), psql on the other hand just continues with the import (using a
pgdumpall-outputthat generates a standard SQL-script), although it too throws the error. 
>
> 2.) How does this really affect the value of the database-dumps? psql continues with import after the error, but the
tablewhere this error occurred remains empty, as the affected COPY-statement has failed altogether due to this error.
Soa plain no-worries import in my case would present me a result with five tables empty - one of them quite large... Is
theresome kind of magic, maybe involving some perl or whatever, that could help to clean up the dump before the import,
soI can accomplish a full restore? 
>
> Kind regards,
>
>    Markus
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: invalid UTF-8 byte sequence detected

From
"Markus Wollny"
Date:
Hi!

> -----Ursprüngliche Nachricht-----
> Von: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
> Gesendet: Dienstag, 15. November 2005 19:46
> An: Markus Wollny
> Cc: pgsql-general@postgresql.org
> Betreff: Re: [GENERAL] invalid UTF-8 byte sequence detected
>
>
> I am also confused how invalid UTF8 sequences got into your database.
> It shouldn't have been possible.
>

Our databases were originally encoded in ASCII, though they should have been LATIN9 - or UNICODE; this has been
remedieda long time ago using iconv on the dumps; our PostgreSQL-version then was 7.4 and we converted databases and
dumpsto UTF-8. Maybe the invalid byte sequences have been entered sometimes during our migration from ODBC to JDBC
whileour encoding was still a mess - though I would have thought that this should have been resolved by
dump&iconv&restorethen. However, I do suspect that the cause of the issue was really more or less a bug in PostgreSQL
<8.1,which accepted certain illegal byte sequences. I our case, I found that the re-import of the dump errored out on
ISO-8859-1encoded backticks (´) - certain mournfully misled individuals do use this "degu"-character instead of the
apostropheeven tough it's more difficult to type on a german keyboard layout. And quite wrong, too. 

Anyway, I found some reference in the hackers-list that encoding-consistency for Unicode has been tightened down (see
http://archives.postgresql.org/pgsql-hackers/2005-10/msg00972.php). Both a solution and a suggestion have been posted
inthis thread; Christopher Kings-Lynne has suggested to include a reference to this issue in the 'upgrading/back
compatibiliy'section for these release notes - I do strongly second his suggestion :) 

The suggested solution was to feed the plain dump again through iconv; however at least on my systems (Debian Sarge)
iconvdidn't like my >5GB sized dump files. So in order to successfully reimport the dumps, I had to "split
--line-bytes=650m"the SQL-file, pass the parts through iconv -c -f UTF8 -t UTF8 and concatenate them back into one file
again.There were no more errors on feeding the dump back into psql and I didn't come across any missing data during my
tests,so this has definitely done the trick for me. 

As 8.1 has tightened down encoding-consistency for Unicode, I believe that the databases should be safe from any
illegalbyte-sequences in text-fields from now on.  

Kind regards

   Markus