invalid UTF-8 byte sequence detected - Mailing list pgsql-general

From Markus Wollny
Subject invalid UTF-8 byte sequence detected
Date
Msg-id 28011CD60FB1724DBA4442E38277F6262857D2@hermes.computec.de
Whole thread Raw
Responses Re: invalid UTF-8 byte sequence detected  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-general
Hello!

I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; in this process I have stumbled a couple
oftimes over certain errors in text-fields that lead to error-messages during import of the dump like these: 

<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>ERROR:  invalid UTF-8 byte sequence detected near byte 0xb4
<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>CONTEXT:  COPY board_message, line 1125662, column text: "HI

        Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden kaufe, da in beiden Addon▒s viel..."
<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>STATEMENT:  COPY board_message (board_id, thread_id,
father_id,message_id, user_id, title, signat 
ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, state_id, user_login, user_status, user_rank,
user_rank_description,user_rank_picture, user_rights, text, deleted_user_id, deleted_date, deleted_login, user_created,
poll_id,idxfti) FROM stdin; 

<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>ERROR:  invalid UTF-8 byte sequence detected near byte 0x98
<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>CONTEXT:  COPY kidszone_tournament2005_user, line 427,
columnphone: "02302▒74" 
<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>STATEMENT:  COPY kidszone_tournament2005_user (id,
first_name,last_name, adress, birthday, phone, 
 email, place, permission, ude, ude_box, invited) FROM stdin;

There are not too many occurrences of the same type - five altogether in a 1.8GB compressed dumpfile, but still it has
meworried and leaves me with some questions: 

1.) How could I have prevented insertion of these invalid byte-sequences in the first place? We're using UTF-8 encoded
databases,data is mostly inserted by users via browser applications, our websites are UTF-8 encoded, too, but still we
cannotreally make 100% sure that all clients behave as expected; on the other hand, it would be extremely inconvenient
ifwe had to check each and every text input for character set conformance in the application, so is there a way to
ascertain"sane" data via some database-setting? pg_restore does throw this error and indeed terminates after that (I
usedcustom dump format for pg_dump), psql on the other hand just continues with the import (using a pgdumpall-output
thatgenerates a standard SQL-script), although it too throws the error. 

2.) How does this really affect the value of the database-dumps? psql continues with import after the error, but the
tablewhere this error occurred remains empty, as the affected COPY-statement has failed altogether due to this error.
Soa plain no-worries import in my case would present me a result with five tables empty - one of them quite large... Is
theresome kind of magic, maybe involving some perl or whatever, that could help to clean up the dump before the import,
soI can accomplish a full restore? 

Kind regards,

   Markus

pgsql-general by date:

Previous
From: "Andrus"
Date:
Subject: Re: Best way to use indexes for partial match at beginning
Next
From: Richard Huxton
Date:
Subject: Re: Best way to use indexes for partial match at beginning